Commit Graph

1795 Commits

Author SHA1 Message Date
Marc Nause
3bc5ee6e3d *) added protection against CSRF in update download page
(http://localhost:8090/ConfigUpdate_p.html?releaseinstall=../../test.txt&deleteRelease=Delete+Release
does not work anymore)
2013-02-04 19:57:28 +01:00
Michael Peter Christen
4f270d89e2 another NPE 2013-02-04 18:04:52 +01:00
Michael Peter Christen
921091c3a6 use thread-safe http connection manager for authenticated remote solr
connections
2013-02-04 17:48:04 +01:00
Michael Peter Christen
e8f7b85b98 fixes to internal RWI usage if RWI is switched off (NPE etc) 2013-02-04 17:11:02 +01:00
Michael Peter Christen
3834829b37 bugfixes and more logging for solr connector 2013-02-04 16:42:10 +01:00
Michael Peter Christen
80fe3d7860 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/cora/federate/solr/connector/EmbeddedSolrConnector.java
2013-02-04 10:57:54 +01:00
Michael Peter Christen
4323621a76 update to Solr 4.1.0 2013-02-04 10:55:49 +01:00
reger
160ce568b3 move testing SolrServlet.main to test, making include of jetty*.jar in distribution and classpath obsolete
- move jetty*.jar to test library 
- move SolrServlet.main as is to test, add also a junit test simulating main 
  - add build.xml cleanup for EmbeddedSolrConnectorTest created test/DATA
- adjust some test compile errors
2013-02-03 22:32:38 +01:00
orbiter
07a20e8253 removed unused import 2013-02-02 10:52:39 +01:00
Michael Peter Christen
d1cb4cbc84 enhanced network scanner, is faster and more flexible now
- start more processes
- remove superfluous host name resolution
- better/more flexible subnet ip range calculation
- prefer ipv4 makes better usable ip pre-settings in servlet
- extended servlet by new subnet /20 - option
- redesign of scanner start process in servlet (generalization)
2013-02-02 09:51:43 +01:00
Michael Peter Christen
592adf7ccb fix for domain navigation 2013-02-02 07:21:18 +01:00
Michael Peter Christen
4ca1b76627 less search overhead when first result set is smaller than requested 2013-02-02 07:20:56 +01:00
Michael Peter Christen
f748b0aa7c NPE fix 2013-02-02 07:20:02 +01:00
Michael Peter Christen
7dfcc92b71 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-01-31 13:15:42 +01:00
Michael Peter Christen
0b6566a389 optimizations when starting large crawl requests with many start urls in
one request:
- allow larger match-fields in html interface
- delete all host hashes at once from zurl
- when deleting by host, do not count size of deleted entries since that
was the reason it took so long
2013-01-31 13:15:28 +01:00
orbiter
a2160054d7 ability to create vocabularies also without any objectspace: this
iterates over all urls in the index do create terms
2013-01-30 19:33:48 +01:00
orbiter
ecc10a752c fixes to index enumeration for vocabulary production 2013-01-29 18:14:14 +01:00
sixcooler
3a13906121 clear some more caches if running out of memory 2013-01-25 04:24:36 +01:00
Michael Peter Christen
8651ec35fe turned author_s into the multi-valued field author_sxt 2013-01-24 18:24:31 +01:00
Michael Peter Christen
4589afe056 fix NPE when solr does not deliver snippets 2013-01-24 14:12:31 +01:00
Michael Peter Christen
0fe7b6fd3b migrated the index export methods from the old metadata to solr. Now
exports are done using solr queries. removed superfluous methods and
servlets.
2013-01-24 12:39:19 +01:00
Michael Peter Christen
1768c82010 removed field selection because that created documents with that field
only which was not useful when re-writing the same document
2013-01-24 03:26:38 +01:00
Michael Peter Christen
31e854bef6 Merge remote-tracking branch 'copro/master' 2013-01-23 14:41:17 +01:00
Michael Peter Christen
4735bd47f4 - changed solr commit call and added an optimize option. Since Solr
4.0.0 there is a new softcommit feature which implements a
near-real-time (NRT) search option. The softcommit does not do IO and
does not cause performance issues.
YaCy has now an extension in its solr connectors to use the softcommit
feature. The softcommit call now replaces all places where a hard commit
was used. Furthermore the commit strategy in when doing a search from
the web interface was changed (it's done every time before a search is
done).

The softcommit feature was implemented because it was needed for the
following changes (customer demands), which is also included in this
git commit:

- added a feature to identify all documents which have unique titles
and/or unique descriptions. These unique flags are disabled by default.
- added also a feature to set a flag when the url from a canonical tag
is equal to the document url. This is also disabled by default.

To support the new softcommit strategy, the commitWithinMs option was
set to -1 do disable automatic commit based on document insert times. If
documents are inserted permanently then also a commit would happen
permanently whenever the commitWithinMs time is reached. This would
conflict with the regular autocommit of 10 minutes and the new
softcommit strategy.
2013-01-23 14:40:58 +01:00
Copro
3ea8380959 Adding Vimeo tag to wiki commands to embedd Video video with id 2013-01-23 04:00:15 +01:00
Copro
ee9d7fd93d Added feature to embedd Youtube videos to wiki commands for usage in
Wiki, Blog or other servlets
2013-01-23 02:43:58 +01:00
Michael Peter Christen
9ccdd21d76 Merge remote-tracking branch 'aleksejs/fixtrans'
Conflicts:
	locales/ru.lng
	
Tried to merge this but I had to made this 'blind'.
Sorry if I deleted something that was right.
2013-01-22 11:54:38 +01:00
Michael Peter Christen
db024a4e19 added new solr fields (unused yet; implementation will follow) 2013-01-21 18:02:29 +01:00
Michael Peter Christen
f5fd2aea18 removed archaic migration code 2013-01-21 17:59:42 +01:00
Michael Peter Christen
60f2a69331 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-01-17 21:53:19 +01:00
Michael Peter Christen
cba038f97b one more NPE fix 2013-01-17 21:52:56 +01:00
sixcooler
f3e705c4fe bump to httpclient / httpcore 4.2.3 (bugfix-release) 2013-01-17 20:10:49 +01:00
Michael Peter Christen
af465cdca5 fix for wrong robots.txt loading for https protocol
see also: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4579
2013-01-16 17:38:06 +01:00
Michael Peter Christen
c3d50d91f8 relaxing site operator for www prefix:
- when using a site operator search for a domain where the domain has a
www prefix, also the domain without the www is enclosed
- when using a site operator search for a domain where the domain has no
www prefix, also the domain with the www in enclosed
- in the host navigator, all domains with and without a www prefix are
accumulated. That means that the host navigator does never show a host
with a www prefix.
This should prevent usage mistakes of the site operator.
2013-01-16 14:54:35 +01:00
Michael Peter Christen
db49e91724 fixed a NPE which may appear for freeworld peers without any rwi index
data. This the NPE looked like:
Caused by: java.lang.NullPointerException
	at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:279)
	at
net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:155)
	at search.respond(search.java:314)
	... 12 more
2013-01-16 11:07:20 +01:00
Michael Peter Christen
4faa07c214 added a timeout for topic computation (solr is here much slower than the
old metadata-db)
2013-01-15 16:20:43 +01:00
Michael Peter Christen
d2d5be032d added a 'inlink' search option according to the suggestion in the YaCy
forum at 
http://forum.yacy-websuche.de/viewtopic.php?f=18&t=4572#p27410

The feature was not called 'haslink' but called 'inlink' to have a
analogous naming like 'inurl'. This causes now that you can search for
words in links of the document, like:
* inlink:yacy
searches all documents which link to pages which have an 'yacy' in the
url.
2013-01-14 12:50:21 +01:00
reger
3897bb4409 added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index)
- migrates all entries in old urldb

Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), 
- added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format)
- removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0  (leaving it to the compiler/optimizer to choose number format)

current log excerpt for NumberFormatException:
W 2013/01/14 00:10:07 StackTrace For input string: "-"
java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
...
Caused by: java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
2013-01-14 03:06:24 +01:00
reger
3b6e08b49f prevent checking of urldb if empty
- disconnect urlIndexFile if empty
- add missing lock class in submenuSearchConfiguration
2013-01-12 15:20:23 +01:00
reger
f143804382 fix configuration for search page navigators
- added additional config page (ConfigSearchPage_p) for easy setup of search page layout (to not overload ConfigPortal page)
   - currently redundant setting with part of ConfigPortal page
- added missing config for filetype and protocol navigator
- adjusted init of SearchEvent to check navigation config setting
- renamed RankigProcess.getTopicNavigator to getTopics (to distiguish between added SearchEvent.getTopicNavigator)
2013-01-05 19:00:54 +01:00
Michael Peter Christen
becd52a984 added also a re-calculation of reference counts during the
post-processing of clickcount calculations. This is a really nice thing
to have because the reference count affects ranking.
2013-01-05 00:58:27 +01:00
Michael Peter Christen
38d3feae65 added separate delete commands for the local+remote solr index, the old
metadata and old rwi and for the citation index. The important
advancement is the separation of the citation index deletion because
that index is responsible for the linkdepth calculation. Now a search
index can be deleted without the citation index and that should cause
that less clickdepths must be post-processed.
2013-01-04 16:39:34 +01:00
Michael Peter Christen
6f0baaa309 added the clickdepth post-processing: some links may have 'shortcuts' to
already calculated click depths. There are then calculated if the crawl
buffer is empty and therefore no new 'shortcuts' can be discovered.
The status of the clickdepth stack (to-be-processed) can be seen using a
solr search command like this:
http://localhost:8090/solr/select?q=process_sxt:[*%20TO%20*]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt
2013-01-04 16:37:39 +01:00
Michael Peter Christen
0f5b6f38c1 enhanced root-url detection 2013-01-03 19:21:21 +01:00
Michael Peter Christen
5c0c56cfe1 Preparations to produce a click depth attribute in the search index.
This attribute can be used for ranking and for other purpose (demand by
customer)
The click depth is computed in two steps:
- during indexing the current fill-state of the reverse link index is
used to backtrack the current page to the root page. The length of that
backtrack is the clickdepth. But this does not discover the shortest
click depth. To get this, a second process to check again is needed
- added a process tag that can be used to do operations on the existing
index after a crawl; i.e. calculation the shortest clickpath. Added a
field to control this operation but not a method to operate on this.
- added a visualization of the clickpath length in the host browser
2013-01-02 20:55:43 +01:00
Michael Peter Christen
6861af87e2 removed warnings 2013-01-02 19:05:48 +01:00
Michael Peter Christen
295884fd54 - Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8'
- fixed conflict in	htroot/yacysearch.java
- removed nedres check because that causes that the remote server is not
called at all in most cases (local index has already results but we want
more)
- fixed a regex bug (a '=' too much)
2013-01-02 15:08:07 +01:00
reger
276e63401e small sanitary fixes
- exclude unix shell scripts in NSIS windows install archive
- replace link to env/grafics/yacy.gif to yacy.png (build.nsi)
- remove unused code lines (Blacklist_p, Response, WordReferenceVars)
- type & xhtml (RankingSolr_p.html)
2013-01-02 01:59:47 +01:00
reger
f301336adf fix: no results with configuration citation reference index switched off
- urlcitationindex != null check added to ResultEntry.referencesCount
- plus other places where conflicting procedure was used (and urlcitationindex not already checked != null)
2012-12-30 02:13:48 +01:00
orbiter
fe50702eb0 added a filterscannerfail attribute to QueryParams which causes that a
check to the network scanner fail/success status can be used/suppressed
for search results. This is a feature that comes with the port scanner.
2012-12-29 17:47:34 +01:00
reger
168b1d130d Adding heuristic to get search results from configured systems which support opensearch specification
- any system supporting opensearch specification can be configured
- search query is only forwarded to remote system if not enough results available on local peer
- discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config
     - sample config file with some general search engines with opensearch support
2012-12-29 08:24:48 +01:00
Michael Peter Christen
eb90d38cd7 added missing extension 'mkv' for navigation 2012-12-27 13:56:13 +01:00
Michael Peter Christen
95712fdc8b update to pdf parser 2012-12-27 04:16:31 +01:00
Michael Peter Christen
4a9182ae16 use the search configuration to default the cacheStrategy to the value
as given in the search configuration
2012-12-27 03:19:21 +01:00
Michael Peter Christen
98819ec3d9 use solr boost configuration to select search fields. At this time it is
possible to enter a negative boost value to switch that value off. This
might be different in the future with a better input interface.
2012-12-27 03:17:45 +01:00
Michael Peter Christen
e1f89efd0d - made image search in interactive search using the ViewImage servlet -
that enables viewing of images for intranet SMB servers.
- added a filter search for protocol, tld and ext again; otherwise p2p
search produces a lot of rubbish
2012-12-26 21:25:27 +01:00
Michael Peter Christen
8f3bd0c387 fix for smb crawl situation (lost too many urls) 2012-12-26 19:15:11 +01:00
reger
d456f69381 SeedUpload url : check to reject localhost url included in saveSeedList (same check as in / copied from Seed.isProper() ), to prevent identity change on next startup (due to rejected seeduploadurl). 2012-12-24 23:29:02 +01:00
reger
4987caf1c9 - apply fix for localhost handling (from yacy2solr) also to metadata2solr 2012-12-23 01:30:52 +01:00
reger
0148f1bb8c fix: exception if default work files don't exist 2012-12-22 23:03:39 +01:00
Michael Peter Christen
9e4033f229 fix for event starter: delete start time when event is removed 2012-12-22 21:16:22 +01:00
Michael Peter Christen
99271ffd13 copy work tables from defaults/data/work if exist there and not in
DATA/WORK
This can be used to create start-up behavior work scripts in the
api.bheap table
2012-12-22 20:54:05 +01:00
Michael Peter Christen
24c9bb35f7 extended the Scheduler: introduced scheduled events
- an event type (once, regular) can be selected
- for this event type, a fixed time can be selected. This may be either
directly after startup or at one of the full hours at a day (==25
options)
The main point about this feature is the opportunity to start an action
directly after startup. That makes it possible to create YaCy
distributions which, after started at the first time, start to index
parts of the intranet/internet by itself.
2012-12-22 16:27:14 +01:00
Michael Peter Christen
433143ba40 removed protocol, tld, ext from the urlmask and created specific
navigation field for these
2012-12-19 12:45:40 +01:00
Michael Peter Christen
84f82541e8 search process enhancements 2012-12-19 10:41:22 +01:00
Michael Peter Christen
02020b590b - removed all extension types from extension navigation which are not
proper/known
- automatically show the protocol navigation if there is more than http
and https
- automatically show the extension navigation if there is some media
content
2012-12-19 02:38:05 +01:00
Michael Peter Christen
01200f06cc using the author field as solr-native facet. this makes it necessary to
introduce a copy-field for the author field to be copied to a string
field. This field is then used to generate facets. Without this field,
the facet would consist only of the words of the author names, not of
the full author string.
2012-12-19 01:56:33 +01:00
Michael Peter Christen
2a4c064c89 using the publisher information for the author field if no author is
given. This applies to cases where only the copyright field in the html
header is filled but not the author field
2012-12-19 01:54:35 +01:00
Michael Peter Christen
bab573361f - using a filter query for facet restriction
- calculating the whole search result in at most two sub-queries from
solr
2012-12-19 01:00:57 +01:00
Michael Peter Christen
eac9650b31 added another solr field clickdepth_i which reflects the number of
clicks which are necessary to get from the portal of a host to a
specific document. At this time, only the start document is flagged with
clickdepth '0', all other with '-1'. To get the actual clickdepth, a
process must use crawled information to collect the actual number of
clicks. This will be added in another/next step.
2012-12-18 17:20:42 +01:00
Michael Peter Christen
1052263af3 - added a new solr field references_i which stores the number of
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
2012-12-18 14:42:35 +01:00
Michael Peter Christen
7c3de8b4cd - fix for localhost detection
- added IPv6 patterns for localhost detection
2012-12-18 12:52:20 +01:00
Michael Peter Christen
34f8786508 removed dependency of vocabulary navigation from Jena and it's
triplestore; the vocabulary search is now done using generic solr fields
which are created on-the-fly during runtime.
2012-12-18 02:29:03 +01:00
reger
ad71747525 fix: set defaul language to "en" 2012-12-16 20:53:45 +01:00
Michael Peter Christen
9319b90d8a - fixes for host navigation
- fixes for filetype navigation
- removed unused code
2012-12-15 09:14:49 +01:00
Michael Peter Christen
cb5cbec14d distinguishing modified query string and original query string 2012-12-15 00:05:46 +01:00
Michael Peter Christen
fb0fa9a102 - fixed 'delete from subpath' during crawl start which deleted nothing;
now works;
- changed some crawl start html design details
2012-12-11 13:38:28 +01:00
orbiter
712cc37c40 if maxFileSize < 0 then the file size limit is without limit. 2012-12-10 21:17:45 +01:00
orbiter
1f33c30d7b re-integrating useForHost method (lost sometime?) to get the noProxy
pattern working again. Without using this method all remote urls
including the localhost had been accessed through the configured proxy
2012-12-10 20:44:29 +01:00
reger
f1a9c2e604 fix Servlet template on conditional file include with use of conditional template pattern in included template file (example IndexCreateQueues_p.html)
see bug http://bugs.yacy.net/view.php?id=215
2012-12-10 20:02:35 +01:00
orbiter
a4a780b871 - fix for bad url conversion in bookmarks when using smb urls
- fix for localhost hosts in solr schema host handling
2012-12-10 07:22:42 +01:00
reger
e80dfeca23 - making blacklist path part case insensitive (solving http://bugs.yacy.net/view.php?id=171)
- blacklist test adding explicite response text "not blocked" if no blacklist match
2012-12-08 06:34:48 +01:00
reger
e2d499be9e remove NOT NEEDED reference to solr.YaCySchema from ConfigurationSet to be able to use ConfigurationSet for other conf files (than solr.keys.default.list). 2012-12-08 00:19:20 +01:00
Michael Peter Christen
a3cd3852ab introduced a better place to update the lastacc time value in latency 2012-12-07 15:49:23 +01:00
Michael Peter Christen
864abcd33d removed Latency update after URL selection because that causes
a completely wrong behaviour when cache fresh cases appear. Makes
re-crawling MUCH faster!
2012-12-07 15:35:44 +01:00
Michael Peter Christen
dd241d03bb latency fix: only set last-visit time if access was actually by the
robot
2012-12-07 02:00:12 +01:00
Michael Peter Christen
118233a7e6 fix for bad xml in gsa result when doing a query with quotes 2012-12-07 01:35:02 +01:00
Michael Peter Christen
1e002ab18e added another blacklist-cleaner into balancer 2012-12-07 01:27:24 +01:00
Michael Peter Christen
10527e28ae fix for wrong display of error urls in HostBrowser 2012-12-07 00:31:10 +01:00
Michael Peter Christen
756772fbd3 fix for waitingtime computation for intranet configuration 2012-12-06 17:40:52 +01:00
Michael Peter Christen
fa27e5820f - check blacklist (again) when taking urls from the crawl stack because
the blacklist may get extended during crawling
- removed debug output
2012-12-06 00:12:16 +01:00
Michael Peter Christen
adfecc6ba8 more robustness during shutdown 2012-12-05 18:20:43 +01:00
Michael Peter Christen
d4bfe9339e Brute-force attempt to start solr in case of a memory problem.
I don't actually know if this is correct. It is a desperate try to get
YaCy running on production servers which must get alive even with
strange hacks like this. This is also related to a forum posting in
http://forum.yacy-websuche.de/viewtopic.php?t=4528&p=27135#p27135
2012-12-05 18:16:06 +01:00
Michael Peter Christen
8aa08261a7 update to Solr Boost handling 2012-12-05 12:26:42 +01:00
Michael Peter Christen
908ad2f174 Added a new servlet to configure the solr ranking using field boosts 2012-12-03 17:01:19 +01:00
Michael Peter Christen
a01e47b992 enhanced exists()-method for solr; should reduce a lot of IO during DHT
target selection
2012-12-02 17:29:37 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
b5ee88c6af added more logging to get info which url causes performance problems 2012-12-02 16:52:12 +01:00
reger
1faa045dc1 fix: prevent regex pattern compile error for blacklist import for path '*' (extend it to '.*') 2012-12-01 22:41:21 +01:00
reger
6cf33f899c prevent Solr "version conflict" on update by set Solr "_version_" field to 0 (=no version check) 2012-11-28 00:09:53 +01:00
Michael Peter Christen
acd98bebb7 improvements in GSA result writer 2012-11-26 15:18:51 +01:00
Michael Peter Christen
3de784c8dd replaced more split and replaceAll missing pattern pre-compilation with
pre-compiled pattern
2012-11-26 13:40:53 +01:00
Michael Peter Christen
8fc3679c66 using more pre-compile pattern for split methods 2012-11-26 13:11:55 +01:00
Michael Peter Christen
d48e9788d2 enhanced search result processing behavior
- query less at one time; query more often
- in between the small queries, evaluate results
- remove fields from search results which are not needed
2012-11-26 12:24:35 +01:00
Michael Peter Christen
bf512e6350 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-11-26 00:14:57 +01:00
reger
469efcdb9d fix: display and calculate authors and namespace search navigator if configured (otherwise skip overhead)
(leave hosts, topics and  not in ConfigPortal included filetype,  protocoll navigator untouched)
2012-11-25 22:49:26 +01:00
Michael Peter Christen
eca68fa197 added debug code to crawler monitor 2012-11-25 15:43:42 +01:00
Michael Peter Christen
205f8b222b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-11-25 14:41:49 +01:00
orbiter
ee612e8b93 start the local search only if this peer is doing a remote search or
when it is doing a local search and the peer is old
2012-11-25 11:58:57 +01:00
Michael Peter Christen
d465773a37 - removed multi-add of documents (no used)
- inserted specialized code for size request
2012-11-25 01:34:39 +01:00
Michael Peter Christen
a1a4d9aa94 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/cora/federate/solr/connector/MirrorSolrConnector.java
2012-11-24 22:31:46 +01:00
Michael Peter Christen
b7004043ea - added a field cache for solr queries which call only for a single
value
- fixed a version conflict exception within a solr add request
2012-11-24 22:30:05 +01:00
orbiter
5aa5202adf fixes for filesystem indexing 2012-11-24 10:27:29 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
5e182a566f - added another enumeration method in kelondro data structure to get a
more random access to data for the balancer
- added random access inside the balancer
2012-11-23 13:58:39 +01:00
Michael Peter Christen
4eab3aae60 removed overhead by preventing generation of full search results when
only the url is requested
2012-11-23 01:35:28 +01:00
Michael Peter Christen
a114bb23bb - using edismax in gsa interface
- generating less field data for gsa search results
- using a boost query in gsa interface to move double content to the end
of the result list
2012-11-22 13:03:33 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
Michael Peter Christen
46be4af5b9 Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890' 2012-11-18 22:11:04 +01:00
Michael Peter Christen
832eead998 Merge remote-tracking branch 'regerdev/master' 2012-11-18 22:04:11 +01:00
Michael Peter Christen
952e143580 FINALLY YaCy can now search for full strings using double- or
singlequoted strings in the search query line!!!
2012-11-18 16:03:34 +01:00
orbiter
5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
2012-11-18 01:22:41 +01:00
cominch
2bb8f045cc content control: use up-to-date definitions 2012-11-13 17:32:19 +01:00
Michael Peter Christen
5fd3b93661 added deletion of hosts during crawl start if deleteold option was given 2012-11-13 16:54:28 +01:00
Michael Peter Christen
d64445c3cb because we have the inurl:<term> - searchmodifier, we don't actually
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
2012-11-13 11:45:56 +01:00
cominch
a67ff1c8ac SMW Import: replaced JSON import routines with stable ones 2012-11-12 11:17:50 +01:00
cominch
d2a94cc55e refactor package 2012-11-09 16:22:24 +01:00
cominch
05742b4562 remove old SMW importer which was part of the ymarks package 2012-11-09 15:44:59 +01:00
cominch
21df1ad9e0 update and generalization of the SMW import and content control routines 2012-11-09 13:48:40 +01:00
Michael Peter Christen
842faf96a2 fixed media search 2012-11-07 17:27:13 +01:00
Michael Peter Christen
93001586a0 removed warnings, removed too-fast pausing of crawls 2012-11-07 15:37:14 +01:00
Michael Peter Christen
8041742e48 added matching of path to query pattern 2012-11-07 15:06:13 +01:00
Michael Peter Christen
8b1c9cba3d fixed a problem with non-terminating crawls 2012-11-07 15:05:44 +01:00
Michael Peter Christen
61a1d32356 fix to ftp client 2012-11-07 14:58:28 +01:00
Michael Peter Christen
5105256927 update to search result logging (this was a remaining issue from the
solr 4.0.0 migration)
2012-11-07 14:15:27 +01:00
Michael Peter Christen
570e42c4e3 fix for filetype naviagtor 2012-11-07 13:53:29 +01:00
Michael Peter Christen
71ed8e5e07 bugfixes for crawler 2012-11-07 12:52:19 +01:00
Michael Peter Christen
12c0db20e5 fixed npe for surrogate import 2012-11-07 02:46:51 +01:00
Michael Peter Christen
52df6ee369 more logging 2012-11-07 02:04:08 +01:00
Michael Peter Christen
158732af37 automatically delete entries from the crawl profile list if crawl is
terminated.
2012-11-07 02:03:44 +01:00
Michael Peter Christen
15d1460b40 added information about the reason of pausing of crawls 2012-11-06 15:21:56 +01:00
Michael Peter Christen
2371ef031c added solr faceted search support to YaCy search results
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
2012-11-06 14:32:08 +01:00
Michael Peter Christen
b30a7162fa added more thread-renaiming for search processes 2012-11-06 12:31:23 +01:00
Michael Peter Christen
900445d8e9 set the thread name during solr queries to the solr query to get better
debugging options
2012-11-06 11:48:04 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
a15819fbec fix for some interface problems 2012-11-05 22:14:52 +01:00
Michael Peter Christen
791e1dcfdf when a new crawl is started, delete all entries about error-urls for
crawl-start domains
2012-11-05 22:14:27 +01:00
Michael Peter Christen
619bf7e875 fixed filetype modified for media types in text search 2012-11-05 18:08:00 +01:00
Michael Peter Christen
97f82994a6 automatically pause the crawler if there is a problem with solr 2012-11-05 16:34:42 +01:00