Commit Graph

1907 Commits

Author SHA1 Message Date
reger
8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case) 2013-06-23 00:39:15 +02:00
Michael Peter Christen
f9d859f5dc now writing image alt texts and (camelcase-)parsed urls into a text
search field for a better image retrieval
2013-06-18 16:51:56 +02:00
Michael Peter Christen
e441a9d4c8 to avoid confusion, the gsa api is available at /search? and
/searchresult?
2013-06-18 16:22:06 +02:00
orbiter
8792e6c6e9 stub for better image indexing 2013-06-18 13:28:30 +02:00
orbiter
97f2ac9091 added hint to gsa response writer that the result comes from a yacy peer 2013-06-17 13:29:03 +02:00
Michael Peter Christen
14186e815e npe fix 2013-06-13 22:42:21 +02:00
Michael Peter Christen
bdf306e0a7 increased time-out for loading of seed-lists 2013-06-13 22:32:06 +02:00
Michael Peter Christen
374d2e2a52 removed warning message during crawling 2013-06-13 13:03:56 +02:00
Michael Peter Christen
570511f3c8 removed fields references_internal_id_sxt and
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
2013-06-13 13:01:28 +02:00
Michael Peter Christen
fd1776a3b0 added a new 'Citations' function: each search result item can now be
explored for citations within other documents. A click on the
'Citations' link shows an analysis with all text lines in the document
each with a complete list of documents which contain the same line. A
second section shows the linking documents in ascending order of number
of citations from the original document. Because documents from
different hosts are most interesting here, they are listed at the top of
the page as possible 'copypasta' source.
2013-06-12 15:02:49 +02:00
Michael Peter Christen
fc3ff92c69 npe fix 2013-06-12 13:23:58 +02:00
Michael Peter Christen
1762911f57 added synchronizations and timeouts in solr api; missing
synchronizations in index modification methods causes deadlocks inside
solr.
2013-06-12 02:13:18 +02:00
Michael Peter Christen
3e1e358fdc calling pdf cache flush on class initialization because calling of the
methods during runtime can conflict with dynamic solr class loader and
cause a deadlock (seriously!)
2013-06-12 00:17:44 +02:00
Michael Peter Christen
291912ee52 removed misleading http accessGranted message (this is only for
debugging)
2013-06-12 00:16:28 +02:00
Michael Peter Christen
2fd7bbb450 reduced load on solr; no seed update in Status and no exists-check in
HTTPLoader in case of redirects, that can be done using the htcache.
2013-06-12 00:14:55 +02:00
Michael Peter Christen
2648b42b27 added fixed clear method as public method 2013-06-11 16:22:43 +02:00
Michael Peter Christen
ffc570f95f removed forced soft commit since this may be the cause for a performance
problem
2013-06-11 14:51:26 +02:00
Michael Peter Christen
6115bef335 added a 'greedy learning' mechanismn which will cause that a 'fresh'
yacy will load linked web pages from search results until the total
number of web pages reaches 15000. This shall give fresh peers a 'boost'
to get faster a personalized search index.
2013-06-11 14:42:30 +02:00
Michael Peter Christen
f24574b3da use s greeting line which does not sound so beta 2013-06-11 13:12:59 +02:00
Michael Peter Christen
b85db72a73 added another response writer which can present search result with
texts, separated by sentences. Then, these sentences can be used to
search again in the index for the same sentence. This can be used to
provide a tool for plagiarism-search. (not finished yet).
Try the following:
http://localhost:8090/solr/select?q=text_t:flut&grep=wasser&defType=edismax&start=0&rows=3&core=collection1&wt=grephtml
.. to search for 'flut' and show only sentences in the result documents
which contain the word 'wasser'.
Consider this like using a grep-tool on documents: you select the
documents by a search query and you grep sentences inside the found
documents with the 'grep' attribute.
2013-06-10 18:41:00 +02:00
Michael Peter Christen
8e965ffd16 fix for host compare in case that the host is null. This happens when
doing a search in the intranet for file resources (they don't have a
host).
2013-06-10 16:23:58 +02:00
orbiter
2b320313d9 replaced yacydoc servlet usage by a solr result output using an html
output writer. This made the creation of a html result writer necessary
which is included in this commit. The yacydoc servlet was used to
present all metadata to a document, but the solr interface can serve for
this purpose in a much better way. All usages (instead one) of yacydoc
were replaced by a solr call. This affects also the 'metadata' link
attached to search results.
2013-06-09 12:12:34 +02:00
Michael Peter Christen
f7a4377812 usage of the new normalized link polularity CRn as default ranking
function. This replaces the previous formula, which was bad. Before you
update to this version, please check if you changed the ranking function
yourself before, since it will be overwritten.
2013-06-07 13:22:22 +02:00
Michael Peter Christen
f7e77a21bf Added a citation reference computation for intra-domain link structures.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.
2013-06-07 13:20:57 +02:00
Michael Peter Christen
e20450e798 patch in HTCache and CitationIndex loading in case that a file is
broken: do not crash; instead ignore the file and delete it.
2013-06-07 12:52:03 +02:00
reger
d367b1f4d9 add null pointer check to stopword fix 2013-06-07 00:13:45 +02:00
reger
7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247
- append language setting specific stopword list

- remove unused OVERHANG stack type
2013-06-06 22:07:54 +02:00
Michael Peter Christen
9fc0c4df98 fix for bad exists 'enhancement'; see bug:
http://bugs.yacy.net/view.php?id=245
2013-06-02 13:50:12 +02:00
reger
9ef1fd9bac fix: enable use of solrcore.properties for property substitution of solrconfig.xml 2013-06-01 05:50:03 +02:00
reger
8a7fcb391d enable use of solrcore.properties for property substitution of solrconfig.xml
- move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties
- add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties
 
reason: on 32bit MMapDirectoryFactory may fail with.....
Caused by: java.io.IOException: Map failed
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
	at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
2013-06-01 05:43:08 +02:00
Michael Peter Christen
f7e887bf49 added missing class 2013-05-30 16:39:48 +02:00
Michael Peter Christen
5f92c68f1f removed block rank ranking and all YBR files in /ranking 2013-05-30 13:01:22 +02:00
Michael Peter Christen
164603b946 cleanup 2013-05-30 12:47:22 +02:00
Michael Peter Christen
ba793a32c0 added timeout for remote searches of 10 seconds 2013-05-30 12:39:28 +02:00
Michael Peter Christen
1c4c1c0345 try to commit in case of failure which hopefully frees up some RAM 2013-05-30 12:38:54 +02:00
Michael Peter Christen
409d6edf53 Store node/solr search threads to be able to send them an interrupt
signal in case that a cleanup process wants to remove the search
process. Added also a new cleanup process which can reduce the number of
stored searches to a specific number which can be higher or lower
according to the remaining RAM. The cleanup process is called every time
a search ist started.
2013-05-30 12:38:15 +02:00
Michael Peter Christen
2a8b99ea82 remove text_t in search result after snippet has been computed to save
space in search result cache
2013-05-30 12:35:47 +02:00
Michael Peter Christen
a1644ca0fd new workflow processor in Segment to enqueue indexing documents to solr 2013-05-30 12:34:53 +02:00
Michael Peter Christen
a8dc4346e8 default configuration of MMapDirectoryFactory for solr, increased lock
timeout, less documents from remote searches (too many results had
easily blocked a peer)
2013-05-30 12:31:28 +02:00
Michael Peter Christen
0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of
soft commits, reduced caching size of search events, ensured that solr
results are processed before connection is closed to keep that stuff not
too long in RAM
2013-05-29 18:27:27 +02:00
Michael Peter Christen
5344a1c5f7 getting the trash out 2013-05-29 16:09:05 +02:00
Michael Peter Christen
709e9b8ce7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-29 13:49:42 +02:00
Michael Peter Christen
1eb9626cca less logging 2013-05-29 13:30:32 +02:00
Michael Peter Christen
281959a2d7 added option to re-boot the embedded solr during run-time. Added also
API recording for this method so it can be repeated automatically. The
index dump generation is now also available for API recording. Added
some synchronization in backend which was necessary for this.
2013-05-29 13:09:34 +02:00
orbiter
da621e827e prevent NPE in case RWI is disabled 2013-05-28 16:26:38 +02:00
Michael Peter Christen
c2bcfd8afb Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-28 11:39:10 +02:00
Michael Peter Christen
67757b425a use a retry handler with retryCount=0 because we usually expect requests
to fail if we access non-permanently available resources (peers, web
pages) and want to fail fast without repeating the same request which is
doomed to fail. The previous appearance of http client connection had a
1-2-4-8-second timeout scheme, which caused that connection attempts
lasted for 16 seconds.
2013-05-28 11:38:45 +02:00
Michael Peter Christen
c2b1075dcf activating pollImmediately in case that DHT receive is off. This will
cause a much faster search result when running in public robinson mode.
2013-05-28 10:36:49 +02:00
orbiter
888a985dc6 set a higher limit for table copy usage 2013-05-27 15:23:12 +02:00
Michael Peter Christen
2b563debbf javadoc of new multiple-exist test 2013-05-27 13:45:09 +02:00