Commit Graph

9598 Commits

Author SHA1 Message Date
Michael Peter Christen
14186e815e npe fix 2013-06-13 22:42:21 +02:00
Michael Peter Christen
4e3007f4a0 typo 2013-06-13 22:40:46 +02:00
Michael Peter Christen
bdf306e0a7 increased time-out for loading of seed-lists 2013-06-13 22:32:06 +02:00
Michael Peter Christen
2cb6b6bc21 added target="_blank" to shutdown links 2013-06-13 22:31:39 +02:00
orbiter
c8e94ad7c7 fix for citation search in case that the citation is very fresh 2013-06-13 18:27:57 +02:00
orbiter
57dcf68665 added a feed-back message inside the shutdown page 2013-06-13 14:44:47 +02:00
Michael Peter Christen
0600d510e1 show the citation report also in ViewFile 2013-06-13 13:22:43 +02:00
Michael Peter Christen
1a92b61d69 fixed usage of ViewFile which needs a commit before showing latest crawl
result pages.
2013-06-13 13:08:24 +02:00
Michael Peter Christen
374d2e2a52 removed warning message during crawling 2013-06-13 13:03:56 +02:00
Michael Peter Christen
570511f3c8 removed fields references_internal_id_sxt and
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
2013-06-13 13:01:28 +02:00
Michael Peter Christen
fd1776a3b0 added a new 'Citations' function: each search result item can now be
explored for citations within other documents. A click on the
'Citations' link shows an analysis with all text lines in the document
each with a complete list of documents which contain the same line. A
second section shows the linking documents in ascending order of number
of citations from the original document. Because documents from
different hosts are most interesting here, they are listed at the top of
the page as possible 'copypasta' source.
2013-06-12 15:02:49 +02:00
Michael Peter Christen
fc3ff92c69 npe fix 2013-06-12 13:23:58 +02:00
Michael Peter Christen
7754a1263b switching back to the merge factor 10; the solr default. 2013-06-12 11:29:35 +02:00
Michael Peter Christen
1762911f57 added synchronizations and timeouts in solr api; missing
synchronizations in index modification methods causes deadlocks inside
solr.
2013-06-12 02:13:18 +02:00
Michael Peter Christen
3e1e358fdc calling pdf cache flush on class initialization because calling of the
methods during runtime can conflict with dynamic solr class loader and
cause a deadlock (seriously!)
2013-06-12 00:17:44 +02:00
Michael Peter Christen
291912ee52 removed misleading http accessGranted message (this is only for
debugging)
2013-06-12 00:16:28 +02:00
Michael Peter Christen
2fd7bbb450 reduced load on solr; no seed update in Status and no exists-check in
HTTPLoader in case of redirects, that can be done using the htcache.
2013-06-12 00:14:55 +02:00
Michael Peter Christen
7ee71c2354 changed administration page headline to 'admnistration' 2013-06-12 00:12:04 +02:00
Michael Peter Christen
898e14471b changed windows icon again 2013-06-12 00:10:25 +02:00
Michael Peter Christen
959ccc4675 increased the solr merge factor because 4 was too much IO load for
frequent index receiving and re-indexing after clickdepth/cr
calculation.
2013-06-11 16:51:40 +02:00
Michael Peter Christen
efd973d29d changed p2p/stealth mode text and links a bit 2013-06-11 16:50:34 +02:00
Michael Peter Christen
2648b42b27 added fixed clear method as public method 2013-06-11 16:22:43 +02:00
Michael Peter Christen
20fab1feb6 allip net has greedy learning disabled 2013-06-11 14:52:46 +02:00
Michael Peter Christen
ffc570f95f removed forced soft commit since this may be the cause for a performance
problem
2013-06-11 14:51:26 +02:00
Michael Peter Christen
6115bef335 added a 'greedy learning' mechanismn which will cause that a 'fresh'
yacy will load linked web pages from search results until the total
number of web pages reaches 15000. This shall give fresh peers a 'boost'
to get faster a personalized search index.
2013-06-11 14:42:30 +02:00
Michael Peter Christen
a5e328d7c5 new icons 2013-06-11 13:16:46 +02:00
Michael Peter Christen
f24574b3da use s greeting line which does not sound so beta 2013-06-11 13:12:59 +02:00
Michael Peter Christen
b85db72a73 added another response writer which can present search result with
texts, separated by sentences. Then, these sentences can be used to
search again in the index for the same sentence. This can be used to
provide a tool for plagiarism-search. (not finished yet).
Try the following:
http://localhost:8090/solr/select?q=text_t:flut&grep=wasser&defType=edismax&start=0&rows=3&core=collection1&wt=grephtml
.. to search for 'flut' and show only sentences in the result documents
which contain the word 'wasser'.
Consider this like using a grep-tool on documents: you select the
documents by a search query and you grep sentences inside the found
documents with the 'grep' attribute.
2013-06-10 18:41:00 +02:00
Michael Peter Christen
856e5c42ae the line "Web Search by the People, for the People" is more generic for
P2P and portal search as default search string. Otherwise, if people
switch to Portal mode, the "P2P Web Search" does not make sense.
2013-06-10 18:36:06 +02:00
Michael Peter Christen
8e965ffd16 fix for host compare in case that the host is null. This happens when
doing a search in the intranet for file resources (they don't have a
host).
2013-06-10 16:23:58 +02:00
Michael Peter Christen
5132bf719c added new buttons to search result page in p2p mode which show the
switch between p2p search and the 'stealth mode' which is simply a
non-p2p search within the p2p network. The functionality was there all
the time, but the switch to this was not very visible.
2013-06-10 16:22:00 +02:00
orbiter
2b320313d9 replaced yacydoc servlet usage by a solr result output using an html
output writer. This made the creation of a html result writer necessary
which is included in this commit. The yacydoc servlet was used to
present all metadata to a document, but the solr interface can serve for
this purpose in a much better way. All usages (instead one) of yacydoc
were replaced by a solr call. This affects also the 'metadata' link
attached to search results.
2013-06-09 12:12:34 +02:00
orbiter
200769d0c6 show the cache link in search results only if there is actually a cache
entry stored in HTCACHE
2013-06-09 08:15:23 +02:00
Michael Peter Christen
713a6199ef activated citation ranking by default 2013-06-07 14:26:14 +02:00
Michael Peter Christen
f7a4377812 usage of the new normalized link polularity CRn as default ranking
function. This replaces the previous formula, which was bad. Before you
update to this version, please check if you changed the ranking function
yourself before, since it will be overwritten.
2013-06-07 13:22:22 +02:00
Michael Peter Christen
f7e77a21bf Added a citation reference computation for intra-domain link structures.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.
2013-06-07 13:20:57 +02:00
Michael Peter Christen
e20450e798 patch in HTCache and CitationIndex loading in case that a file is
broken: do not crash; instead ignore the file and delete it.
2013-06-07 12:52:03 +02:00
Michael Peter Christen
fdcd4e6a6f fixes to index deletion: quoting of host name (a '-' may be part of the
url) and disabling the engage button when changing the url field at
'Delete by URL matching'
2013-06-07 08:52:07 +02:00
reger
d367b1f4d9 add null pointer check to stopword fix 2013-06-07 00:13:45 +02:00
reger
7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247
- append language setting specific stopword list

- remove unused OVERHANG stack type
2013-06-06 22:07:54 +02:00
orbiter
5c7ddc67fe in GSA api enable usage of solr fq-attribute together with GSA
site-attribute
2013-06-06 13:36:58 +02:00
Michael Peter Christen
9fc0c4df98 fix for bad exists 'enhancement'; see bug:
http://bugs.yacy.net/view.php?id=245
2013-06-02 13:50:12 +02:00
reger
9ef1fd9bac fix: enable use of solrcore.properties for property substitution of solrconfig.xml 2013-06-01 05:50:03 +02:00
reger
8a7fcb391d enable use of solrcore.properties for property substitution of solrconfig.xml
- move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties
- add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties
 
reason: on 32bit MMapDirectoryFactory may fail with.....
Caused by: java.io.IOException: Map failed
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
	at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
2013-06-01 05:43:08 +02:00
Michael Peter Christen
f7e887bf49 added missing class 2013-05-30 16:39:48 +02:00
Michael Peter Christen
eb9d0ba5b1 ranking and boost function update, small bugfixes, better default search
field for solr
2013-05-30 16:30:35 +02:00
Michael Peter Christen
5f92c68f1f removed block rank ranking and all YBR files in /ranking 2013-05-30 13:01:22 +02:00
Michael Peter Christen
164603b946 cleanup 2013-05-30 12:47:22 +02:00
Michael Peter Christen
ba793a32c0 added timeout for remote searches of 10 seconds 2013-05-30 12:39:28 +02:00
Michael Peter Christen
1c4c1c0345 try to commit in case of failure which hopefully frees up some RAM 2013-05-30 12:38:54 +02:00