Commit Graph

6383 Commits

Author SHA1 Message Date
orbiter
dac88561ae minimum access time has a tight connection to ClientIdentification,
therefore it is defined there.
2013-07-11 17:04:24 +02:00
Michael Peter Christen
9a29ab469e another patch to prevent CLOSE_WAIT status on solr connections 2013-07-11 12:53:39 +02:00
Michael Peter Christen
5091d627bc fixed parsing of peer flags 2013-07-11 12:53:16 +02:00
Michael Peter Christen
87e9052081 added Connection:close to all http requests in our http client to
prevent CLOSE_WAIT states (as seen in lsof)
2013-07-11 11:54:11 +02:00
Michael Peter Christen
5c6946dd5f replaced usage of log4j by ConcurrentLog where possible 2013-07-09 14:42:39 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
orbiter
f4f6551c66 better handling of time-out at solrj in case that a commit is done in a
fail-over case during add
2013-07-09 11:01:37 +02:00
Michael Peter Christen
07261fe274 Merge remote-tracking branch 'nutomics/blacklist_structure' 2013-07-08 23:32:15 +02:00
Michael Peter Christen
dea71851d2 - better concurrency for network scanner
- network scanner can now start from the list of all hosts in the search
index
2013-07-08 16:29:30 +02:00
Michael Peter Christen
a34e137e27 fix for citation index generation in case that entry.referrerhash() is
null. This is especially the case if ftp sites are crawled
2013-07-08 16:26:11 +02:00
Michael Peter Christen
a2c8116a8f accept (but ignore) a '+' sign in front of search words 2013-07-08 16:20:40 +02:00
orbiter
9f0cc9b401 enhanced network scanner
- textarea input field can now be used to paste in a large list of hosts
- /31er subnet is possible (only one host)
- auto-detect subdomains for ftp and www subdomains
2013-07-08 13:17:09 +02:00
sixcooler
308d73f855 do not use remote proxy if not switched on - regardless of the proto 2013-07-04 19:16:13 +02:00
sixcooler
69906b1d2e Revert "do not use remote proxy if not switched on - regardless of the proto"
This reverts commit 20f452d228.
2013-07-04 19:13:51 +02:00
sixcooler
20f452d228 do not use remote proxy if not switched on - regardless of the proto 2013-07-04 19:12:50 +02:00
sixcooler
9551720d5c re-enable saved setting for proxy-crawl-profile 2013-07-04 19:10:57 +02:00
sixcooler
d5d8936f9d For indexes that are changing rapidly in NRT situations, fcs (stands for
Field Cache per Segment) may be a better choice than the default fc.
(saves memory)
see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method
2013-07-04 19:08:53 +02:00
Felix Ableitner
44f8fcf62e Changed class structure of Blacklist. 2013-07-04 18:37:57 +02:00
Michael Peter Christen
57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
default.
2013-07-03 14:50:06 +02:00
Michael Peter Christen
5a5d411ec0 new robots_i attribute fields 2013-07-02 14:29:13 +02:00
Michael Peter Christen
fa08bd9d5a hack to prevent long waiting times in crawler 2013-07-01 13:24:52 +02:00
Michael Peter Christen
f1c5338210 prepartion for greedy crawl profiles and refactoring 2013-07-01 13:10:09 +02:00
Michael Peter Christen
e6f361f474 adding the canonical tag to crawl queues 2013-07-01 13:09:41 +02:00
reger
a6bf44212e bugfix: location (lat/lon) meta data retrival (Double.NaN check) 2013-06-30 03:50:07 +02:00
Michael Peter Christen
203921006a redesign of citation index storage 2013-06-30 02:11:46 +02:00
reger
83763ee4a4 jpeg parser: extract GPS location from meta data 2013-06-29 00:35:43 +02:00
Michael Peter Christen
32aa1d4569 removed unused option for queries 2013-06-28 15:32:36 +02:00
Michael Peter Christen
9d291764d1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-06-28 15:03:25 +02:00
sixcooler
e5abccdfe4 added optimize-option 2013-06-28 14:51:37 +02:00
Michael Peter Christen
64140f35cd fix for solr requests if no query part is given (prevent npe) 2013-06-28 13:16:25 +02:00
Michael Peter Christen
8caaf6203a fixed false multiple-generation of remote facet search which
caused high cpu usage on remote side.
2013-06-28 12:39:36 +02:00
Michael Peter Christen
823ae4d6a7 added url_protocol_s to error documents 2013-06-26 16:51:36 +02:00
Michael Peter Christen
660a196989 refactoring 2013-06-26 09:27:22 +02:00
Michael Peter Christen
c4538d8d91 added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib 2013-06-26 09:26:34 +02:00
reger
3760e2616b bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments 2013-06-25 23:24:02 +02:00
Michael Peter Christen
9a6fcdf597 npe fix 2013-06-25 16:36:16 +02:00
Michael Peter Christen
16d1d744fa added url_file_name_s in default collection schema for the file name
without the file extension. This part of the file path is removed from
the multi-field url_paths_sxt, which has now not the file name as last
part of the path list.

The same applies to the new fields source_file_name_s and
target_file_name_s in the webgraph schema.
2013-06-25 16:27:20 +02:00
reger
8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case) 2013-06-23 00:39:15 +02:00
Michael Peter Christen
f9d859f5dc now writing image alt texts and (camelcase-)parsed urls into a text
search field for a better image retrieval
2013-06-18 16:51:56 +02:00
Michael Peter Christen
e441a9d4c8 to avoid confusion, the gsa api is available at /search? and
/searchresult?
2013-06-18 16:22:06 +02:00
orbiter
8792e6c6e9 stub for better image indexing 2013-06-18 13:28:30 +02:00
orbiter
97f2ac9091 added hint to gsa response writer that the result comes from a yacy peer 2013-06-17 13:29:03 +02:00
Michael Peter Christen
14186e815e npe fix 2013-06-13 22:42:21 +02:00
Michael Peter Christen
bdf306e0a7 increased time-out for loading of seed-lists 2013-06-13 22:32:06 +02:00
Michael Peter Christen
374d2e2a52 removed warning message during crawling 2013-06-13 13:03:56 +02:00
Michael Peter Christen
570511f3c8 removed fields references_internal_id_sxt and
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
2013-06-13 13:01:28 +02:00
Michael Peter Christen
fd1776a3b0 added a new 'Citations' function: each search result item can now be
explored for citations within other documents. A click on the
'Citations' link shows an analysis with all text lines in the document
each with a complete list of documents which contain the same line. A
second section shows the linking documents in ascending order of number
of citations from the original document. Because documents from
different hosts are most interesting here, they are listed at the top of
the page as possible 'copypasta' source.
2013-06-12 15:02:49 +02:00
Michael Peter Christen
fc3ff92c69 npe fix 2013-06-12 13:23:58 +02:00
Michael Peter Christen
1762911f57 added synchronizations and timeouts in solr api; missing
synchronizations in index modification methods causes deadlocks inside
solr.
2013-06-12 02:13:18 +02:00
Michael Peter Christen
3e1e358fdc calling pdf cache flush on class initialization because calling of the
methods during runtime can conflict with dynamic solr class loader and
cause a deadlock (seriously!)
2013-06-12 00:17:44 +02:00