orbiter
5533fc8e01
fix for bug 260
2013-07-14 17:40:28 +02:00
orbiter
a9c8046c87
do a light optimization at the end of a crawl postprocessing
2013-07-13 19:09:46 +02:00
orbiter
a548354c71
replaced type of solr schema object sku of text_en_splitting_tight by
...
string
2013-07-13 18:54:09 +02:00
orbiter
2f1ec8d4a2
npe fix
2013-07-13 11:10:05 +02:00
Michael Peter Christen
bcc623a843
refactoring of load_delay: this is a matter of client identification
2013-07-12 16:24:56 +02:00
orbiter
0d0b3a30f5
activate api actions after postprocessing of crawls
2013-07-12 16:05:48 +02:00
orbiter
3978c5ca5d
fix for http://bugs.yacy.net/view.php?id=255
2013-07-12 14:38:30 +02:00
orbiter
2be456e7fb
added a postprocessing field into api/status_p.xml to show if the
...
postprocessing task is running at that time (status: busy) or not
(status:idle)
2013-07-12 14:29:22 +02:00
orbiter
dac88561ae
minimum access time has a tight connection to ClientIdentification,
...
therefore it is defined there.
2013-07-11 17:04:24 +02:00
Michael Peter Christen
9a29ab469e
another patch to prevent CLOSE_WAIT status on solr connections
2013-07-11 12:53:39 +02:00
Michael Peter Christen
5091d627bc
fixed parsing of peer flags
2013-07-11 12:53:16 +02:00
Michael Peter Christen
87e9052081
added Connection:close to all http requests in our http client to
...
prevent CLOSE_WAIT states (as seen in lsof)
2013-07-11 11:54:11 +02:00
Michael Peter Christen
5c6946dd5f
replaced usage of log4j by ConcurrentLog where possible
2013-07-09 14:42:39 +02:00
Michael Peter Christen
5878c1d599
- refactoring of log to ConcurrentLog:
...
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
orbiter
f4f6551c66
better handling of time-out at solrj in case that a commit is done in a
...
fail-over case during add
2013-07-09 11:01:37 +02:00
Michael Peter Christen
07261fe274
Merge remote-tracking branch 'nutomics/blacklist_structure'
2013-07-08 23:32:15 +02:00
Michael Peter Christen
dea71851d2
- better concurrency for network scanner
...
- network scanner can now start from the list of all hosts in the search
index
2013-07-08 16:29:30 +02:00
Michael Peter Christen
a34e137e27
fix for citation index generation in case that entry.referrerhash() is
...
null. This is especially the case if ftp sites are crawled
2013-07-08 16:26:11 +02:00
Michael Peter Christen
a2c8116a8f
accept (but ignore) a '+' sign in front of search words
2013-07-08 16:20:40 +02:00
orbiter
9f0cc9b401
enhanced network scanner
...
- textarea input field can now be used to paste in a large list of hosts
- /31er subnet is possible (only one host)
- auto-detect subdomains for ftp and www subdomains
2013-07-08 13:17:09 +02:00
sixcooler
308d73f855
do not use remote proxy if not switched on - regardless of the proto
2013-07-04 19:16:13 +02:00
sixcooler
69906b1d2e
Revert "do not use remote proxy if not switched on - regardless of the proto"
...
This reverts commit 20f452d228
.
2013-07-04 19:13:51 +02:00
sixcooler
20f452d228
do not use remote proxy if not switched on - regardless of the proto
2013-07-04 19:12:50 +02:00
sixcooler
9551720d5c
re-enable saved setting for proxy-crawl-profile
2013-07-04 19:10:57 +02:00
sixcooler
d5d8936f9d
For indexes that are changing rapidly in NRT situations, fcs (stands for
...
Field Cache per Segment) may be a better choice than the default fc.
(saves memory)
see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method
2013-07-04 19:08:53 +02:00
Felix Ableitner
44f8fcf62e
Changed class structure of Blacklist.
2013-07-04 18:37:57 +02:00
Michael Peter Christen
57ffdfad4c
added a crawl option to obey html-meta-robots-noindex. This is on by
...
default.
2013-07-03 14:50:06 +02:00
Michael Peter Christen
5a5d411ec0
new robots_i attribute fields
2013-07-02 14:29:13 +02:00
Michael Peter Christen
fa08bd9d5a
hack to prevent long waiting times in crawler
2013-07-01 13:24:52 +02:00
Michael Peter Christen
f1c5338210
prepartion for greedy crawl profiles and refactoring
2013-07-01 13:10:09 +02:00
Michael Peter Christen
e6f361f474
adding the canonical tag to crawl queues
2013-07-01 13:09:41 +02:00
reger
a6bf44212e
bugfix: location (lat/lon) meta data retrival (Double.NaN check)
2013-06-30 03:50:07 +02:00
Michael Peter Christen
203921006a
redesign of citation index storage
2013-06-30 02:11:46 +02:00
reger
83763ee4a4
jpeg parser: extract GPS location from meta data
2013-06-29 00:35:43 +02:00
Michael Peter Christen
32aa1d4569
removed unused option for queries
2013-06-28 15:32:36 +02:00
Michael Peter Christen
9d291764d1
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-06-28 15:03:25 +02:00
sixcooler
e5abccdfe4
added optimize-option
2013-06-28 14:51:37 +02:00
Michael Peter Christen
64140f35cd
fix for solr requests if no query part is given (prevent npe)
2013-06-28 13:16:25 +02:00
Michael Peter Christen
8caaf6203a
fixed false multiple-generation of remote facet search which
...
caused high cpu usage on remote side.
2013-06-28 12:39:36 +02:00
Michael Peter Christen
823ae4d6a7
added url_protocol_s to error documents
2013-06-26 16:51:36 +02:00
Michael Peter Christen
660a196989
refactoring
2013-06-26 09:27:22 +02:00
Michael Peter Christen
c4538d8d91
added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib
2013-06-26 09:26:34 +02:00
reger
3760e2616b
bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments
2013-06-25 23:24:02 +02:00
Michael Peter Christen
9a6fcdf597
npe fix
2013-06-25 16:36:16 +02:00
Michael Peter Christen
16d1d744fa
added url_file_name_s in default collection schema for the file name
...
without the file extension. This part of the file path is removed from
the multi-field url_paths_sxt, which has now not the file name as last
part of the path list.
The same applies to the new fields source_file_name_s and
target_file_name_s in the webgraph schema.
2013-06-25 16:27:20 +02:00
reger
8d1c4c423d
make imageparser fileextension detection case insensitive (extensions are often upper case)
2013-06-23 00:39:15 +02:00
Michael Peter Christen
f9d859f5dc
now writing image alt texts and (camelcase-)parsed urls into a text
...
search field for a better image retrieval
2013-06-18 16:51:56 +02:00
Michael Peter Christen
e441a9d4c8
to avoid confusion, the gsa api is available at /search? and
...
/searchresult?
2013-06-18 16:22:06 +02:00
orbiter
8792e6c6e9
stub for better image indexing
2013-06-18 13:28:30 +02:00
orbiter
97f2ac9091
added hint to gsa response writer that the result comes from a yacy peer
2013-06-17 13:29:03 +02:00