reger
3d53da8236
refactor ResultEntry to be based on MetadataNode/SolrDocument
...
to share/reuse common access routines
2015-05-25 21:28:48 +02:00
reger
d882991bc5
Implement sharing of ioDispatcher for term & citation index
...
as proposed in ioDispatcher description
2015-05-25 19:46:26 +02:00
reger
370ba9da71
On imageSearch prefere mime to sort out none-image documents
...
Generalize the hack to prevent urls with just a img extension beeing returned
improving http://mantis.tokeek.de/view.php?id=528
2015-05-24 21:48:58 +02:00
reger
cd31633369
improve MultiprotocolURL.getFileExtension()
...
prevent string OOB while querypart contains a dot (return just "")
see log snippet in http://mantis.tokeek.de/view.php?id=533
2015-05-24 19:38:04 +02:00
reger
c60ccdfbcf
Increase IODspatcher dumpQueue size to 2 to reduce risk of concurrent emergency dump,
...
skip concurrent emergency merge
dealing with/see http://mantis.tokeek.de/view.php?id=566
2015-05-24 18:03:27 +02:00
reger
8a9622c31c
fix string OoB on getImagelinks with long alttext
...
in description calculation
2015-05-24 01:59:40 +02:00
reger
3e742d1e34
Init remote crawler on demand
...
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
reger
13f013f64a
Limit extra sleep of BusyThread on LowMemCycle
2015-05-17 06:21:12 +02:00
reger
cd7c0e0aae
detail optimization of RecrawlThread
2015-05-17 00:13:00 +02:00
reger
ace71a8877
Initial (experimental) implementation of index update/re-crawl job
...
added to IndexReIndexMonitor_p.html
Selects existing documents from index and feeds it to the crawler.
currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY]
Documents are added in small chunks (200) to the crawler, only if no other crawl is running.
2015-05-16 01:23:08 +02:00
reger
141cd80456
correct log msg text
2015-05-16 00:01:54 +02:00
reger
f3ce99bfb8
fix extract of inboundlinks_protocol_sxt
...
url counter maybe > 999
2015-05-14 00:03:09 +02:00
reger
2bc9cb5828
fix early return in addToCrawler
...
check / handle all supplied urls after error url
2015-05-13 21:58:43 +02:00
Michael Peter Christen
f5f88272e4
Merge branch 'master' of git@github.com:yacy/yacy_search_server.git
2015-05-12 12:06:42 +02:00
Michael Peter Christen
5c67c4d460
fix for latest commit, see
...
f810915717 (commitcomment-11145880)
2015-05-12 12:06:21 +02:00
reger
c37dda8849
fix NPE on MultiProtocolURL on url with parameter value and '='
...
in getAttribute
- added test case for it
2015-05-12 01:09:10 +02:00
Michael Peter Christen
f810915717
added crawl start from a clone with very, very large url: they are now
...
encoded as post submit form inside a javascript creation function.
2015-05-11 16:30:41 +02:00
Michael Peter Christen
51de86c992
disabled debug thread dumps
2015-05-11 14:46:09 +02:00
Michael Peter Christen
d524a9d77c
Merge branch 'master' of git@github.com:yacy/yacy_search_server.git
2015-05-11 14:42:40 +02:00
Michael Peter Christen
0710648c31
enable api calls with very long urls
2015-05-11 14:42:21 +02:00
reger
31346e873b
upd library reference of missing jsch-0.1.21 in seeduploadscp.xml
...
upd to jsch-0.1.52.jar
2015-05-11 01:35:12 +02:00
reger
609c52e987
refactor getBookmark
...
to consistenly check existance by != null (w/o throwing exception on not found)
2015-05-11 00:37:04 +02:00
reger
1481a8ab56
add opensearch rss results to dht collection (due to text = snippet)
...
which is used to differentiate meta from full data
- make sure check for dht is not dependant on number of collection entries
2015-05-10 18:52:33 +02:00
reger
f134aa7f7f
persist bookmark timestamp
...
on setTimeStamp()
2015-05-10 15:29:23 +02:00
reger
752eec6697
fix NPE in addToIndex when used outside searchEvent
2015-05-10 05:18:23 +02:00
Michael Peter Christen
fbf85a1561
added temporary debug output in http client
2015-05-08 15:31:01 +02:00
Michael Peter Christen
ff29b0e503
added option to re-index exported xml snapshot dumps to
...
HTCACHE/snapshots by just placing them in the SURROGATES/in path
2015-05-08 15:30:26 +02:00
Michael Peter Christen
6f4fe4b175
revert of 8a7c68e4c7
...
keeping surrogates after processing is essential for some users. If the
space they are taking is too high, please set up an automatic deletion
process (like a cronjob).
2015-05-08 14:01:30 +02:00
Michael Peter Christen
97930a6aad
added must-not-match filter to snapshot generation.
...
also: fixed some bugs
2015-05-08 13:46:27 +02:00
Michael Peter Christen
9d8f426890
adding a try-catch to link graph processing to prevent that a single
...
malformed url interrupts the storage process
2015-05-08 10:38:33 +02:00
reger
8a5b8f8789
on bookmaring of search result, remember orig. query in separate bookmark property
...
(instead of using the description field)
- adjust display and autosearch
- don't overwrite existing bookmark but combine info
2015-05-03 02:31:50 +02:00
reger
7224209486
break out of NormalizeDistributor loop on timeout
2015-05-02 02:36:18 +02:00
reger
47e61f8325
fix typo in image filter query
...
(extra bracket)
2015-04-28 03:12:14 +02:00
reger
4b4ab6799f
fix String out of range in Collection Nav
...
see http://mantis.tokeek.de/view.php?id=573
2015-04-27 22:38:40 +02:00
reger
572cfe8fd4
improve character encoding for urlproxy servlet
...
for none utf-8 pages
2015-04-26 17:42:39 +02:00
reger
6bc8a9b11e
make Quality of Service Servlet available to prioritize requests from local host
...
This assigns priorities to incoming requests. Higher priority numbers are served before lower.
(disabled by default in defaults/web.xml,
uncomment or copy entry to DATA/Settings/web.xml)
2015-04-26 04:29:32 +02:00
Ryszard Goń
ca1a70aec8
fix for Accept '?' URLs column in Crawl Profile List
2015-04-19 15:55:49 +02:00
reger
5408448a56
skip redundant add. of keywords to text
...
search uses keywords as default search field
2015-04-17 02:14:13 +02:00
reger
296e97c78e
put https port in peers dna
...
as we flag if a peer is accesible via https, we need to know the port if we want to use is (e.g. for interYaCy communication)
start to provide / tansport the port by recording it in peers dna.
- add https link on the Network.html lock symbol
2015-04-16 02:36:12 +02:00
Michael Peter Christen
fed26f33a8
enhanced timezone managament for indexed data:
...
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
Michael Peter Christen
b060ba900d
added parsing of contentprop attribute in html tags for
...
content='startDate' and content='endDate'. The value of these field is
now written to new solr fields startDates_dts and endDates_dts.
2015-04-13 16:20:00 +02:00
Michael Peter Christen
4cb4f67f38
added parsing of dd, dt and article html fields. The parsed result is
...
written to special solr fields which are deactivated by default.
2015-04-12 22:02:45 +02:00
reger
1395f10e95
fix typecast for css links
2015-04-12 01:11:47 +02:00
Michael Peter Christen
3288489fd2
more logging during start-up
2015-04-11 13:00:32 +02:00
Michael Peter Christen
abaaaef5f1
fix for filter queries
2015-04-11 12:30:29 +02:00
Michael Peter Christen
4d00175157
<experimental> added parsing of <article> html element.
...
Whenever such an element occurs, the complete content of all article
elements replaces the parsed <content> part of documents.
2015-04-10 16:16:20 +02:00
Michael Peter Christen
1df6492019
enhanced suggestions
2015-04-10 15:59:18 +02:00
Michael Peter Christen
ae02c92fd0
logging fix
2015-04-09 14:21:23 +02:00
Michael Peter Christen
5651713134
better debugging of fq
2015-04-07 17:02:02 +02:00
Michael Peter Christen
f5a032f293
split query into filter query and text query to get better ranking
...
results and faster results
2015-04-07 16:10:13 +02:00