Commit Graph

11667 Commits

Author SHA1 Message Date
reger
1f0f77bb77 make location facet return results
for location nav facet of field coordinate_p does not return results, now using coordinate_p_0_coordinate as alternative to get facet counts. As the actual facet value is not used this should not harm any analysis (even if facet is a incomplete location).
If facet value is used in future likely *_geohash field could be introduced (for facet and other ... as transport value)
2015-04-01 01:57:56 +02:00
reger
b1ec0644e5 fix NPE in location search on missing/empty PubDate in underlaying rss data 2015-03-31 02:20:13 +02:00
reger
c1dcc8c456 fix display and limit of max server connections after startup
(on restart value returned to default=50)
This has no effect on Jetty but the limit is still respected.
2015-03-29 07:12:23 +02:00
reger
2f84b04fa9 add err msg on failure during Load_rss 2015-03-29 05:48:54 +02:00
reger
96292cf3eb shorten exception loggin on not available connection in Load_RSS_p servlet 2015-03-28 21:12:00 +01:00
reger
839b962c20 correct percent encoding for '%' char 2015-03-28 03:05:21 +01:00
reger
66d0b5046a fix NPE on viewfile of url not in index 2015-03-26 00:21:31 +01:00
Michael Peter Christen
9bf0d7ecb9 added a new collection type 'dht' to all documents from the peer-to-peer
interface to distinguish rich and poor document data.
This also reverts some changes from commit
796770e070 because the firstSeen database
is the wrong method to distinguish these types of data
2015-03-24 12:32:39 +01:00
reger
7fcf0d0b71 fix missing display of CrawlerMonitor -> robots.txt Monitor
revert delete of file api/table_p.html see 3ffe19b85c
(still used in this menu)
2015-03-24 00:13:05 +01:00
reger
796770e070 prevent overwrite of crawled or received full documents by (newer) metadata
To protect rich index data (full resource) from overwriting by metadata gathered during remote search,
the newly introduced "firstSeen" index is used to differentiate between full-resource-doc and metadata,
as a "firstSeen" entry is only added on store's of full-resource-docs (during crawl or remote search).
2015-03-23 03:57:47 +01:00
reger
7cf28c4f94 upd to Jetty 9.2.10 2015-03-22 02:47:12 +01:00
Michael Peter Christen
ee2490ab98 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-19 10:42:57 +01:00
reger
431311df42 fix get fresh_date_dt to allow returned value to be date in future 2015-03-18 22:04:03 +01:00
otter
74c7e8b686 Fixes hanging FlushThread (see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5447)
by replacing put() method by the more robust add() to
add a merge job to the queue.
2015-03-18 21:57:41 +01:00
reger
f63fff9008 fix snippet containig number with comma as desmo point http://mantis.tokeek.de/view.php?id=344
to keep it as one word (by altering the split regex)
- added sniipet test case with number
- regex for word split to match multiple splitcars
2015-03-16 02:03:40 +01:00
reger
b241264632 fix error on *abc query input
http://mantis.tokeek.de/view.php?id=486
2015-03-15 22:31:47 +01:00
reger
2ef8ffdb60 apply UTF-8 encoding
copied from escape()
2015-03-15 06:02:45 +01:00
reger
7120ea42f1 fix for path with char code > 255
(causing index out of bound exception)
+ test cas for it
2015-03-15 03:37:32 +01:00
reger
1d81bd0687 fix url encoding for path see http://mantis.tokeek.de/view.php?id=559
So far we used same escape procedure for all parts of the url (which includes x-www-form-urlencoded for all url components)
Added capability to use different encoding rules for the different url components (through specific bitset for each component).
(this is inspired by org.apache.http.client and java.net.uri implementation).
- Added test case for  http://mantis.tokeek.de/view.php?id=559
2015-03-15 00:46:07 +01:00
reger
62087fb8b2 fix MultiProtocolURL mailto protocol detection 2015-03-13 02:02:53 +01:00
reger
65f8371163 fix link to DeReWo project page 2015-03-11 21:28:57 +01:00
reger
2e8c24e02a fix link to DeReWo download file 2015-03-11 20:02:23 +01:00
reger
706f75ddc2 try to fix hang on index blob merge on shutdown
http://mantis.tokeek.de/view.php?id=505
It happens but not able to reproduce. This change makes sure terminate signal is catched at end of currently running merge jobs
2015-03-11 19:36:23 +01:00
reger
f94e34058c fix url (path) %-decoding http://mantis.tokeek.de/view.php?id=519
- add test case for this
2015-03-11 01:05:14 +01:00
reger
a5d19e2982 update configheuristics_p.html text
to state current opensearch heuristic function
2015-03-09 00:09:36 +01:00
reger
4b63dad88d fix version conflict in pom
for commons-io
2015-03-08 22:10:51 +01:00
reger
7e09bff4a1 exclude default search fields from text copy to text_t
for metadata index documents (reduce text redundance)
2015-03-08 21:49:23 +01:00
reger
86073a5ba3 For remote crawlReceipt add document abstract/description
enhance the returned metadata returned to the originator by description_txt to improve fulltext search result hits.
2015-03-08 02:34:48 +01:00
reger
8af70950d9 harmonize snippet computation
to considere description_txt always (solr hl & internal).
For now just added desc to text list for computation, could be further equalized with hl computation.
2015-03-05 02:22:05 +01:00
reger
74ed399180 remove unused statement 2015-03-05 02:09:27 +01:00
Michael Peter Christen
fd4e2c809a Show dates in the content of a document in the search result:
- if an eventDate is given in the search result, replace the document
date with the event date and prefix it with the string "on ".
- the document date is omitted if a date from the cent is shown

Added also the date as fields in the json and rss result sets.
2015-03-02 18:00:20 +01:00
Michael Peter Christen
893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
search for: "Berlin on:tomorrow" to find events happening tomorrow in
Berlin
2015-03-02 13:10:05 +01:00
Michael Peter Christen
710a0efa1b generalized time period computations 2015-03-02 12:55:31 +01:00
Michael Peter Christen
dcfc384eee bugfix for fixed host/port 2015-03-02 04:43:42 +01:00
Michael Peter Christen
d9d3111d10 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-02 04:31:05 +01:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
16bc267a32 add test case for snippet html encoding check 2015-03-01 23:50:17 +01:00
reger
a4629ad83b upd pom 2015-02-28 19:48:29 +01:00
reger
d7259419f3 postpone raw snippet html encoding upon use
instead of during init of snippet 
adressing http://mantis.tokeek.de/view.php?id=551
2015-02-28 19:02:18 +01:00
Michael Peter Christen
c3aadcf899 Fix for Jetty "JetLeak" bug: update to jetty 9.2.9
The bug was inside the jetty library, for details see:
http://blog.gdssecurity.com/labs/2015/2/25/jetleak-vulnerability-remote-leakage-of-shared-buffers-in-je.html
We recommend to update your YaCy peer with this bugfix.
2015-02-28 15:46:46 +01:00
reger
de56d934b2 apply query parameter getQueryFields() to GSA servlet 2015-02-27 00:53:20 +01:00
Marc Nause
d23f7165ab Next try to fix start script for OpenBSD. 2015-02-25 21:11:59 +01:00
reger
2d2299f484 fix mimetype of rss items in rss parser
- remove self reference as anchor for items
2015-02-25 01:58:42 +01:00
Michael Peter Christen
b432049d59 enhanced date parsing time 2015-02-25 01:05:46 +01:00
reger
9b0de2de64 introduce getQueryFields to return default query fields (queryparamter QF)
calculated from boostfields config, making sure title, description, keywords and content is always searched.
- apply change to solrServlet makes sure every remote query uses at least all locally defined boost fields for search
- apply to local solr search
- simplify select query by using QF defaults
2015-02-23 23:12:07 +01:00
Marc Nause
53e4ae65d0 Changes to improve compatibility with OpenBSD. (see
http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5503)
2015-02-23 22:54:49 +01:00
reger
ba276d3e64 add description_txt to default query fields,
Dublin Core Metadata field extracted by most parsers.
2015-02-22 05:42:04 +01:00
reger
a0f04db9ea add extracted description/subject to pptParser 2015-02-22 05:31:56 +01:00
reger
8ec1db76ee url unescape add check for inconsistent utf8 multibyte parsing
If the url contains special chars (like umlaute äöü) it's interpreted as multybyte char and actually not converted at all (removed).
Added a check if the multibyte convesion is not complete, just add the char as is.

This fixes http://mantis.tokeek.de/view.php?id=200
2015-02-20 02:21:04 +01:00
reger
4b97ddb9ec stop sending crawl receipts if receiver got offline 2015-02-17 03:16:10 +01:00