Commit Graph

3280 Commits

Author SHA1 Message Date
Michael Peter Christen
4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
written to special solr fields which are deactivated by default.
2015-04-12 22:02:45 +02:00
reger
1395f10e95 fix typecast for css links 2015-04-12 01:11:47 +02:00
Michael Peter Christen
3288489fd2 more logging during start-up 2015-04-11 13:00:32 +02:00
Michael Peter Christen
abaaaef5f1 fix for filter queries 2015-04-11 12:30:29 +02:00
Michael Peter Christen
4d00175157 <experimental> added parsing of <article> html element.
Whenever such an element occurs, the complete content of all article
elements replaces the parsed <content> part of documents.
2015-04-10 16:16:20 +02:00
Michael Peter Christen
1df6492019 enhanced suggestions 2015-04-10 15:59:18 +02:00
Michael Peter Christen
ae02c92fd0 logging fix 2015-04-09 14:21:23 +02:00
Michael Peter Christen
5651713134 better debugging of fq 2015-04-07 17:02:02 +02:00
Michael Peter Christen
f5a032f293 split query into filter query and text query to get better ranking
results and faster results
2015-04-07 16:10:13 +02:00
Michael Peter Christen
2e88028c1a when selecting collections in navigation, do show the un-selected
collections in search result. When selecting one of them in another
search, switch off the previously selected collection. This actually
turns the collection navigation modifier into a radio-button like
behaviour
2015-04-07 13:13:58 +02:00
Michael Peter Christen
1de9b21c65 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-04-07 12:40:43 +02:00
reger
5f4cd8d6f5 replace deprecated getIP with getIPs in AbstractRemoteHandler 2015-04-07 00:10:42 +02:00
Michael Peter Christen
fa7edc9f7a refactoring of filter queries (several queries instead only one) 2015-04-02 13:27:47 +02:00
Michael Peter Christen
40389987ec Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-04-01 18:18:05 +02:00
Michael Peter Christen
f9ba50379d added an expansion option to search facets on result page:
- if less or equal of 8 facet options are present, they are shown by
default
- if more facet options are present, they are hidden
To view or hide all facets, just click on the facet header bar
2015-04-01 18:17:52 +02:00
reger
1f0f77bb77 make location facet return results
for location nav facet of field coordinate_p does not return results, now using coordinate_p_0_coordinate as alternative to get facet counts. As the actual facet value is not used this should not harm any analysis (even if facet is a incomplete location).
If facet value is used in future likely *_geohash field could be introduced (for facet and other ... as transport value)
2015-04-01 01:57:56 +02:00
reger
b1ec0644e5 fix NPE in location search on missing/empty PubDate in underlaying rss data 2015-03-31 02:20:13 +02:00
reger
c1dcc8c456 fix display and limit of max server connections after startup
(on restart value returned to default=50)
This has no effect on Jetty but the limit is still respected.
2015-03-29 07:12:23 +02:00
reger
839b962c20 correct percent encoding for '%' char 2015-03-28 03:05:21 +01:00
Michael Peter Christen
9bf0d7ecb9 added a new collection type 'dht' to all documents from the peer-to-peer
interface to distinguish rich and poor document data.
This also reverts some changes from commit
796770e070 because the firstSeen database
is the wrong method to distinguish these types of data
2015-03-24 12:32:39 +01:00
reger
796770e070 prevent overwrite of crawled or received full documents by (newer) metadata
To protect rich index data (full resource) from overwriting by metadata gathered during remote search,
the newly introduced "firstSeen" index is used to differentiate between full-resource-doc and metadata,
as a "firstSeen" entry is only added on store's of full-resource-docs (during crawl or remote search).
2015-03-23 03:57:47 +01:00
Michael Peter Christen
ee2490ab98 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-19 10:42:57 +01:00
reger
431311df42 fix get fresh_date_dt to allow returned value to be date in future 2015-03-18 22:04:03 +01:00
otter
74c7e8b686 Fixes hanging FlushThread (see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5447)
by replacing put() method by the more robust add() to
add a merge job to the queue.
2015-03-18 21:57:41 +01:00
reger
f63fff9008 fix snippet containig number with comma as desmo point http://mantis.tokeek.de/view.php?id=344
to keep it as one word (by altering the split regex)
- added sniipet test case with number
- regex for word split to match multiple splitcars
2015-03-16 02:03:40 +01:00
reger
b241264632 fix error on *abc query input
http://mantis.tokeek.de/view.php?id=486
2015-03-15 22:31:47 +01:00
reger
2ef8ffdb60 apply UTF-8 encoding
copied from escape()
2015-03-15 06:02:45 +01:00
reger
7120ea42f1 fix for path with char code > 255
(causing index out of bound exception)
+ test cas for it
2015-03-15 03:37:32 +01:00
reger
1d81bd0687 fix url encoding for path see http://mantis.tokeek.de/view.php?id=559
So far we used same escape procedure for all parts of the url (which includes x-www-form-urlencoded for all url components)
Added capability to use different encoding rules for the different url components (through specific bitset for each component).
(this is inspired by org.apache.http.client and java.net.uri implementation).
- Added test case for  http://mantis.tokeek.de/view.php?id=559
2015-03-15 00:46:07 +01:00
reger
62087fb8b2 fix MultiProtocolURL mailto protocol detection 2015-03-13 02:02:53 +01:00
reger
2e8c24e02a fix link to DeReWo download file 2015-03-11 20:02:23 +01:00
reger
706f75ddc2 try to fix hang on index blob merge on shutdown
http://mantis.tokeek.de/view.php?id=505
It happens but not able to reproduce. This change makes sure terminate signal is catched at end of currently running merge jobs
2015-03-11 19:36:23 +01:00
reger
f94e34058c fix url (path) %-decoding http://mantis.tokeek.de/view.php?id=519
- add test case for this
2015-03-11 01:05:14 +01:00
reger
7e09bff4a1 exclude default search fields from text copy to text_t
for metadata index documents (reduce text redundance)
2015-03-08 21:49:23 +01:00
reger
86073a5ba3 For remote crawlReceipt add document abstract/description
enhance the returned metadata returned to the originator by description_txt to improve fulltext search result hits.
2015-03-08 02:34:48 +01:00
reger
8af70950d9 harmonize snippet computation
to considere description_txt always (solr hl & internal).
For now just added desc to text list for computation, could be further equalized with hl computation.
2015-03-05 02:22:05 +01:00
Michael Peter Christen
fd4e2c809a Show dates in the content of a document in the search result:
- if an eventDate is given in the search result, replace the document
date with the event date and prefix it with the string "on ".
- the document date is omitted if a date from the cent is shown

Added also the date as fields in the json and rss result sets.
2015-03-02 18:00:20 +01:00
Michael Peter Christen
893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
search for: "Berlin on:tomorrow" to find events happening tomorrow in
Berlin
2015-03-02 13:10:05 +01:00
Michael Peter Christen
710a0efa1b generalized time period computations 2015-03-02 12:55:31 +01:00
Michael Peter Christen
d9d3111d10 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-02 04:31:05 +01:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
d7259419f3 postpone raw snippet html encoding upon use
instead of during init of snippet 
adressing http://mantis.tokeek.de/view.php?id=551
2015-02-28 19:02:18 +01:00
reger
de56d934b2 apply query parameter getQueryFields() to GSA servlet 2015-02-27 00:53:20 +01:00
reger
2d2299f484 fix mimetype of rss items in rss parser
- remove self reference as anchor for items
2015-02-25 01:58:42 +01:00
Michael Peter Christen
b432049d59 enhanced date parsing time 2015-02-25 01:05:46 +01:00
reger
9b0de2de64 introduce getQueryFields to return default query fields (queryparamter QF)
calculated from boostfields config, making sure title, description, keywords and content is always searched.
- apply change to solrServlet makes sure every remote query uses at least all locally defined boost fields for search
- apply to local solr search
- simplify select query by using QF defaults
2015-02-23 23:12:07 +01:00
reger
a0f04db9ea add extracted description/subject to pptParser 2015-02-22 05:31:56 +01:00
reger
8ec1db76ee url unescape add check for inconsistent utf8 multibyte parsing
If the url contains special chars (like umlaute äöü) it's interpreted as multybyte char and actually not converted at all (removed).
Added a check if the multibyte convesion is not complete, just add the char as is.

This fixes http://mantis.tokeek.de/view.php?id=200
2015-02-20 02:21:04 +01:00
reger
4b97ddb9ec stop sending crawl receipts if receiver got offline 2015-02-17 03:16:10 +01:00
reger
7e35518787 add extracted description/subject to docParser 2015-02-16 00:50:16 +01:00
reger
f0a5188e11 replace depreciated HTTPClient setStaleConnectionCheckEnabled with setValidateAfterInactivity() 2015-02-15 23:09:01 +01:00
reger
7b569d2dbe replace depriciated HTTPClient ALLOW_ALL_HOSTNAME_VERIFIER with NoopHostnameVerifier() 2015-02-15 21:34:01 +01:00
reger
fba34e12ef fix formatting issue if snippet contains html code
replacement for reverted commit
61f42a7928
2015-02-15 20:39:20 +01:00
reger
e48720a58c fix NPE in snippet computation 2015-02-15 05:30:14 +01:00
reger
eda0aeaf26 allow/recognize host in file: protocol crawl target
This is useful in intranet indexing while crawling a intranet file server accessed via hostname while e.g. under Windows mapped to different drive letters on individual clients.
Here you can crawl e.g.  file://fileserver/documents having a valid uri in that intranet environment (while e.g. P:/documents might be client dependant).
2015-02-11 23:26:39 +01:00
reger
df83fcc4fc disable optimistic GC assumption in StandardMemoryStrategy
After several tests found that eom is not prevented. Major reason in testing was assumption future GC will free avg of last 5 GC.
Disabeling this check improved eom exceptions.

Added simplest testcase used for verification
2015-02-11 01:42:01 +01:00
Michael Peter Christen
8ff76f8682 the cleanup process experienced a 100% CPU load situation and the loop
did not terminate:

Occurrences: 100
at java.util.HashMap$KeyIterator.next(HashMap.java:956)
at
net.yacy.cora.protocol.ConnectionInfo.cleanup(ConnectionInfo.java:300)
at
net.yacy.cora.protocol.ConnectionInfo.cleanUp(ConnectionInfo.java:293)
at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2212)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:105)
at
net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:215)

This tries to fix the problem; the problem should be monitored
2015-02-10 08:43:45 +01:00
Michael Peter Christen
1f5b5c0111 npe fix for latest scraper feature 2015-02-10 08:33:30 +01:00
Michael Peter Christen
ee97302a23 hack to make date detection faster (while it becomes a bit incomplete
regarding language alternatives)
2015-02-09 18:46:06 +01:00
Michael Peter Christen
6578ff3ddb enhanced suggest function 2015-02-09 18:45:07 +01:00
reger
fe6f5a395d fix Umlaut handling in blekko heuristic search term
http://mantis.tokeek.de/view.php?id=169
observation: blekko seams to block xxxbot agents (=0 results)
2015-02-08 23:40:33 +01:00
reger
23924348e2 url with semicolon or comma handling in proxy request
apply patch supplied with bugreport http://mantis.tokeek.de/view.php?id=540
2015-02-07 22:01:54 +01:00
reger
9025fe3518 upd error message for proxy
fix http://mantis.tokeek.de/view.php?id=539
2015-02-07 00:37:43 +01:00
Michael Peter Christen
97ba5ddbb7 configuration option for maxload limit for remote search 2015-02-04 01:12:25 +01:00
reger
c454ef69c6 add shortMemory check to heuristic search
and skip operation on shortMemory (no request to remote openserch systems)
2015-02-03 03:08:34 +01:00
reger
9e1ec5fec4 refactor: just some more useages of constant for term ":[* TO *]" 2015-02-01 04:26:33 +01:00
reger
8c491f51a5 remove hardcoded initialization of language nav if not used 2015-02-01 00:29:28 +01:00
Michael Peter Christen
b5ac29c9a5 added a html field scraper which reads text from html entities of a
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.

To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
2015-01-30 13:20:56 +01:00
Michael Peter Christen
1cb290170e refactoring of autotagging code (combined same code pieces) 2015-01-29 11:39:47 +01:00
Michael Peter Christen
c3b55455fc enhanced initialization speed of vocabularies by using better
normalization and by removal of unused data structures
2015-01-29 02:45:32 +01:00
Michael Peter Christen
68c605d637 replace with CommonPattern.SPACE for split 2015-01-29 02:28:03 +01:00
Michael Peter Christen
de3e373913 using precompiled CommonPattern.TAB for split 2015-01-29 02:22:28 +01:00
Michael Peter Christen
1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits 2015-01-29 02:19:41 +01:00
Michael Peter Christen
a8a2b7a803 persistency for vocabulary facet switch 2015-01-29 02:16:42 +01:00
Michael Peter Christen
efbc9a3561 introducting a new getConfig method which parses comma-separated llists
from setting fields; refactoring for all places where such lists are
parsed
2015-01-29 01:53:36 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
Michael Peter Christen
ac19690d30 refactoring with CommonPattern.COMMA 2015-01-29 01:35:28 +01:00
Michael Peter Christen
cf9b22ca5c do not reindex based on vocabulary fields (there are meanwhile many of
them) and some default settings
2015-01-29 01:22:28 +01:00
Michael Peter Christen
5a060c9f26 refactoring of reindexSolr (just replaced constant string) 2015-01-29 00:33:07 +01:00
Michael Peter Christen
b5a55c8b3d fix for wkhtmltopdf (custom header does not work) 2015-01-28 17:45:25 +01:00
Michael Peter Christen
3d717b749a fix for urlmaskfilter 2015-01-28 13:40:41 +01:00
Michael Peter Christen
bee5ee7cce removed some warnings 2015-01-27 17:00:20 +01:00
Michael Peter Christen
783cf6fbc7 the LinkedBlockingQueue is much faster than the ArrayBlockingQueue
(strange but this is the result of a test:
ArrayBlockingQueue: 39461 lines / second;
LinkedBlockingQueue: 60774 lines / second)
2015-01-27 16:53:09 +01:00
Michael Peter Christen
6390454652 fix for vocabulary on/off setting 2015-01-27 16:24:27 +01:00
Michael Peter Christen
a3c5995bde Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-26 14:13:17 +01:00
reger
5ca0762179 fix: eom on parsing ico file by genericImageParser
trace: java.lang.OutOfMemoryError: Java heap space
	at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
	at java.awt.image.Raster.createPackedRaster(Raster.java:467)
	at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
	at java.awt.image.BufferedImage.<init>(BufferedImage.java:331)
	at net.yacy.document.parser.images.bmpParser$IMAGEMAP.<init>(bmpParser.java:149)
	at net.yacy.document.parser.images.bmpParser.parse(bmpParser.java:69)
	at net.yacy.document.parser.images.genericImageParser.parse(genericImageParser.java:116)
2015-01-24 23:17:07 +01:00
Michael Peter Christen
4cd2d68e03 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-24 07:10:47 +01:00
Michael Peter Christen
dc5700148f update to latest code changes from json.org 2015-01-24 07:10:14 +01:00
reger
42b0672be3 Let auto-disabled crawls recover if low resource condition vanished.
Analog to autodisabled DHT switch autodisabled crawls back on upon mem ok
by remembering the autodisable by conf parameter.
2015-01-24 01:53:58 +01:00
Michael Peter Christen
287c528f46 replaced old JavaApplicationStub for Mac Application framework with new
script. Adopted the YaCyApp environment and fixed a problem in the
startYACY.sh application wrapper which caused wrong usage of logging
option -l which caused that files had been written to the YaCy
application folder.
As a result of this fix, it is not necessary any more to change path
settings in Info.plist if libraries are changed.
2015-01-23 11:30:13 +01:00
Michael Peter Christen
4c9d2a7c64 reverted 'do not show all options' strategy. This is actually confusing
new users. Will be activated maybe again if there is an optional
tutorial mode which can be switched on for this special purpose of
running a tutorial.
2015-01-20 18:18:12 +01:00
Michael Peter Christen
7db2888336 fixed font size and print page generation in pdf snapshots 2015-01-20 17:14:14 +01:00
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
Michael Peter Christen
3b51636ecb fix for mediawiki import 2015-01-12 00:35:47 +01:00
Michael Peter Christen
b07afbc115 a test with http://validator.w3.org/feed/#validate_by_input shows that
the time format was wrong; we must use RFC-822
2015-01-09 16:45:43 +01:00
Michael Peter Christen
8cafdb989a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-09 11:00:02 +01:00
reger
66839f73fa remove debug limit from commit before 2015-01-09 02:52:18 +01:00
reger
4214f250d0 Add option for extended search (Autosearch) to Bookmark.html asking all connected peers for the searchterm added as description to the bookmark created by the bookmark icon.
Intended for searches/research projects with not sufficient results from local and DHT selected remote target peers.

Function: the process checks newly created bookmarks for description starting with "query=..." and takes this to ask every peer for 20 search results and adds it to the local index in a background job.
link to start/stop the process added to /Bookmarks.html
2015-01-09 02:06:30 +01:00
reger
8e751d754a - add javadoc to busythread with hint about the init parameter useage
- remove obsolete 10_httpd config parameter
2015-01-09 01:31:57 +01:00
Michael Peter Christen
3e6c3e2237 documents pushed over the api/push_p.html interface will have their
unique flag set by default
2015-01-06 15:22:59 +01:00