Commit Graph

7854 Commits

Author SHA1 Message Date
Michael Peter Christen
1de9b21c65 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-04-07 12:40:43 +02:00
reger
5f4cd8d6f5 replace deprecated getIP with getIPs in AbstractRemoteHandler 2015-04-07 00:10:42 +02:00
Michael Peter Christen
fa7edc9f7a refactoring of filter queries (several queries instead only one) 2015-04-02 13:27:47 +02:00
Michael Peter Christen
40389987ec Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-04-01 18:18:05 +02:00
Michael Peter Christen
f9ba50379d added an expansion option to search facets on result page:
- if less or equal of 8 facet options are present, they are shown by
default
- if more facet options are present, they are hidden
To view or hide all facets, just click on the facet header bar
2015-04-01 18:17:52 +02:00
reger
1f0f77bb77 make location facet return results
for location nav facet of field coordinate_p does not return results, now using coordinate_p_0_coordinate as alternative to get facet counts. As the actual facet value is not used this should not harm any analysis (even if facet is a incomplete location).
If facet value is used in future likely *_geohash field could be introduced (for facet and other ... as transport value)
2015-04-01 01:57:56 +02:00
reger
b1ec0644e5 fix NPE in location search on missing/empty PubDate in underlaying rss data 2015-03-31 02:20:13 +02:00
reger
c1dcc8c456 fix display and limit of max server connections after startup
(on restart value returned to default=50)
This has no effect on Jetty but the limit is still respected.
2015-03-29 07:12:23 +02:00
reger
839b962c20 correct percent encoding for '%' char 2015-03-28 03:05:21 +01:00
Michael Peter Christen
9bf0d7ecb9 added a new collection type 'dht' to all documents from the peer-to-peer
interface to distinguish rich and poor document data.
This also reverts some changes from commit
796770e070 because the firstSeen database
is the wrong method to distinguish these types of data
2015-03-24 12:32:39 +01:00
reger
796770e070 prevent overwrite of crawled or received full documents by (newer) metadata
To protect rich index data (full resource) from overwriting by metadata gathered during remote search,
the newly introduced "firstSeen" index is used to differentiate between full-resource-doc and metadata,
as a "firstSeen" entry is only added on store's of full-resource-docs (during crawl or remote search).
2015-03-23 03:57:47 +01:00
Michael Peter Christen
ee2490ab98 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-19 10:42:57 +01:00
reger
431311df42 fix get fresh_date_dt to allow returned value to be date in future 2015-03-18 22:04:03 +01:00
otter
74c7e8b686 Fixes hanging FlushThread (see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5447)
by replacing put() method by the more robust add() to
add a merge job to the queue.
2015-03-18 21:57:41 +01:00
reger
f63fff9008 fix snippet containig number with comma as desmo point http://mantis.tokeek.de/view.php?id=344
to keep it as one word (by altering the split regex)
- added sniipet test case with number
- regex for word split to match multiple splitcars
2015-03-16 02:03:40 +01:00
reger
b241264632 fix error on *abc query input
http://mantis.tokeek.de/view.php?id=486
2015-03-15 22:31:47 +01:00
reger
2ef8ffdb60 apply UTF-8 encoding
copied from escape()
2015-03-15 06:02:45 +01:00
reger
7120ea42f1 fix for path with char code > 255
(causing index out of bound exception)
+ test cas for it
2015-03-15 03:37:32 +01:00
reger
1d81bd0687 fix url encoding for path see http://mantis.tokeek.de/view.php?id=559
So far we used same escape procedure for all parts of the url (which includes x-www-form-urlencoded for all url components)
Added capability to use different encoding rules for the different url components (through specific bitset for each component).
(this is inspired by org.apache.http.client and java.net.uri implementation).
- Added test case for  http://mantis.tokeek.de/view.php?id=559
2015-03-15 00:46:07 +01:00
reger
62087fb8b2 fix MultiProtocolURL mailto protocol detection 2015-03-13 02:02:53 +01:00
reger
2e8c24e02a fix link to DeReWo download file 2015-03-11 20:02:23 +01:00
reger
706f75ddc2 try to fix hang on index blob merge on shutdown
http://mantis.tokeek.de/view.php?id=505
It happens but not able to reproduce. This change makes sure terminate signal is catched at end of currently running merge jobs
2015-03-11 19:36:23 +01:00
reger
f94e34058c fix url (path) %-decoding http://mantis.tokeek.de/view.php?id=519
- add test case for this
2015-03-11 01:05:14 +01:00
reger
7e09bff4a1 exclude default search fields from text copy to text_t
for metadata index documents (reduce text redundance)
2015-03-08 21:49:23 +01:00
reger
86073a5ba3 For remote crawlReceipt add document abstract/description
enhance the returned metadata returned to the originator by description_txt to improve fulltext search result hits.
2015-03-08 02:34:48 +01:00
reger
8af70950d9 harmonize snippet computation
to considere description_txt always (solr hl & internal).
For now just added desc to text list for computation, could be further equalized with hl computation.
2015-03-05 02:22:05 +01:00
Michael Peter Christen
fd4e2c809a Show dates in the content of a document in the search result:
- if an eventDate is given in the search result, replace the document
date with the event date and prefix it with the string "on ".
- the document date is omitted if a date from the cent is shown

Added also the date as fields in the json and rss result sets.
2015-03-02 18:00:20 +01:00
Michael Peter Christen
893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
search for: "Berlin on:tomorrow" to find events happening tomorrow in
Berlin
2015-03-02 13:10:05 +01:00
Michael Peter Christen
710a0efa1b generalized time period computations 2015-03-02 12:55:31 +01:00
Michael Peter Christen
d9d3111d10 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-02 04:31:05 +01:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
d7259419f3 postpone raw snippet html encoding upon use
instead of during init of snippet 
adressing http://mantis.tokeek.de/view.php?id=551
2015-02-28 19:02:18 +01:00
reger
de56d934b2 apply query parameter getQueryFields() to GSA servlet 2015-02-27 00:53:20 +01:00
reger
2d2299f484 fix mimetype of rss items in rss parser
- remove self reference as anchor for items
2015-02-25 01:58:42 +01:00
Michael Peter Christen
b432049d59 enhanced date parsing time 2015-02-25 01:05:46 +01:00
reger
9b0de2de64 introduce getQueryFields to return default query fields (queryparamter QF)
calculated from boostfields config, making sure title, description, keywords and content is always searched.
- apply change to solrServlet makes sure every remote query uses at least all locally defined boost fields for search
- apply to local solr search
- simplify select query by using QF defaults
2015-02-23 23:12:07 +01:00
reger
a0f04db9ea add extracted description/subject to pptParser 2015-02-22 05:31:56 +01:00
reger
8ec1db76ee url unescape add check for inconsistent utf8 multibyte parsing
If the url contains special chars (like umlaute äöü) it's interpreted as multybyte char and actually not converted at all (removed).
Added a check if the multibyte convesion is not complete, just add the char as is.

This fixes http://mantis.tokeek.de/view.php?id=200
2015-02-20 02:21:04 +01:00
reger
4b97ddb9ec stop sending crawl receipts if receiver got offline 2015-02-17 03:16:10 +01:00
reger
7e35518787 add extracted description/subject to docParser 2015-02-16 00:50:16 +01:00
reger
f0a5188e11 replace depreciated HTTPClient setStaleConnectionCheckEnabled with setValidateAfterInactivity() 2015-02-15 23:09:01 +01:00
reger
7b569d2dbe replace depriciated HTTPClient ALLOW_ALL_HOSTNAME_VERIFIER with NoopHostnameVerifier() 2015-02-15 21:34:01 +01:00
reger
fba34e12ef fix formatting issue if snippet contains html code
replacement for reverted commit
61f42a7928
2015-02-15 20:39:20 +01:00
reger
e48720a58c fix NPE in snippet computation 2015-02-15 05:30:14 +01:00
reger
eda0aeaf26 allow/recognize host in file: protocol crawl target
This is useful in intranet indexing while crawling a intranet file server accessed via hostname while e.g. under Windows mapped to different drive letters on individual clients.
Here you can crawl e.g.  file://fileserver/documents having a valid uri in that intranet environment (while e.g. P:/documents might be client dependant).
2015-02-11 23:26:39 +01:00
reger
df83fcc4fc disable optimistic GC assumption in StandardMemoryStrategy
After several tests found that eom is not prevented. Major reason in testing was assumption future GC will free avg of last 5 GC.
Disabeling this check improved eom exceptions.

Added simplest testcase used for verification
2015-02-11 01:42:01 +01:00
Michael Peter Christen
8ff76f8682 the cleanup process experienced a 100% CPU load situation and the loop
did not terminate:

Occurrences: 100
at java.util.HashMap$KeyIterator.next(HashMap.java:956)
at
net.yacy.cora.protocol.ConnectionInfo.cleanup(ConnectionInfo.java:300)
at
net.yacy.cora.protocol.ConnectionInfo.cleanUp(ConnectionInfo.java:293)
at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2212)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:105)
at
net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:215)

This tries to fix the problem; the problem should be monitored
2015-02-10 08:43:45 +01:00
Michael Peter Christen
1f5b5c0111 npe fix for latest scraper feature 2015-02-10 08:33:30 +01:00
Michael Peter Christen
ee97302a23 hack to make date detection faster (while it becomes a bit incomplete
regarding language alternatives)
2015-02-09 18:46:06 +01:00
Michael Peter Christen
6578ff3ddb enhanced suggest function 2015-02-09 18:45:07 +01:00
reger
fe6f5a395d fix Umlaut handling in blekko heuristic search term
http://mantis.tokeek.de/view.php?id=169
observation: blekko seams to block xxxbot agents (=0 results)
2015-02-08 23:40:33 +01:00
reger
23924348e2 url with semicolon or comma handling in proxy request
apply patch supplied with bugreport http://mantis.tokeek.de/view.php?id=540
2015-02-07 22:01:54 +01:00
reger
9025fe3518 upd error message for proxy
fix http://mantis.tokeek.de/view.php?id=539
2015-02-07 00:37:43 +01:00
Michael Peter Christen
97ba5ddbb7 configuration option for maxload limit for remote search 2015-02-04 01:12:25 +01:00
reger
c454ef69c6 add shortMemory check to heuristic search
and skip operation on shortMemory (no request to remote openserch systems)
2015-02-03 03:08:34 +01:00
reger
9e1ec5fec4 refactor: just some more useages of constant for term ":[* TO *]" 2015-02-01 04:26:33 +01:00
reger
8c491f51a5 remove hardcoded initialization of language nav if not used 2015-02-01 00:29:28 +01:00
Michael Peter Christen
b5ac29c9a5 added a html field scraper which reads text from html entities of a
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.

To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
2015-01-30 13:20:56 +01:00
Michael Peter Christen
1cb290170e refactoring of autotagging code (combined same code pieces) 2015-01-29 11:39:47 +01:00
Michael Peter Christen
c3b55455fc enhanced initialization speed of vocabularies by using better
normalization and by removal of unused data structures
2015-01-29 02:45:32 +01:00
Michael Peter Christen
68c605d637 replace with CommonPattern.SPACE for split 2015-01-29 02:28:03 +01:00
Michael Peter Christen
de3e373913 using precompiled CommonPattern.TAB for split 2015-01-29 02:22:28 +01:00
Michael Peter Christen
1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits 2015-01-29 02:19:41 +01:00
Michael Peter Christen
a8a2b7a803 persistency for vocabulary facet switch 2015-01-29 02:16:42 +01:00
Michael Peter Christen
efbc9a3561 introducting a new getConfig method which parses comma-separated llists
from setting fields; refactoring for all places where such lists are
parsed
2015-01-29 01:53:36 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
Michael Peter Christen
ac19690d30 refactoring with CommonPattern.COMMA 2015-01-29 01:35:28 +01:00
Michael Peter Christen
cf9b22ca5c do not reindex based on vocabulary fields (there are meanwhile many of
them) and some default settings
2015-01-29 01:22:28 +01:00
Michael Peter Christen
5a060c9f26 refactoring of reindexSolr (just replaced constant string) 2015-01-29 00:33:07 +01:00
Michael Peter Christen
b5a55c8b3d fix for wkhtmltopdf (custom header does not work) 2015-01-28 17:45:25 +01:00
Michael Peter Christen
3d717b749a fix for urlmaskfilter 2015-01-28 13:40:41 +01:00
Michael Peter Christen
bee5ee7cce removed some warnings 2015-01-27 17:00:20 +01:00
Michael Peter Christen
783cf6fbc7 the LinkedBlockingQueue is much faster than the ArrayBlockingQueue
(strange but this is the result of a test:
ArrayBlockingQueue: 39461 lines / second;
LinkedBlockingQueue: 60774 lines / second)
2015-01-27 16:53:09 +01:00
Michael Peter Christen
6390454652 fix for vocabulary on/off setting 2015-01-27 16:24:27 +01:00
Michael Peter Christen
a3c5995bde Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-26 14:13:17 +01:00
reger
5ca0762179 fix: eom on parsing ico file by genericImageParser
trace: java.lang.OutOfMemoryError: Java heap space
	at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
	at java.awt.image.Raster.createPackedRaster(Raster.java:467)
	at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
	at java.awt.image.BufferedImage.<init>(BufferedImage.java:331)
	at net.yacy.document.parser.images.bmpParser$IMAGEMAP.<init>(bmpParser.java:149)
	at net.yacy.document.parser.images.bmpParser.parse(bmpParser.java:69)
	at net.yacy.document.parser.images.genericImageParser.parse(genericImageParser.java:116)
2015-01-24 23:17:07 +01:00
Michael Peter Christen
4cd2d68e03 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-24 07:10:47 +01:00
Michael Peter Christen
dc5700148f update to latest code changes from json.org 2015-01-24 07:10:14 +01:00
reger
42b0672be3 Let auto-disabled crawls recover if low resource condition vanished.
Analog to autodisabled DHT switch autodisabled crawls back on upon mem ok
by remembering the autodisable by conf parameter.
2015-01-24 01:53:58 +01:00
Michael Peter Christen
287c528f46 replaced old JavaApplicationStub for Mac Application framework with new
script. Adopted the YaCyApp environment and fixed a problem in the
startYACY.sh application wrapper which caused wrong usage of logging
option -l which caused that files had been written to the YaCy
application folder.
As a result of this fix, it is not necessary any more to change path
settings in Info.plist if libraries are changed.
2015-01-23 11:30:13 +01:00
Michael Peter Christen
4c9d2a7c64 reverted 'do not show all options' strategy. This is actually confusing
new users. Will be activated maybe again if there is an optional
tutorial mode which can be switched on for this special purpose of
running a tutorial.
2015-01-20 18:18:12 +01:00
Michael Peter Christen
7db2888336 fixed font size and print page generation in pdf snapshots 2015-01-20 17:14:14 +01:00
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
Michael Peter Christen
3b51636ecb fix for mediawiki import 2015-01-12 00:35:47 +01:00
Michael Peter Christen
b07afbc115 a test with http://validator.w3.org/feed/#validate_by_input shows that
the time format was wrong; we must use RFC-822
2015-01-09 16:45:43 +01:00
Michael Peter Christen
8cafdb989a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-09 11:00:02 +01:00
reger
66839f73fa remove debug limit from commit before 2015-01-09 02:52:18 +01:00
reger
4214f250d0 Add option for extended search (Autosearch) to Bookmark.html asking all connected peers for the searchterm added as description to the bookmark created by the bookmark icon.
Intended for searches/research projects with not sufficient results from local and DHT selected remote target peers.

Function: the process checks newly created bookmarks for description starting with "query=..." and takes this to ask every peer for 20 search results and adds it to the local index in a background job.
link to start/stop the process added to /Bookmarks.html
2015-01-09 02:06:30 +01:00
reger
8e751d754a - add javadoc to busythread with hint about the init parameter useage
- remove obsolete 10_httpd config parameter
2015-01-09 01:31:57 +01:00
Michael Peter Christen
3e6c3e2237 documents pushed over the api/push_p.html interface will have their
unique flag set by default
2015-01-06 15:22:59 +01:00
Michael Peter Christen
35c24608cc fix for division by zero (rare cases) 2015-01-06 14:21:20 +01:00
Michael Peter Christen
4144c7cc52 do not write frame links to webgraph 2015-01-06 14:14:25 +01:00
reger
4eb89d7f15 revert clickservlet
(default was indeed a mistakenly)
2015-01-05 09:10:20 +01:00
Michael Peter Christen
c9e2128260 please commit new files under your own name, this file was not created
by me.
2015-01-05 08:18:19 +01:00
reger
d44d8996d0 Added a “don't store remote search results” option
This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. 
The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules).
Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index.

To be able to improve the local index a Click-Servlet option was added additionally.
If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index.
The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks)

The option check-boxes are placed in ConfigPortal.html
2015-01-04 11:10:45 +01:00
reger
c156548efe add info text to metadata page (htmlresponsewriter) on no documents found 2015-01-04 02:59:21 +01:00
reger
3ac1d14a21 improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list.
This prevents unusual mapping of supported fileextension -> mimetype
(like htm=application/x-tex)
2015-01-02 04:20:02 +01:00
Michael Peter Christen
d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
for crawling
2015-01-02 02:44:03 +01:00
Michael Peter Christen
3cd7deb3b8 do not flush non-errors to stdout because this is a concurrency issue.
the flush-call appeared very often in thread dumps with high load, so
this hopefully gives some performances
2014-12-28 15:48:37 +01:00
Michael Peter Christen
4e3e2acc69 Merge branch 'master' of gitorious.org:yacy/rc1-fixed_percent-encoding 2014-12-28 15:01:40 +01:00
Michael Peter Christen
ecb6a59e9e do not translate gif images into png images for thumbnails. Instead,
stream the original to the search result thumb viewer. This has two
reasons:
- animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a
known bug which is obviously not yet fixed
- animated gifs now appear in the search result also as animation
2014-12-28 14:53:55 +01:00
arucard21
3e9871291f Applied URL-decoding prior to HTML-encoding.
This removes percent-encoding from text shown in HTML
2014-12-27 09:52:34 +01:00
reger
6a04563578 Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml
so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top.
By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations
and individual addition/changes are still respected.
2014-12-27 00:10:14 +01:00
reger
51ec9c1f44 fix "null" title in response writer for documents with multivalued title 2014-12-26 18:23:26 +01:00
reger
73ba5d8ef7 adjust fieldtype and description of field httpstatus_redirect_s in CollectionSchema
- the field is not used (delete candidate)
2014-12-26 18:21:35 +01:00
reger
1f9389396a fix NPE related 500 (Bad Request) response of UrlProxy on blacklisted urls,
by adding parameter HTTPDeamon and removing unused hostAddress lookup code in sendRespondError
2014-12-25 02:21:45 +01:00
reger
f856edecb6 fix proxy redirect (http status 302) response
fixes http://mantis.tokeek.de/view.php?id=517

The url given in bug report uses a gzip input stream which causes the HTTPClient.writeto() throw an IOException due to incomplete input stream. This in turn prevents the 302 reponse to the client browser. 
By limiting to serve target content just on httpstatus=200 will proxy the header reponse and client browsers redirect settings can be honored.
2014-12-23 02:01:03 +01:00
Michael Peter Christen
cc090bcb01 enhanced initialization of autotagging 2014-12-23 00:37:51 +01:00
Michael Peter Christen
a0576ec737 fix for pdf sub-page result preparation 2014-12-22 14:32:09 +01:00
Michael Peter Christen
6ad43c4a8b removed debug code 2014-12-22 14:24:09 +01:00
Michael Peter Christen
407cfff010 fix to wkhtmltopdf usage 2014-12-22 02:01:55 +01:00
Michael Peter Christen
5d321d3dc5 fixes to wkhtmltopdf call 2014-12-21 20:11:39 +01:00
Michael Peter Christen
eb78388a98 changed prefer strategy for http unique in such a way that http is
preferred over https. While this is a bad idea from the standpoint of
security it is more common applicable for environments where http and
https mix and for some domains https is not available. Then the
double-check is possible even if no postprocessing is performed.
2014-12-21 19:17:06 +01:00
Michael Peter Christen
9e588944fa prevent NPE during initialization of very large vocabularies 2014-12-21 19:02:36 +01:00
Michael Peter Christen
aaf7d4775a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-21 18:10:25 +01:00
Michael Peter Christen
8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
parsing into individual pages and add them all using different URLs.
These constructed urls are generated from the source url with an
appended page=<pagenumber> attribute to the url get/post properties.
This will distinguish the different page entries. The search result list
will then replace the post parameter with a url anchor # mark which
causes that the original url is presented in the search result. These
URLs can be opened directly on the correct page using pdf.js which is
now built-in into firefox. That means: if you find a search hit on page
5 and click on the search result, firefox will open the pdf viewer and
shows page 5.
2014-12-21 18:10:15 +01:00
Michael Peter Christen
d14114697c the miss cache does not seem to work, it sometimes contains urlhashes
from documents which actually are inside the index. This can be
reproduced using the crawl result table at 
http://localhost:8090/CrawlResults.html?process=5
The cache is temporary disabled to remove the bad behaviour, however a
later reactivation of that feater may be possible.
2014-12-21 17:31:51 +01:00
reger
deb75a1dbe fix refactored size() -> filesize() in YMarkMetadata 2014-12-21 14:02:06 +01:00
reger
198102304b refactor size() -> filesize() of URIMetadataNode
(harmonize with ResultEntry and to not get confused with Collection.size())
2014-12-21 06:05:35 +01:00
reger
c6f634a4f2 remove redundant caching of urlhash in URIMetadataNode
(is already cached in underlaying DigestURL .url)

upd pom keyword for maven-antrun-plugin
2014-12-21 03:45:54 +01:00
Michael Peter Christen
5516819354 preventing the use of no-cache and expires in case that images are
generated dynamically which will stay static in the future. This applies
mainly to the search result favicon in front of search hits. These icons
will now be generated once, but then caches in the browser. There is
also a YaCy-internal cache for these icons which had prevented the
re-generation of the icons in YaCy, but this cache is now superfluous
since the browser should not call the servlet ViewImage again.
2014-12-19 17:41:38 +01:00
Michael Peter Christen
d3e71ed070 fixes for searches when initialization of large autotagging libraries
have not been finished
2014-12-19 17:38:58 +01:00
Michael Peter Christen
28683530cd fixes to usage of no-cache: use and recognize also the no-store
directive
2014-12-19 17:37:58 +01:00
Michael Peter Christen
c9c700b510 reduction of http requests to YaCy using the correct cache-control,
expires and last-modified headers in http response.
2014-12-19 11:51:14 +01:00
reger
13cca2b114 fix missing AppPath
upd Maven plugin versionid
2014-12-19 01:58:37 +01:00
Michael Peter Christen
65125439fe added query modifier 'on'. This makes it possible to search for date
occurrences within the (web) page documents (not the document
last-modified!). This works only if the solr field dates_in_content_sxt
is enabled. A search request may then have the form "term on:<date>",
like
gift on:24.12.2014
gift on:2014/12/24
* on:2014/12/31
For the date format you may use any kind of human-readable date
representation(!yes!) - the on:<date> parser tries to identify language
and also knows event names, like:
bunny on:eastern
.. as long as the date term has no spaces inside (use a dot). Further
enhancement will be made to accept also strings encapsulated with
quotes.
2014-12-16 13:53:12 +01:00
Michael Peter Christen
1cfddea578 added (very experimental) Solr response writer for snapshot image
results
2014-12-16 13:18:49 +01:00
Michael Peter Christen
7287dd764e added url, date, time and page number on pdf snapshot footer 2014-12-16 12:39:10 +01:00
Michael Peter Christen
8b5d074715 fix for image parser (there is a class missing!) 2014-12-16 12:10:15 +01:00
Michael Peter Christen
932faafffe reactivated on-demand snapshot loading 2014-12-16 12:09:57 +01:00
Michael Peter Christen
2362ad7c34 fix for a count issue in snapshot api 2014-12-16 11:33:30 +01:00
Michael Peter Christen
3354cd63be Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-15 23:32:57 +01:00
Michael Peter Christen
9971e197e0 Added a transaction interface to the snapshots: all documents in the
snapshots can now be processed with transactions using commit and
rollback commands. Furthermore, a large number of monitoring methods had
been added to check the success of transactions.

The transactions for snapshots have two main components: a rss search
API to get information about latest/oldest entries and a commit/rollback
API to move entries away from the rss results. This is done by usage of
two storage locations for the snapshots, INVENTORY and ARCHIVE. New
snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE,
rollback snapshots move to INVENTORY again.

Normal Workflow:
Beside all these options below, usually it is sufficient to process data
like this:
- call
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
- process the rss result and use the <guid> value as <urlhash> (see next
command)
- for each processed result call
http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash>
- then you can call the rss feed again and the commited urls are omited
from the next set of items.

These are the commands to control this:
The rss feed:
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST

The feed will return a <urlhash> in the <guid> - field of the rss. This
must be used for commit/rollback:

Commit/Rollback:
http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash>
http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash>
The json will return a property list containing the property "result"
with possible values "success" or "fail", according of the result. If an
"fail" occurs, please look into the log for further info.

Monitoring:
http://localhost:8090/api/snapshot.json?command=status
This shows the total number of entries in the INVENTORY and the ARCHIVE 
http://localhost:8090/api/snapshot.json?command=list
This will result a list of all hosts which have snapshots and the number
of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in
the porperties for "count.INVENTORY" and "count.ARCHIVE"
http://localhost:8090/api/snapshot.json?command=list&depth=2
The list can be restricted to such which have a specific depth. The list
contains then the same host names, but the count values change because
only documents at that specific crawl depth are listed
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80
This lists all urlhashes for the given host, not only an accumulated
list of the number of entries
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0
This restricts the list of urlhashes for that host for the given depth
http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY
http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE
This selects either the INVENTORY or ARCHIVE for all list commands,
default is ALL which means that from both snapshot directories the host
information is collected and combined. You can use the state option for
all the commands as listed above

Detailed Information:
http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ
This collects metadata information for the given urlhash. This can also
be restricted with state=INVENTORY and state=ARCHIVE to test if the
document is either in one of these snapshot directories. If an urlhash
is not found, an empty result is returned. If an entry was found and the
state was not restricted, then the result contains a state property
containing the name of the location where the document is, either
INVENTORY or ARCHIVE.

Hint:
If a very large number of documents is inside of INVENTORY, then it
could be better to call the rss feed with
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
because that is very efficient.
2014-12-15 23:32:46 +01:00
reger
63846ddb89 add final SolrQueryRequest.close to SolrServlet 2014-12-15 22:54:49 +01:00
reger
9edc7308aa update to metadata-extractor-2.7.0.jar
add 2 simple JUnit test cases for jpeg and tif parsing
2014-12-15 20:45:05 +01:00
Michael Peter Christen
578ae29f1e added a note that the servlet is linked using web.xml 2014-12-15 05:56:12 +01:00
reger
6c3f36def1 - fix path to default heuristic.cfg
- deprecate unused ProxyServlet
2014-12-14 21:27:45 +01:00
Michael Peter Christen
bbf0ac40c3 add the actual DateDetection class... (missed in latest commit) 2014-12-14 13:43:30 +01:00
Michael Peter Christen
66b5a56976 Added and integrated new date detection class which can identify date
notions within the fulltext of a document. This class attempts to
identify also dates given abbreviated or with missing year or described
with names for special days, like 'Halloween'. In case that a date has
no year given, the current year and following years are considered.

This process is therefore able to identify a large set of dates to a
document, either because there are several dates given in the document
or the date is ambiguous. Four new Solr fields are used to store the
parsing result:

dates_in_content_sxt:
if date expressions can be found in the content, these dates are listed
here in order of the appearances

dates_in_content_count_i:
the number of entries in dates_in_content_sxt

date_in_content_min_dt:
if dates_in_content_sxt is filled, this contains the oldest date from
the list of available dates

#date_in_content_max_dt:
if dates_in_content_sxt is filled, this contains the youngest date from
the list of available dates, that may also be possibly in the future

These fields are deactiviated by default because the evaluation of
regular expressions to detect the date is yet too CPU intensive. Maybe
future enhancements will cause that this is switched on by default.

The purpose of these fields is the creation of calendar-like search
facets, to be implemented next.
2014-12-14 13:40:45 +01:00
Michael Peter Christen
c3c2b6999b fixes on wkhtmltopdf 2014-12-14 04:03:20 +01:00
Michael Peter Christen
114f0afc1e enable sku as anchor in html response writer 2014-12-14 04:02:13 +01:00
Michael Peter Christen
aa80cb1159 enhanced tagging preparation speed which reduces initialization time for
very large vocabularies
2014-12-13 09:54:41 +01:00
Michael Peter Christen
6a1865f507 refactoring date -> lastModified 2014-12-11 23:37:41 +01:00
Michael Peter Christen
ab6cc3c88c added concurrent generation of snapshot pdfs 2014-12-10 14:10:05 +01:00
Michael Peter Christen
413eeefed4 added character set detection library from
http://www-archive.mozilla.org/projects/intl/chardet.html
2014-12-10 13:08:29 +01:00
Michael Peter Christen
7bfc5b80cb added new options to vocabulary editor:
- new switch 'isFacet' which causes that the usage of the vocabulary for
search facets is enabled or disabled. This shall be used for large
vocabularies sind searched in solr are extremely slow if facets for a
large set of alternative terms are generated
- new option to disable auto-enrichment from synonyms
- new option to add synonyms from another column when importing from csv
- automatically recognize double-occurrences in synonyms and bundling
terms for such synonyms
2014-12-10 12:20:27 +01:00
Michael Peter Christen
87b53b3572 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-09 16:20:44 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
reger
5d67e165d9 remove redundant null check in ResponseHeader.lastModified
added a JUnit testcase for ResponseHeader dates (using age()),
adjusted age() to pass all tests
2014-12-09 00:58:08 +01:00
reger
5f0bb1214f modified FieldReIndex to reindex queries with low number of documents first
by using a internally a score map with number of documents as score
and working through the list from low to high.
2014-12-07 04:31:09 +01:00
reger
e52370728a fix startup stop on missing HTCACHE/SNAPSHOT directory 2014-12-06 02:25:24 +01:00
reger
e5236aa7ca Merge origin/master 2014-12-06 01:44:03 +01:00
reger
70cf7060a4 coding fixes suggested in
http://mantis.tokeek.de/view.php?id=509
http://mantis.tokeek.de/view.php?id=510
2014-12-06 01:42:24 +01:00
Michael Peter Christen
4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a
list of latest/oldest entries in the snapshot database. This is an
example:
http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100

The properties depth, order, host and maxcount can be omited. The
meaning of the fields are:
host: select only urls from this host or all, if not given
depth: select only urls at that crawl depth or all, if not given
maxcount: select at most the given number of urls or 10, if not given
order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to
select the first entries or ANY to select any

The rss feed needs administration rights to work, a call to this servlet
with rss extension must attach login credentials.
2014-12-06 00:25:05 +01:00
Michael Peter Christen
8b522687e0 added toString() methods to feed classes which makes it possible to
export full rss feed files out of the RSSFeed class
2014-12-06 00:18:14 +01:00
reger
568c991405 remove the unused Request variable
(fix of  prev. commit)
2014-12-05 03:03:28 +01:00
reger
d6539ba597 Merge origin/master 2014-12-05 01:15:41 +01:00
reger
ff18129def ViewFile servlet: update index if newer,
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index 
   (the resource is loaded, parsed anyway, so it's not a expensive operation)

Request: remove 2 unused init parameter 
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
2014-12-05 01:13:37 +01:00
Michael Peter Christen
a304058840 added Image Events as another option to generate images with a mac if no
Ghostscript is available or does not work...
2014-12-04 01:21:24 +01:00
Michael Peter Christen
d83de9ecf5 added another path for the convert command because on older Macs
ImageMagick has a different installation location
2014-12-03 18:07:05 +01:00
Michael Peter Christen
226aea5914 added a servlet which can create preview images, preview tumbnails and
preview pdfs from web pages, i.e.:
http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/

This supports also an on-the-fly generation of the preview documents if
the user is an administrator. Otherwise, the servlet fails.
To enable this, you must add wkhtmltopdf, imagemagick and (on headless
servers) xvfb to your operation system.

for detailed instructions, see
97f6089a41
2014-12-03 11:45:48 +01:00
reger
28456dfc09 skip creation of unused Bluelist contenttransformer 2014-12-02 21:03:00 +01:00
Michael Peter Christen
321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
2014-12-02 16:26:07 +01:00
Michael Peter Christen
7bfab5eb9d set Busy- and Blocking-Threads to daemon mode (they will now not prevent
YaCy from termination if still running)
2014-12-02 16:05:00 +01:00
Michael Peter Christen
e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without
cache using the user agent string given in the crawl profile
2014-12-02 13:35:19 +01:00
Michael Peter Christen
d5bac64421 recognize more html file types for snapshots 2014-12-02 12:52:36 +01:00
Michael Peter Christen
a1ee101079 recognize more html file extensions 2014-12-02 12:10:44 +01:00
Michael Peter Christen
8480641f2d fix to xvfb-run usage (quotes did not parse in xvfb-run, default values
are appropriate)
2014-12-02 11:51:12 +01:00
Michael Peter Christen
68b040e31e added fail-over missing http proxy service (i.e. overload) and quiet
mode
2014-12-01 18:21:52 +01:00
Michael Peter Christen
25a64c51b3 moved snapshot generation out of the html handler to prevent that
existing cache entries cause that the handler is not executed
2014-12-01 17:37:25 +01:00
Michael Peter Christen
c35170a305 more logging 2014-12-01 16:50:37 +01:00
Michael Peter Christen
e8be07ec78 grr 2014-12-01 16:38:07 +01:00
Michael Peter Christen
6f81bb756c wrap wkhtmltopdf with xvfb if necessary 2014-12-01 16:26:28 +01:00
Michael Peter Christen
0119f8665d more logging when failing to create pdf snapshot 2014-12-01 16:00:45 +01:00
Michael Peter Christen
416fe886e3 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-01 15:20:24 +01:00
Michael Peter Christen
60f27bdf49 added the property timeoutrequests to configuration to disable
TimeoutRequests. The purpose is to test if YaCy runs better on VMs where
there is a limitation of concurrent processes;  see
/proc/user_beancounters in row numproc; this value is limited and should
be low. Try to set timeoutrequests to keep this low. (works only after
restart)
2014-12-01 15:20:10 +01:00
Michael Peter Christen
97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
be transcoded into jpg for image previews. To create such pdfs you must
do:

Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"

Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.

Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
2014-12-01 15:03:09 +01:00
reger
ff80700aff replace depreciated Solr DateField.formatExternal with recommended TrieDateField.formatExternal 2014-12-01 00:21:30 +01:00
Michael Peter Christen
9ea120dbe5 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-30 22:02:25 +01:00
reger
0c97cc2440 skip unused call parameter for hashSentence() 2014-11-30 19:42:33 +01:00
reger
5790c7242e skip to tokenize punktuation as word in WordTokenizer
remove unused variables in condenser related to Tokenizer
2014-11-29 17:16:05 +01:00
reger
f07392ff17 add. use host port parameter in YaCyApp 2014-11-29 15:27:16 +01:00
Michael Peter Christen
09d2867050 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-29 12:05:19 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
Michael Peter Christen
5f5c7d69d1 added image screenshot generator 2014-11-28 01:25:52 +01:00
Michael Peter Christen
1d45d9405a security bugfix 2014-11-28 01:19:01 +01:00
Michael Peter Christen
ff728b4aa5 ignore url errors during search 2014-11-27 20:50:55 +01:00
Michael Peter Christen
8317914ce3 changed vocabulary navigator object type to TreeMap to get a specific
order into the vocabularies. This is now lexicographic which is not so
much random as a hashed order
2014-11-27 07:44:41 +01:00
Michael Peter Christen
d5c1b07768 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-26 18:07:17 +01:00
Michael Peter Christen
c0f9f6ac66 added option to change the navbar-default, i.e. usable for dark skins 2014-11-26 18:01:35 +01:00
Michael Peter Christen
10794e8efd trying facet.method fc instead of fcs to handle large facets 2014-11-25 23:11:42 +01:00
Michael Peter Christen
041b605cfe Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-25 09:48:48 +01:00
Michael Peter Christen
f1f74e8626 toString fix 2014-11-24 20:53:40 +01:00
Michael Peter Christen
30276a2b48 prevent that a local Solr search and a local RWI search are running
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
2014-11-24 20:53:19 +01:00
Michael Peter Christen
84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
2014-11-24 20:28:52 +01:00
reger
1e7ee72240 fix path lookup to ./defaults/yacy.badwords
(fix of commit ee277b9b3e)
2014-11-23 23:29:20 +01:00
reger
7d863d6254 fix empty text facet entry
(noticed on Author facet)
2014-11-23 23:12:01 +01:00
Michael Peter Christen
a39419f2ef more stacks shall be considered for on-demand loading, not only
deep-depth stacks to prevent "too many open files" problem
2014-11-23 20:11:23 +01:00
Michael Peter Christen
5bb52f79be reduce number of calls to queue.size() because that may be a bottleneck
during crawling
2014-11-23 20:09:32 +01:00
Michael Peter Christen
4920ab7b76 optimize usage of size() cache 2014-11-23 20:07:32 +01:00
reger
ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
   (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)

move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
2014-11-23 05:22:23 +01:00
reger
de56266bcb remove redundant toLower for topwords 2014-11-22 22:49:23 +01:00
Michael Peter Christen
a34f837592 better delete all files in path when removing host crawl stack 2014-11-22 12:09:07 +01:00
Michael Peter Christen
10b1db430a if we have many hosts, use on-demand earlier 2014-11-22 12:04:04 +01:00
Michael Peter Christen
1324927e66 prevent division by zero 2014-11-22 12:01:00 +01:00
Michael Peter Christen
2beb6abeb6 disabled crazy sleep loop 2014-11-21 14:38:54 +01:00
Michael Peter Christen
70f03f7c8e do not cache search requests to Solr if the result is used for
doublechecking. If a double-check comes from cached results the
doublecheck fails.
2014-11-20 18:45:27 +01:00
Michael Peter Christen
a0b84e4def use a LinkedHashMap for factes to maintain facet order as given by solr 2014-11-20 18:44:29 +01:00
reger
ef5dc68313 include domtype to searcheventcache id
to differenciate between local / global events for reuse of cached events 
fix for http://mantis.tokeek.de/view.php?id=493
2014-11-20 02:04:43 +01:00
Michael Peter Christen
0dc6e0a5f2 added option to enrich vocabularies with synonyms from synonym database 2014-11-19 18:12:43 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
c67c5c0709 added new solr schema fields which record the occurences of vocabulary
matchings. These matches can be used for result boosting, i.e. if a
document contains words from a specific vocabulary, boost it.
2014-11-18 15:02:34 +01:00
Michael Peter Christen
a67a465415 fix field counter for multi-fields in html writer for the solr servlet 2014-11-18 12:11:18 +01:00
Michael Peter Christen
ec9d021568 added option in vocabulary editor to import CSV files with different
encodings (preselected windows-type character encoding which is typical
for CSV files). Fixed also other problems with character encoding in
dictionary files. Automatically generated vocabularies are now also
noted in the API steering.
2014-11-17 14:22:40 +01:00
reger
3c818fc912 add a check of java version string >=1.7 to startup class
stopping start with error msg on version < 1.7
2014-11-16 01:26:07 +01:00
Michael Peter Christen
0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
2014-11-14 16:34:55 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
Michael Peter Christen
8aee7f940e added missing class for latest changes 2014-11-13 01:30:12 +01:00
Michael Peter Christen
97039049e4 fix in key enumeration methods for cases where the enumeration is done
in reverse order.
2014-11-13 01:15:31 +01:00
Michael Peter Christen
7e1b0b6712 fix for wildcard patch in search queries 2014-11-13 00:59:30 +01:00
Michael Peter Christen
0a879c98e7 added new 'firstSeen' database table and necessary data structures which
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
2014-11-13 00:58:58 +01:00
Michael Peter Christen
421ee64f33 another fix to ordering of table indexes; fixes also network stats
graphics
2014-11-11 13:57:04 +01:00
Michael Peter Christen
1db476c67e fix for bad table iteration 2014-11-10 18:52:01 +01:00
reger
e4316e2d74 skip creation of local var in proxyhandler.storetocache 2014-11-09 04:17:14 +01:00
sixcooler
9c6e3a6b1c fix assertation-failure in version-string for Solr-4.10.2 by changing
the assert - hope that is ok
+ add forgotten NB-Projekt-changes
2014-11-07 22:43:50 +01:00
sixcooler
725b206fb4 update to solr-/lucene-4.10.2 2014-11-07 18:51:31 +01:00
Michael Peter Christen
5c97ecb30f fix of bad query generation for search facets 2014-11-07 18:11:49 +01:00
Michael Peter Christen
95d87f00b3 fix for bad query generation in doublecheck in postprocessing 2014-11-07 18:11:23 +01:00
orbiter
72c2bc5189 fix for search in case where local peer has no local seed address in
portal mode
2014-11-02 21:16:51 +01:00
orbiter
5be352da99 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-02 20:35:08 +01:00
orbiter
0fcd8097a3 removed unused options from BusyThreads 2014-11-02 20:08:49 +01:00
Michael Peter Christen
fe8b1d137d emergency bugfix for 100% CPU in image drawing 2014-11-02 13:28:10 +01:00
Michael Peter Christen
92007e5d2d more enhancements to posprocessing speed 2014-11-02 12:52:23 +01:00
Michael Peter Christen
9a7fe9e0d1 fix for bad timing computation in postprocessing 2014-10-31 23:17:56 +01:00
Michael Peter Christen
bd16119a00 another fix for postprocessing (the query for "" on numeric field did
not work in external solr)
2014-10-31 17:44:45 +01:00
Michael Peter Christen
327e83bfe7 more fixes in postprocessing: partitioning of the complete queue to
enable smaller queries
2014-10-31 17:30:24 +01:00
orbiter
2bc6199408 more concurrency for postprocessing 2014-10-30 21:52:52 +01:00
orbiter
a83cf26c38 more fixes and enhancements to postprocessing 2014-10-30 20:53:57 +01:00
orbiter
71758f0d62 enhanced postprocessing by usage of a field-list generation to prevent
lazy initialization of the documents. This is useful because the
documents must be read completely anyway.
2014-10-30 18:05:48 +01:00
orbiter
7856fbdbe8 fix for npe (in rare cases) 2014-10-30 15:20:35 +01:00
orbiter
8a2b569d7c fix for literal computation 2014-10-30 15:01:27 +01:00
orbiter
856da2712b Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-29 16:53:18 +01:00
orbiter
ca9cd7b58a more IPv6 fixes 2014-10-29 16:52:58 +01:00
Michael Peter Christen
b4585e9546 added new index size history image in /Status.html page 2014-10-29 13:37:44 +01:00
Michael Peter Christen
167c5a51f0 IPv6 fix 2014-10-28 15:36:13 +01:00
Michael Peter Christen
fe537679de fix for exact_signature_unique_b, exact_signature_copycount_i,
fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same
criteria for 'valid document' as for title and description uniqueness
test.
2014-10-24 15:04:40 +02:00
sixcooler
eb9d2705d2 fix for ConnectionInfo.cleanup of server-connections 2014-10-22 11:25:07 +02:00
Michael Peter Christen
2e5214eb21 added field postprocessing.partialUpdate to settings which can be used
to switch on or off partial updates. Both options should cause the same
result. Default is on.
2014-10-17 14:17:49 +02:00
Michael Peter Christen
11074d8d24 fix for a ssl bug that appear only in java 7.
The bug was reported in
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5407&p=30956#p30956
a solution was described in
http://teknosrc.com/javax-net-ssl-sslprotocolexception-handshake-alert-unrecognized_name-solved/
which worked for this example given in the yacy forum
2014-10-17 13:25:17 +02:00
Michael Peter Christen
e96490e3a1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-17 12:51:35 +02:00