yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	0710648c31	enable api calls with very long urls	2015-05-11 14:42:21 +02:00
reger	1481a8ab56	add opensearch rss results to dht collection (due to text = snippet) which is used to differentiate meta from full data - make sure check for dht is not dependant on number of collection entries	2015-05-10 18:52:33 +02:00
Michael Peter Christen	fbf85a1561	added temporary debug output in http client	2015-05-08 15:31:01 +02:00
Michael Peter Christen	ff29b0e503	added option to re-index exported xml snapshot dumps to HTCACHE/snapshots by just placing them in the SURROGATES/in path	2015-05-08 15:30:26 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
Michael Peter Christen	b060ba900d	added parsing of contentprop attribute in html tags for content='startDate' and content='endDate'. The value of these field is now written to new solr fields startDates_dts and endDates_dts.	2015-04-13 16:20:00 +02:00
Michael Peter Christen	ae02c92fd0	logging fix	2015-04-09 14:21:23 +02:00
Michael Peter Christen	5651713134	better debugging of fq	2015-04-07 17:02:02 +02:00
reger	b1ec0644e5	fix NPE in location search on missing/empty PubDate in underlaying rss data	2015-03-31 02:20:13 +02:00
reger	839b962c20	correct percent encoding for '%' char	2015-03-28 03:05:21 +01:00
reger	2ef8ffdb60	apply UTF-8 encoding copied from escape()	2015-03-15 06:02:45 +01:00
reger	7120ea42f1	fix for path with char code > 255 (causing index out of bound exception) + test cas for it	2015-03-15 03:37:32 +01:00
reger	1d81bd0687	fix url encoding for path see http://mantis.tokeek.de/view.php?id=559 So far we used same escape procedure for all parts of the url (which includes x-www-form-urlencoded for all url components) Added capability to use different encoding rules for the different url components (through specific bitset for each component). (this is inspired by org.apache.http.client and java.net.uri implementation). - Added test case for http://mantis.tokeek.de/view.php?id=559	2015-03-15 00:46:07 +01:00
reger	62087fb8b2	fix MultiProtocolURL mailto protocol detection	2015-03-13 02:02:53 +01:00
reger	f94e34058c	fix url (path) %-decoding http://mantis.tokeek.de/view.php?id=519 - add test case for this	2015-03-11 01:05:14 +01:00
Michael Peter Christen	710a0efa1b	generalized time period computations	2015-03-02 12:55:31 +01:00
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	2015-03-02 04:30:10 +01:00
reger	9b0de2de64	introduce getQueryFields to return default query fields (queryparamter QF) calculated from boostfields config, making sure title, description, keywords and content is always searched. - apply change to solrServlet makes sure every remote query uses at least all locally defined boost fields for search - apply to local solr search - simplify select query by using QF defaults	2015-02-23 23:12:07 +01:00
reger	8ec1db76ee	url unescape add check for inconsistent utf8 multibyte parsing If the url contains special chars (like umlaute äöü) it's interpreted as multybyte char and actually not converted at all (removed). Added a check if the multibyte convesion is not complete, just add the char as is. This fixes http://mantis.tokeek.de/view.php?id=200	2015-02-20 02:21:04 +01:00
reger	f0a5188e11	replace depreciated HTTPClient setStaleConnectionCheckEnabled with setValidateAfterInactivity()	2015-02-15 23:09:01 +01:00
reger	7b569d2dbe	replace depriciated HTTPClient ALLOW_ALL_HOSTNAME_VERIFIER with NoopHostnameVerifier()	2015-02-15 21:34:01 +01:00
reger	eda0aeaf26	allow/recognize host in file: protocol crawl target This is useful in intranet indexing while crawling a intranet file server accessed via hostname while e.g. under Windows mapped to different drive letters on individual clients. Here you can crawl e.g. file://fileserver/documents having a valid uri in that intranet environment (while e.g. P:/documents might be client dependant).	2015-02-11 23:26:39 +01:00
Michael Peter Christen	8ff76f8682	the cleanup process experienced a 100% CPU load situation and the loop did not terminate: Occurrences: 100 at java.util.HashMap$KeyIterator.next(HashMap.java:956) at net.yacy.cora.protocol.ConnectionInfo.cleanup(ConnectionInfo.java:300) at net.yacy.cora.protocol.ConnectionInfo.cleanUp(ConnectionInfo.java:293) at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2212) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:105) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:215) This tries to fix the problem; the problem should be monitored	2015-02-10 08:43:45 +01:00
Michael Peter Christen	6578ff3ddb	enhanced suggest function	2015-02-09 18:45:07 +01:00
reger	fe6f5a395d	fix Umlaut handling in blekko heuristic search term http://mantis.tokeek.de/view.php?id=169 observation: blekko seams to block xxxbot agents (=0 results)	2015-02-08 23:40:33 +01:00
reger	c454ef69c6	add shortMemory check to heuristic search and skip operation on shortMemory (no request to remote openserch systems)	2015-02-03 03:08:34 +01:00
reger	9e1ec5fec4	refactor: just some more useages of constant for term ":[* TO *]"	2015-02-01 04:26:33 +01:00
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	2015-01-30 13:20:56 +01:00
Michael Peter Christen	1cb290170e	refactoring of autotagging code (combined same code pieces)	2015-01-29 11:39:47 +01:00
Michael Peter Christen	c3b55455fc	enhanced initialization speed of vocabularies by using better normalization and by removal of unused data structures	2015-01-29 02:45:32 +01:00
Michael Peter Christen	de3e373913	using precompiled CommonPattern.TAB for split	2015-01-29 02:22:28 +01:00
Michael Peter Christen	a8a2b7a803	persistency for vocabulary facet switch	2015-01-29 02:16:42 +01:00
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	2015-01-29 01:46:22 +01:00
Michael Peter Christen	ac19690d30	refactoring with CommonPattern.COMMA	2015-01-29 01:35:28 +01:00
Michael Peter Christen	b5a55c8b3d	fix for wkhtmltopdf (custom header does not work)	2015-01-28 17:45:25 +01:00
Michael Peter Christen	bee5ee7cce	removed some warnings	2015-01-27 17:00:20 +01:00
Michael Peter Christen	783cf6fbc7	the LinkedBlockingQueue is much faster than the ArrayBlockingQueue (strange but this is the result of a test: ArrayBlockingQueue: 39461 lines / second; LinkedBlockingQueue: 60774 lines / second)	2015-01-27 16:53:09 +01:00
Michael Peter Christen	6390454652	fix for vocabulary on/off setting	2015-01-27 16:24:27 +01:00
Michael Peter Christen	dc5700148f	update to latest code changes from json.org	2015-01-24 07:10:14 +01:00
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	2015-01-20 17:14:14 +01:00
reger	24f68a4eb7	refactor opensearch heuristic introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors, which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector. The manager enforces now a min 15s delay between calls to external systems. Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation. default heuristicopensearch.conf: - openbdb.com removed - seems not longer to deliver results - config via solrconnector to datacite.org added (large technical library archive)	2015-01-19 03:30:35 +01:00
Michael Peter Christen	b07afbc115	a test with http://validator.w3.org/feed/#validate_by_input shows that the time format was wrong; we must use RFC-822	2015-01-09 16:45:43 +01:00
reger	c156548efe	add info text to metadata page (htmlresponsewriter) on no documents found	2015-01-04 02:59:21 +01:00
reger	51ec9c1f44	fix "null" title in response writer for documents with multivalued title	2014-12-26 18:23:26 +01:00
Michael Peter Christen	cc090bcb01	enhanced initialization of autotagging	2014-12-23 00:37:51 +01:00
Michael Peter Christen	a0576ec737	fix for pdf sub-page result preparation	2014-12-22 14:32:09 +01:00
Michael Peter Christen	407cfff010	fix to wkhtmltopdf usage	2014-12-22 02:01:55 +01:00
Michael Peter Christen	5d321d3dc5	fixes to wkhtmltopdf call	2014-12-21 20:11:39 +01:00
Michael Peter Christen	d14114697c	the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible.	2014-12-21 17:31:51 +01:00
Michael Peter Christen	1cfddea578	added (very experimental) Solr response writer for snapshot image results	2014-12-16 13:18:49 +01:00

1 2 3 4 5 ...

1040 Commits