yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	2014-12-19 17:37:58 +01:00
reger	13cca2b114	fix missing AppPath upd Maven plugin versionid	2014-12-19 01:58:37 +01:00
Michael Peter Christen	65125439fe	added query modifier 'on'. This makes it possible to search for date occurrences within the (web) page documents (not the document last-modified!). This works only if the solr field dates_in_content_sxt is enabled. A search request may then have the form "term on:<date>", like gift on:24.12.2014 gift on:2014/12/24 * on:2014/12/31 For the date format you may use any kind of human-readable date representation(!yes!) - the on:<date> parser tries to identify language and also knows event names, like: bunny on:eastern .. as long as the date term has no spaces inside (use a dot). Further enhancement will be made to accept also strings encapsulated with quotes.	2014-12-16 13:53:12 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	2014-12-14 13:40:45 +01:00
Michael Peter Christen	6a1865f507	refactoring date -> lastModified	2014-12-11 23:37:41 +01:00
Michael Peter Christen	7bfc5b80cb	added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms	2014-12-10 12:20:27 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	2014-12-07 04:31:09 +01:00
reger	e52370728a	fix startup stop on missing HTCACHE/SNAPSHOT directory	2014-12-06 02:25:24 +01:00
reger	70cf7060a4	coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510	2014-12-06 01:42:24 +01:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	60f27bdf49	added the property timeoutrequests to configuration to disable TimeoutRequests. The purpose is to test if YaCy runs better on VMs where there is a limitation of concurrent processes; see /proc/user_beancounters in row numproc; this value is limited and should be low. Try to set timeoutrequests to keep this low. (works only after restart)	2014-12-01 15:20:10 +01:00
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	2014-12-01 15:03:09 +01:00
reger	0c97cc2440	skip unused call parameter for hashSentence()	2014-11-30 19:42:33 +01:00
Michael Peter Christen	ad0da5f246	added new web page snapshot infrastructure which will lead to the ability to have web page previews in the search results. (This is a stub, no function available with this yet...)	2014-11-29 11:56:32 +01:00
Michael Peter Christen	1d45d9405a	security bugfix	2014-11-28 01:19:01 +01:00
Michael Peter Christen	ff728b4aa5	ignore url errors during search	2014-11-27 20:50:55 +01:00
Michael Peter Christen	8317914ce3	changed vocabulary navigator object type to TreeMap to get a specific order into the vocabularies. This is now lexicographic which is not so much random as a hashed order	2014-11-27 07:44:41 +01:00
Michael Peter Christen	041b605cfe	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-11-25 09:48:48 +01:00
Michael Peter Christen	30276a2b48	prevent that a local Solr search and a local RWI search are running concurrently. When a RWI search result is flushed into the result set, id does Solr Queries (which replaced the old-style Metadata Queries) and they are possibly running concurrently to a previously startet Solr search. Both methods may block each other with IO. To enhance the speed, they are now serialized. Because the Solr search results may result in better results using the more advanced and configurable Ranking methods, this result is preverred over the RWI search result. However, remote RWI search results are still feeded concurrently into the search result as well.	2014-11-24 20:53:19 +01:00
reger	1e7ee72240	fix path lookup to ./defaults/yacy.badwords (fix of commit `ee277b9b3e`)	2014-11-23 23:29:20 +01:00
reger	ee277b9b3e	allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/) if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default) move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory	2014-11-23 05:22:23 +01:00
reger	de56266bcb	remove redundant toLower for topwords	2014-11-22 22:49:23 +01:00
Michael Peter Christen	70f03f7c8e	do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails.	2014-11-20 18:45:27 +01:00
reger	ef5dc68313	include domtype to searcheventcache id to differenciate between local / global events for reuse of cached events fix for http://mantis.tokeek.de/view.php?id=493	2014-11-20 02:04:43 +01:00
Michael Peter Christen	6a2a669db4	added loading of the synonyms file from addon/synonyms into the knowledge loader	2014-11-19 17:36:56 +01:00
Michael Peter Christen	c67c5c0709	added new solr schema fields which record the occurences of vocabulary matchings. These matches can be used for result boosting, i.e. if a document contains words from a specific vocabulary, boost it.	2014-11-18 15:02:34 +01:00
Michael Peter Christen	0550b54d56	added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects.	2014-11-14 16:34:55 +01:00
Michael Peter Christen	68e8039fd1	added high-precision scheduler for API processes. This allows also to make the execution in dependency of available RAM or CPU load. The default value for CPU load is 4.0 and the check runs once a minute.	2014-11-14 10:02:50 +01:00
Michael Peter Christen	7e1b0b6712	fix for wildcard patch in search queries	2014-11-13 00:59:30 +01:00
Michael Peter Christen	0a879c98e7	added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality.	2014-11-13 00:58:58 +01:00
sixcooler	9c6e3a6b1c	fix assertation-failure in version-string for Solr-4.10.2 by changing the assert - hope that is ok + add forgotten NB-Projekt-changes	2014-11-07 22:43:50 +01:00
sixcooler	725b206fb4	update to solr-/lucene-4.10.2	2014-11-07 18:51:31 +01:00
Michael Peter Christen	5c97ecb30f	fix of bad query generation for search facets	2014-11-07 18:11:49 +01:00
Michael Peter Christen	95d87f00b3	fix for bad query generation in doublecheck in postprocessing	2014-11-07 18:11:23 +01:00
orbiter	5be352da99	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-11-02 20:35:08 +01:00
orbiter	0fcd8097a3	removed unused options from BusyThreads	2014-11-02 20:08:49 +01:00
Michael Peter Christen	92007e5d2d	more enhancements to posprocessing speed	2014-11-02 12:52:23 +01:00
Michael Peter Christen	9a7fe9e0d1	fix for bad timing computation in postprocessing	2014-10-31 23:17:56 +01:00
Michael Peter Christen	bd16119a00	another fix for postprocessing (the query for "" on numeric field did not work in external solr)	2014-10-31 17:44:45 +01:00
Michael Peter Christen	327e83bfe7	more fixes in postprocessing: partitioning of the complete queue to enable smaller queries	2014-10-31 17:30:24 +01:00
orbiter	71758f0d62	enhanced postprocessing by usage of a field-list generation to prevent lazy initialization of the documents. This is useful because the documents must be read completely anyway.	2014-10-30 18:05:48 +01:00
Michael Peter Christen	fe537679de	fix for exact_signature_unique_b, exact_signature_copycount_i, fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same criteria for 'valid document' as for title and description uniqueness test.	2014-10-24 15:04:40 +02:00
Michael Peter Christen	2e5214eb21	added field postprocessing.partialUpdate to settings which can be used to switch on or off partial updates. Both options should cause the same result. Default is on.	2014-10-17 14:17:49 +02:00
Michael Peter Christen	77662e08e1	concurrently initialize the error cache; extended also the cache by factor 10 up to 1000 entries. This error cache is only used to catch up paused crawls between shutdown+startup	2014-10-17 12:45:26 +02:00
Michael Peter Christen	07c5b57953	removed warnings	2014-10-15 11:19:25 +02:00
Michael Peter Christen	2e09da9832	npe fix	2014-10-14 12:48:15 +02:00
Michael Peter Christen	d80418f1b1	added partial updates to solr during postprocessing: during postprocessing the solr documents are now not completely retrieved. instead, only fiels, needed for the postprocessing are extracted. When Solr document are written, this is done using partial updates. This increases postprocessing speed by about 50% for embedded Solr configurations. For external Solr configurations the enhancement should be much higher because the postprocessing with remote Solr is very slow. When doing partial updates to a remote Solr, this method should perform much better than before, it is expected that this is even much higher than the increase with local Solr.	2014-10-14 12:19:59 +02:00
Michael Peter Christen	b1cfbc4a04	added new solr field url_paths_count_i which can be used to enhance the index browser and maybe also for ranking; possibly also for SEO-with-YaCy applications.	2014-10-13 23:51:19 +02:00

1 2 3 4 5 ...

1048 Commits