yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	acc1f8a749	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-11-07 12:01:37 +01:00
Michael Peter Christen	81d9e23532	fixed another memory leak in the PDF parser: the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space which cannot be cleaned if PDFont.clearResources is called. The attempt to clean the class cache therefore causes that the class is loaded and this cache is initialized with some rubbish. I tried to prevent to instantiate this class by usage of a hacked findLoadedClass call to the SystemClassLoader (which is protected ...). Now, without using the PDF parser at all, 8MB of RAM space is not occupied, however, when the first PDF arrives this space will be taked and never given back to GC. WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!	2013-11-07 11:57:01 +01:00
Michael Peter Christen	c152d996e6	reduced footprint of BookmarksDB which can take quite a lot of memory if the number of bookmarks is high (i.e. > 2000 URLs)	2013-11-07 10:55:02 +01:00
Michael Peter Christen	81bb50118e	found and fixed a huge memory leak in solr caching (inside Solr). The not-flushed Solr cache is now handled in this way: - it is smaller by default - an Solr-internal process is started to flush the cache periodically (this does NOT clean the cache, just removes old objects) - a Solr-external process (the standard YaCy cleanup-process) now has direct access to the solr internal cache and flushes them completely. The time frame for such a flush is defined by the cleanup-process frequency, by default 10 minutes.	2013-11-07 10:01:44 +01:00
reger	7b17cdf6dd	add content_type:image/* to image search - see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result - try it yourself with following sample query /solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type adresses also possible url without or deviating extension.	2013-11-07 03:11:03 +01:00
sixcooler	987f410011	URL-export:add query and fix for cast-class-exception	2013-11-06 19:22:26 +01:00
Michael Peter Christen	a8253ca49c	added missing unicode transformation in href link contents during parsing	2013-11-06 18:05:02 +01:00
Michael Peter Christen	0cf9e9580b	added clickdepth and CR computation debug code to verify that the process is complete	2013-11-06 15:01:40 +01:00
Michael Peter Christen	234a974955	load image only if their parser flag is activated	2013-11-04 11:59:28 +01:00
Michael Peter Christen	b2c329929f	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-11-04 10:18:52 +01:00
Michael Peter Christen	60187a4ec2	fix in html parser	2013-11-04 10:16:20 +01:00
Michael Peter Christen	e1c1e57877	less overhead calling exist() with only one hash	2013-11-04 09:37:31 +01:00
reger	3d5d366f1c	fix html header in Solr HTMLResponseWriter - move 1st body content after </head> tag - add closing <span> tag	2013-11-04 03:12:02 +01:00
Michael Peter Christen	5a02d650ee	avoid cloning	2013-11-03 18:31:50 +01:00
Michael Peter Christen	cc39667399	Speed enhancements and less CPU usage during Solr searches when using the embedded Solr (the default). This was obtained by cirumventing solrj search encapsulation and the implementation of direct index access methods to Solr. The effect will not only be seen during search, but this has also a strong effect on suggestions (much more) and less CPU power usage during index distribution (which needs many search requests)	2013-11-01 17:24:36 +01:00
Michael Peter Christen	434e13b46d	in host browser also show the properties of failed documents including referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)	2013-11-01 13:30:53 +01:00
reger	69599566f9	catch one more malformed url in proxy url rewrite	2013-10-27 04:42:33 +01:00
reger	605530fec5	catch proxy url rewrite exception malformed url (" http:\/\/" ) may cause error response testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test	2013-10-27 04:06:11 +01:00
Michael Peter Christen	9bb7eab389	hacks to prevent storage of data longer than necessary during search and some speed enhancements. This should reduce the memory usage during heavy-load search a bit.	2013-10-25 15:05:30 +02:00
orbiter	3c3cb78555	- removed a lot of garbage and bloated code from GuiHandler. - transformed log lines to String before they are stored because the storage space is about 1:250 (45kb for one line before transformation, 180 bytes afterwards) - this saves up to 10MB RAM so we can increase the number of lines to 1000 again.	2013-10-24 20:42:34 +02:00
Michael Peter Christen	5afa6e3aee	Automatically flush the log cache if a short memory status is reached. For the default of 200 lines this can flush about 10MB.	2013-10-24 17:39:50 +02:00
Michael Peter Christen	030d0776ff	Enhanced crawl start for very, very large crawl lists (i.e. > 5000) which had a problem because of badly used concurrency. This fix also caused a redesign of the whole host deletion process. This should fix bug http://bugs.yacy.net/view.php?id=250	2013-10-24 16:20:20 +02:00
Michael Peter Christen	6aabc4e5c8	reduced logging line memory, 10000 lines had filled up 450MB! grrr. (thank you, a bomb from the past)	2013-10-24 16:17:53 +02:00
Michael Peter Christen	1a8783147b	enhanced computation of number of solr documents.	2013-10-24 15:48:05 +02:00
Michael Peter Christen	4948c39e48	added concurrency for mass crawl check	2013-10-23 11:27:19 +02:00
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	2013-10-23 00:16:54 +02:00
Michael Peter Christen	82621bead0	When doing bootstraping, always accept one seedlist-File without checking the date of the file. This should help to start the peer in case that the user has a completely wrong date setting.	2013-10-22 15:34:51 +02:00
Michael Peter Christen	691d7e70fa	added hint to development/commit rss feed	2013-10-21 15:16:29 +02:00
orbiter	20bbde8665	fix for mustmatch regex computation: result had correct semantic, but may have contained multiple same expressions within the disjunction of domain-restrictions. This fix removes the redundant restrictions and makes the regex shorter.	2013-10-18 13:55:37 +02:00
Michael Peter Christen	c833d02cf5	fixed webgraph postprocessing (did nothing and repeated to do this...)	2013-10-16 11:49:04 +02:00
Michael Peter Christen	74d0256e93	enhanced postprocessing: fixed bugs, enable proper postprocessing also without the harvestingkey, remove crawl profiles after postprocessing, speed-up for clickdepth computation.	2013-10-16 11:27:06 +02:00
Michael Peter Christen	7b69c438f7	more methods for the table class	2013-10-15 16:46:59 +02:00
Michael Peter Christen	820b896146	Replaced the inframe loading from yacy.net for donations with the loading of this iframe from the local host. To make this more flexible, this iframe is loaded once after startup from yacy.net.	2013-10-15 16:46:06 +02:00
reger	0d4efabaa8	fix YaCy version string in proxy headers (config parameter vString not longer used)	2013-10-13 17:56:53 +02:00
sixcooler	d9a02ed277	NPE fix for my last commit	2013-10-11 00:44:04 +02:00
sixcooler	61f627eb85	fix for ssl-connections from proxy-usage staying in close-wait-state + some extra 'close' in HttpClient	2013-10-10 20:57:37 +02:00
Michael Peter Christen	d328cc4a83	fix for didyoumean, added also more asian alphabets	2013-10-09 16:17:50 +02:00
Michael Peter Christen	90c8577840	enhanced ranking; patches to replace old ranking	2013-10-09 15:10:03 +02:00
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	2013-10-08 23:48:13 +02:00
orbiter	6efa7532d2	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-10-08 19:04:57 +02:00
orbiter	5f5a97bafc	added the anchor text within web pages to the searcheable entities of a web page. This can be of benefit for the ranking if these fields are used for boosts.	2013-10-08 18:41:07 +02:00
orbiter	705b3338ee	list more fields available for search and for ranking boosts	2013-10-08 18:15:35 +02:00
sixcooler	d536092fe4	fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout for eg. caused by massive requests when crawl from file	2013-10-08 18:02:42 +02:00
Michael Peter Christen	78e7aadb26	removed unused initialization method	2013-10-07 23:51:28 +02:00
Michael Peter Christen	4fbc4740df	removed warnings	2013-10-07 23:41:50 +02:00
Michael Peter Christen	21aa6a0321	migration to Solr 4.5.0	2013-10-07 17:09:40 +02:00
Michael Peter Christen	ef31d0f279	fix for rss reader, see http://bugs.yacy.net/view.php?id=294	2013-10-07 12:59:54 +02:00
Michael Peter Christen	101a6e6e14	Patch the citation index for links with canonical tags. This shall fulfill the following requirement: If a document A links to B and B contains a 'canonical C', then the citation rank computation shall consider that A links to C and B does not link to C. To do so, we first must collect all canonical links, find all references to them, get the anchor list of the documents and patch the citation reference of these links.	2013-10-07 11:15:58 +02:00
reger	fd119deb00	fix NPE on modified since check ( Response.requestHeader allowed to be null)	2013-09-30 02:50:53 +02:00
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	2013-09-27 16:57:05 +02:00
Michael Peter Christen	a52f3a597e	fix for canonical-from-http-header feature	2013-09-27 15:09:04 +02:00
Michael Peter Christen	2dd7c5be44	added parsing of http-canonical tags (untested, could not find an example page)	2013-09-27 13:17:50 +02:00
Michael Peter Christen	4476dea5ba	do not fail if a wrong boost key is used; instead, print only a warning See also: http://bugs.yacy.net/view.php?id=293	2013-09-27 12:28:09 +02:00
Michael Peter Christen	3bf0104199	fix for crawl domain counter limitation (limit was reached too early)	2013-09-26 13:41:52 +02:00
Michael Peter Christen	82bfd9e00a	- crawl profiles shall be deleted from active and passive stacks if they are deleted to terminate the crawl because otherwise the crawl will go on after the load-from-passive stack policy. - better check if a crawl is terminated using the loader queue.	2013-09-26 10:22:31 +02:00
Michael Peter Christen	1b3d26dd23	hack to remove most of the warning: deprecated messages (but not all, one is left)	2013-09-25 21:14:52 +02:00
Michael Peter Christen	a496313248	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-25 20:41:02 +02:00
sixcooler	3c48fc65fd	reverted RemoteInstance to deprecated methods of httpClient-4.2 this should work with current remote-Solr-Instances	2013-09-25 18:45:16 +02:00
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	2013-09-25 18:27:54 +02:00
Michael Peter Christen	095053a9b4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-25 17:32:52 +02:00
sixcooler	0cae420d8e	some dns-timing changes: since httpclient uses the domain-cache it is useful not to clean the domain cache until crawling is running (domains are filled into this cache) On huge crawl-starts (eg. from file) my DNS did not follow the high rates - so I reduced the rate and give some more time(-out)	2013-09-25 15:01:28 +02:00
sixcooler	15b1bb2513	bump to httpClient-4.3	2013-09-25 14:48:37 +02:00
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	2013-09-25 14:38:24 +02:00
orbiter	14442efa6d	when profiles are cleaned, there shall be first a callback showing which profiles are cleaned. This shall enable a profile-termination-driven postprocessing. To do this, index writings must carry the profile key which will be implemented in another (next) step.	2013-09-25 11:04:12 +02:00
orbiter	0013d0d0bb	removed superfluous class	2013-09-24 21:18:37 +02:00
orbiter	f90d5296cb	Added new data structure to be used by the balancer (not used yet). These data structures will enable the balancer to store the crawl queue into individual queues, one each for a single host.	2013-09-24 21:08:40 +02:00
orbiter	0e8d752462	refactoring	2013-09-24 19:55:59 +02:00
orbiter	8ac2e8c8c9	added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values.	2013-09-24 11:26:51 +02:00
orbiter	d86d2be5c3	automatically removed Places autotagging if no location library is wanted	2013-09-24 11:23:45 +02:00
orbiter	214a087cdf	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-23 20:59:03 +02:00
Michael Peter Christen	96ed0c980e	- added hosthash to all documents (also fail documents which is needed there for deletion), this fixes a problem for the deletion of old documents for new crawl starts - added clickdepth and citation computation for fail documents	2013-09-23 18:09:42 +02:00
Michael Peter Christen	179ad281f9	close include byte buffer after usage	2013-09-23 12:19:51 +02:00
reger	6b9a624808	remove double declaration of TLD_any_zone_filter	2013-09-23 03:01:08 +02:00
orbiter	d2effd21db	fix for npe during location search	2013-09-21 21:03:58 +02:00
orbiter	828603e4f1	fix for 100%CPU problem in error cache cleaning process	2013-09-21 10:20:13 +02:00
orbiter	c64b51134e	hack to add all tokens from the url to text_t. This was working for the RWI index (and still is working) but not for solr-only search indexes. Maybe we should find a solution using a separate search field instead.	2013-09-21 08:57:43 +02:00
orbiter	6e8377b8ad	do not check all words with synonym library if the library is empty	2013-09-21 08:56:24 +02:00
orbiter	70ba74b23a	disabled ipv4 preference to enable ipv6-only networks like freifunk	2013-09-20 16:52:37 +02:00
orbiter	f3be1930cb	CPU problem when pusing to the error cache; wrong class, ConcurrentHashMap needed for concurrency	2013-09-20 16:51:50 +02:00
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	2013-09-17 15:52:57 +02:00
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	2013-09-17 15:27:02 +02:00
Michael Peter Christen	31920385f7	set anchor rel attribute of all links to "nofollow" if the html meta contains a robots:nofollow or if the http header contains a "X-Robots-Tag: nofollow"	2013-09-16 16:14:56 +02:00
Michael Peter Christen	57e00baf26	fix for parsing of image links inside of anchor links (image-links)	2013-09-15 23:54:46 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	3ea9bb4427	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-15 00:30:41 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
reger	603368fc3e	remove redundant declaration of USER_AGENT	2013-09-14 18:29:44 +02:00
Michael Peter Christen	1a8c64117f	decreased the responseHeaderDB database which is now flushed more frequently. This will preserve more documents in the cache in case of a crash.	2013-09-11 13:03:58 +02:00
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	2013-09-10 10:31:57 +02:00
Michael Peter Christen	9cc8468b30	added tools to visualize image generation (i.e. during testing)	2013-09-09 12:58:26 +02:00
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	2013-09-05 13:22:16 +02:00
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	2013-09-05 09:59:41 +02:00
Michael Peter Christen	7a5574cd51	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-04 23:12:04 +02:00
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	2013-09-04 23:11:53 +02:00
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	2013-09-04 16:00:47 +02:00
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	2013-09-04 10:47:18 +02:00
Michael Peter Christen	85b1922244	activated image type navigation for image search	2013-09-03 13:34:01 +02:00
Michael Peter Christen	9e12fdff23	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-03 12:22:57 +02:00
Michael Peter Christen	ab1201fdfd	fixed wrong facet count	2013-09-03 12:22:29 +02:00
Michael Peter Christen	049c3b3f2e	added an option to exclude image search results from text search. This is on by default.	2013-09-03 11:14:23 +02:00

1 2 3 4 5 ...

6630 Commits