yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	3e6c3e2237	documents pushed over the api/push_p.html interface will have their unique flag set by default	2015-01-06 15:22:59 +01:00
Michael Peter Christen	8c3e5b7b6d	added experimental pdf splitting which enables YaCy to split pdfs during parsing into individual pages and add them all using different URLs. These constructed urls are generated from the source url with an appended page=<pagenumber> attribute to the url get/post properties. This will distinguish the different page entries. The search result list will then replace the post parameter with a url anchor # mark which causes that the original url is presented in the search result. These URLs can be opened directly on the correct page using pdf.js which is now built-in into firefox. That means: if you find a search hit on page 5 and click on the search result, firefox will open the pdf viewer and shows page 5.	2014-12-21 18:10:15 +01:00
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	2014-12-19 17:37:58 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	2014-12-16 11:33:30 +01:00
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	2014-12-15 23:32:46 +01:00
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	2014-12-14 13:40:45 +01:00
Michael Peter Christen	ab6cc3c88c	added concurrent generation of snapshot pdfs	2014-12-10 14:10:05 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	2014-12-06 00:25:05 +01:00
reger	568c991405	remove the unused Request variable (fix of prev. commit)	2014-12-05 03:03:28 +01:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	2014-12-03 11:45:48 +01:00
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	2014-12-02 13:35:19 +01:00
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	2014-12-01 17:37:25 +01:00
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	2014-12-01 15:03:09 +01:00
Michael Peter Christen	ad0da5f246	added new web page snapshot infrastructure which will lead to the ability to have web page previews in the search results. (This is a stub, no function available with this yet...)	2014-11-29 11:56:32 +01:00
Michael Peter Christen	84763126e0	added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet!	2014-11-24 20:28:52 +01:00
Michael Peter Christen	a39419f2ef	more stacks shall be considered for on-demand loading, not only deep-depth stacks to prevent "too many open files" problem	2014-11-23 20:11:23 +01:00
Michael Peter Christen	5bb52f79be	reduce number of calls to queue.size() because that may be a bottleneck during crawling	2014-11-23 20:09:32 +01:00
Michael Peter Christen	a34f837592	better delete all files in path when removing host crawl stack	2014-11-22 12:09:07 +01:00
Michael Peter Christen	10b1db430a	if we have many hosts, use on-demand earlier	2014-11-22 12:04:04 +01:00
Michael Peter Christen	6983dff334	explain crawl denial when not switched to intranet mode	2014-10-11 09:02:12 +02:00
Michael Peter Christen	d8beafba3a	fix for values in CrawlProfileEditor table and xml; now the full profile is available in the xml.	2014-10-09 13:27:20 +02:00
Michael Peter Christen	ec95dfa2e6	fixed crawl profile xml result which did not show the correct crawl status.	2014-10-08 18:48:57 +02:00
Michael Peter Christen	9b1958e8ca	more ipv6 bugfixes	2014-10-08 15:21:49 +02:00
Michael Peter Christen	e1bc768f9d	more IPv6 bugfixes	2014-10-06 17:44:27 +02:00
reger	fb1fcc2b03	handle noarchive tag, skip writing page to cache http://mantis.tokeek.de/view.php?id=44	2014-10-01 04:35:34 +02:00
Michael Peter Christen	6491270b3a	large IPv6 redesign of peer ping methods! removed preferred IPv4 in start options and added a new field IP6 in peer seeds which will contain one or more IPv6 addresses. Now every peer has one or more IP addresses assigned, even several IPv6 addresses are possible. The peer-ping process must check all given and possible IP addresses for a backping and return the one IP which was successful when pinging the peer. The ping-ing peer must be able to recognize which of the given IPs are available for outside access of the peer and store this accordingly. If only one IPv6 address is available and no IPv4, then the IPv6 is stored in the old IP field of the seed DNA. Many methods in Seed.java are now marked as @deprecated because they had been used for a single IP only. There is still a large construction site left in YaCy now where all these deprecated methods must be replaced with new method calls. The 'extra'-IPs, used by cluster assignment had been removed since that can be replaced with IPv6 usage in p2p clusters. All clusters must now use IPv6 if they want an intranet-routing.	2014-09-30 14:53:52 +02:00
Michael Peter Christen	67cd4c37bd	activated the new apk parser which was already ready but not included in the parser initialization. To make the apk parser usable, the handling of application type links had to be modified. Now all documents which have not a parser attached are placed to the noload-queue while all other documents are parsed using the associated parser class. This may have side-Effects on other parsers and the display of different file classes (images, apps, videos).	2014-09-24 13:32:58 +02:00
Michael Peter Christen	025516f682	fix for crawl limit for number of pages fail	2014-09-20 13:06:46 +02:00
orbiter	3ac31614a3	added option to reverse-sort YaCy tables (internal API change only)	2014-09-18 11:11:09 +02:00
Michael Peter Christen	bf18a39d0e	replaced warning with info	2014-09-16 14:41:04 +02:00
Michael Peter Christen	ebd0be2cea	fixes and speed updates for search process	2014-09-10 14:24:03 +02:00
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	2014-09-02 13:22:12 +02:00
orbiter	4ae7aead28	addon to latest fix	2014-08-27 00:03:49 +02:00
Michael Peter Christen	eca9380e3d	bugfix for crawler double-check: if an url is redirected, the redirect-target was not double-checked. This is now done by replacing the redirect-URL on the crawl queue again (where it is double-checked)	2014-08-06 12:35:12 +02:00
Michael Peter Christen	9ac0c93f17	fix for subpath crawl filter	2014-08-06 01:33:24 +02:00
Michael Peter Christen	66106bdaf0	fix for crawler attribute maxdompages	2014-08-05 21:32:25 +02:00
Michael Peter Christen	49d91b94c3	npe fix in crawler	2014-08-05 21:31:59 +02:00
Michael Peter Christen	c465b791af	typo	2014-08-04 16:13:39 +02:00
Michael Peter Christen	3c23b89823	less logging	2014-08-04 13:37:34 +02:00
Michael Peter Christen	1609763be5	toString fix	2014-08-04 12:58:39 +02:00
Michael Peter Christen	001e05bb80	do not store failure of loading of robots.txt into the index as a fail document	2014-08-01 12:15:14 +02:00
Michael Peter Christen	05d58e4df0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-08-01 12:04:25 +02:00
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	2014-08-01 12:04:15 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
orbiter	e9163e7e10	fix for malformed hostpath names in crawl balancer	2014-07-29 11:18:45 +02:00
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	2014-07-24 14:59:37 +02:00
orbiter	4b06adb751	fix for file urls	2014-07-23 17:54:31 +02:00
Michael Peter Christen	542c20a597	changed handling of crawl profile field crawlingIfOlder: this should be filled with the date, when the url is recognized as to be outdated. That field was partly misinterpreted and the time interval was filled in. In case that all the urls which are in the index shall be treated as outdated, the field is filled now with Long.MAX_VALUE because then all crawl dates are before that date and therefore outdated.	2014-07-22 00:23:17 +02:00
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	2014-07-21 23:54:23 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	2014-07-11 19:52:25 +02:00
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	2014-07-11 18:36:04 +02:00
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	2014-07-11 18:05:11 +02:00
Michael Peter Christen	06ab72d1af	enhanced crawler host round-robin strategy	2014-07-11 16:01:42 +02:00
Michael Peter Christen	49886fab08	enhanced debugging	2014-06-26 12:57:01 +02:00
Michael Peter Christen	b893c42a0f	bugfix for image search	2014-06-26 12:56:33 +02:00
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	2014-06-12 18:10:07 +02:00
Michael Peter Christen	ba6ffddefc	refactoring	2014-06-12 05:23:26 +02:00
reger	92d1604a31	Crawler hostbalancer does not delete finished queue files, use alternative delete to fight the sympthom (and fix deletion of host dirs on startup) Root cause (which class holds a lock on .stack) not found. http://mantis.tokeek.de/view.php?id=404	2014-06-05 02:13:08 +02:00
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	2014-05-31 09:29:55 +02:00
reger	ca5437dd50	fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149 local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows) for win like file:///c:/tmp or file://localhost/c:/tmp for linux like file:///tmp or file://localhost/tmp Host is ignored and path must be absolute	2014-05-28 03:01:34 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
reger	1600414450	fix NPE on continuing crawls after YaCy restart (Agent is then nulll)	2014-05-02 19:32:09 +02:00
Michael Peter Christen	c1c1be8f02	fix for slow crawling and better logging in balancer	2014-04-29 19:50:33 +02:00
Michael Peter Christen	3acf416335	npe fix	2014-04-29 19:24:05 +02:00
orbiter	2f63bd0261	enhanced Host Balancer strategy: fair round robin	2014-04-23 23:11:37 +02:00
Michael Peter Christen	8b32dd5f9e	special strategy for balancer: do not remove targets with zero wait time from the queue	2014-04-18 06:50:07 +02:00
Michael Peter Christen	9c6228d948	fix for deadlocks in crawler	2014-04-17 16:58:17 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	06afb568e2	new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files.	2014-04-17 12:52:54 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	075b6f9278	refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer.	2014-04-14 13:32:35 +02:00
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	2014-04-10 09:08:59 +02:00
Michael Peter Christen	9e503b3376	also delete the robots.txt file from the cache when a new crawl is started	2014-04-09 21:59:54 +02:00
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	2014-04-09 18:33:48 +02:00
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	2014-04-04 14:43:35 +02:00
Michael Peter Christen	d4b5c457e4	NPE fix	2014-04-04 12:34:34 +02:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
Michael Peter Christen	85a427ec54	support for multiple sitemaps in robots.txt	2014-03-14 13:33:23 +01:00
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	2014-03-11 09:51:04 +01:00
reger	dd5bf0b71b	cleanup old reference to HTTPDemon.setAlternativeResolver optimize .yacyh check in AbstractRemoteHandler	2014-03-06 03:08:04 +01:00
Michael Peter Christen	e485fbd0ce	- let crawl loader jobs die after 10 seconds without new jobs - corrected shutdown order t prevent a deadlock during shutdown	2014-03-04 00:33:13 +01:00
Michael Peter Christen	bcd9dd9e1d	enhanced concurrent loading by using a fixed set of concurrent loader processes in favor of throwaway-processes. The control mechanism does less often report a 'queue full' message to the busy loop which then does not perform a long busy waiting; instead all requests are queued and new loader processes are started if necessary up to a given limit (as set before)	2014-03-03 22:13:40 +01:00
Michael Peter Christen	6ed9c0164e	attaching names to all Threads to get a better view in profiling tools like VisualVM	2014-02-28 15:02:01 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
orbiter	da5d4128bf	prevent npe	2014-02-25 03:26:20 +01:00
orbiter	a878c7982c	prevent npe	2014-02-25 03:19:41 +01:00
orbiter	ced1a96f9c	fixed error cache	2014-02-25 02:16:22 +01:00
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	2014-02-19 04:03:45 +01:00
Michael Peter Christen	8b14e92ba4	added button in host browser to re-load 404/failed documents	2014-01-23 15:56:36 +01:00
Michael Peter Christen	6ada0daae9	making latency_factor and maximum number of same hosts in loader queue settings available in Crawler_p.html servlet for steering.	2014-01-21 19:28:00 +01:00
Michael Peter Christen	0168f80c28	new crawling factors can now be changed during runtime	2014-01-21 17:52:16 +01:00
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	2014-01-21 16:05:55 +01:00
Michael Peter Christen	c0da966dfa	enhanced crawler speed	2014-01-20 21:46:40 +01:00
Michael Peter Christen	0d235a565b	cleanup crawl loader jobs	2014-01-20 18:36:00 +01:00
Michael Peter Christen	1ea17bd9f3	- removed old metadata database and all migration code - refactored all code which uses URIMetadataRow as standard for word hash length and word hash ordering and moved that to the class 'Word', becuase the class URIMetadataRow defined the old metadata data structure and should be superfluous in the future - removed unused methods from URIMetadataRow as preparation for further removal of that class	2014-01-20 18:31:46 +01:00
Michael Peter Christen	022c6d3ce1	do YaCy p2p connections using a timeout-request which covers the http request into a separate thread and ignores the furthure result of a request if that does not answer within the requested time-out. This is a try to solve a problem with the peer-ping, which hangs whenever a peer appears to be dead or blocked.	2014-01-19 15:21:23 +01:00

1 2 3 4 5 ...

293 Commits