Commit Graph

712 Commits

Author SHA1 Message Date
Michael Peter Christen
c87cdfca2e do not set a load prerequisite that prevents the start of one-time-jobs 2014-01-22 17:18:53 +01:00
Michael Peter Christen
6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
settings available in Crawler_p.html servlet for steering.
2014-01-21 19:28:00 +01:00
sixcooler
40a4030b55 configurable max-load values for YaCy-Threads:
try lower values on smal systems like a Pi
2014-01-21 17:04:22 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
Michael Peter Christen
25a6c05008 experimental removal of synchronization. This should work for all cases
where the size() and isEmpty() method is used only for statistics, which
happens at many locations in YaCy. If these methods are used for
structual reasons (like accessing the last element in an array) then it
may fail or cause other problems. As far as visible, this is not the
case.
2014-01-19 14:47:11 +01:00
Michael Peter Christen
5695280edd removed superfluous synchronization 2014-01-19 14:44:58 +01:00
Michael Peter Christen
a1977b7a75 removed debug code 2014-01-19 14:42:26 +01:00
Michael Peter Christen
ec10ed45bd better logging in logger 2014-01-16 13:08:39 +01:00
Michael Peter Christen
c3dcbdc8d5 try to recover from an OOM during citation index reading and fail-over
to second solr core in case of unrecoverable OOM.
2013-11-28 01:10:25 +01:00
Michael Peter Christen
2c39b65409 fixes for searches containing stopwords. The fix was done using a
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
2013-11-26 02:24:47 +01:00
Michael Peter Christen
191fd3d7e7 added an optimization option to HandleSet mass data storage structure 2013-11-15 15:38:00 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
orbiter
3c3cb78555 - removed a lot of garbage and bloated code from GuiHandler.
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
2013-10-24 20:42:34 +02:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
6aabc4e5c8 reduced logging line memory, 10000 lines had filled up 450MB! grrr.
(thank you, a bomb from the past)
2013-10-24 16:17:53 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
7b69c438f7 more methods for the table class 2013-10-15 16:46:59 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
e8e558a9b7 fix for content domain classification in URIMetadataNode 2013-09-03 10:49:09 +02:00
Michael Peter Christen
cb85b22725 redesign of the image search process (with much better results,
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
Michael Peter Christen
0f3d8890db removed an assert which causes a shortcut call circuit 2013-08-22 10:12:25 +02:00
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
Roland Haeder
13433d41a1 Log this exception better
Conflicts:
	source/net/yacy/kelondro/blob/Tables.java
2013-07-27 09:54:51 +02:00
orbiter
056b42f5aa - added information about segment count to status_p.xml
- also moved this information from the old index structure, which is
still in use for the RWI/DHT index to that front-end
2013-07-23 18:03:33 +02:00
Michael Peter Christen
336f86394c replaced StringBuffer with StringBuilder 2013-07-23 12:21:27 +02:00
Michael Peter Christen
aeac2fb763 replaced more containsKey() -> get() usages by a simple get(), followed
by a test for NULL. This should increase the application speed and
reduces the lookup time for the affected methods by 50%
2013-07-23 12:16:51 +02:00
Michael Peter Christen
735a66eff3 enhancements to crawler 2013-07-18 12:29:04 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
5c6946dd5f replaced usage of log4j by ConcurrentLog where possible 2013-07-09 14:42:39 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
reger
a6bf44212e bugfix: location (lat/lon) meta data retrival (Double.NaN check) 2013-06-30 03:50:07 +02:00
Michael Peter Christen
14186e815e npe fix 2013-06-13 22:42:21 +02:00
Michael Peter Christen
f7e77a21bf Added a citation reference computation for intra-domain link structures.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.
2013-06-07 13:20:57 +02:00
Michael Peter Christen
e20450e798 patch in HTCache and CitationIndex loading in case that a file is
broken: do not crash; instead ignore the file and delete it.
2013-06-07 12:52:03 +02:00
reger
7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247
- append language setting specific stopword list

- remove unused OVERHANG stack type
2013-06-06 22:07:54 +02:00
Michael Peter Christen
a1644ca0fd new workflow processor in Segment to enqueue indexing documents to solr 2013-05-30 12:34:53 +02:00
Michael Peter Christen
5344a1c5f7 getting the trash out 2013-05-29 16:09:05 +02:00
orbiter
888a985dc6 set a higher limit for table copy usage 2013-05-27 15:23:12 +02:00
Michael Peter Christen
8dbc80da70 redesign of index.exist-test: this shall now not be done using a single
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
2013-05-17 13:59:37 +02:00
Michael Peter Christen
44e363f37f refactoring of WorkflowProcessor, added process counter, update of
process counter if an blocking thread dies. Added also a new column in
PerformanceConcurrency_p servlet to show the actual number of concurrent
processes.
2013-05-13 13:28:07 +02:00
orbiter
aeff31cd44 fix for workflow processor (cause: latest redesign for less threads) 2013-05-12 21:36:20 +02:00
orbiter
a1c989002b fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652
generate dht data even if dht receive and dht transmission is switched
off
2013-05-08 16:48:45 +02:00
orbiter
7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
- check geolocation coordinates and accept only those, which are
well-formed
- the solr push process does not stop crawling any more if after 20
requests to Solr Solr does not accept the record. Instead, a severe log
entry asks the user to create a bug request
2013-05-03 00:24:39 +02:00
Michael Peter Christen
bb4bf3d8fd infinity timeout bug protection patch 2013-04-30 11:06:48 +02:00
orbiter
e1bfe9d07a - reduction of the concurrently running processes to make YaCy more
adjusted to smaller and 1-core devices.
- the workflow processor now starts no process at all. these are started
as soon as parser/condenser/indexing queues are filled.
- better abstraction
2013-04-25 11:33:17 +02:00
Michael Peter Christen
c1a2175fbc added transparency to gif image animation and the integration to the
YaCy httpd for on-the-fly generated gifs (including animated gifs)
2013-04-21 12:29:05 +02:00
Michael Peter Christen
ada3f27de7 added three new field for a better ranking: references_internal_i,
references_external_i and references_exthosts_i. These can be used to
count and evaluate the number of external links to every web page. An
experimental ranking function can be i.e.:
div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))
2013-04-12 16:17:14 +02:00