Commit Graph

736 Commits

Author SHA1 Message Date
reger
9313447de2 reimplement tighter lat/lon calc in URIMetadataNode
from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272
2014-04-27 18:20:33 +02:00
orbiter
12ba890205 removed warnings 2014-04-22 19:35:15 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
Michael Peter Christen
1aea01fe5b fix for Table in case that requested file does not exist and paths also
do not exist
2014-04-17 12:44:05 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
17e0956312 refactoring of SystemLoad calls (only one backend tool) 2014-04-11 09:25:18 +02:00
reger
227c42bc96 eleminate obsolete URIMetaDataRow class
by joining it with/into URIMetaDataNode.
2014-04-03 00:35:15 +02:00
Michael Peter Christen
62a36fa584 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-02 03:27:08 +02:00
reger
c9f92abddc fix: application link count
(URIMetadataNode)
2014-04-02 03:21:51 +02:00
Michael Peter Christen
5b83887da8 npe fix 2014-04-02 02:34:55 +02:00
Michael Peter Christen
56710ecb26 prevent opening of new files as that could be a cause for the latest
too-many-open-files exception. The old file is just truncated if the
table is cleaned.
2014-03-28 14:31:43 +01:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
Michael Peter Christen
1a764135be one more Thread Dump fix for new bootstrap css style 2014-03-27 23:01:28 +01:00
Michael Peter Christen
bb21d825f9 fix for thread dump line spacing 2014-03-27 22:13:37 +01:00
Michael Peter Christen
5f4a6892c1 enhanced RowSet re-sort limit for small sets 2014-03-05 23:28:19 +01:00
Michael Peter Christen
6ed9c0164e attaching names to all Threads to get a better view in profiling tools
like VisualVM
2014-02-28 15:02:01 +01:00
Michael Peter Christen
fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
2014-02-28 14:01:09 +01:00
Michael Peter Christen
9eb668e951 enhanced the resource observer
The resource observer is now able to recognize free disk space AND
available space for YaCy. The amount of space which is assigned for YaCy
are defined in new settings in the configuration file.
Furthermore, there is now a cleanup process which deletes files in case
that an autodelete is activated. The autodelete is now BY DEFAULT ON if
the disk space is low, which means that YaCy starts to delete documents
when the disk is full!
2014-02-12 01:00:44 +01:00
Michael Peter Christen
fbee98c06f fixed shortcut self-reference bug 2014-02-11 22:14:46 +01:00
Michael Peter Christen
acc8d7faa7 fixed setting of shortMemoryStatus in MemoryControl 2014-02-09 12:25:55 +01:00
Michael Peter Christen
94245ce0a8 fixed "Size in KBytes" calculation in PerformanceQueues_p.html,
see http://bugs.yacy.net/view.php?id=362
2014-02-07 17:19:08 +01:00
Michael Peter Christen
ebfaf753b7 - faster initialization of index files
- removal of not used space if index files shrink (rare, but possible)
2014-01-28 12:39:58 +01:00
reger
a3e2cca8e9 improve isOlder check to not overwrite node index with metadata on equal load date 2014-01-26 01:00:52 +01:00
orbiter
c351e47a84 fix for bad-formatted lonlat 2014-01-22 21:33:11 +01:00
Michael Peter Christen
c87cdfca2e do not set a load prerequisite that prevents the start of one-time-jobs 2014-01-22 17:18:53 +01:00
Michael Peter Christen
6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
settings available in Crawler_p.html servlet for steering.
2014-01-21 19:28:00 +01:00
sixcooler
40a4030b55 configurable max-load values for YaCy-Threads:
try lower values on smal systems like a Pi
2014-01-21 17:04:22 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
Michael Peter Christen
25a6c05008 experimental removal of synchronization. This should work for all cases
where the size() and isEmpty() method is used only for statistics, which
happens at many locations in YaCy. If these methods are used for
structual reasons (like accessing the last element in an array) then it
may fail or cause other problems. As far as visible, this is not the
case.
2014-01-19 14:47:11 +01:00
Michael Peter Christen
5695280edd removed superfluous synchronization 2014-01-19 14:44:58 +01:00
Michael Peter Christen
a1977b7a75 removed debug code 2014-01-19 14:42:26 +01:00
Michael Peter Christen
ec10ed45bd better logging in logger 2014-01-16 13:08:39 +01:00
Michael Peter Christen
c3dcbdc8d5 try to recover from an OOM during citation index reading and fail-over
to second solr core in case of unrecoverable OOM.
2013-11-28 01:10:25 +01:00
Michael Peter Christen
2c39b65409 fixes for searches containing stopwords. The fix was done using a
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
2013-11-26 02:24:47 +01:00
Michael Peter Christen
191fd3d7e7 added an optimization option to HandleSet mass data storage structure 2013-11-15 15:38:00 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
orbiter
3c3cb78555 - removed a lot of garbage and bloated code from GuiHandler.
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
2013-10-24 20:42:34 +02:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
6aabc4e5c8 reduced logging line memory, 10000 lines had filled up 450MB! grrr.
(thank you, a bomb from the past)
2013-10-24 16:17:53 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
7b69c438f7 more methods for the table class 2013-10-15 16:46:59 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
e8e558a9b7 fix for content domain classification in URIMetadataNode 2013-09-03 10:49:09 +02:00
Michael Peter Christen
cb85b22725 redesign of the image search process (with much better results,
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
Michael Peter Christen
0f3d8890db removed an assert which causes a shortcut call circuit 2013-08-22 10:12:25 +02:00
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
Roland Haeder
13433d41a1 Log this exception better
Conflicts:
	source/net/yacy/kelondro/blob/Tables.java
2013-07-27 09:54:51 +02:00