Commit Graph

772 Commits

Author SHA1 Message Date
reger
c6f634a4f2 remove redundant caching of urlhash in URIMetadataNode
(is already cached in underlaying DigestURL .url)

upd pom keyword for maven-antrun-plugin
2014-12-21 03:45:54 +01:00
Michael Peter Christen
413eeefed4 added character set detection library from
http://www-archive.mozilla.org/projects/intl/chardet.html
2014-12-10 13:08:29 +01:00
Michael Peter Christen
a304058840 added Image Events as another option to generate images with a mac if no
Ghostscript is available or does not work...
2014-12-04 01:21:24 +01:00
Michael Peter Christen
321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
2014-12-02 16:26:07 +01:00
Michael Peter Christen
7bfab5eb9d set Busy- and Blocking-Threads to daemon mode (they will now not prevent
YaCy from termination if still running)
2014-12-02 16:05:00 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
Michael Peter Christen
4920ab7b76 optimize usage of size() cache 2014-11-23 20:07:32 +01:00
Michael Peter Christen
2beb6abeb6 disabled crazy sleep loop 2014-11-21 14:38:54 +01:00
Michael Peter Christen
8aee7f940e added missing class for latest changes 2014-11-13 01:30:12 +01:00
Michael Peter Christen
97039049e4 fix in key enumeration methods for cases where the enumeration is done
in reverse order.
2014-11-13 01:15:31 +01:00
Michael Peter Christen
421ee64f33 another fix to ordering of table indexes; fixes also network stats
graphics
2014-11-11 13:57:04 +01:00
Michael Peter Christen
1db476c67e fix for bad table iteration 2014-11-10 18:52:01 +01:00
orbiter
0fcd8097a3 removed unused options from BusyThreads 2014-11-02 20:08:49 +01:00
sixcooler
72561926aa do not overwrite yacy.conf in case of an exception
may be a fix for http://mantis.tokeek.de/view.php?id=180
2014-10-15 18:13:54 +02:00
Michael Peter Christen
bc275dca07 added network history graph image /NetworkHistory.png which can show
many different statistics about the history of the peer.
2014-10-10 14:06:47 +02:00
Michael Peter Christen
ee27be3399 misc bugfixes (concurrency, memory protection) 2014-10-08 15:22:29 +02:00
Michael Peter Christen
7817fc50c9 added a high cpu cycle monitor to PerformanceQueues 2014-10-08 15:20:43 +02:00
orbiter
3ac31614a3 added option to reverse-sort YaCy tables (internal API change only) 2014-09-18 11:11:09 +02:00
Michael Peter Christen
ec6082c872 very bad language detection hack fix hack 2014-09-05 23:29:09 +02:00
Michael Peter Christen
a7dd89c4de changed method to write the citation index: do not catch up references
during document parsing; instead use the same references that would also
be written into the webgraph. That should cause that the webgraph and
the citation index express the exact same semantic.
2014-09-02 13:22:12 +02:00
reger
ea6c9e9b07 reduce mem buffer overhead for gap files during r/w
(they are typically small compared to idx allowing to use smaller buffersize -> set to 16k records)
2014-08-18 00:03:24 +02:00
orbiter
487021fb0a snippet computation update 2014-08-15 01:17:11 +02:00
Michael Peter Christen
0ceeceb35e more logic on Solr queries; usage of the query terms in posprocessing,
saving one query for double document detection now per document
2014-08-04 02:35:38 +02:00
reger
2b8cc5832c fix seek error for 0 file size records file
by add extra check for file size = 0 in cleanlast()
- (http://mantis.tokeek.de/view.php?id=411)
2014-07-06 20:49:01 +02:00
reger
2ba394333f fix Crawler HostQueue release of stackfile
- close stackfile inputstream at end of ChunkIterator
This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)
2014-07-06 16:04:30 +02:00
Michael Peter Christen
501d55cd35 removed superfluous assert 2014-06-19 12:10:12 +02:00
Michael Peter Christen
f0db501630 better handling of ranking parameters and new default values for date
navigation which is done using ranking in solr.
2014-05-22 03:01:07 +02:00
Michael Peter Christen
6634b5b737 debug code for index distribution testing 2014-05-21 18:20:16 +02:00
orbiter
97983ba89f fixed generics warnings for generic array instantiation that appeared
after migration to Java 7
2014-05-20 21:50:16 +02:00
orbiter
88f4af90da removed warnings 2014-05-13 22:27:31 +02:00
orbiter
89f76da24b Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-05-06 05:38:38 +02:00
sixcooler
b8cee9b7d8 remove tables from tabletracker on close to avoid lots of dead entrys in
/PerformanceMemory_p.html
2014-05-02 22:55:47 +02:00
orbiter
f15c832587 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-30 07:42:52 +02:00
reger
ffc5b75c73 optimize and fix lat / lon assignment 2014-04-27 20:52:06 +02:00
reger
9313447de2 reimplement tighter lat/lon calc in URIMetadataNode
from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272
2014-04-27 18:20:33 +02:00
orbiter
a3542f29b4 npe fix 2014-04-25 09:26:20 +02:00
orbiter
c48d2a2a02 npe fix 2014-04-25 09:23:10 +02:00
orbiter
12ba890205 removed warnings 2014-04-22 19:35:15 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
Michael Peter Christen
1aea01fe5b fix for Table in case that requested file does not exist and paths also
do not exist
2014-04-17 12:44:05 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
17e0956312 refactoring of SystemLoad calls (only one backend tool) 2014-04-11 09:25:18 +02:00
reger
227c42bc96 eleminate obsolete URIMetaDataRow class
by joining it with/into URIMetaDataNode.
2014-04-03 00:35:15 +02:00
Michael Peter Christen
62a36fa584 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-02 03:27:08 +02:00
reger
c9f92abddc fix: application link count
(URIMetadataNode)
2014-04-02 03:21:51 +02:00
Michael Peter Christen
5b83887da8 npe fix 2014-04-02 02:34:55 +02:00
Michael Peter Christen
56710ecb26 prevent opening of new files as that could be a cause for the latest
too-many-open-files exception. The old file is just truncated if the
table is cleaned.
2014-03-28 14:31:43 +01:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
Michael Peter Christen
1a764135be one more Thread Dump fix for new bootstrap css style 2014-03-27 23:01:28 +01:00
Michael Peter Christen
bb21d825f9 fix for thread dump line spacing 2014-03-27 22:13:37 +01:00