Commit Graph

776 Commits

Author SHA1 Message Date
orbiter
1b8d346b4c fixes in connection with transiton to byte[] hashes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5843 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-20 21:54:00 +00:00
orbiter
996572de95 quickfix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5841 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-20 16:11:35 +00:00
orbiter
380ed2dac0 performance and debugging additions
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5840 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-20 15:01:43 +00:00
f1ori
76af84d732 * add custom comparator to ScoreCluster for byte[]
* fixes http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2010


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5836 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-19 20:01:46 +00:00
f1ori
2f860a2564 * convert byte[] hashes to string for log output
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5830 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-18 14:35:18 +00:00
orbiter
fbcbcc5bdb export of yacy document objects as dublin core record in xml
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5826 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-17 14:20:12 +00:00
orbiter
d7cbf4cdd4 more performance hacks: less overhead in word hash computation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5825 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-17 13:47:06 +00:00
orbiter
29e96c1a60 bugfixes and performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5824 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-17 13:04:56 +00:00
f1ori
44daec7936 * introduce signatures to autoupdate
as long as there aren't publickeys for the updatelocations set,
  no signatures are checked
* wiki-article follows...


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5822 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-17 09:58:06 +00:00
orbiter
538e375901 replaced old caching method for computed word hashes with a better method. The word hash computation is a new performance bottleneck (after the IO bottleneck was removed with the IndexCell data structure) and a better caching for word hashes was necessary.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5821 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-17 09:26:16 +00:00
orbiter
e16c25ddf7 (peak-) performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-16 22:45:39 +00:00
orbiter
63cd152969 fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5818 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-16 22:18:35 +00:00
orbiter
c8624903c6 full redesign of index access data model:
terms (words) are not any more retrieved by their word hash string, but by a byte[] containing the word hash.
this has strong advantages when RWIs are sorted in the ReferenceContainer Cache and compared with the sun.java TreeMap method, which needed getBytes() and new String() transformations before.
Many thousands of such conversions are now omitted every second, which increases the indexing speed by a factor of two.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-16 15:29:00 +00:00
orbiter
bc80dc913a added new surrogate reader (surrogates are parsed documents on batches)
this will open a new way to insert indexes to YaCy (instead crawling)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5808 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-15 15:30:25 +00:00
orbiter
8a24350036 - fix for join method with new generalized RWI data structure (caused by latest commit)
- added more functions to mediawiki parser


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5806 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-15 10:26:24 +00:00
orbiter
89ec3acb3e - full abstraction of index content type: the kelondro full text index may now also contain indexes about other content than text, i.e. navigation indexes or reverse linking indexes.
- during index joins all word positions are maintained: better ranking for word distance possible; exact phrase match can be implemented soundly


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5804 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-15 06:34:27 +00:00
orbiter
8862a2fed0 ups
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5799 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-12 10:22:21 +00:00
orbiter
de68948bc5 better handling of free memory computation and emrgency cache flush for index cell
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5798 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-12 09:24:32 +00:00
orbiter
b81c7467d8 protection against too many files in RICELL in case of massive emergency dumps caused by low memory
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5791 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-09 23:55:47 +00:00
orbiter
14361f1ca4 added log message for index generation in HeapReader
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5787 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-09 10:34:22 +00:00
orbiter
44e01afa5b - refactoring
- a little bit more abstraction
- new interfaces for index abstraction

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5783 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-07 09:34:41 +00:00
orbiter
82fb60a720 increased memory limit for emergency cache flush
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5782 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-06 15:54:19 +00:00
lotus
596e6215dc fix in case of white space in path name
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5779 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 16:07:24 +00:00
orbiter
b887f4a116 keep more free mem
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5778 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 14:27:04 +00:00
orbiter
c2359f20dd refactoring: better abstraction of reference and metadata prototypes.
This is a preparation to introduce other index tables as used now only for reverse text indexes. Next application of the reverse index is a citation index.
Moved to version 0.74

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5777 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 13:23:45 +00:00
orbiter
ab656687d7 more strict BLOB initialization .. may also help to save some ram
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5776 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 12:42:24 +00:00
orbiter
f6691411b5 - migration of files from SplitTable (which are used for the URL-DB) to a different file name format.
- the file generation logic is slightly different: files may now have only a maximum size of one gigabyte and a maximum age of one month.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5773 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 22:15:33 +00:00
orbiter
f21a8c9e9c a different naming scheme for BLOBArray files. This may be necessary if blobs are written more often than once in a second.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5771 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 15:08:56 +00:00
orbiter
7ba078daa1 - added fast site-operator
- refactoring merge into BLOBArray

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5770 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 13:26:47 +00:00
orbiter
b4126432bc hardening of index dump write process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5769 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 12:24:15 +00:00
orbiter
9bfb2641db - removed deprecated threads
- added automatic http client reset. this was necessary because excessive intranet crawling caused deadlocks. this hack solved the problem.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5768 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 20:13:57 +00:00
orbiter
0139988c04 - added writing of temporary file names and renaming to final file name when index dump/merge are done. Interrupted merges can be cleaned up.
- added clean-up of unfinished merges and unused idx/gap files
- enhanced merge file selection method

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5764 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 12:39:11 +00:00
orbiter
3621aa96ab - added a memory protection for the IndexCell migration
- fix for bad cell file selection

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5763 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 19:17:45 +00:00
orbiter
568e8f1741 fix in unmountBLOB
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5762 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 17:03:13 +00:00
orbiter
9da69d6b68 - better selection of files to be merged
- fix for getChannel().close(), which works on windows but not on macs and linux

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5761 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 16:49:02 +00:00
orbiter
d39a5b42ca more care about open file handles. Now files also close on windows and can be deleted afterwards.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5760 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 12:42:12 +00:00
orbiter
029495e64d fixed bug introduced in SVN 5756 in EcoTable.put()
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5759 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 07:51:32 +00:00
orbiter
587838bd09 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5758 6c8d7289-2bf4-0310-a012-ef5d649a1542 2009-03-30 21:13:53 +00:00
orbiter
d2e2420a68 - added another file selection method for index cell merge
- more hacks to check that files are closed propertly and filehandles do not exist after files are closed.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5757 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 19:05:08 +00:00
orbiter
96eaecda3e - added migration class to go from index collections to the index cell data structure.
- added better control over file deletion, because this sometimes fails, especially on windows

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5756 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 15:31:25 +00:00
orbiter
0f0b4aec75 better index cell merge logic
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5754 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 06:22:27 +00:00
orbiter
fa07234d4e fix for clear method: now deletes files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5752 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-29 21:28:14 +00:00
borg-0300
c450e3746b svn attributes added
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5736 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 15:44:59 +00:00
orbiter
37f892b988 added new concurrent merger class for IndexCell RWI data
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5735 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 14:54:37 +00:00
orbiter
67aaffc0a2 - added Latency control to the crawler:
because of the strongly enhanced indexing speed when using the new IndexCell RWI data structures (> 2000PPM on my notebook), it is now necessary to control the crawling speed depending on the response time of the target server (which is also YaCy in case of some intranet indexing use cases).
The latency factor in crawl delay times is derived from the time that a target hosts takes to answer on http requests. For internet domains, the crawl delay is a minimum of twice the response time, in intranet cases the delay time is now a halve of the response time.

- added API to monitor the latency times of the crawler:
a new api at /api/latency_p.xml returns the current response times of domains, the time when the domain was accessed by the crawler the last time and many more attributes.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 10:21:23 +00:00
orbiter
0926310461 another performance hack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5731 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 22:33:36 +00:00
orbiter
ebe5d69d14 performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5730 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 22:19:08 +00:00
orbiter
b3f75e48fa - enhanced balancer: auto-solving of waiting-deadlocks
- removed deprecated cache-init size value
- more debug lines for IndexCell cache dump merge

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5728 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 20:21:19 +00:00
orbiter
9a90ea05e0 added a merge operation for IndexCell data structures
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5727 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 16:14:31 +00:00
orbiter
d99ff745aa fix for http://forum.yacy-websuche.de/viewtopic.php?p=13378#p13378
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5726 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 10:29:13 +00:00