Commit Graph

3494 Commits

Author SHA1 Message Date
orbiter
89ec3acb3e - full abstraction of index content type: the kelondro full text index may now also contain indexes about other content than text, i.e. navigation indexes or reverse linking indexes.
- during index joins all word positions are maintained: better ranking for word distance possible; exact phrase match can be implemented soundly


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5804 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-15 06:34:27 +00:00
borg-0300
7a48090fcf - fix for "uk" language
- svn attributes added

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5803 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-14 11:40:44 +00:00
orbiter
dc2af61bc9 allow up to 50 results from remote peers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5802 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-13 21:47:57 +00:00
orbiter
c0e8ed5461 fixed problem with not http client
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5801 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-13 21:21:47 +00:00
orbiter
8862a2fed0 ups
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5799 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-12 10:22:21 +00:00
orbiter
de68948bc5 better handling of free memory computation and emrgency cache flush for index cell
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5798 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-12 09:24:32 +00:00
f1ori
fcb77c3140 * added .im (Isle of Man) to TLD-list
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5794 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-10 23:06:48 +00:00
orbiter
b81c7467d8 protection against too many files in RICELL in case of massive emergency dumps caused by low memory
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5791 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-09 23:55:47 +00:00
orbiter
d4d87d90c4 - extended experimental wikipedia dump parser
- removed historic, possibly unused code from wiki parser that was in conflict with actual wikipedia wiki code

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5790 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-09 14:55:20 +00:00
orbiter
c3aff2521e fix for NPE
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5789 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-09 13:32:56 +00:00
orbiter
57c00dd8c9 fix for bad filtering of common http error
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5788 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-09 13:20:09 +00:00
orbiter
14361f1ca4 added log message for index generation in HeapReader
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5787 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-09 10:34:22 +00:00
orbiter
c08f9b36a4 refactoring of wiki parser.
This was done to prepare the wiki parser as parser for wikipedia dumps, which will be used for performance test (to omit crawling)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5785 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-08 15:28:45 +00:00
orbiter
44e01afa5b - refactoring
- a little bit more abstraction
- new interfaces for index abstraction

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5783 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-07 09:34:41 +00:00
orbiter
82fb60a720 increased memory limit for emergency cache flush
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5782 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-06 15:54:19 +00:00
low012
9180617dd9 *) Classes to handle import of lists (especially blacklists) from XML files, not used yet, but will be used soon.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5780 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-05 13:36:44 +00:00
lotus
596e6215dc fix in case of white space in path name
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5779 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 16:07:24 +00:00
orbiter
b887f4a116 keep more free mem
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5778 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 14:27:04 +00:00
orbiter
c2359f20dd refactoring: better abstraction of reference and metadata prototypes.
This is a preparation to introduce other index tables as used now only for reverse text indexes. Next application of the reverse index is a citation index.
Moved to version 0.74

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5777 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 13:23:45 +00:00
orbiter
ab656687d7 more strict BLOB initialization .. may also help to save some ram
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5776 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 12:42:24 +00:00
orbiter
5b138ada16 fixes to web structure reference collection and url construction
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5775 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 08:29:40 +00:00
orbiter
a29a11e526 added evaluation of incoming links in webstructure api
the api hash changed, new XML schema.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5774 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 07:59:49 +00:00
orbiter
f6691411b5 - migration of files from SplitTable (which are used for the URL-DB) to a different file name format.
- the file generation logic is slightly different: files may now have only a maximum size of one gigabyte and a maximum age of one month.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5773 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 22:15:33 +00:00
shostakovich
1f37cc6107 Robots.txt is now reused after one day. See forum-topic:
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1669&p=13565#p13565

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5772 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 15:29:36 +00:00
orbiter
f21a8c9e9c a different naming scheme for BLOBArray files. This may be necessary if blobs are written more often than once in a second.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5771 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 15:08:56 +00:00
orbiter
7ba078daa1 - added fast site-operator
- refactoring merge into BLOBArray

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5770 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 13:26:47 +00:00
orbiter
b4126432bc hardening of index dump write process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5769 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 12:24:15 +00:00
orbiter
9bfb2641db - removed deprecated threads
- added automatic http client reset. this was necessary because excessive intranet crawling caused deadlocks. this hack solved the problem.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5768 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 20:13:57 +00:00
orbiter
293290c317 fix for bad assert in last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5767 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 15:17:14 +00:00
orbiter
bd409fb7ba added web structure analysis for a special domain that can be requested from the api.
Example:
http://localhost:8080/api/webstructure.xml?about=www.yacy.net
returns a xml with the following content:

<?xml version="1.0"?>
<webstructure>
<domains reference="reverse" count="1" maxref="300">
<domain host="www.yacy.net" id="FXg39Q" date="20090401">
  <citation host="java.sun.com" id="o-R3yY" count="1" />
  <citation host="yacy-suche.de" id="-KCLaB" count="1" />
  <citation host="suma-ev.de" id="VRAHIA" count="1" />
  <citation host="www.kit.edu" id="EMaLDQ" count="1" />
  <citation host="yacy.net" id="Fh1hyQ" count="1" />
  <citation host="www.fzk.de" id="V2Kl-A" count="1" />
  <citation host="en.wikipedia.org" id="rwtdfR" count="3" />
  <citation host="vimeo.com" id="MmdQDY" count="3" />
  <citation host="liebel.fzk.de" id="sX4ozA" count="6" />
</domain>
</domains>
</webstructure>


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5766 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 14:53:23 +00:00
orbiter
b6c2167143 - patch for bad web structure dumps
- added automatic slow down of accessed to specific domains when access to a web page fails

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5765 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 13:21:47 +00:00
orbiter
0139988c04 - added writing of temporary file names and renaming to final file name when index dump/merge are done. Interrupted merges can be cleaned up.
- added clean-up of unfinished merges and unused idx/gap files
- enhanced merge file selection method

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5764 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 12:39:11 +00:00
orbiter
3621aa96ab - added a memory protection for the IndexCell migration
- fix for bad cell file selection

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5763 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 19:17:45 +00:00
orbiter
568e8f1741 fix in unmountBLOB
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5762 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 17:03:13 +00:00
orbiter
9da69d6b68 - better selection of files to be merged
- fix for getChannel().close(), which works on windows but not on macs and linux

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5761 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 16:49:02 +00:00
orbiter
d39a5b42ca more care about open file handles. Now files also close on windows and can be deleted afterwards.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5760 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 12:42:12 +00:00
orbiter
029495e64d fixed bug introduced in SVN 5756 in EcoTable.put()
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5759 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 07:51:32 +00:00
orbiter
587838bd09 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5758 6c8d7289-2bf4-0310-a012-ef5d649a1542 2009-03-30 21:13:53 +00:00
orbiter
d2e2420a68 - added another file selection method for index cell merge
- more hacks to check that files are closed propertly and filehandles do not exist after files are closed.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5757 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 19:05:08 +00:00
orbiter
96eaecda3e - added migration class to go from index collections to the index cell data structure.
- added better control over file deletion, because this sometimes fails, especially on windows

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5756 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 15:31:25 +00:00
orbiter
0f0b4aec75 better index cell merge logic
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5754 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 06:22:27 +00:00
orbiter
832fef670f migration of urls-files into subdirectory METADATA
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5753 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 04:41:06 +00:00
orbiter
fa07234d4e fix for clear method: now deletes files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5752 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-29 21:28:14 +00:00
lulabad
df87e4dbf6 missing count of send Index and URLs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5747 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-28 20:49:58 +00:00
borg-0300
c450e3746b svn attributes added
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5736 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 15:44:59 +00:00
orbiter
37f892b988 added new concurrent merger class for IndexCell RWI data
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5735 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 14:54:37 +00:00
borg-0300
8c494afcfe svn attributes added
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5734 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 11:21:32 +00:00
orbiter
67aaffc0a2 - added Latency control to the crawler:
because of the strongly enhanced indexing speed when using the new IndexCell RWI data structures (> 2000PPM on my notebook), it is now necessary to control the crawling speed depending on the response time of the target server (which is also YaCy in case of some intranet indexing use cases).
The latency factor in crawl delay times is derived from the time that a target hosts takes to answer on http requests. For internet domains, the crawl delay is a minimum of twice the response time, in intranet cases the delay time is now a halve of the response time.

- added API to monitor the latency times of the crawler:
a new api at /api/latency_p.xml returns the current response times of domains, the time when the domain was accessed by the crawler the last time and many more attributes.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 10:21:23 +00:00
orbiter
0926310461 another performance hack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5731 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 22:33:36 +00:00
orbiter
ebe5d69d14 performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5730 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 22:19:08 +00:00