Commit Graph

82 Commits

Author SHA1 Message Date
orbiter
694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion
- changed menu structure slightly

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7583 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-10 23:25:07 +00:00
orbiter
30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7580 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-10 12:35:32 +00:00
low012
3b40b98256 *) set SVN properties
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7567 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-08 01:51:51 +00:00
orbiter
cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 20:36:40 +00:00
orbiter
1110d16af9 performance hack: replaced generic row.getColBytes() call with row.getPrimaryKeyBytes() where the column is 0
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7529 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-25 12:41:27 +00:00
orbiter
4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
- some restructuring of the document counting and logging structures was necessary
- better abstraction of CrawlProfiles
- added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation
- more refactoring to get the LibraryProvider more clean
- some refactoring of the Condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-12 00:01:40 +00:00
orbiter
10ae8d961b - cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring)
- cleaned up (removed special code and documentation for 27c3)
- added remote search functions to be used within cora

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7420 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-03 20:52:54 +00:00
f1ori
e4aabaa1c3 * fix negative filelength for files >2G
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7408 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-28 17:25:39 +00:00
f1ori
ee3cef91e8 * fix filesize in ftp crawls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7402 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-28 02:15:22 +00:00
orbiter
e88c428008 fix to ftp loader
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7387 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-18 10:22:54 +00:00
orbiter
9b25a33fd9 - fixed numerous bugs
- better document names
- fixed problem with ftp crawling
- added automatic removal of search results from services that are not online according to the latest network scan: this does not delete the index but just does not show them. after the next network scan when the server is available again, the results are again showed.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7385 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-17 17:30:09 +00:00
orbiter
7bdb13bf7f more fixes to smb crawling: better file names
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7384 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-17 00:52:24 +00:00
orbiter
94c48500cc several fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7383 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-17 00:11:42 +00:00
f1ori
9d2159582f * fix system update if urls are in blacklist (for example for very general blacklists like *.de)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7375 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-15 19:20:00 +00:00
orbiter
56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
- integrated new parser into loader processes: enrich document parser
- fixed a concurrent modification exception in kelondro iterator
- hand-over of document size from crawler to indexer

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7374 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-15 00:03:19 +00:00
orbiter
a563b05b60 enhanced crawler:
- added a new queue 'noload' which can be filled with urls where it is already known that the content cannot be loaded. This may be because there is no parser available or the file is too big
- the noload queue is emptied with the parser process which indexes the file names only
- the 'start from file' functionality now also reads from ftp crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7368 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-11 00:31:57 +00:00
orbiter
4e2c14efbb fixed bugs in parser and ftp client
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7360 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-02 11:05:04 +00:00
orbiter
b769cce433 - added a catch-all parser for all documents that cannot be parsed: they will contributed with their document url for the search index only
- enhanced the pdf and torrent parser: better documents titles
- enhanced the ftp client: more time-out time
- fixed bugs in json for search results
- enhanced yacyinteractive.html: added a file type navigator and a download-script generator for search result files

Please have a look at yacyinteractive.html: this will become the hacker-download tool for 27c3!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7355 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-30 16:13:55 +00:00
f1ori
741a87a3e9 * make .yacy-domains crawlable (.yacy-domains are local domains, so only in custom networks/peers)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7334 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-22 19:12:51 +00:00
f1ori
dca9e16f51 * don't index pages, which redirect, twice
* there fore auto-redirection of HTTPClient for crawling is disabled and the old code is reactivated

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7332 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-21 22:46:12 +00:00
f1ori
7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-26 16:10:20 +00:00
orbiter
6a166c2040 patches for bad proxy behaviour
- accept ipv6 localhost clients
- index media files (url only)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7238 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-11 11:38:36 +00:00
orbiter
091dd3f6ec - enhanced intranet search speed
- enhanced intranet portscan speed (better time-out)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7227 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-08 10:54:13 +00:00
orbiter
2c549ae341 fixed a number of small bugs:
- better crawl star for files paths and smb paths
- added time-out wrapper for dns resolving and reverse resolving to prevent blockings
- fixed intranet scanner result list check boxes
- prevented htcache usage in case of file and smb crawling (not necessary, documents are locally available)
- fixed rss feed loader
- fixes sitemap loader which had not been restricted to single files (crawl-depth must be zero)
- clearing of crawl result lists when a network switch was done
- higher maximum file size for crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7214 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-30 23:57:58 +00:00
orbiter
d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
- migrated the 'yacy' user agent to 'yacybot' in many client methods since the 'yacy' user agent is only used for the proxy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7199 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-27 14:54:32 +00:00
orbiter
48c0d508ac fixes for crawling of smb links (file length not always available)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7190 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-25 22:32:26 +00:00
orbiter
5870b13f3a - code cleanup / added debug line for further investigation in HTTPDemon.parseMultipart
- changed data structure for sorting in search which performs better in that specific case (too many updates)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7150 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-14 21:03:50 +00:00
sixcooler
17eebd4ef8 counting crawler traffic again:
fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2808

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7138 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-11 15:58:15 +00:00
orbiter
65eaf30f77 redesign of crawl profiles data structure. target will be:
- permanent storage of auto-dom statistics in profile
- storage of profiles in WorkTable data structure
not finished yet. No functional change yet.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7088 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-31 15:47:47 +00:00
orbiter
3197ca42ed preparations to move the HTCache into cora:
- move the header framework classes to cora
- move the ARC caching classes to cora
- refactoring of code to call these classes from cora

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7068 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 12:32:02 +00:00
orbiter
844f158686 - removed dependencies in header framework:
moved http date methods from DateFormatter to HeaderFramework
  changed logging to log4j
- added ftp load access to MultiProtocolURI
- ensured termination of RSS feed iteration

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7067 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 11:41:12 +00:00
orbiter
5e7081cd19 refactoring towards a unified loading mechanism for MultiProtocolURIs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7065 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 01:08:56 +00:00
orbiter
caece04f26 removed System.err and System.out usage from FTPClient; changed logging to log4j (preferred in yacy.cora)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7064 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-22 22:51:31 +00:00
orbiter
90531f78ff refactoring of the cora package to get subpackages for http and ftp (smb to come)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7063 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-22 22:32:39 +00:00
orbiter
a82a93f2fc - better url double check in crawler
- more logging for error urls

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7032 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-11 09:54:18 +00:00
sixcooler
a6ed6e8cb9 ... migrating to HttpComponents-Client-4.x ...
make the occurrence of multiple header-keys possible

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7031 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-10 21:22:30 +00:00
sixcooler
15e8c13526 ... migrating to HttpComponents-Client-4.x ...
(gzip decompression, httploader, robots, ...)

+ enable proxy-crawling while log is fine

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7001 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-27 01:16:26 +00:00
orbiter
b6fb239e74 redesign of parser interface:
some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure.
This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents.
To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files.
additionally, parsing of jar manifest files had been added.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-29 19:20:45 +00:00
orbiter
150cf42a1b migrated all my LGPL 3 -licensed files to the LGPL 2.1 because LGPL 3 is not compatible to the GPL 2
see http://www.gnu.org/licenses/license-list.html for explanation
Since (as far as I know) nobody else has ever contributed to these files I may be allowed to just apply an older license.
You may consider this as a dual-licensing and may use and optionally replicate the older files under GPL 3.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6952 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-28 16:25:14 +00:00
orbiter
777195e8d1 more abstraction for access of LoaderDispatcher and cache
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6937 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-22 12:28:53 +00:00
orbiter
87087f12fe - scanned remote search process and enhanced some data structure and synchronizations here and there
- removed concurrency overhead for small number of index normalizations as it happens during remote search
- removed 'load only parseable' constraint for snippet fetch because some resources may not have any url file extension and these had therefore not been parseable and searcheable since they may become parseable after loading when their mime type is known
- this partly fixes some problems with http://forum.yacy-websuche.de/viewtopic.php?p=20300#p20300 but more changes are necessary to get all expected search results

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6926 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-17 11:59:40 +00:00
orbiter
3f93a0cc8f redesign of remote proxy settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6903 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-26 00:01:16 +00:00
orbiter
11639aef35 - added new protocol loader for 'file'-type URLs
- it is now possible to crawl the local file system with an intranet peer
- redesign of URL handling
- refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-25 12:54:57 +00:00
orbiter
6950d8a33d fixes to SMB crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6900 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-23 01:17:44 +00:00
orbiter
2126c03a62 - removed download-limit that can be given for the crawler for non-crawler download tasks. This was necessary because the same procedure was used for other downloads like for the download of dictionary files where a limit is not useful. The limit still stays for the indexer
- migrated the opengeodb downloader to a new version of the opengeodb-dump


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6873 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-14 18:30:11 +00:00
orbiter
cf43bdc87e This is a large bugfix and enhancement commit to support a better location detection for data
- fixes to http file server session handling
- fixes and enhancements to metadata date/time handling
- added dc:publisher metadata field and updated all document parser
- fixed bug in metdata read procedure
- enhanced dublin core and rss parser to understand more fields more properly
- enhanced url selection in case that multiple urls are given in surrogates
- fix for condenser; failure when last word does not end with termination symbol

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6863 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-11 11:14:05 +00:00
orbiter
c45117f81f fixed dates in metadata
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6860 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-08 22:09:36 +00:00
orbiter
25aef069a6 continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-08 00:11:32 +00:00
orbiter
3300930fc5 - (almost) fixed FTP crawler
- integrated/fixed SMB crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6742 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-11 15:43:06 +00:00
orbiter
9623d9e6d2 added a smb loader component for the YaCy crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6737 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-10 08:55:29 +00:00