Commit Graph

184 Commits

Author SHA1 Message Date
orbiter
528da7c9ea removed unused class and added license header for new class
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7680 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 13:14:30 +00:00
orbiter
f6077b3cc0 added more attributes for html parser and enhanced data structures
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7679 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 13:09:01 +00:00
sixcooler
4eb9c1e7c3 not setting userAgent from Constructor as default for following calls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7676 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 17:39:16 +00:00
orbiter
d8e934c085 better abstraction of http client identification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7675 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 13:35:29 +00:00
sixcooler
a3e707283d not using HTTPConnector anymore
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7674 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 11:46:31 +00:00
orbiter
9f1f47ec67 added some comments to explain the isLocal patch
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7673 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-21 21:59:56 +00:00
orbiter
b77b8cac0c - enhanced html parser: recognized much more details in the content
- added more properties to solr index
- refactoring
- more constants in switchboard
- fix for some NPEs
- recognition of more images
- removed synchronization in HandleMap (obviously not necessary?)
- added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-21 13:58:49 +00:00
orbiter
3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
- added deletion of solr index in IndexControlRWIs
- added asynchronous adding of large url lists (happens when crawls are startet with file)
- fixed npe in Image display
- replaced language warning with fine logging
- added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups)
- added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list
- added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm)
- fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-18 16:11:16 +00:00
orbiter
958ff4778e enhanced location search:
search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found.
This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed.

Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search.
Fixed many smaller bugs in connection with location search.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-15 15:54:19 +00:00
sixcooler
8d63f3b70f just cosmetics - keeping my baby clean :-)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7656 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-15 00:48:39 +00:00
orbiter
e402622584 removed httpclient-3.1 (this was added with last commit which was a mistake)
the httpclient is required by solrj but no class from solrj is used which references to httpclient-3.1
Instead the YaCy http client library based on the apache http client 4.1 is used using a wrapper class
which is in net.yacy.cora.services.federated.solr.SolrHTTPClient

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7655 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-14 20:12:14 +00:00
orbiter
19fd13d3bc Added federated index storage to solr.
YaCy supports now the storage to remote solr indexes.
More federated storage (and search) methods may follow.

The remote index scheme is the same as produced by the SolrCell; see
http://wiki.apache.org/solr/ExtractingRequestHandler
Because this default scheme is used, the default example scheme can be used as solr configuration
This is also the same scheme that solr uses if documents are imported with apache tika.

federated solr storage is switched off by default.

To use this, do the following:
- set federated.service.solr.indexing.enabled = true
- download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/
- extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar'
- start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes.
- to check whats in solr after indexing, open http://localhost:8983/solr/admin/

Until now it is not possible to use the solr index to search with YaCy in that solr index.
This functionality is now available for two reasons:
1) to compare the functionality of Solr and YaCy and to compare the search speed
2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods
   that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still
   want to use solr instead of YaCy.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-14 20:05:04 +00:00
orbiter
c17d102bd8 enhanced speed for OrderedScoreMap inc method and size comparisment in concurrent environments
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7653 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-13 22:04:23 +00:00
orbiter
b788182954 some enhancements to scoring speed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7652 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-13 15:17:00 +00:00
orbiter
4c013d9088 more UTF8 getBytes() performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7649 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-12 05:02:36 +00:00
cominch
9ac02caf00 different initialization of empty variables in alternative constructor. This leads to wrong interpretation of user credentials, resulting in unnecessary "@" in front of host, and different urlhash values.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7646 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-06 10:59:31 +00:00
orbiter
57ce1fb491 reverted synchronization from SVN 7641
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7643 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-04 20:31:02 +00:00
orbiter
7c8e764201 removed synchronization again...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7641 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-04 10:13:30 +00:00
orbiter
96c32e87b0 fixes to crawler and new user-agent crawl-delay handling
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7640 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-04 09:47:18 +00:00
orbiter
cb6f709a16 - enhancements in surrogate reading
- better display of map in location search

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7636 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-02 00:11:37 +00:00
orbiter
564184909a enhanced the surrogate parser: better reading of UTF-8 characters
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7634 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-01 11:05:42 +00:00
orbiter
156cf02703 - added an index constraint 'has location' to the condenser
- added evaluation of the 'has location' constraint to search using the /location operator


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7633 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-31 09:41:30 +00:00
orbiter
41b8d7f655 fix for url normalization (no backpath resolving in post parameters)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7632 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-31 09:40:01 +00:00
orbiter
9b25d07295 - added geo information parsing to html parser
- extended metadata information in index with geolocalisation
- added display of location in yacydoc and ViewFile

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7629 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-30 00:49:47 +00:00
orbiter
f3baaca920 - enhancements to DNS IP caching and crawler speed
- bugfixes (NPEs)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7619 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-22 09:34:10 +00:00
orbiter
78d4c45d09 enhancement during search process: fast fail of search in case that all index feeder have terminated.
This change should affect filtering and navigators and should cause that search navigation gets faster

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7614 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 13:05:51 +00:00
orbiter
2b5f8585bf performance hack for Balancer and ip address parsing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7608 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-17 21:09:18 +00:00
orbiter
a6935e7dc8 fix for active dns resolving: do not resolve in case that the dns server is not available (offline mode)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7604 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-16 07:05:10 +00:00
orbiter
61acf55da4 avoided using a synchronized(this) for the hash computation to prevent that the lock on the object is (accidently) stolen by another thread and replaced this synchronization using the protocol object. Made also the protocol object final.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7602 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-15 09:52:39 +00:00
orbiter
1989ebc24b removed more warnings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7598 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-14 22:52:30 +00:00
orbiter
b62b79675b removed type cast warnings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7594 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-14 21:08:18 +00:00
orbiter
8f11d3a5bb redesigned the ScoreMap classes:
- new concurrent score map using atom operation from java concurrency classes
- redesigned difference beween StaticScore and Dynamic Score into ScoreMap and ReversibleScoreMap allowed that many classes can now use simple ScoreMap Objects which can be used better in concurrent environments using the ConcurrentScoreMap
- switched from DynamicScore to ConcurrentScoreMap usage wherever possible

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7586 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-13 01:41:44 +00:00
orbiter
a564230c48 more enhancements against blocked threads occurred in seed age evaluation (blocks httpd in some cases)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7585 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-12 22:54:41 +00:00
orbiter
694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion
- changed menu structure slightly

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7583 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-10 23:25:07 +00:00
orbiter
30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7580 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-10 12:35:32 +00:00
orbiter
7962d35425 - removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons:
1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file
2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible

- fixed deletion of crawl start point urls in crawl stack and balancer double-check
- fixed a problem with steering self-call (no resolving of localhost)
- added more logging for the crawler to supervise why crawl urls are not taken by the loader
- added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url
- fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-09 12:50:39 +00:00
orbiter
e1b6916423 always try to guess the size of a StringBuilder to prevent too many memory re-allocations
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7572 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-09 09:29:05 +00:00
low012
3b40b98256 *) set SVN properties
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7567 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-08 01:51:51 +00:00
orbiter
cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 20:36:40 +00:00
orbiter
7138f4036b less synchronization, better thread dump tool
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7556 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 15:29:45 +00:00
orbiter
8d14916c74 more patches for a better out-of-memory management
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7555 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 01:45:11 +00:00
sixcooler
65bcc60808 stupid me: revert placement of closing connection which caused unclosed connections
+ reuse sockets

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7543 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-03 04:16:33 +00:00
sixcooler
e3d75d6cd5 Not storing external header in an Header-Array and reduce a loop for its conversion.
Ensure connection close if a OOM is thrown.
Ensure setting resolved host is set at the request.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7542 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-01 13:50:11 +00:00
orbiter
42d90664f3 - fixed a memory leak in the httpc.post method (no finish)
- patched some more memory-saving relevant code
- some more minor bug fixes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7541 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-01 09:03:33 +00:00
orbiter
38dce547c0 better concurrency (less locking on date formatting) more logging and minor bug fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7540 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-28 06:28:29 +00:00
orbiter
89d337841c more logging for OOMs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7534 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-26 09:27:15 +00:00
orbiter
5e186e0122 continuing the fight against deadlocks during time formatting: better caching.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7531 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-25 21:11:53 +00:00
orbiter
dec24244cf added convenience class to generate UTF StringBody objects with a default UTF8 charset.
Reason: if this is not used in StringBody-Class initialization, a default charset name is parsed.
This is a synchronized process and all classes using default charsets synchronize at that point
Synchronization is omitted if this class is used
 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7530 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-25 13:26:09 +00:00
orbiter
19b2a50578 - enhanced date formatter cache
- added more instances of formatter objects to different classes to make them independent in case of lockings that may applay during synchronization of the date formatter object (date formatting is not thread-safe and must be synchronized therefore)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7528 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-25 12:23:00 +00:00
orbiter
a92d80a545 performance enhancements using an alternative to a insensitive collator (a complex string compare):
- less synchronizations
- better speed
..at most important and commonly used classes: http headers, url parsing and html parsing

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7526 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-25 11:23:57 +00:00