Commit Graph

71 Commits

Author SHA1 Message Date
reger
bfb0d4c69b - add language detection from <html lang="xx"> tag
- add jaudiotagger jar to Netbeans-IDE project classpath
2012-10-09 20:02:58 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
e54ac38095 - some corrections in usage of getFile() and getFileName()
- added more attributes in json response writer according to yacy
servlet
2012-09-11 23:28:21 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00
orbiter
0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
- replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be
done automatically
- implemented some isEmpty() methods
2012-07-10 22:59:03 +02:00
Michael Peter Christen
b1e7c11fba fix for pattern matcher in html parser 2012-07-05 14:24:03 +02:00
orbiter
7f851d62a7 replaced HashARC with SizeLimited Objects which are less costly 2012-07-04 21:56:25 +02:00
orbiter
78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
of the major CPU users during snippet verification. The class was not
efficient for two reasons:
- it used a too complex input stream; generated from sources and UTF8
byte-conversions. The BufferedReader applied a strong overhead.
- to feed data into the SentenceReader, multiple toString/getBytes had
been applied until a buffered Reader from an input stream was possible.
These superfluous conversions had been removed.
- the best source for the Sentence Reader is a String. Therefore the
production of Strings had been forced inside the Document class.
2012-07-04 21:15:10 +02:00
Michael Peter Christen
ad09b786bf clean up parser data 2012-07-03 17:20:41 +02:00
Michael Peter Christen
276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
A limit was necessary because some web pages have such huge numbers of
links that it can easily cause a OOM just by the number of links.
The quesion if the number of 1000 links is sufficient or too weak must
be answered with the result of testing this feature.
2012-07-03 17:06:20 +02:00
Michael Peter Christen
de903a53a0 parser refactoring & hacks 2012-07-03 06:06:38 +02:00
Michael Peter Christen
508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
the meta-refresh html header field.
2012-06-28 13:27:45 +02:00
Michael Peter Christen
f3167def64 do not fill the keywords with title content if keywords do not exist. 2012-06-27 13:07:02 +02:00
Michael Peter Christen
77f795756c fixing redirects and status codes: storing of status code in
ResponseHeader to make it available for late evaluations, like storage
in solr.
2012-06-25 18:17:31 +02:00
Michael Peter Christen
be928815fc fixed wrong parsing of style and script 2012-06-17 17:18:19 +02:00
Michael Peter Christen
0284a4d88f more fixes for double precision of coordinates 2012-06-04 23:37:41 +02:00
Michael Peter Christen
9b4c699526 ehanced location search:
- search request are now made using a map boundary
- search results are only computed for the map boundary
- the number of results is adopted to the results in the visible range
- added a double-buffering for the search result markers
- added a search query option for the search results:
/radius/<lat>/<lon>/<radius>
2012-05-31 22:39:53 +02:00
Michael Peter Christen
c15fcde1c8 add-on to latest commit 2012-05-21 17:52:30 +02:00
Michael Peter Christen
ba6aaabc51 refactoring + parser bugfixes 2012-05-04 17:28:27 +02:00
Michael Peter Christen
453010bd68 - solved problems with backpath normalization
- redesigned in/outbound link handover
- removed iframe links from inbound/outbound in solr scheme
2012-04-27 16:48:51 +02:00
Michael Peter Christen
8d63a5887c bugfixes 2012-02-02 23:38:23 +01:00
Michael Peter Christen
9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
ready-prepared crawl list but at the stacks of the domains that are
stored for balanced crawling. This affects also the balancer since that
does not need to prepare the pre-selected crawl list for monitoring. As
a effect:
- it is no more possible to see the correct order of next to-be-crawled
links, since that depends on the actual state of the balancer stack the
next time another url is requested for loading
- the balancer works better since the next url can be selected according
to the current situation and not according to a pre-selected order.
2012-02-02 21:33:42 +01:00
Michael Peter Christen
7e4e3fe5b6 free some memory after parsing html 2012-02-02 09:55:27 +01:00
Michael Peter Christen
4540174fe0 memory hacks 2012-02-02 07:37:00 +01:00
Michael Peter Christen
b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large
content
2012-01-10 03:02:17 +01:00
Michael Christen
c04bfaa51b refactoring 2011-12-16 23:59:29 +01:00
Michael Christen
1f4afb4dc0 performance hacks 2011-12-15 15:15:53 +01:00
Al Sutton
8993cac4d8 Initial performance improvements 2011-11-30 11:15:54 +00:00
orbiter
5a55397f99 some last-minute performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-25 11:23:52 +00:00
orbiter
1c007188ad bugfixes in html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-31 16:02:06 +00:00
orbiter
5dd2efc9a2 - bugfixes in html parser
- new fields in solr
- extended file viewer to debug parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-25 15:52:25 +00:00
orbiter
51cf697acd refactoring: moved all score-related classes to new ranking package
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7889 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-22 22:37:53 +00:00
orbiter
299af4943c added another memory protection hack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7849 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-17 17:55:08 +00:00
orbiter
bda3eec0ff added parsing of canonical link element to html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-01 16:38:01 +00:00
orbiter
9706fc55aa enhanced content scraper (should discover urls much faster in case of very large plain texts)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7787 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-20 22:29:45 +00:00
orbiter
0c1b29f3c9 - applied many small performance hacks
- added a memory limitation in the zip parser and the pdf parser
- added a search throttling: if there are too many search queries are still to be computed, then new requests are not accepted for some time. if after a one second still no space is there to perform another search, the search terminates with no results. this case should only happen in case of DoS-like situations and in case of strong load on a peer like if it is integrated in metager.
- added a search cache deletion process that removes search requests in case that throttling happens

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7766 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-01 19:31:56 +00:00
orbiter
3ed4a09368 small features, some bug fixes and performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-23 21:08:04 +00:00
orbiter
021840e5ba removed (almost) deadlocks and unnecessary CPU load
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7726 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-17 00:00:01 +00:00
orbiter
4e8fa03514 added more attributes to html evaluation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7688 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 15:36:44 +00:00
orbiter
f6077b3cc0 added more attributes for html parser and enhanced data structures
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7679 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 13:09:01 +00:00
orbiter
b77b8cac0c - enhanced html parser: recognized much more details in the content
- added more properties to solr index
- refactoring
- more constants in switchboard
- fix for some NPEs
- recognition of more images
- removed synchronization in HandleMap (obviously not necessary?)
- added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-21 13:58:49 +00:00
orbiter
3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
- added deletion of solr index in IndexControlRWIs
- added asynchronous adding of large url lists (happens when crawls are startet with file)
- fixed npe in Image display
- replaced language warning with fine logging
- added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups)
- added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list
- added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm)
- fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-18 16:11:16 +00:00
orbiter
958ff4778e enhanced location search:
search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found.
This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed.

Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search.
Fixed many smaller bugs in connection with location search.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-15 15:54:19 +00:00
orbiter
0430a94eaa the location search shows now not re-evaluated locations but only such locations that are attached as metadata to web pages
- added parser for in-text appearing geo-locations
- added geo-locations to rss search result
- added evaluation of metadata-attached geo-locations in yacysearch_location to show search results within a map


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7631 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-30 23:26:36 +00:00
orbiter
9b25d07295 - added geo information parsing to html parser
- extended metadata information in index with geolocalisation
- added display of location in yacydoc and ViewFile

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7629 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-30 00:49:47 +00:00
orbiter
78d4c45d09 enhancement during search process: fast fail of search in case that all index feeder have terminated.
This change should affect filtering and navigators and should cause that search navigation gets faster

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7614 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 13:05:51 +00:00
orbiter
30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7580 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-10 12:35:32 +00:00
orbiter
cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 20:36:40 +00:00
orbiter
e717bf74ba more logging, more care about OOMs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7503 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-21 09:51:05 +00:00
orbiter
4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
- some restructuring of the document counting and logging structures was necessary
- better abstraction of CrawlProfiles
- added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation
- more refactoring to get the LibraryProvider more clean
- some refactoring of the Condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-12 00:01:40 +00:00