Commit Graph

178 Commits

Author SHA1 Message Date
Michael Peter Christen
9b4c699526 ehanced location search:
- search request are now made using a map boundary
- search results are only computed for the map boundary
- the number of results is adopted to the results in the visible range
- added a double-buffering for the search result markers
- added a search query option for the search results:
/radius/<lat>/<lon>/<radius>
2012-05-31 22:39:53 +02:00
Michael Peter Christen
4d3cc02168 replaced old bzip2 library against better documented commons-compress
package from http://commons.apache.org/compress/
2012-05-28 23:53:48 +02:00
Michael Peter Christen
c15fcde1c8 add-on to latest commit 2012-05-21 17:52:30 +02:00
Michael Peter Christen
81737dcb18 removed stack trace from swf parser since we cant do anything there 2012-05-21 02:27:06 +02:00
Michael Peter Christen
acf8d521a2 fix for http://bugs.yacy.net/view.php?id=126 2012-05-19 00:21:03 +02:00
Michael Peter Christen
89142d1e8d removed (not all) warnings 2012-05-16 13:42:32 +02:00
Roland 'Quix0r' Haeder
a093ccf5eb Now used synchronization in all close() methods to make sure all objects
are 'closed' in an ordered way

Conflicts:
	source/de/anomic/http/server/ChunkedInputStream.java
	source/de/anomic/http/server/ChunkedOutputStream.java
	source/de/anomic/http/server/ContentLengthInputStream.java
	source/net/yacy/cora/protocol/Domains.java
	source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java
	source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java
	source/net/yacy/document/content/dao/PhpBB3Dao.java
	source/net/yacy/document/parser/html/AbstractTransformer.java
	source/net/yacy/kelondro/blob/BEncodedHeap.java
	source/net/yacy/kelondro/blob/HeapReader.java
	source/net/yacy/kelondro/index/RAMIndexCluster.java
	source/net/yacy/kelondro/io/ByteCountInputStream.java
	source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java
	source/net/yacy/kelondro/table/SQLTable.java
2012-05-14 07:41:55 +02:00
Michael Peter Christen
ba6aaabc51 refactoring + parser bugfixes 2012-05-04 17:28:27 +02:00
Michael Peter Christen
09484955dc added new entry class for embed tags 2012-04-27 17:48:51 +02:00
Michael Peter Christen
453010bd68 - solved problems with backpath normalization
- redesigned in/outbound link handover
- removed iframe links from inbound/outbound in solr scheme
2012-04-27 16:48:51 +02:00
Michael Peter Christen
659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
queue and not from virtual documents generated by the parser.
- The parser now generates nice description texts for NOLOAD entries
which shall make it possible to find media content using the search
index and not using the media prefetch algorithm during search (which
was costly)
- Removed the media-search prefetch process from image search
2012-04-24 16:07:03 +02:00
Michael Peter Christen
4d5da75814 fix for parser problem if a <a>-tag is 'within' html tags with unclosed
tags. That prevented the <a> tags from beeing recognized. This is a fix
for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516
2012-04-18 10:30:04 +02:00
Michael Peter Christen
046f3a7e8d check if httpc has decompressed the release file and rename the file
from .tar.gz to .tar if that happened
2012-04-16 09:50:55 +02:00
Michael Peter Christen
8d63a5887c bugfixes 2012-02-02 23:38:23 +01:00
Michael Peter Christen
9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
ready-prepared crawl list but at the stacks of the domains that are
stored for balanced crawling. This affects also the balancer since that
does not need to prepare the pre-selected crawl list for monitoring. As
a effect:
- it is no more possible to see the correct order of next to-be-crawled
links, since that depends on the actual state of the balancer stack the
next time another url is requested for loading
- the balancer works better since the next url can be selected according
to the current situation and not according to a pre-selected order.
2012-02-02 21:33:42 +01:00
Michael Peter Christen
7e4e3fe5b6 free some memory after parsing html 2012-02-02 09:55:27 +01:00
Michael Peter Christen
4540174fe0 memory hacks 2012-02-02 07:37:00 +01:00
Michael Peter Christen
1f4f60654a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/document/parser/pdfParser.java
2012-01-24 20:42:30 +01:00
reger
32104360ce PDFParser - return at least first 3 pages of PDF
fix for pdf parsing without returning parsed text due to interruption by
time out.
2012-01-23 20:58:36 +01:00
Michael Peter Christen
eadb58dd87 small enhancements in pdf parser 2012-01-23 00:46:02 +01:00
reger
b616de5973 PDFParser - return at least first 3 pages of PDF
fix for pdf parsing without returning parsed text due to interruption by time out.
2012-01-21 03:15:12 +01:00
Michael Peter Christen
b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large
content
2012-01-10 03:02:17 +01:00
Michael Christen
c04bfaa51b refactoring 2011-12-16 23:59:29 +01:00
Michael Christen
1f4afb4dc0 performance hacks 2011-12-15 15:15:53 +01:00
Michael Christen
9cd469e6d6 added pull request from als plus an NPE fix 2011-12-04 12:15:03 +01:00
Al Sutton
39898cb94a Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer 2011-12-01 11:30:14 +00:00
Al Sutton
4c67a964a1 Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer 2011-12-01 11:28:52 +00:00
Al Sutton
3f9b9f953f Added close() to ensure buffer close actions are invoked 2011-12-01 11:25:59 +00:00
Al Sutton
d73c84f9a0 Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown 2011-12-01 11:20:13 +00:00
Al Sutton
8993cac4d8 Initial performance improvements 2011-11-30 11:15:54 +00:00
orbiter
5a55397f99 some last-minute performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-25 11:23:52 +00:00
orbiter
804e48888b smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter
should also be a little bit faster

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8057 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-18 13:09:07 +00:00
low012
277b454a62 *) added comments
*) minor refactoring

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7971 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-25 13:16:52 +00:00
orbiter
8a428d3e77 ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7958 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-15 11:17:38 +00:00
orbiter
0819e1d397 protection against OOM cases in image parser. See also bugs.yacy.net/view.php?id=54
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7942 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-09 23:00:45 +00:00
orbiter
49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-07 10:08:57 +00:00
orbiter
610b01e1c3 - added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index.
- some refactoring for mime type discovery

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7919 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-01 16:05:00 +00:00
orbiter
1c007188ad bugfixes in html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-31 16:02:06 +00:00
orbiter
231074bf0a fixed a parsing bug by reverting SVN 7766
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7910 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-28 22:59:19 +00:00
orbiter
5dd2efc9a2 - bugfixes in html parser
- new fields in solr
- extended file viewer to debug parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-25 15:52:25 +00:00
orbiter
51cf697acd refactoring: moved all score-related classes to new ranking package
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7889 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-22 22:37:53 +00:00
sixcooler
9170a434ed throwing an exception again in FileUtils.copy(reader, writer)
OOMs could occour here and should not be ignored

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7858 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-01 23:32:58 +00:00
orbiter
299af4943c added another memory protection hack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7849 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-17 17:55:08 +00:00
orbiter
b06faab9d3 do not allocate a StringBuilder object in case that there is not enough memory for that
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7846 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-16 23:17:19 +00:00
orbiter
bda3eec0ff added parsing of canonical link element to html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-01 16:38:01 +00:00
orbiter
9706fc55aa enhanced content scraper (should discover urls much faster in case of very large plain texts)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7787 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-20 22:29:45 +00:00
orbiter
77fe69395d added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7774 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-05 20:04:41 +00:00
orbiter
0c1b29f3c9 - applied many small performance hacks
- added a memory limitation in the zip parser and the pdf parser
- added a search throttling: if there are too many search queries are still to be computed, then new requests are not accepted for some time. if after a one second still no space is there to perform another search, the search terminates with no results. this case should only happen in case of DoS-like situations and in case of strong load on a peer like if it is integrated in metager.
- added a search cache deletion process that removes search requests in case that throttling happens

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7766 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-01 19:31:56 +00:00
orbiter
4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes).
The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-27 08:24:54 +00:00
orbiter
e28bd0d038 fix for some possible causes of memory leaks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7741 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 14:35:32 +00:00