Commit Graph

271 Commits

Author SHA1 Message Date
Michael Peter Christen
f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
only links where the content can be parsed. All non-parseable links are
placed into the noload queue. The search process must therefore be able
to filter out non-text search results.
- This fixes the problem that image search results appeared in the text
search.
- The interactive search can retrieve now ALL types of links
- The p2p interface is now extended to retrieve only certain types of
links (text, image, video, apps)
- The search process has an extension to filter the right document type
according to the search query
2012-04-22 02:05:17 +02:00
Michael Peter Christen
a1a5b015d8 refactoring: moved document Classification to cora package 2012-04-21 21:31:13 +02:00
Michael Peter Christen
4d5da75814 fix for parser problem if a <a>-tag is 'within' html tags with unclosed
tags. That prevented the <a> tags from beeing recognized. This is a fix
for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516
2012-04-18 10:30:04 +02:00
Michael Peter Christen
046f3a7e8d check if httpc has decompressed the release file and rename the file
from .tar.gz to .tar if that happened
2012-04-16 09:50:55 +02:00
Michael Peter Christen
e101c2e0e2 added changes from copperdust (submitted by email):
1. Improved and fixed language detection:
	1.1 Identificator.java - recognition fix (improved)
	1.2 DCEntry.java - fix (changed detection order due to detection from
tld in many cases is incorrect)
	1.3 MultiProtocolURI.java - fixed and enhanced language from tld
detection (all currently used top-level domains; ccTLD added but not
tested).
2. Ukrainian language update.
3. Main Slavic languages langstats (tested and works fine).
2012-02-22 12:21:27 +01:00
Michael Peter Christen
8d63a5887c bugfixes 2012-02-02 23:38:23 +01:00
Michael Peter Christen
9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
ready-prepared crawl list but at the stacks of the domains that are
stored for balanced crawling. This affects also the balancer since that
does not need to prepare the pre-selected crawl list for monitoring. As
a effect:
- it is no more possible to see the correct order of next to-be-crawled
links, since that depends on the actual state of the balancer stack the
next time another url is requested for loading
- the balancer works better since the next url can be selected according
to the current situation and not according to a pre-selected order.
2012-02-02 21:33:42 +01:00
Michael Peter Christen
7e4e3fe5b6 free some memory after parsing html 2012-02-02 09:55:27 +01:00
Michael Peter Christen
4540174fe0 memory hacks 2012-02-02 07:37:00 +01:00
Michael Peter Christen
2e5cd6a1b2 fixed parser extension deny list generation and usage 2012-02-01 00:15:59 +01:00
Michael Peter Christen
8bee1472c9 there is no noindex, only nofollow in links 2012-01-31 23:46:35 +01:00
Michael Peter Christen
c560a582ac fix for single-word vocabulary lines 2012-01-26 16:44:30 +01:00
Michael Peter Christen
ef78f22ee1 performance hack 2012-01-25 12:48:48 +01:00
Michael Peter Christen
1f4f60654a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/document/parser/pdfParser.java
2012-01-24 20:42:30 +01:00
reger
32104360ce PDFParser - return at least first 3 pages of PDF
fix for pdf parsing without returning parsed text due to interruption by
time out.
2012-01-23 20:58:36 +01:00
Michael Peter Christen
eadb58dd87 small enhancements in pdf parser 2012-01-23 00:46:02 +01:00
reger
b616de5973 PDFParser - return at least first 3 pages of PDF
fix for pdf parsing without returning parsed text due to interruption by time out.
2012-01-21 03:15:12 +01:00
Michael Peter Christen
7f9b6b7a0c added switches to ConfigParser to accept/deny documents by their
extension
2012-01-17 16:43:34 +01:00
Michael Peter Christen
4901cee3cc suppress auto-tagged subject entries when sending out or receiving
metadata from other peers
2012-01-17 02:10:05 +01:00
Michael Peter Christen
83009d86f7 added the vocabulary navigator. It can be very simply tested by
switching on the locale dictionaries.
2012-01-17 01:53:08 +01:00
Michael Peter Christen
a58dc4a91f added autotagging to document condenser:
- tags that are automatically generated now enrich the dc:subject
- auto-generated tags have a '$' at the beginning of the tag
- auto-generated tags lead the tag name with a vocabulary name
each tag has the form
$<vocabulary-name>:<tag-printname-space-replaced-by-'_'>
2012-01-15 22:17:57 +01:00
Michael Peter Christen
254adea51c small fixes 2012-01-13 11:24:08 +01:00
Michael Peter Christen
b7bb84c0bb set a limit to CharBuffer object size to fight against bad/too large
content
2012-01-10 03:02:17 +01:00
Michael Christen
e6d51363ee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-01-09 02:00:09 +01:00
Marek Otahal
72adbeae90 !Important: move from Hashtable to HashMap
Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better
functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue,
but I found notices that some (ugly big) helper classes had to be created in past
to compensate missing Hashtable's functionality. I'd like input if we can remove some of them.
look for //FIX: if these commits

Signed-off-by: Marek Otahal <markotahal@gmail.com>
2012-01-09 01:29:18 +01:00
Michael Christen
fa8da7f89d vocabularies are now also used as source for a did-you-mean computation 2012-01-08 02:13:52 +01:00
Michael Christen
eaec14ecc4 Dictionaries from words caches can now be used as autotagging vocabulary 2012-01-08 02:07:10 +01:00
Michael Peter Christen
91940fdf56 redesign of WordCache to be prepared to hold multiple
independent dictionaries. Such dictionaries can then be also used as
simplified vocabularies.
2012-01-08 00:47:32 +01:00
Michael Christen
bd40a10230 added autotaggig stub .. only reading and parsing of vocabularies at
this time
2012-01-07 17:34:38 +01:00
Michael Christen
c04bfaa51b refactoring 2011-12-16 23:59:29 +01:00
Michael Christen
1f4afb4dc0 performance hacks 2011-12-15 15:15:53 +01:00
Michael Christen
762e0ecfb6 fixed localization dictionaries, see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=3418&view=next
2011-12-06 02:21:40 +01:00
Michael Christen
9cd469e6d6 added pull request from als plus an NPE fix 2011-12-04 12:15:03 +01:00
Al Sutton
39898cb94a Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer 2011-12-01 11:30:14 +00:00
Al Sutton
4c67a964a1 Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer 2011-12-01 11:28:52 +00:00
Al Sutton
3f9b9f953f Added close() to ensure buffer close actions are invoked 2011-12-01 11:25:59 +00:00
Al Sutton
d73c84f9a0 Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown 2011-12-01 11:20:13 +00:00
Al Sutton
f02ea27b31 Added missing closure of ByteArrayInputSteam 2011-12-01 11:11:13 +00:00
Al Sutton
8993cac4d8 Initial performance improvements 2011-11-30 11:15:54 +00:00
orbiter
ebd840ebf6 - enhanced description on search front page
- fixed language and heuristic modifier
- added hint to crawl start that we can do also ftp and smb crawls
- added a protocol extension to remote crawls to transport all search modifiers to remote peers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8108 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-26 13:40:33 +00:00
orbiter
e22f8497c9 - tested the ARC methods
- removed strict authentication (if password is empty; this was buggy and not useful; can be switched on if necessary globally and not for each interface method)
- increased speed of CrawlResults page (no dns lookup any more)
- increased speed of favicon display (removed dns lookup)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8104 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-25 14:09:25 +00:00
orbiter
5a55397f99 some last-minute performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-25 11:23:52 +00:00
apfelmaennchen
564374d1fe - included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand.
- reworked bookmark creation on crawlstart
- many smaller adjustments to ymarks


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8072 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-22 23:50:49 +00:00
orbiter
804e48888b smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter
should also be a little bit faster

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8057 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-18 13:09:07 +00:00
orbiter
85d6bf4ac4 fixed urls to media content during indexing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8021 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-09 15:40:14 +00:00
orbiter
0d858d48ec replaced String with StringBuilder in suggestion process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8020 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-09 14:42:55 +00:00
orbiter
d2ea250d99 refactoring:
- moved many classes from de.anomic to net.yacy
- made more sub-packages for search classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-25 16:59:06 +00:00
low012
277b454a62 *) added comments
*) minor refactoring

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7971 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-25 13:16:52 +00:00
orbiter
6b22865dbc - removed some warinings
- removed a dead update location

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7970 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-24 01:58:54 +00:00
orbiter
8a428d3e77 ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7958 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-15 11:17:38 +00:00