Commit Graph

473 Commits

Author SHA1 Message Date
orbiter
97983ba89f fixed generics warnings for generic array instantiation that appeared
after migration to Java 7
2014-05-20 21:50:16 +02:00
orbiter
88f4af90da removed warnings 2014-05-13 22:27:31 +02:00
reger
8a7c68e4c7 content of surrogates/out never accessed (remove)
After import the conent is never accessed but may take up a lot of disk space,
also the getLoadedOAIServer (which lists the files in surrogate out) is not used.
Making the surrogate.out obsolete. Removed keeping of xmls after import.
2014-05-04 09:29:07 +02:00
reger
2eb7682772 add html5 audio/video <source> tag to html content scraper
- <source src=.. type=..> tag content is added to embed collection
2014-04-29 00:41:29 +02:00
reger
0b6db04e40 fix contentscraper img height/width parsing
prevent numberformat exception on common "100px" property

- include in test case
2014-04-28 04:59:47 +02:00
reger
121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
this allows to continue loading next resumptionToken even if import file caused sax parser error
fix http://mantis.tokeek.de/view.php?id=63
2014-04-25 01:05:28 +02:00
reger
86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
- some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags,
remove all tags for text property (inline img tags are still parsed)
- added test case for above (to htmlParserTest)
- fix solr test case
2014-04-23 00:55:16 +02:00
Michael Peter Christen
5746aae3db add canonical links to the same crawldepth, not the next crawldepth 2014-04-18 06:51:46 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
ce1d1b2fa0 fix for maximum tag length in parser 2014-04-11 09:56:44 +02:00
Michael Peter Christen
67beef657f strong redesign of html parser: object recursion is now made using a
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
2014-04-10 18:58:03 +02:00
reger
af6ad20728 fix: remove obsolete ref to yacy.home
(use Switchboard instead)
2014-04-04 02:45:04 +02:00
Michael Peter Christen
cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
2014-04-02 23:37:01 +02:00
reger
49e76a1c55 make use of detected charset in htmlParser if none is given. 2014-04-01 04:02:34 +02:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
reger
651d057e93 surrogate import translate dc:language 3-char codes
OAI records often use 3-char language codes, start converting some 3-char lang's to the internal ISO639-1 2-char code
2014-03-23 00:40:36 +01:00
Michael Peter Christen
453bfd0f17 removed unused variables and warnings 2014-03-19 09:29:01 +01:00
reger
1d01672bd3 fix DCEntry.getIdentifier
on successful url parameter
2014-03-12 23:35:57 +01:00
reger
6306d28a6a OAI import get multivalued keywords (dc:subject) 2014-03-09 03:15:35 +01:00
reger
5c9dcc269d improve OAI-PMH import identifier recognition
- find best fittng identifier (url) by checking all given dc:identifier in record (many entries proviede several identifiers)
  as identifier is currently a multivalued field use "getParams" in preference of splitting the 1st string by ";" 
- add resolve DOI:... identifier via http://dx.doi.org/
2014-03-04 03:08:37 +01:00
Michael Peter Christen
6e59ca4ebf removed jena library and all code that depended on jena. When jena was
introduced, it was also used for search facets. The generic search
facets are now deduced from generic solr fields which makes jena as tool
for facet semantics superfluous.
2014-02-07 01:20:06 +01:00
reger
bd1685c94a fix not needed getFileExtension().toLower (double)
add missing .getFileExtension
2014-02-05 03:45:02 +01:00
Michael Peter Christen
022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
request into a separate thread and ignores the furthure result of a
request if that does not answer within the requested time-out. This is a
try to solve a problem with the peer-ping, which hangs whenever a peer
appears to be dead or blocked.
2014-01-19 15:21:23 +01:00
reger
6932aa4d7a use configured admin-username for api calls
- the admin user name can be configured, in apiExec calls the default "admin" username is used. 

TODO: the bin/apicall.sh script should likely take that into account.
2014-01-07 21:26:50 +01:00
orbiter
3cb6c7861f fixed shutdown authenticaton problem 2014-01-06 01:48:54 +01:00
Michael Peter Christen
77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives 2013-12-26 23:26:21 +01:00
Michael Peter Christen
7603e879dc Merge branch 'master' into HEAD
Conflicts:
	.classpath
	source/net/yacy/cora/federate/solr/SolrServlet.java
2013-12-20 01:19:06 +01:00
orbiter
937273d4e3 added parsing of metadata to surrogate reading:
a dublin core record inside of surrogate input files may now contain
tokens within the namespace 'md' (short for: metadata). The token names
must be valid withing the namespace of the solr field names. All
md-tokens inside of surrogate files then overwrite values within solr
documents before they are written to the solr index. This makes it
possible to assign collection names to each surrogate entry and also
ranking information can be added. Please see the example file.
2013-12-17 14:02:27 +01:00
reger
effea4bca0 Merge origin/master into jetty
Conflicts:
	source/net/yacy/cora/federate/solr/SolrServlet.java
2013-11-29 22:39:52 +01:00
orbiter
61409788eb less word hash computations (removing some overhead because of MD5
calcs) using the clear word in a normalized form.
2013-11-25 15:20:54 +01:00
reger
f111f30ace Merge origin/master into jetty 2013-11-17 00:18:25 +01:00
orbiter
19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
monitor page
2013-11-16 18:23:14 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
orbiter
4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-10 18:50:43 +01:00
orbiter
909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
completed
2013-11-10 18:50:34 +01:00
reger
1437c45383 merge rc1/master 2013-11-07 21:30:17 +01:00
Michael Peter Christen
81d9e23532 fixed another memory leak in the PDF parser:
the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space
which cannot be cleaned if PDFont.clearResources is called.
The attempt to clean the class cache therefore causes that the class is
loaded and this cache is initialized with some rubbish. I tried to
prevent to instantiate this class by usage of a hacked findLoadedClass
call to the SystemClassLoader (which is protected ...).
Now, without using the PDF parser at all, 8MB of RAM space is not
occupied, however, when the first PDF arrives this space will be taked
and never given back to GC.
WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!
2013-11-07 11:57:01 +01:00
Michael Peter Christen
a8253ca49c added missing unicode transformation in href link contents during
parsing
2013-11-06 18:05:02 +01:00
Michael Peter Christen
60187a4ec2 fix in html parser 2013-11-04 10:16:20 +01:00
reger
f017066197 Merge origin/master into jetty 2013-10-27 15:09:24 +01:00
Michael Peter Christen
9bb7eab389 hacks to prevent storage of data longer than necessary during search and
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
reger
5c4ba9b5db merge rc1 master 2013-09-22 02:21:24 +02:00
reger
70c51775ae Merge remote-tracking branch 'origin/master' into jetty 2013-09-22 02:09:02 +02:00
orbiter
6e8377b8ad do not check all words with synonym library if the library is empty 2013-09-21 08:56:24 +02:00
Michael Peter Christen
31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
contains a robots:nofollow or if the http header contains a
"X-Robots-Tag: nofollow"
2013-09-16 16:14:56 +02:00
Michael Peter Christen
57e00baf26 fix for parsing of image links inside of anchor links (image-links) 2013-09-15 23:54:46 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
reger
f7f86d8a5d update to Jetty 9 jars
- include javax.servlet 3.0
2013-09-14 20:49:05 +02:00