yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
reger	cb2c17d236	extract author and keywords in .doc and .ppt parser	2014-06-29 02:54:09 +02:00
orbiter	fec673c9d1	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-06-27 10:15:37 +02:00
orbiter	4a66af716d	added apkParser stub (work in progress)	2014-06-27 10:15:01 +02:00
reger	2d67f29244	adjust mergeDocument after parsing to - preserve charset and languages - fix merge of author	2014-06-26 22:16:15 +02:00
Michael Peter Christen	0d29b972cc	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-06-26 13:02:56 +02:00
reger	7847a93558	fix AbstractParser.singleList not adding null strings - prevents null titles in oo... parser (as detected by ParserTest) - correct ParserTest dc_description check (dc_description allowed to return 0 length array)	2014-06-26 02:56:45 +02:00
Michael Peter Christen	8acae852a0	write <em>-tagged texts also into the bold_txt field	2014-06-25 11:51:11 +02:00
reger	3b559e7846	optimize pdfParser skip starting reader thread if all content already read	2014-06-10 04:25:20 +02:00
reger	09f73b790f	fix pdfParser not closed warning from pdfbox for encrypted pdf on exit due to missing permission to extract	2014-06-08 08:20:30 +02:00
reger	d8d318233e	fix logging settings - add missing .level - remove obsolete jena settings - set default level=INFO to prevent debug logging of not explicite specified classes	2014-06-01 06:43:50 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
orbiter	88f4af90da	removed warnings	2014-05-13 22:27:31 +02:00
reger	8a7c68e4c7	content of surrogates/out never accessed (remove) After import the conent is never accessed but may take up a lot of disk space, also the getLoadedOAIServer (which lists the files in surrogate out) is not used. Making the surrogate.out obsolete. Removed keeping of xmls after import.	2014-05-04 09:29:07 +02:00
reger	2eb7682772	add html5 audio/video <source> tag to html content scraper - <source src=.. type=..> tag content is added to embed collection	2014-04-29 00:41:29 +02:00
reger	0b6db04e40	fix contentscraper img height/width parsing prevent numberformat exception on common "100px" property - include in test case	2014-04-28 04:59:47 +02:00
reger	121d25be38	recover sax fatal error on OAI-PMH import of xml with entity error this allows to continue loading next resumptionToken even if import file caused sax parser error fix http://mantis.tokeek.de/view.php?id=63	2014-04-25 01:05:28 +02:00
reger	86f6975edc	exclude html tags in in/outboundlinks_anchortext_txt parsed text - some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags, remove all tags for text property (inline img tags are still parsed) - added test case for above (to htmlParserTest) - fix solr test case	2014-04-23 00:55:16 +02:00
Michael Peter Christen	5746aae3db	add canonical links to the same crawldepth, not the next crawldepth	2014-04-18 06:51:46 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	ce1d1b2fa0	fix for maximum tag length in parser	2014-04-11 09:56:44 +02:00
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	2014-04-10 18:58:03 +02:00
reger	af6ad20728	fix: remove obsolete ref to yacy.home (use Switchboard instead)	2014-04-04 02:45:04 +02:00
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	2014-04-02 23:37:01 +02:00
reger	49e76a1c55	make use of detected charset in htmlParser if none is given.	2014-04-01 04:02:34 +02:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
reger	651d057e93	surrogate import translate dc:language 3-char codes OAI records often use 3-char language codes, start converting some 3-char lang's to the internal ISO639-1 2-char code	2014-03-23 00:40:36 +01:00
Michael Peter Christen	453bfd0f17	removed unused variables and warnings	2014-03-19 09:29:01 +01:00
reger	1d01672bd3	fix DCEntry.getIdentifier on successful url parameter	2014-03-12 23:35:57 +01:00
reger	6306d28a6a	OAI import get multivalued keywords (dc:subject)	2014-03-09 03:15:35 +01:00
reger	5c9dcc269d	improve OAI-PMH import identifier recognition - find best fittng identifier (url) by checking all given dc:identifier in record (many entries proviede several identifiers) as identifier is currently a multivalued field use "getParams" in preference of splitting the 1st string by ";" - add resolve DOI:... identifier via http://dx.doi.org/	2014-03-04 03:08:37 +01:00
Michael Peter Christen	6e59ca4ebf	removed jena library and all code that depended on jena. When jena was introduced, it was also used for search facets. The generic search facets are now deduced from generic solr fields which makes jena as tool for facet semantics superfluous.	2014-02-07 01:20:06 +01:00
reger	bd1685c94a	fix not needed getFileExtension().toLower (double) add missing .getFileExtension	2014-02-05 03:45:02 +01:00
Michael Peter Christen	022c6d3ce1	do YaCy p2p connections using a timeout-request which covers the http request into a separate thread and ignores the furthure result of a request if that does not answer within the requested time-out. This is a try to solve a problem with the peer-ping, which hangs whenever a peer appears to be dead or blocked.	2014-01-19 15:21:23 +01:00
reger	6932aa4d7a	use configured admin-username for api calls - the admin user name can be configured, in apiExec calls the default "admin" username is used. TODO: the bin/apicall.sh script should likely take that into account.	2014-01-07 21:26:50 +01:00
orbiter	3cb6c7861f	fixed shutdown authenticaton problem	2014-01-06 01:48:54 +01:00
Michael Peter Christen	77aeb288a2	suppress deprecation warning (for now); TODO: find alternatives	2013-12-26 23:26:21 +01:00
Michael Peter Christen	7603e879dc	Merge branch 'master' into HEAD Conflicts: .classpath source/net/yacy/cora/federate/solr/SolrServlet.java	2013-12-20 01:19:06 +01:00
orbiter	937273d4e3	added parsing of metadata to surrogate reading: a dublin core record inside of surrogate input files may now contain tokens within the namespace 'md' (short for: metadata). The token names must be valid withing the namespace of the solr field names. All md-tokens inside of surrogate files then overwrite values within solr documents before they are written to the solr index. This makes it possible to assign collection names to each surrogate entry and also ranking information can be added. Please see the example file.	2013-12-17 14:02:27 +01:00
reger	effea4bca0	Merge origin/master into jetty Conflicts: source/net/yacy/cora/federate/solr/SolrServlet.java	2013-11-29 22:39:52 +01:00
orbiter	61409788eb	less word hash computations (removing some overhead because of MD5 calcs) using the clear word in a normalized form.	2013-11-25 15:20:54 +01:00
reger	f111f30ace	Merge origin/master into jetty	2013-11-17 00:18:25 +01:00
orbiter	19a051bec8	more monitoring for postprocessing and enhanced layout in Crawler monitor page	2013-11-16 18:23:14 +01:00
Michael Peter Christen	1a4a69c226	set more logger to 'final static'	2013-11-13 06:18:48 +01:00
orbiter	4234b0ed6c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-11-10 18:50:43 +01:00
orbiter	909bbb49d8	added (partly commented) test code for url rewrite methods .. to be completed	2013-11-10 18:50:34 +01:00
reger	1437c45383	merge rc1/master	2013-11-07 21:30:17 +01:00
Michael Peter Christen	81d9e23532	fixed another memory leak in the PDF parser: the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space which cannot be cleaned if PDFont.clearResources is called. The attempt to clean the class cache therefore causes that the class is loaded and this cache is initialized with some rubbish. I tried to prevent to instantiate this class by usage of a hacked findLoadedClass call to the SystemClassLoader (which is protected ...). Now, without using the PDF parser at all, 8MB of RAM space is not occupied, however, when the first PDF arrives this space will be taked and never given back to GC. WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!	2013-11-07 11:57:01 +01:00
Michael Peter Christen	a8253ca49c	added missing unicode transformation in href link contents during parsing	2013-11-06 18:05:02 +01:00
Michael Peter Christen	60187a4ec2	fix in html parser	2013-11-04 10:16:20 +01:00
reger	f017066197	Merge origin/master into jetty	2013-10-27 15:09:24 +01:00

1 2 3 4 5 ...

483 Commits