yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	57e00baf26	fix for parsing of image links inside of anchor links (image-links)	2013-09-15 23:54:46 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
reger	f7f86d8a5d	update to Jetty 9 jars - include javax.servlet 3.0	2013-09-14 20:49:05 +02:00
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	2013-09-10 10:31:57 +02:00
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	2013-08-22 14:23:47 +02:00
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	2013-08-20 15:46:04 +02:00
reger	b4016ff324	- remove possible double initialization of rdfa parser - use ordered list to use preferred parser for mime/extension first (relates to html, rdfa, argument parser) - harmonize xhtml extension config for the 3 html base parsers	2013-08-14 21:12:10 +02:00
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-30 12:49:14 +02:00
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	2013-07-30 12:48:57 +02:00
reger	92d3f71b16	htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used), note: stream.close is done by caller (Textparser.parseSource) - removed unnecessary reset in AugmentParser - added stream.mark in tdfatripleimpl. to make stream.reset work here	2013-07-28 03:41:09 +02:00
reger	aa1a1f1d2c	- small adjustment to make sure genericParser is tried last -- for some documents genericParser grabs document instead of specific available parser due to unordered pick of 1st to try parser (like .ps .rdf files and other) - remove redundant file extension registration	2013-07-23 20:24:13 +02:00
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
Michael Peter Christen	e6f361f474	adding the canonical tag to crawl queues	2013-07-01 13:09:41 +02:00
reger	83763ee4a4	jpeg parser: extract GPS location from meta data	2013-06-29 00:35:43 +02:00
Michael Peter Christen	c4538d8d91	added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib	2013-06-26 09:26:34 +02:00
reger	3760e2616b	bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments	2013-06-25 23:24:02 +02:00
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	2013-06-25 16:27:20 +02:00
reger	8d1c4c423d	make imageparser fileextension detection case insensitive (extensions are often upper case)	2013-06-23 00:39:15 +02:00
Michael Peter Christen	3e1e358fdc	calling pdf cache flush on class initialization because calling of the methods during runtime can conflict with dynamic solr class loader and cause a deadlock (seriously!)	2013-06-12 00:17:44 +02:00
Michael Peter Christen	5344a1c5f7	getting the trash out	2013-05-29 16:09:05 +02:00
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	2013-05-20 22:05:28 +02:00
reger	97ab5b90e8	- odt & ooxml (office document) parser correction to add content to fulltext index - adjust Junit yacyVersionTest & ParserTest - update yacyVersion.combined2prettyVersion to the default 4-digit minor ver.	2013-05-20 01:50:09 +02:00
orbiter	7de5b9cfa0	fix for http://bugs.yacy.net/view.php?id=233 - check geolocation coordinates and accept only those, which are well-formed - the solr push process does not stop crawling any more if after 20 requests to Solr Solr does not accept the record. Instead, a severe log entry asks the user to create a bug request	2013-05-03 00:24:39 +02:00
Michael Peter Christen	25499eead5	- added a new field for the regular expression in crawl start - added the field in crawl profile - adopted logging end error management - adopted duplicate document detection - added a new rule to the indexing process to reject non-matching content - full redesign of the expert crawl start servlet The new filter field can now be seen in /CrawlStartExpert_p.html at Section "Document Filter", subsection item "Filter on Content of Document"	2013-04-26 10:49:55 +02:00
Michael Peter Christen	50421171c3	added new schema fields: hreflang_url_sxt and hreflang_cc_sxt for http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077 navigation_url_sxt and navigation_type_sxt for http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html publisher_url_s for http://support.google.com/plus/answer/1713826?hl=de all fields are disabled by default and not written to the index.	2013-04-18 17:21:17 +02:00
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	2013-04-16 01:35:15 +02:00
orbiter	17ae51e741	increased number of links limitation from 1000 to 10000 for rss feeds and html documents	2013-03-17 22:13:56 +01:00
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	2013-03-13 14:47:00 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	6a4878940b	fix in html parser and bookmark generation	2013-02-11 13:28:08 +01:00
reger	3897bb4409	added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index) - migrates all entries in old urldb Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), - added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format) - removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format) current log excerpt for NumberFormatException: W 2013/01/14 00:10:07 StackTrace For input string: "-" java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152) ... Caused by: java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152)	2013-01-14 03:06:24 +01:00
reger	168b1d130d	Adding heuristic to get search results from configured systems which support opensearch specification - any system supporting opensearch specification can be configured - search query is only forwarded to remote system if not enough results available on local peer - discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config - sample config file with some general search engines with opensearch support	2012-12-29 08:24:48 +01:00
Michael Peter Christen	95712fdc8b	update to pdf parser	2012-12-27 04:16:31 +01:00
Michael Peter Christen	34f8786508	removed dependency of vocabulary navigation from Jena and it's triplestore; the vocabulary search is now done using generic solr fields which are created on-the-fly during runtime.	2012-12-18 02:29:03 +01:00
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	2012-12-02 16:54:29 +01:00
Michael Peter Christen	b5ee88c6af	added more logging to get info which url causes performance problems	2012-12-02 16:52:12 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	2012-11-19 17:24:34 +01:00
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	2012-11-18 01:22:41 +01:00
Michael Peter Christen	d88eb657fd	Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1	2012-11-04 09:21:21 +01:00
Michael Peter Christen	6905182d41	- fix for number of words log message - adding meta:refresh also to crawler stack	2012-10-29 21:42:31 +01:00
Michael Peter Christen	a33e2742cb	- removed unnecessary synchronized and deadlock in crawler - removed problem with monitoring object on Balancer.wait - added missing user agent settings	2012-10-28 19:56:02 +01:00
reger	722a447b0d	- optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type	2012-10-26 18:50:45 +02:00
orbiter	276dd6452b	removed warnings	2012-10-23 19:08:44 +02:00
Michael Peter Christen	b991685782	Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1	2012-10-23 18:14:58 +02:00
Michael Peter Christen	b7ac1da6a3	gsa results shall have only one title in metadata and that should be the visible title in the <title>-tag	2012-10-23 18:03:12 +02:00
reger	87aab9aa7c	- fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url - fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/	2012-10-22 22:48:35 +02:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	2012-10-15 13:17:13 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
orbiter	68d0f8de03	Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1	2012-10-09 20:36:32 +02:00
reger	bfb0d4c69b	- add language detection from <html lang="xx"> tag - add jaudiotagger jar to Netbeans-IDE project classpath	2012-10-09 20:02:58 +02:00
Michael Peter Christen	7e3e45fd04	added Open Graph Metadata default fields, see http://ogp.me/ns#	2012-10-09 17:28:48 +02:00
Michael Peter Christen	c3e5f667a7	added schema.org breadcrumb counter to parser and solr schema	2012-10-09 13:02:43 +02:00
Michael Peter Christen	4b5e0c1500	added an url rewriter which can be used to remove session ids from urls	2012-10-09 11:24:48 +02:00
Michael Peter Christen	584663ae8c	- redesign of solr query construction - fix for solr boosts and location search - fix for number of search results in local search	2012-10-07 07:46:55 +02:00
Michael Peter Christen	6ab64746d7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-10-06 03:35:32 +02:00
sof	5cb244b79b	Merge remote branch 'origin/master'	2012-10-05 18:54:39 +02:00
apfelmaennchen	88b062210c	Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based on the jaudiotagger library. The parser is disabled by default as it needs to store temporary files for non file:// protocols, which might be disliked. For your local MP3-collection it loads nicely Artist, Title, Album etc. from the audio files meta data.	2012-10-05 18:54:26 +02:00
Michael Peter Christen	31485a963d	refactoring	2012-10-02 21:57:50 +02:00
Michael Peter Christen	3d33a5bdf6	turned the synonyms_t Text field into a multi-valued String field synonyms_sxt	2012-10-02 11:13:06 +02:00
Michael Peter Christen	3b959ee002	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-10-02 10:14:09 +02:00
orbiter	3190347814	added a synonyms_t field to solr and a process to read synonym files. This can be used to add another stemming to solr using stemming files that are expressed as synonyms for grammatical alternatives. The synonym/stemming files must have the following form: - each line is a comma-separated list of synonyms - the list of synonyms may be enclosed with {} (like the GSA synonyms file) - the file may contain comments which are lines starting with a '#' The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and are activated by default whenever a synonym file is in place. Then, for each word that is found in a document all synonyms are added to a long text field which is stored into synonyms_t. Processes using the synonyms must query with that field as optional matcher.	2012-10-02 00:02:50 +02:00
Michael Peter Christen	411d0e839b	added an underline text field to solr to record all underlined texts	2012-10-01 14:16:49 +02:00
Michael Peter Christen	24d2ee3c52	- better date ranking - more protection against NPE and time travel effects	2012-09-26 18:36:32 +02:00
sixcooler	6c50d016ed	pdf- and zipParser should not use forced Memory-Limits	2012-09-26 14:03:51 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	8219a445f3	refactoring	2012-09-21 16:46:57 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
orbiter	63762d8f89	removed kelondro dependencies from cora	2012-09-20 19:38:22 +02:00
Michael Peter Christen	e54ac38095	- some corrections in usage of getFile() and getFileName() - added more attributes in json response writer according to yacy servlet	2012-09-11 23:28:21 +02:00
Michael Peter Christen	528d6763fa	- added new solr fields: title_count_i, title_chars_val, title_words_val description_count_i, description_chars_val, description_words_val - added many asserts to ensure data type correctness from YaCy to Solr and vice versa - made many fixes according to new findings from these asserts (!)	2012-08-31 10:30:43 +02:00
Michael Peter Christen	e8acd542b5	- added faceted drill-down for host and geolocation to solr queries - added a new geolocation field to index schema, the old values are migrated if possible	2012-08-27 14:41:33 +02:00
orbiter	67f2866cd0	small fixes	2012-08-24 21:44:22 +02:00
orbiter	67edfd991c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-08-05 15:49:48 +02:00
orbiter	d9173ba7ed	added more solr fields to integrate values from URIMetadataRow. All writings to the Metadata-DB are now also done to solr. This includes metadata transfer during search and rwi transfer. The new/added solr fields are: ## time when resource was loaded load_date_dt ## date until resource shall be considered as fresh fresh_date_dt ## id of the host, a 6-byte hash that is part of the document id host_id_s ## ids of referrer to this document referrer_id_ss ## the md5 of the raw source md5_s ## the name of the publisher of the document publisher_t ## the language used in the document; starts with primary language language_ss ## an external ranking value ranking_i ## the size of the raw source size_i ## number of links to audio resources audiolinkscount_i ## number of links to video resources videolinkscount_i ## number of links to application resources applinkscount_i	2012-08-05 15:49:27 +02:00
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	2012-07-30 10:38:23 +02:00
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	2012-07-27 12:13:53 +02:00
orbiter	482afed07c	reduced logging overhead (a bit)	2012-07-12 19:23:40 +02:00
orbiter	bbfa497a3c	replaced more size() > 0 by !isEmpty()	2012-07-12 11:12:21 +02:00
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	2012-07-10 22:59:03 +02:00
Michael Peter Christen	801972fe6f	fix for url camel case parser and sentence reader	2012-07-08 16:48:09 +02:00
Michael Peter Christen	fbc1a2030d	fix for sitemap importer: can now also import very large sitemaps within small memory configurations	2012-07-08 16:11:50 +02:00
Michael Peter Christen	92731e5287	fix for sevenzip parser	2012-07-08 16:11:19 +02:00
Michael Peter Christen	8efc1c1078	- fixed a memory leak (or bad usage) during parsing/snippet fetch - more logging for errors	2012-07-06 09:05:41 +02:00
Michael Peter Christen	b1e7c11fba	fix for pattern matcher in html parser	2012-07-05 14:24:03 +02:00
Michael Peter Christen	b0c408788b	made class methods static where possible	2012-07-05 12:38:41 +02:00
Michael Peter Christen	7c1ba99755	removed more unused method parameters	2012-07-05 10:44:30 +02:00
Michael Peter Christen	0301aba1e9	removed unused method parameters	2012-07-05 10:23:07 +02:00
Michael Peter Christen	d3964253ae	- added @SuppressWarnings to unused servlet method parameters - removed unnecessary casts - removed unnecessary throw statements	2012-07-05 09:14:04 +02:00
Michael Peter Christen	ea10766bfd	cleaned unnecessary nested code	2012-07-05 08:44:39 +02:00
orbiter	fc0f9543fe	More SentenceReader cleanup	2012-07-05 00:20:58 +02:00
orbiter	586bb0eb6a	Simplified SentenceReader (no more Reader inside..)	2012-07-04 22:06:20 +02:00
orbiter	7f851d62a7	replaced HashARC with SizeLimited Objects which are less costly	2012-07-04 21:56:25 +02:00
orbiter	78fc3cf8f8	refactoring and new usage of SentenceReader: this class appeared as one of the major CPU users during snippet verification. The class was not efficient for two reasons: - it used a too complex input stream; generated from sources and UTF8 byte-conversions. The BufferedReader applied a strong overhead. - to feed data into the SentenceReader, multiple toString/getBytes had been applied until a buffered Reader from an input stream was possible. These superfluous conversions had been removed. - the best source for the Sentence Reader is a String. Therefore the production of Strings had been forced inside the Document class.	2012-07-04 21:15:10 +02:00
orbiter	bb8dcb4911	automatically adopt size of word cache to available memory	2012-07-03 18:22:25 +02:00
Michael Peter Christen	ad09b786bf	clean up parser data	2012-07-03 17:20:41 +02:00
Michael Peter Christen	276a66a793	Adding a limit of 1000 links that a parser shall store during indexing. A limit was necessary because some web pages have such huge numbers of links that it can easily cause a OOM just by the number of links. The quesion if the number of 1000 links is sufficient or too weak must be answered with the result of testing this feature.	2012-07-03 17:06:20 +02:00
Michael Peter Christen	de903a53a0	parser refactoring & hacks	2012-07-03 06:06:38 +02:00
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	2012-07-02 13:57:29 +02:00
Michael Peter Christen	ce8d4b87d9	fixes for new eclipse 'Juno' warning 'Resource leak'.	2012-07-02 10:27:46 +02:00
Michael Peter Christen	0c345d1559	giving threads name so its easier to see whats happening during debugging and within a thread dump	2012-07-02 09:51:43 +02:00
Michael Peter Christen	508a81b86c	added solr field 'refresh_s' which stores the refresh url contained in the meta-refresh html header field.	2012-06-28 13:27:45 +02:00
Michael Peter Christen	f3167def64	do not fill the keywords with title content if keywords do not exist.	2012-06-27 13:07:02 +02:00
Michael Peter Christen	77f795756c	fixing redirects and status codes: storing of status code in ResponseHeader to make it available for late evaluations, like storage in solr.	2012-06-25 18:17:31 +02:00
Michael Peter Christen	dbdd697f4d	moved RDFaParser.xsl configuration file to defaults	2012-06-21 16:09:12 +02:00
Michael Peter Christen	786be7d175	better integration of RDFaParser	2012-06-20 16:39:04 +02:00
Michael Peter Christen	de3ef8ad73	removed unimportant warnings	2012-06-19 08:45:34 +02:00
Michael Peter Christen	24bbe359ca	integrate also geonames library files for less cities. these are more useful for tagging since less normal words are false-identified as location	2012-06-18 15:19:57 +02:00
Michael Peter Christen	223a5440ab	preventing that an empty pnd is inserted into the vocabularies	2012-06-18 01:22:39 +02:00
Michael Peter Christen	963f92ed9a	- merged files - changed behaviour of delete button in vocabulary edit - fixed size numbe in vocabulary listing	2012-06-17 23:48:33 +02:00
Michael Peter Christen	dd88d0ace2	more logging	2012-06-17 19:03:53 +02:00
Michael Peter Christen	94d54e2d91	added recognition of multi-word terms in vocabulary matching this makes the PND usable: it is now possible to recognize persons and navigate with a 'Persons' facet.	2012-06-16 19:40:27 +02:00
Michael Peter Christen	64c0268b2b	show triplestore metadata in yacydoc and viewfile	2012-06-16 17:40:15 +02:00
Michael Peter Christen	c2f0d16d2c	fixed vocabulary initialization	2012-06-16 13:12:02 +02:00
Michael Peter Christen	df3531f8d5	added the generation of virtual vocabularies using the pnd	2012-06-16 12:36:15 +02:00
Michael Peter Christen	a0f1decd82	- added loading of the dbpedia pnd triplestore in the dictionary loader - renamed the dictionary loader to knowledge loader - some refactoring in the library provider method names	2012-06-15 19:19:18 +02:00
Michael Peter Christen	16d8f33795	added objectlink generation to vocabulary generation and editor	2012-06-14 18:50:35 +02:00
Michael Peter Christen	d45718251e	refactoring (Localization -> Location)	2012-06-14 09:45:57 +02:00
Michael Peter Christen	b8b3c87ba7	- renamed localization to location (that was confusing) - renamed 'Locale' navigator to 'Location' - produce Location navigation only if geolocation libraries are loaded	2012-06-14 09:44:14 +02:00
Michael Peter Christen	e89747bb67	- added automated generation of vocabularies from url stubs - added clear of all terms for vocabularies - added deletion of vocabularies	2012-06-13 15:53:18 +02:00
Michael Peter Christen	79464189a4	The 'Locale' vocabulary, which is generated by geo data, has now the objectspace "http://dbpedia.org/resource/"	2012-06-13 13:05:41 +02:00
Michael Peter Christen	61bb52d55c	- using http://purl.org/dc/terms/references to refer from an auto-annotated document to a 'pseudo-linked' document which has an url created with an object-prefix as defined in the vocabulary file	2012-06-12 14:23:51 +02:00
Michael Peter Christen	50c576599b	allow multiple parser options instead of printing an error	2012-06-12 01:42:58 +02:00
Michael Peter Christen	8b53771db2	changed behavior of navigation processing: - vocabulary annotation is not done any more into the metadata of urldb - vocabularies are written into the jena triplestore using a rdf vocabulary - vocabularies for rdf tripel must be updated; refactoring done - with the new navigation tags in the triplestore a faster pre-urldb-lookup is possible: navigation is processed now within the RWI during pre-ranking retrieval - added also a Owl vocabulary stub to add the plain-text url to the triplestore using the owl:sameas predicate	2012-06-11 23:49:30 +02:00
Michael Peter Christen	5fc6524ca8	- moved triple store to net.yacy.cora.lod (should be generalized there later - added abstract add, delete, get methods in the triplestore - added generation of triples after auto-annotation - migrated all MultiProtocolURI objects to DigestURI in the parser since the url hash is needed as subject value in the triples in the triple store	2012-06-11 16:48:53 +02:00
cominch	bbfc53b663	bugfix	2012-06-10 13:12:12 +02:00
cominch	65c5826d93	bugfix Conflicts: source/net/yacy/document/parser/augment/AugmentParser.java	2012-06-10 13:11:54 +02:00
cominch	5f8ba7f4f2	small changes Conflicts: source/net/yacy/document/parser/augment/AugmentParser.java source/net/yacy/interaction/Interaction.java	2012-06-10 13:02:00 +02:00
cominch	90512640bf	Added config switches for custom parser Conflicts: source/net/yacy/document/TextParser.java	2012-06-10 12:49:36 +02:00
cominch	bcbd8eee33	Add several parsers, for RDFa and rdf files. Conflicts: source/net/yacy/document/TextParser.java	2012-06-10 10:42:33 +02:00
cominch	9cbfc1a1c0	augmentedProxy, which forwards every proxy request to a rewrite engine to customize existing webpages. originally implemented by Florian Richter. Conflicts: source/de/anomic/http/server/HTTPDProxyHandler.java	2012-06-10 10:15:34 +02:00
Michael Peter Christen	cde20911bb	saved a bit more ram using UTF8 String compression for OpenGeoDB and Geonames data files.	2012-06-09 10:07:11 +02:00
Michael Peter Christen	225ee42879	made the GeoLocation into an interface with the current integer implementation as accuracy implementation of 1.863cm	2012-06-09 09:46:27 +02:00
Michael Peter Christen	96e9d77270	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/cora/sorting/WeakPriorityBlockingQueue.java	2012-06-06 20:13:28 +02:00
Michael Peter Christen	96c8119b50	added GeoLocation / GeoPoint classes which uses less memory than Location/Coordinates and has initializers with correct order of lat,lon coordinates	2012-06-06 12:57:42 +02:00
Michael Peter Christen	461a0ce052	removed warnings	2012-06-05 20:03:43 +02:00
Michael Peter Christen	2fe207f813	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-06-04 23:44:38 +02:00
Michael Peter Christen	514700291a	moved Vocabulary to cora package (added in git `964406ad17`)	2012-06-04 23:41:36 +02:00
Michael Peter Christen	0284a4d88f	more fixes for double precision of coordinates	2012-06-04 23:37:41 +02:00
Michael Peter Christen	964406ad17	added concurrency enhancement to xml parser	2012-06-04 23:35:56 +02:00
Michael Peter Christen	e0d8643226	- performance hacks - added log warnings in case that search processes run into time-out situations - better concurrency for Integer formatter (used a non-synchronized formatter before) - bugfix for search termination (a poison pill was missing) - added timeout parameters for search (again) -> target is, that they are never reached.	2012-06-04 15:37:39 +02:00
Michael Peter Christen	6e83b02b83	- bugfix for surrogate file reader - bugfix for location search: suppress empty search	2012-06-01 00:08:31 +02:00
Michael Peter Christen	9b4c699526	ehanced location search: - search request are now made using a map boundary - search results are only computed for the map boundary - the number of results is adopted to the results in the visible range - added a double-buffering for the search result markers - added a search query option for the search results: /radius/<lat>/<lon>/<radius>	2012-05-31 22:39:53 +02:00
Michael Peter Christen	4d3cc02168	replaced old bzip2 library against better documented commons-compress package from http://commons.apache.org/compress/	2012-05-28 23:53:48 +02:00
Michael Peter Christen	c15fcde1c8	add-on to latest commit	2012-05-21 17:52:30 +02:00
Michael Peter Christen	81737dcb18	removed stack trace from swf parser since we cant do anything there	2012-05-21 02:27:06 +02:00
Michael Peter Christen	acf8d521a2	fix for http://bugs.yacy.net/view.php?id=126	2012-05-19 00:21:03 +02:00
Michael Peter Christen	89142d1e8d	removed (not all) warnings	2012-05-16 13:42:32 +02:00
Roland 'Quix0r' Haeder	a093ccf5eb	Now used synchronization in all close() methods to make sure all objects are 'closed' in an ordered way Conflicts: source/de/anomic/http/server/ChunkedInputStream.java source/de/anomic/http/server/ChunkedOutputStream.java source/de/anomic/http/server/ContentLengthInputStream.java source/net/yacy/cora/protocol/Domains.java source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java source/net/yacy/document/content/dao/PhpBB3Dao.java source/net/yacy/document/parser/html/AbstractTransformer.java source/net/yacy/kelondro/blob/BEncodedHeap.java source/net/yacy/kelondro/blob/HeapReader.java source/net/yacy/kelondro/index/RAMIndexCluster.java source/net/yacy/kelondro/io/ByteCountInputStream.java source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java source/net/yacy/kelondro/table/SQLTable.java	2012-05-14 07:41:55 +02:00
Michael Peter Christen	ba6aaabc51	refactoring + parser bugfixes	2012-05-04 17:28:27 +02:00
Michael Peter Christen	09484955dc	added new entry class for embed tags	2012-04-27 17:48:51 +02:00
Michael Peter Christen	453010bd68	- solved problems with backpath normalization - redesigned in/outbound link handover - removed iframe links from inbound/outbound in solr scheme	2012-04-27 16:48:51 +02:00
Michael Peter Christen	659178942f	- Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search	2012-04-24 16:07:03 +02:00
Michael Peter Christen	f8cd57c92f	new indexing strategy: ALL links that appear anywhere are indexed, not only links where the content can be parsed. All non-parseable links are placed into the noload queue. The search process must therefore be able to filter out non-text search results. - This fixes the problem that image search results appeared in the text search. - The interactive search can retrieve now ALL types of links - The p2p interface is now extended to retrieve only certain types of links (text, image, video, apps) - The search process has an extension to filter the right document type according to the search query	2012-04-22 02:05:17 +02:00
Michael Peter Christen	a1a5b015d8	refactoring: moved document Classification to cora package	2012-04-21 21:31:13 +02:00
Michael Peter Christen	4d5da75814	fix for parser problem if a <a>-tag is 'within' html tags with unclosed tags. That prevented the <a> tags from beeing recognized. This is a fix for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516	2012-04-18 10:30:04 +02:00
Michael Peter Christen	046f3a7e8d	check if httpc has decompressed the release file and rename the file from .tar.gz to .tar if that happened	2012-04-16 09:50:55 +02:00
Michael Peter Christen	e101c2e0e2	added changes from copperdust (submitted by email): 1. Improved and fixed language detection: 1.1 Identificator.java - recognition fix (improved) 1.2 DCEntry.java - fix (changed detection order due to detection from tld in many cases is incorrect) 1.3 MultiProtocolURI.java - fixed and enhanced language from tld detection (all currently used top-level domains; ccTLD added but not tested). 2. Ukrainian language update. 3. Main Slavic languages langstats (tested and works fine).	2012-02-22 12:21:27 +01:00
Michael Peter Christen	8d63a5887c	bugfixes	2012-02-02 23:38:23 +01:00
Michael Peter Christen	9ad1d8dde2	complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order.	2012-02-02 21:33:42 +01:00
Michael Peter Christen	7e4e3fe5b6	free some memory after parsing html	2012-02-02 09:55:27 +01:00
Michael Peter Christen	4540174fe0	memory hacks	2012-02-02 07:37:00 +01:00
Michael Peter Christen	2e5cd6a1b2	fixed parser extension deny list generation and usage	2012-02-01 00:15:59 +01:00
Michael Peter Christen	8bee1472c9	there is no noindex, only nofollow in links	2012-01-31 23:46:35 +01:00
Michael Peter Christen	c560a582ac	fix for single-word vocabulary lines	2012-01-26 16:44:30 +01:00
Michael Peter Christen	ef78f22ee1	performance hack	2012-01-25 12:48:48 +01:00
Michael Peter Christen	1f4f60654a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/document/parser/pdfParser.java	2012-01-24 20:42:30 +01:00
reger	32104360ce	PDFParser - return at least first 3 pages of PDF fix for pdf parsing without returning parsed text due to interruption by time out.	2012-01-23 20:58:36 +01:00
Michael Peter Christen	eadb58dd87	small enhancements in pdf parser	2012-01-23 00:46:02 +01:00
reger	b616de5973	PDFParser - return at least first 3 pages of PDF fix for pdf parsing without returning parsed text due to interruption by time out.	2012-01-21 03:15:12 +01:00
Michael Peter Christen	7f9b6b7a0c	added switches to ConfigParser to accept/deny documents by their extension	2012-01-17 16:43:34 +01:00
Michael Peter Christen	4901cee3cc	suppress auto-tagged subject entries when sending out or receiving metadata from other peers	2012-01-17 02:10:05 +01:00
Michael Peter Christen	83009d86f7	added the vocabulary navigator. It can be very simply tested by switching on the locale dictionaries.	2012-01-17 01:53:08 +01:00
Michael Peter Christen	a58dc4a91f	added autotagging to document condenser: - tags that are automatically generated now enrich the dc:subject - auto-generated tags have a '$' at the beginning of the tag - auto-generated tags lead the tag name with a vocabulary name each tag has the form $<vocabulary-name>:<tag-printname-space-replaced-by-'_'>	2012-01-15 22:17:57 +01:00
Michael Peter Christen	254adea51c	small fixes	2012-01-13 11:24:08 +01:00
Michael Peter Christen	b7bb84c0bb	set a limit to CharBuffer object size to fight against bad/too large content	2012-01-10 03:02:17 +01:00
Michael Christen	e6d51363ee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-01-09 02:00:09 +01:00
Marek Otahal	72adbeae90	!Important: move from Hashtable to HashMap Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue, but I found notices that some (ugly big) helper classes had to be created in past to compensate missing Hashtable's functionality. I'd like input if we can remove some of them. look for //FIX: if these commits Signed-off-by: Marek Otahal <markotahal@gmail.com>	2012-01-09 01:29:18 +01:00
Michael Christen	fa8da7f89d	vocabularies are now also used as source for a did-you-mean computation	2012-01-08 02:13:52 +01:00
Michael Christen	eaec14ecc4	Dictionaries from words caches can now be used as autotagging vocabulary	2012-01-08 02:07:10 +01:00
Michael Peter Christen	91940fdf56	redesign of WordCache to be prepared to hold multiple independent dictionaries. Such dictionaries can then be also used as simplified vocabularies.	2012-01-08 00:47:32 +01:00
Michael Christen	bd40a10230	added autotaggig stub .. only reading and parsing of vocabularies at this time	2012-01-07 17:34:38 +01:00
Michael Christen	c04bfaa51b	refactoring	2011-12-16 23:59:29 +01:00
Michael Christen	1f4afb4dc0	performance hacks	2011-12-15 15:15:53 +01:00
Michael Christen	762e0ecfb6	fixed localization dictionaries, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=3418&view=next	2011-12-06 02:21:40 +01:00
Michael Christen	9cd469e6d6	added pull request from als plus an NPE fix	2011-12-04 12:15:03 +01:00
Al Sutton	39898cb94a	Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer	2011-12-01 11:30:14 +00:00
Al Sutton	4c67a964a1	Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer	2011-12-01 11:28:52 +00:00
Al Sutton	3f9b9f953f	Added close() to ensure buffer close actions are invoked	2011-12-01 11:25:59 +00:00
Al Sutton	d73c84f9a0	Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown	2011-12-01 11:20:13 +00:00
Al Sutton	f02ea27b31	Added missing closure of ByteArrayInputSteam	2011-12-01 11:11:13 +00:00
Al Sutton	8993cac4d8	Initial performance improvements	2011-11-30 11:15:54 +00:00
orbiter	ebd840ebf6	- enhanced description on search front page - fixed language and heuristic modifier - added hint to crawl start that we can do also ftp and smb crawls - added a protocol extension to remote crawls to transport all search modifiers to remote peers git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8108 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-26 13:40:33 +00:00
orbiter	e22f8497c9	- tested the ARC methods - removed strict authentication (if password is empty; this was buggy and not useful; can be switched on if necessary globally and not for each interface method) - increased speed of CrawlResults page (no dns lookup any more) - increased speed of favicon display (removed dns lookup) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8104 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 14:09:25 +00:00
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 11:23:52 +00:00
apfelmaennchen	564374d1fe	- included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand. - reworked bookmark creation on crawlstart - many smaller adjustments to ymarks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8072 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 23:50:49 +00:00
orbiter	804e48888b	smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter should also be a little bit faster git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8057 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-18 13:09:07 +00:00
orbiter	85d6bf4ac4	fixed urls to media content during indexing git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8021 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-09 15:40:14 +00:00
orbiter	0d858d48ec	replaced String with StringBuilder in suggestion process git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8020 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-09 14:42:55 +00:00
orbiter	d2ea250d99	refactoring: - moved many classes from de.anomic to net.yacy - made more sub-packages for search classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-25 16:59:06 +00:00
low012	277b454a62	) added comments ) minor refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7971 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-25 13:16:52 +00:00
orbiter	6b22865dbc	- removed some warinings - removed a dead update location git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7970 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-24 01:58:54 +00:00
orbiter	8a428d3e77	ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7958 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-15 11:17:38 +00:00
orbiter	85a5487d6d	YaCy can now use the solr index to compute text snippets. This makes search result preparation MUCH faster because no document fetching and parsing is necessary any more. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7943 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-13 14:39:41 +00:00
orbiter	0819e1d397	protection against OOM cases in image parser. See also bugs.yacy.net/view.php?id=54 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7942 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-09 23:00:45 +00:00
orbiter	49e5ca579f	added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-07 10:08:57 +00:00
orbiter	610b01e1c3	- added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index. - some refactoring for mime type discovery git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7919 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-01 16:05:00 +00:00
orbiter	b5252ef91f	added new word recommendation library in DictionaryLoader_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7913 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-01 10:14:17 +00:00
orbiter	1c007188ad	bugfixes in html parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-31 16:02:06 +00:00
orbiter	231074bf0a	fixed a parsing bug by reverting SVN 7766 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7910 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-28 22:59:19 +00:00
low012	24e76a7b69	) Replaced occurrences of "Wikimedia" with "MediaWiki" where applicable. (Thanks to the folks of 0x20.be for pointing this out.) ) Added description of where to place MediaWiki dump for import. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7905 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-28 00:16:36 +00:00
orbiter	5dd2efc9a2	- bugfixes in html parser - new fields in solr - extended file viewer to debug parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-25 15:52:25 +00:00
orbiter	51cf697acd	refactoring: moved all score-related classes to new ranking package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7889 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-22 22:37:53 +00:00
sixcooler	eb14111200	encapsulate potential expensive objects in TextSnippet to allow GC them asap this reduces chance of OOMs at massive search & snippet-fetching git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7865 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-11 21:07:52 +00:00
sixcooler	a311596881	finishing up my commits (7855-7858) which could be helpful for not declaring inside loops (helps GC of some VMs) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7859 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-01 23:35:24 +00:00
sixcooler	9170a434ed	throwing an exception again in FileUtils.copy(reader, writer) OOMs could occour here and should not be ignored git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7858 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-01 23:32:58 +00:00
sixcooler	ce248cc8dd	less byte-arrays of response-content, less byte-array <-> stream conversation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7856 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-01 23:31:08 +00:00
sixcooler	59b767eebd	stop loading via http at defined maximum of bytes - even size is unknown before loading using max-file-size of type int for parsing documents (since content is used as byte-arrays, 'integer' should be maximum) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7855 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-08-01 23:28:23 +00:00
orbiter	299af4943c	added another memory protection hack git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7849 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-07-17 17:55:08 +00:00
orbiter	b06faab9d3	do not allocate a StringBuilder object in case that there is not enough memory for that git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7846 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-07-16 23:17:19 +00:00
orbiter	2d4bb139d3	- added counting of links with noindex tag for solr index - bugfixes for solr index git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7820 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-07-03 06:40:05 +00:00
orbiter	bda3eec0ff	added parsing of canonical link element to html parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-07-01 16:38:01 +00:00
orbiter	9706fc55aa	enhanced content scraper (should discover urls much faster in case of very large plain texts) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7787 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-20 22:29:45 +00:00
orbiter	f667b9c289	enhanced identificator: using AtomicInteger for counter git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7785 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-19 13:31:10 +00:00
orbiter	115abc8917	- more attributes for search progress bar - moved cache strategy to cora package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7778 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-13 21:44:03 +00:00
orbiter	77fe69395d	added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7774 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-05 20:04:41 +00:00
orbiter	0c1b29f3c9	- applied many small performance hacks - added a memory limitation in the zip parser and the pdf parser - added a search throttling: if there are too many search queries are still to be computed, then new requests are not accepted for some time. if after a one second still no space is there to perform another search, the search terminates with no results. this case should only happen in case of DoS-like situations and in case of strong load on a peer like if it is integrated in metager. - added a search cache deletion process that removes search requests in case that throttling happens git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7766 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-01 19:31:56 +00:00
orbiter	4bea3f9714	hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources: used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes). The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-27 08:24:54 +00:00
orbiter	e28bd0d038	fix for some possible causes of memory leaks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7741 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-26 14:35:32 +00:00
orbiter	10e2f588f8	- enhanced ybr ranking computation - many speed/performance hacks - added solr charding and new charding web interface - added option to switch off the yacy index when using solr - added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr - refactoring/renaming of some method names to distinguish host/url hashes better - a large number of bug/npe fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-26 10:57:02 +00:00
orbiter	3ed4a09368	small features, some bug fixes and performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7733 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-23 21:08:04 +00:00
orbiter	205cc75157	abstraction of surrogate main element (xmlns:geo was missing for wiki extracts) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7727 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-17 08:57:49 +00:00
orbiter	021840e5ba	removed (almost) deadlocks and unnecessary CPU load git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7726 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-17 00:00:01 +00:00
orbiter	9248a4eef4	reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder' see http://bugs.yacy.net/view.php?id=9 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7705 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-07 07:37:46 +00:00
orbiter	76f2817e00	a fix for the snippet computation and hopefully better snippets git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7701 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-05 23:05:38 +00:00
orbiter	deda54d684	- relaxed matching of string-search (this is now case-insensitive) - added transport of string-search pattern to remote search protocol - fixed a problem parsing snippets with a '-' inside git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7700 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-05 22:37:06 +00:00
orbiter	15e3a57b4e	removed unused functions in condenser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7698 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-05 09:23:10 +00:00
orbiter	e3d19d0a90	fix in Document inboundlinks/outboundlinks sorting git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7690 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-01 15:49:04 +00:00
orbiter	4e8fa03514	added more attributes to html evaluation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7688 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-29 15:36:44 +00:00
orbiter	528da7c9ea	removed unused class and added license header for new class git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7680 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-28 13:14:30 +00:00
orbiter	f6077b3cc0	added more attributes for html parser and enhanced data structures git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7679 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-28 13:09:01 +00:00
orbiter	d8e934c085	better abstraction of http client identification git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7675 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-26 13:35:29 +00:00
orbiter	b77b8cac0c	- enhanced html parser: recognized much more details in the content - added more properties to solr index - refactoring - more constants in switchboard - fix for some NPEs - recognition of more images - removed synchronization in HandleMap (obviously not necessary?) - added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-21 13:58:49 +00:00
orbiter	3d5104d357	- fixed a bug in crawl start with file name (npe in new url) - added deletion of solr index in IndexControlRWIs - added asynchronous adding of large url lists (happens when crawls are startet with file) - fixed npe in Image display - replaced language warning with fine logging - added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups) - added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list - added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm) - fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-18 16:11:16 +00:00
orbiter	958ff4778e	enhanced location search: search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found. This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed. Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search. Fixed many smaller bugs in connection with location search. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-15 15:54:19 +00:00
orbiter	c17d102bd8	enhanced speed for OrderedScoreMap inc method and size comparisment in concurrent environments git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7653 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-13 22:04:23 +00:00
orbiter	b788182954	some enhancements to scoring speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7652 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-13 15:17:00 +00:00

... 3 4 5 6 7 ...

627 Commits