yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	2013-03-13 14:47:00 +01:00
Michael Peter Christen	2b6c79d347	in method exists() also use the new caching-stacks for documents/metadata	2013-03-04 01:13:17 +01:00
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	2013-03-02 10:25:52 +01:00
orbiter	d74472f562	corrected result counter	2013-02-27 22:40:23 +01:00
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	2013-02-26 17:16:31 +01:00
Michael Peter Christen	35fa718b77	testing to use solr for portalsearch caused some bugfixing but no full success: try to comment out the solr search request in yacy-portalsearch.js	2013-02-25 14:31:50 +01:00
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	2013-02-25 00:09:41 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Marc Nause	75f9568472	) only install files from the RELEASE directory ) minor changes	2013-02-05 21:02:32 +01:00
Marc Nause	3bc5ee6e3d	*) added protection against CSRF in update download page (http://localhost:8090/ConfigUpdate_p.html?releaseinstall=../../test.txt&deleteRelease=Delete+Release does not work anymore)	2013-02-04 19:57:28 +01:00
reger	3897bb4409	added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index) - migrates all entries in old urldb Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), - added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format) - removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format) current log excerpt for NumberFormatException: W 2013/01/14 00:10:07 StackTrace For input string: "-" java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152) ... Caused by: java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152)	2013-01-14 03:06:24 +01:00
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	2013-01-04 16:39:34 +01:00
Michael Peter Christen	0f5b6f38c1	enhanced root-url detection	2013-01-03 19:21:21 +01:00
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	2013-01-02 20:55:43 +01:00
reger	276e63401e	small sanitary fixes - exclude unix shell scripts in NSIS windows install archive - replace link to env/grafics/yacy.gif to yacy.png (build.nsi) - remove unused code lines (Blacklist_p, Response, WordReferenceVars) - type & xhtml (RankingSolr_p.html)	2013-01-02 01:59:47 +01:00
Michael Peter Christen	24c9bb35f7	extended the Scheduler: introduced scheduled events - an event type (once, regular) can be selected - for this event type, a fixed time can be selected. This may be either directly after startup or at one of the full hours at a day (==25 options) The main point about this feature is the opportunity to start an action directly after startup. That makes it possible to create YaCy distributions which, after started at the first time, start to index parts of the intranet/internet by itself.	2012-12-22 16:27:14 +01:00
reger	ad71747525	fix: set defaul language to "en"	2012-12-16 20:53:45 +01:00
orbiter	712cc37c40	if maxFileSize < 0 then the file size limit is without limit.	2012-12-10 21:17:45 +01:00
Michael Peter Christen	8fc3679c66	using more pre-compile pattern for split methods	2012-11-26 13:11:55 +01:00
Michael Peter Christen	5e182a566f	- added another enumeration method in kelondro data structure to get a more random access to data for the balancer - added random access inside the balancer	2012-11-23 13:58:39 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	2012-11-19 17:24:34 +01:00
Michael Peter Christen	832eead998	Merge remote-tracking branch 'regerdev/master'	2012-11-18 22:04:11 +01:00
Michael Peter Christen	570e42c4e3	fix for filetype naviagtor	2012-11-07 13:53:29 +01:00
reger	633fbe9188	Fix Metadata handling - language default on missing lang property to "uk" (fix set to nothing) - language set to TLD (added call to existing language calculation from TLD) - coordinate number exception on possible lat/lon content of "NaN,NaN" adjust Netbeans IDE classpath (for Solr/Lucene 4.0.0 jars)	2012-11-04 02:07:59 +01:00
Michael Peter Christen	c5f67a5d6d	fixed a problem with local search from solr results: now all results from solr are shown (again)	2012-11-01 10:22:22 +01:00
Michael Peter Christen	f8f05ecba7	- added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently	2012-10-31 17:44:45 +01:00
Michael Peter Christen	a33e2742cb	- removed unnecessary synchronized and deadlock in crawler - removed problem with monitoring object on Balancer.wait - added missing user agent settings	2012-10-28 19:56:02 +01:00
orbiter	354f0d9acd	moved static method from ClusteredScoreMap to MapDataMining because it was not used in the ClusteredScoreMap class but only in MapDataMining	2012-10-28 11:29:53 +01:00
Michael Peter Christen	1baf498d59	- show more lines in online log - reverse order is default now	2012-10-25 18:38:39 +02:00
Michael Peter Christen	f2d0418218	because the new PngEncoder had a problem with the PixelGrabber which is caused by a JRE bug, the PixelGrabber had to be circumvented using an own frame buffer which can be read without a PixelGrabber. This resulted in ultra-fast and much less memory-consuming transformation. YaCy images are now generated really fast!	2012-10-25 17:59:20 +02:00
orbiter	276dd6452b	removed warnings	2012-10-23 19:08:44 +02:00
Michael Peter Christen	ce0e5b1e17	- more refactoring / private methods - fix for usage of custom solr field names	2012-10-18 15:09:04 +02:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	b400fc7b4d	fix for file parser problem	2012-10-17 18:06:44 +02:00
Michael Peter Christen	e5b3c172ff	removed hack which translated Solr documents to virtual RWI entries which had been then mixed with remote RWIs. Now these Solr documents are feeded into the result set as they appear during local and remote search. That makes the search much faster.	2012-10-17 17:45:41 +02:00
Michael Peter Christen	6017691522	added an exception catch	2012-10-17 13:56:11 +02:00
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	2012-10-16 18:11:57 +02:00
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	2012-10-15 13:17:13 +02:00
Michael Peter Christen	613cf7da7f	enhancement to post argument parsing - possible fix to zero-filled parameter values	2012-10-11 10:46:06 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	a06930662c	replaced some more .getBytes() with UTF8/ASCII.getBytes()	2012-10-09 12:14:28 +02:00
Michael Peter Christen	2f536cb54d	code cleanup: removed unised methods and made more methods and objects private	2012-10-08 10:50:24 +02:00
Michael Peter Christen	584663ae8c	- redesign of solr query construction - fix for solr boosts and location search - fix for number of search results in local search	2012-10-07 07:46:55 +02:00
Michael Peter Christen	a8167e6e5b	clean-up: removed unused methods in kelondro	2012-10-06 03:34:52 +02:00
Michael Peter Christen	24d2ee3c52	- better date ranking - more protection against NPE and time travel effects	2012-09-26 18:36:32 +02:00
Michael Peter Christen	ca313e404f	- if a "/date" modifier is used, the solr remote query applies an ordering by date (ascending) - added also some 'anti-timetravel' protection (check if date is in the future within any metadata date field)	2012-09-26 16:56:33 +02:00
Michael Peter Christen	24f4ca4d85	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-26 12:01:34 +02:00
apfelmaennchen	116f429e35	fix for java.lang.RuntimeException: TableColumnIndex not available...	2012-09-26 09:56:16 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	872f83ebe0	refactoring	2012-09-25 21:04:58 +02:00
Michael Peter Christen	8219a445f3	refactoring	2012-09-21 16:46:57 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
orbiter	563d584420	removed more dependencies in cora from kelondro	2012-09-21 11:02:36 +02:00
Michael Peter Christen	e072632a54	no complaints about memory if the database is empty	2012-09-11 22:28:10 +02:00
Michael Peter Christen	e65cecc419	- updated lucene libraries to 3.6.1 - added lucene-grouping which enables faceted search; try this: http://localhost:8090/solr/select?q=:&start=0&rows=3&facet=true&facet.field=host_s	2012-09-10 10:12:38 +02:00
Michael Peter Christen	4d29f59a27	removed warnings	2012-09-10 07:15:52 +02:00
Michael Peter Christen	8c099d2106	Merge remote-tracking branch 'origin/master' Conflicts: htroot/api/ymarks/import_ymark.java source/de/anomic/data/ymark/YMarkEntry.java source/de/anomic/data/ymark/YMarkTables.java	2012-09-10 07:05:20 +02:00
apfelmaennchen	d31a632951	- added dmoz RDF dump importer - added indexing to Tables columns to support larger bookmark collections - added RDF output (HTTP) for public bookmarks at /YMarks.rdf - YMarkRDF also provides a Jena RDF Model as "internal" API - various other changes/fixes for YMarks (mainly backend)	2012-09-09 09:53:58 +02:00
Michael Peter Christen	d8425e6809	added collections to crawl monitor	2012-09-04 14:47:53 +02:00
Michael Peter Christen	528d6763fa	- added new solr fields: title_count_i, title_chars_val, title_words_val description_count_i, description_chars_val, description_words_val - added many asserts to ensure data type correctness from YaCy to Solr and vice versa - made many fixes according to new findings from these asserts (!)	2012-08-31 10:30:43 +02:00
Michael Peter Christen	316b5fe116	- added a solr type definition verifier - fixed type definition found by the verifier - added multivalue-string fields for solr with extension 'sxt' - added multivalue-integer fields for solr with extension 'val' - renamed some solr attributes from txt to sxt - changed solr query line to an explicit AND/OR structure - added a country code second level domain list to Domains class; with parser - added a host string parser to get domain class name, country-code second-level domain and subdomain out of it - removed old coordinate attributes	2012-08-28 16:58:06 +02:00
Michael Peter Christen	e8acd542b5	- added faceted drill-down for host and geolocation to solr queries - added a new geolocation field to index schema, the old values are migrated if possible	2012-08-27 14:41:33 +02:00
orbiter	2094df2e4e	- correct length computation for BStringObject (bugfix suggested by apfelmaennchen) - using ASCII for string conversion for Strings generated from Integer	2012-08-26 17:46:40 +02:00
Michael Peter Christen	4716546ef5	- reduced memory usage in index transmission using a transformation of Node to Row objects - removed peerDeparture in solr remote search in case that peer does not answer (this may be normal because it is allowed to switch this off)	2012-08-22 16:30:33 +02:00
Michael Peter Christen	06b0081fdc	fix for NPE during host navigation computation	2012-08-22 01:55:39 +02:00
orbiter	acb9f04e80	removed unused classes	2012-08-21 18:18:30 +02:00
Michael Peter Christen	755f5e76cf	removed strange assert statements and simplified code in metadata transformation	2012-08-19 08:44:39 +02:00
orbiter	ee01c12e56	fixes for putDocument and putMetadata	2012-08-18 13:05:27 +02:00
Michael Peter Christen	f9fc5cfaba	better check for bad urls in url transmission	2012-08-17 17:17:00 +02:00
Michael Peter Christen	40c0856489	refactoring	2012-08-17 15:33:02 +02:00
Michael Peter Christen	9bece5ac5f	enhanced snippet fetch - removed a bug that caused documents to be parsed even if a solr text was available	2012-08-17 14:22:07 +02:00
Michael Peter Christen	395b78a0d8	using the solr search index to concurrently search within solr and the rwis during local search requests.	2012-08-17 01:21:56 +02:00
Michael Peter Christen	e5ef840f40	- renamed DoubleSolrConnector to MirrorSolrConnector and added a hit/miss/document cache to the MirrorSolrConnector. - more abstraction to SolrDocument in Connector interface - bugfixes in Solr field reader	2012-08-13 13:32:32 +02:00
Michael Peter Christen	94a334f128	another fix to the Solr metadata reading process and to the shutdown process	2012-08-13 11:13:53 +02:00
Michael Peter Christen	b51df6c7e8	- added coordinate storage in solr schema - fixed shutdown process - fixed some solr-to-metadata reading - added a large number of metadata attributes in ViewFile.html	2012-08-13 10:40:04 +02:00
Michael Peter Christen	f9c0e6e950	- Implemented and integrated the URIMetadataNode object which is a metadata representation from the solr index. This shall replace metadata from the built-in database in the future. - added the Solr-driven metadata into the search index of YaCy which makes it now possible to run YaCy without the old metadata index. This is a major stept forward to a full migration to Solr.	2012-08-10 13:26:51 +02:00
Michael Peter Christen	dcc72799c4	better abstraction for result writers using controlled vocabularies and URIRefs	2012-08-10 07:45:43 +02:00
Michael Peter Christen	a12f693ec9	added two response writer for embedded solr interface: a rss/opensearch writer and an enhanced solr xml writer. The enhanced solr writer has less configuration overhead than the original writer and should by slightly faster. The rss/opensearch writer is at this time slightly incomplete compared with the already existing rss search result form YaCy and also snippets are missing at this time. To test the new interface, open for example: http://localhost:8090/solr/select?wt=rss&q=olympia The wt-code for the new result writers are= wt=rss for opensearch wt=exml for the enhanced solr xml writer. Additionally, the SRU search parameters had been added to the solr interface which can now also be used for a normal solr/xml search.	2012-08-09 18:06:48 +02:00
sixcooler	f32aa9a49c	prevent merge of blobs that can't be handled in memory	2012-07-31 23:23:16 +02:00
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	2012-07-27 12:13:53 +02:00
Michael Peter Christen	e432bb9cd9	better calculation of possible saving in HeapReader index data structure	2012-07-26 10:05:06 +02:00
Michael Peter Christen	9549984c65	documentation/comments	2012-07-25 21:34:23 +02:00
Michael Peter Christen	826967513b	changed options in IndexFederated_p to switch on/off parts of the index individually. The settings are experimental and the values of the settings will be overwritten when an index migration from urldb to solr starts.	2012-07-23 16:28:39 +02:00
orbiter	69e743d9e3	- more abstraction for the RWI index as preparation for solr integration - added options in search index to switch parts of the index on or off	2012-07-22 13:18:45 +02:00
Michael Peter Christen	f0a079ac9f	allow larger log entries	2012-07-14 16:28:14 +02:00
Michael Peter Christen	784a4abb18	enhancement in internal data organization which should generate less synchronizations in database access	2012-07-14 13:09:44 +02:00
Michael Peter Christen	f78ce93a80	collection of speed and memory saving hacks	2012-07-13 21:15:38 +02:00
orbiter	a196f24f60	prevent enqueueing of non-loggeable logging entries	2012-07-12 19:42:42 +02:00
orbiter	482afed07c	reduced logging overhead (a bit)	2012-07-12 19:23:40 +02:00
orbiter	e76159040b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-07-12 11:14:04 +02:00
orbiter	bbfa497a3c	replaced more size() > 0 by !isEmpty()	2012-07-12 11:12:21 +02:00
Michael Peter Christen	83da68c4c1	fixed a memory leak inside the logger which appeared if the log was writter faster that the logger is able to print this out to its out stream. A very large collection of unwritten log outputs had been seen during strong crawling. The new ArrayBlockingQueue is limited to prevent this case.	2012-07-12 01:23:04 +02:00
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	2012-07-10 22:59:03 +02:00
Michael Peter Christen	1addbc792c	use less memory for md5 cache	2012-07-08 22:05:04 +02:00
Michael Peter Christen	f32de94723	more logging	2012-07-08 22:04:36 +02:00
Michael Peter Christen	8efc1c1078	- fixed a memory leak (or bad usage) during parsing/snippet fetch - more logging for errors	2012-07-06 09:05:41 +02:00
Michael Peter Christen	b0c408788b	made class methods static where possible	2012-07-05 12:38:41 +02:00
Michael Peter Christen	5bd3c90907	- removed unnecessary semicolons - added default case for switch	2012-07-05 11:18:31 +02:00
Michael Peter Christen	132afaf687	removed unaccessible code	2012-07-05 11:09:44 +02:00
Michael Peter Christen	7c1ba99755	removed more unused method parameters	2012-07-05 10:44:30 +02:00
Michael Peter Christen	83701a1b4c	removed unused ImageReference package	2012-07-05 10:24:52 +02:00
Michael Peter Christen	0301aba1e9	removed unused method parameters	2012-07-05 10:23:07 +02:00
Michael Peter Christen	d3964253ae	- added @SuppressWarnings to unused servlet method parameters - removed unnecessary casts - removed unnecessary throw statements	2012-07-05 09:14:04 +02:00
Michael Peter Christen	ea10766bfd	cleaned unnecessary nested code	2012-07-05 08:44:39 +02:00
Michael Peter Christen	1481037820	replaced non-generic array with collection	2012-07-05 01:02:51 +02:00
Michael Peter Christen	613b45f604	- better data structures in secondary search - fixed a big memory leak in secondary search	2012-07-03 07:12:20 +02:00
Michael Peter Christen	8a82609360	- smaller caches to save memory - close cloneable iterators to free memory	2012-07-02 15:40:40 +02:00
Michael Peter Christen	ce8d4b87d9	fixes for new eclipse 'Juno' warning 'Resource leak'.	2012-07-02 10:27:46 +02:00
Michael Peter Christen	0c345d1559	giving threads name so its easier to see whats happening during debugging and within a thread dump	2012-07-02 09:51:43 +02:00
Michael Peter Christen	b9d42fd9c8	using com.google.common.io.Files instead of homebrew methods	2012-06-22 11:39:17 +02:00
Michael Peter Christen	de3ef8ad73	removed unimportant warnings	2012-06-19 08:45:34 +02:00
Michael Peter Christen	9264d8b4af	removed old navigation practice using subject tags in favor of triplestore-tags	2012-06-17 00:33:40 +02:00
Michael Peter Christen	61bb52d55c	- using http://purl.org/dc/terms/references to refer from an auto-annotated document to a 'pseudo-linked' document which has an url created with an object-prefix as defined in the vocabulary file	2012-06-12 14:23:51 +02:00
Michael Peter Christen	8b53771db2	changed behavior of navigation processing: - vocabulary annotation is not done any more into the metadata of urldb - vocabularies are written into the jena triplestore using a rdf vocabulary - vocabularies for rdf tripel must be updated; refactoring done - with the new navigation tags in the triplestore a faster pre-urldb-lookup is possible: navigation is processed now within the RWI during pre-ranking retrieval - added also a Owl vocabulary stub to add the plain-text url to the triplestore using the owl:sameas predicate	2012-06-11 23:49:30 +02:00
Michael Peter Christen	bef823c247	close the reader if finished	2012-06-11 01:20:54 +02:00
cominch	9cbfc1a1c0	augmentedProxy, which forwards every proxy request to a rewrite engine to customize existing webpages. originally implemented by Florian Richter. Conflicts: source/de/anomic/http/server/HTTPDProxyHandler.java	2012-06-10 10:15:34 +02:00
Michael Peter Christen	3b992e6b00	using utf8 String compression in Webstructure database	2012-06-09 11:00:33 +02:00
Michael Peter Christen	2280a7b276	- changed initialization order to prefer allocation of memory for table files first - bugfixes in memory amount calculation	2012-06-09 09:05:47 +02:00
Michael Peter Christen	0746308bc2	only the metadata tables shall be able to use the tail cache	2012-06-08 18:36:11 +02:00
Michael Peter Christen	7ec9bef0c3	fix for OOM	2012-06-08 17:14:09 +02:00
Michael Peter Christen	41c02cb10e	- less restrictions for usage of Table RAM copy - new limit to use the table copy (instead of flag): 400MB available. If less is available, then a copy is never used. If more is available, then it can be used if there is a remaining space of at least 200MB - flush caches more often: flush the Digest cache	2012-06-08 12:48:25 +02:00
Michael Peter Christen	b8f56a9803	npe bugfix	2012-06-08 10:20:43 +02:00
Michael Peter Christen	ba10caf89a	lazy initialization of database tables	2012-06-08 09:30:51 +02:00
Michael Peter Christen	701b9a28a0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: htroot/PerformanceMemory_p.java	2012-06-08 09:16:16 +02:00
Michael Peter Christen	10c9c17d51	fixed handlemap spread factor and null iterator handling	2012-06-08 09:13:41 +02:00
Michael Peter Christen	b0095c8d3c	flush the compressor cache when a cleanup is done	2012-06-07 19:42:33 +02:00
Michael Peter Christen	96e9d77270	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/cora/sorting/WeakPriorityBlockingQueue.java	2012-06-06 20:13:28 +02:00
Michael Peter Christen	00f2df1120	a variety of possible memory leak fixes	2012-06-06 18:23:18 +02:00
Michael Peter Christen	3dd8376825	added automatic cleaning of cache if metadata and file database size is not equal. It might happen that these data is different because one of that caches is cleaned after a while or when it is too big. The metadata is then not cleaned, but now wiped after a checkup process at every application start. This should cause a bit less memory usage.	2012-06-06 14:15:24 +02:00
Michael Peter Christen	6bb07afcc3	accept also files with other file prefix; used to read 'foreign' cache files	2012-06-06 13:36:10 +02:00
Michael Peter Christen	461a0ce052	removed warnings	2012-06-05 20:03:43 +02:00
Michael Peter Christen	407fdf6968	more bug fixes and performance hacks for search process	2012-06-05 15:04:23 +02:00
Michael Peter Christen	a1fe65b115	performance hacks	2012-06-05 12:06:26 +02:00
Michael Peter Christen	e0d8643226	- performance hacks - added log warnings in case that search processes run into time-out situations - better concurrency for Integer formatter (used a non-synchronized formatter before) - bugfix for search termination (a poison pill was missing) - added timeout parameters for search (again) -> target is, that they are never reached.	2012-06-04 15:37:39 +02:00
Michael Peter Christen	9b4c699526	ehanced location search: - search request are now made using a map boundary - search results are only computed for the map boundary - the number of results is adopted to the results in the visible range - added a double-buffering for the search result markers - added a search query option for the search results: /radius/<lat>/<lon>/<radius>	2012-05-31 22:39:53 +02:00
Michael Peter Christen	1f48d1528b	performance hacks	2012-05-31 00:46:30 +02:00
Michael Peter Christen	10da7335ea	performance hack: use a hash cache for all hashes that are computed by a byte array. If this hash is used in a HashMap (which is very often the case) then this hack eliminates a lot of re-computations of the same hash.	2012-05-30 16:59:13 +02:00
Michael Peter Christen	7c1feefb28	introduced a default 10 second time-out in rwi normalization time uring search process to prevent endless deadlocks after a very long running search	2012-05-30 16:26:05 +02:00
Michael Peter Christen	8d997d55b6	better logging	2012-05-30 15:47:35 +02:00
Michael Peter Christen	43c2c6e588	better logging	2012-05-30 15:27:45 +02:00
Michael Peter Christen	c15fcde1c8	add-on to latest commit	2012-05-21 17:52:30 +02:00
Michael Peter Christen	cf47d94888	performance hack to parse numbers inside of substrings without actually generating a substring. This avoids the allocation of a String object ech time a substring is parsed. Should affect CPU load during RWI transmission.	2012-05-21 13:40:46 +02:00
Michael Peter Christen	7e0ddbd275	added a "fromCache" flag in Response object to omit one cache.has() check during snippet generation. This should cause less blockings	2012-05-21 03:03:47 +02:00
Michael Peter Christen	c6a09eab0b	synchronization needed	2012-05-21 00:58:29 +02:00
reger	6696cb1313	bugfix: lookup of peernames no result for active peer in page IndexControlRWIs_p.html -> Transfer RWI to other Peer SeedDB.lookupByName searche for lowercase peerNames, while MapColumnIndex.getIndex uses peername as is in the keyset. Changed the index init to insert lowercase peer names as key	2012-05-20 05:25:16 +02:00
Michael Peter Christen	f294f2e295	bugfix to http://bugs.yacy.net/view.php?id=181 tried to make a bit less 'noise' to dns server also included: less processes in snippet fetch to reduce load during search on small computers	2012-05-19 01:06:33 +02:00
Michael Peter Christen	acf8d521a2	fix for http://bugs.yacy.net/view.php?id=126	2012-05-19 00:21:03 +02:00
Michael Peter Christen	fa735f4f04	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-05-17 23:40:08 +02:00
Michael Peter Christen	3e1bc9477f	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-05-17 13:58:09 +02:00
Michael Peter Christen	6f8a2fef1f	small speed enhancement using a column factory	2012-05-17 11:08:48 +02:00
Roland 'Quix0r' Haeder	d10627d591	More sync in close() methods Conflicts: source/net/yacy/kelondro/logging/GuiHandler.java source/net/yacy/kelondro/workflow/InstantBusyThread.java	2012-05-17 06:03:18 +02:00
Roland 'Quix0r' Haeder	fbb946f913	Made a method static (Eclipse suggested it), removed unused import, pk=null check does now output a warning in logfile	2012-05-17 05:55:44 +02:00
Michael Peter Christen	89142d1e8d	removed (not all) warnings	2012-05-16 13:42:32 +02:00
Michael Peter Christen	15db703808	added missing serialization to remove all warnings	2012-05-15 13:13:07 +02:00
Michael Peter Christen	1795a7325b	made HandleSet serializable	2012-05-15 12:55:15 +02:00
Roland 'Quix0r' Haeder	a093ccf5eb	Now used synchronization in all close() methods to make sure all objects are 'closed' in an ordered way Conflicts: source/de/anomic/http/server/ChunkedInputStream.java source/de/anomic/http/server/ChunkedOutputStream.java source/de/anomic/http/server/ContentLengthInputStream.java source/net/yacy/cora/protocol/Domains.java source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java source/net/yacy/document/content/dao/PhpBB3Dao.java source/net/yacy/document/parser/html/AbstractTransformer.java source/net/yacy/kelondro/blob/BEncodedHeap.java source/net/yacy/kelondro/blob/HeapReader.java source/net/yacy/kelondro/index/RAMIndexCluster.java source/net/yacy/kelondro/io/ByteCountInputStream.java source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java source/net/yacy/kelondro/table/SQLTable.java	2012-05-14 07:41:55 +02:00
Michael Peter Christen	0cf3d36eae	more tolerance in case of corrupted file	2012-05-11 20:46:50 +02:00
Michael Peter Christen	34f4225d7e	less 'wellformed' calls without asserts	2012-05-08 23:24:39 +02:00
Michael Peter Christen	ba6aaabc51	refactoring + parser bugfixes	2012-05-04 17:28:27 +02:00
Michael Christen	e32055aa15	added stub classes for - a new database for url reference data ('seen links') - a new database extending the references to the full url metadata attributes set which shall replace the old metadata database if it is finished - migration help classes stub to use old and new metadata databases simultanously	2012-04-13 07:09:15 +02:00
Michael Peter Christen	2fc8ecee36	ConcurrentLinkedQueue has a VERY long return time on the .size() method. See http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html and the following test programm: public class QueueLengthTimeTest { public static long countTest(Queue<Integer> q, int c) { long t = System.currentTimeMillis(); for (int i = 0; i < c; i++) { q.add(q.size()); } return System.currentTimeMillis() - t; } public static void main(String[] args) { int c = 1; for (int i = 0; i < 100; i++) { Runtime.getRuntime().gc(); long t1 = countTest(new ArrayBlockingQueue<Integer>(c), c); Runtime.getRuntime().gc(); long t2 = countTest(new LinkedBlockingQueue<Integer>(), c); Runtime.getRuntime().gc(); long t3 = countTest(new ConcurrentLinkedQueue<Integer>(), c); System.out.println("count = " + c + ": ArrayBlockingQueue = " + t1 + ", LinkedBlockingQueue = " + t2 + ", ConcurrentLinkedQueue = " + t3); c = c * 2; } } }	2012-02-27 00:42:32 +01:00
Michael Peter Christen	213c8d97f2	use less proccesses in process pool	2012-02-25 14:07:20 +01:00
Michael Peter Christen	c639248c23	protection against strange answers from remote peers during search	2012-02-25 14:07:02 +01:00
Michael Peter Christen	1cd711d005	added classes for citation references (for new citation ranking)	2012-02-24 01:07:15 +01:00
Michael Peter Christen	e0f1e7d904	added new citation reference data structure that shall be used for a citation ranking	2012-02-23 01:22:29 +01:00
Michael Peter Christen	e18a4f6b74	more tolerant merge iterator	2012-02-23 01:21:24 +01:00
Michael Peter Christen	7e4e3fe5b6	free some memory after parsing html	2012-02-02 09:55:27 +01:00
Michael Peter Christen	4540174fe0	memory hacks	2012-02-02 07:37:00 +01:00
Michael Peter Christen	b4409cc803	small redesign of blob column index and usage	2012-02-02 06:43:57 +01:00
Michael Peter Christen	d5c1f2746e	performance hack	2012-02-02 06:43:15 +01:00
Michael Peter Christen	803963aebd	performance hack: better space grow in CharBuffer (speeds up html parser)	2012-02-01 23:27:59 +01:00
Michael Peter Christen	e2f8f263e8	changed storage of search words: keep order	2012-02-01 18:13:31 +01:00
Michael Peter Christen	0b67a0a5d8	added a column index for tables in blob files. This is heavily used during receiving of DHT submissions and when answering remote search requests. Both events together may have caused IO-deadlocking and this commit shall fix that.	2012-02-01 15:11:21 +01:00
Michael Peter Christen	e3bb73c3d6	serialized some database access methods	2012-01-31 21:13:49 +01:00
Michael Peter Christen	2ea585d616	fix for host navigator	2012-01-26 18:10:34 +01:00
Michael Peter Christen	ef78f22ee1	performance hack	2012-01-25 12:48:48 +01:00
Michael Peter Christen	a02fdf8625	better error messages	2012-01-23 00:47:25 +01:00
Michael Peter Christen	c6ba44468e	timeout = 5000 instead 3000	2012-01-23 00:45:32 +01:00
low012	8776b84c10	*) small fix to make password change function of reconfigureYACY.sh work again	2012-01-17 20:43:19 +01:00
Michael Peter Christen	4901cee3cc	suppress auto-tagged subject entries when sending out or receiving metadata from other peers	2012-01-17 02:10:05 +01:00
sixcooler	985b78cf89	correct 'avaiable()' to use max of young / eden	2012-01-16 16:59:58 +01:00
sixcooler	4da8746275	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-01-16 01:48:36 +01:00
sixcooler	c9aaa9e00a	respect non-reserved Memory in GenerationMemoryStrategy and enable it again	2012-01-16 01:46:12 +01:00
Michael Peter Christen	37f2d1b3e9	replaced Thread initialization with ExecutorService pool for delete method. This is much faster and produces less blocking when using the Compressor class which is used by the HTCache. I.e. picture search is much faster now.	2012-01-16 01:05:30 +01:00
Michael Peter Christen	0d6176804b	emergency disabling of GenerationMemoryStrategy because of non-working available-method	2012-01-15 21:58:18 +01:00
Michael Peter Christen	87f0210480	enriched log output to find NPE in HeapReader	2012-01-15 12:08:46 +01:00
Michael Peter Christen	254adea51c	small fixes	2012-01-13 11:24:08 +01:00
Michael Peter Christen	49be60a7c8	WorkflowProcess is forced to make small pauses if shortMemoryStatus is reached.	2012-01-10 03:03:12 +01:00
Michael Peter Christen	b7bb84c0bb	set a limit to CharBuffer object size to fight against bad/too large content	2012-01-10 03:02:17 +01:00
Marek Otahal	72adbeae90	!Important: move from Hashtable to HashMap Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue, but I found notices that some (ugly big) helper classes had to be created in past to compensate missing Hashtable's functionality. I'd like input if we can remove some of them. look for //FIX: if these commits Signed-off-by: Marek Otahal <markotahal@gmail.com>	2012-01-09 01:29:18 +01:00
Marek Otahal	f75b5e40e0	little fix in copy() Signed-off-by: Marek Otahal <markotahal@gmail.com>	2012-01-09 01:16:46 +01:00
Michael Christen	216a287a85	Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	2012-01-04 20:16:37 +01:00
Michael Christen	20962a4ed7	added metadata node stub for metadata from blobs	2012-01-03 14:38:03 +01:00
Michael Christen	575dbbaa93	enhancements in Blob retrieval: try to use less CPU resources by testing a blog first that most certainly has wanted entries.	2012-01-02 02:14:05 +01:00
Roland 'Quix0r' Haeder	6d4e08ed06	Rewrote filesize() to (hopefully) avoid a NPE, rewrote Blacklist class to concurrent classes to avoid a CME	2011-12-29 03:42:38 +01:00
Roland 'Quix0r' Haeder	fa08ed5ae5	Fixed a lot CHMOD rights (no need for execute flag on .java/.html) and introduced local/remote crawl size ratio based check	2011-12-29 00:33:16 +01:00
Michael Christen	9e5894c784	Removed handling of components objects for URIMetadataRows. This is a preparation to replace this rows with nodes from the node store.	2011-12-17 01:27:08 +01:00
Michael Christen	c04bfaa51b	refactoring	2011-12-16 23:59:29 +01:00

... 2 3 4 5 6 ...

810 Commits