Commit Graph

156 Commits

Author SHA1 Message Date
Michael Peter Christen
c95a84103a complete redesign of search process:
- removed 'worker' processes
- no internal time-out behaviour: methods either are successful or
return null
- waiting is only done on top-level
- removed snippet-production; this is replaced by solr snippets
- removed statistics based on solr size queries (they had been VERY
long); the statistics (like suggestions or tag cloud) are now again
based on the old but very fast RWI index. In portal or intranet mode the
RWI index is usually switched off; if you like to have statistics again
then you must switch on the rwis again in this mode.
- fixed many bugs regarding correct page counter
2013-02-26 17:16:31 +01:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
0f5b6f38c1 enhanced root-url detection 2013-01-03 19:21:21 +01:00
Michael Peter Christen
cb5cbec14d distinguishing modified query string and original query string 2012-12-15 00:05:46 +01:00
Michael Peter Christen
4eab3aae60 removed overhead by preventing generation of full search results when
only the url is requested
2012-11-23 01:35:28 +01:00
Michael Peter Christen
8fb370d9f8 renovated the way how search results are count. should be correct now... 2012-11-05 03:19:28 +01:00
Michael Peter Christen
6629e37685 tried to clean up the search process mess 2012-11-01 17:16:43 +01:00
Michael Peter Christen
c5f67a5d6d fixed a problem with local search from solr results: now all results
from solr are shown (again)
2012-11-01 10:22:22 +01:00
sixcooler
57ddd63888 not hold a expensive cache of references for DHT-out,but but load them
on demand
see: http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4530
2012-10-21 20:00:36 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
5d16c23a1f specified more URIMetadata as URIMetadataNode 2012-10-16 18:26:21 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of & parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
941873fba4 moved the index deletion functions from IndexControlRWIs to
IndexControlURLs where it appears more naturally. Because the RWI
administration is less important in the presence of Solr, the
IndexControlURL is now the default servlet when the Index Administration
button on the main menu is selected.
2012-10-10 00:09:27 +02:00
Michael Peter Christen
a4214694df We assert that no other metadata storage than solr is used now.
Therefore a property like solrConnected() must be true all the time.
Removal of this method causes removal of all write operations to the old
metadata index.
2012-09-26 16:05:11 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00
Michael Peter Christen
31d4d38804 - extended the solr interface by a references-by-word-count method
- reduced danger that a non-existing RWI database causes NPEs
- added Solr queries to did-you-mean: this makes it possible that our
did-you-mean algorithm works together with only Solr and without RWIs
2012-08-31 13:03:00 +02:00
orbiter
67f2866cd0 small fixes 2012-08-24 21:44:22 +02:00
Michael Peter Christen
4716546ef5 - reduced memory usage in index transmission using a transformation of
Node to Row objects
- removed peerDeparture in solr remote search in case that peer does not
answer (this may be normal because it is allowed to switch this off)
2012-08-22 16:30:33 +02:00
Michael Peter Christen
0cab06c47c refactoring 2012-08-17 15:52:33 +02:00
Michael Peter Christen
06a78eecb7 code simplification 2012-08-17 14:43:32 +02:00
Michael Peter Christen
18f989dfb1 - refactoring (load -> getMetadata)
- added getDocument to retrieve Solr documents which shall replace
getMetadata
2012-08-17 01:34:38 +02:00
orbiter
e816b88b55 changed behaviour of metadata storage: in case that any solr is
attached, the metadata is not written to the metadata-db, even if it is
enabled but instead to solr. This prevents that metadata is written in
two store systems at the same time. It is also the next step to migrate
the current metadata-db to solr.
2012-08-10 15:39:10 +02:00
Michael Peter Christen
1687737771 Abstraction of HandleMap and HandleSet 2012-07-27 12:13:53 +02:00
Michael Peter Christen
76202f068e extended abstraction of local and remote solr index using one front-end
for index administration and querying.
2012-07-24 17:23:29 +02:00
orbiter
69e743d9e3 - more abstraction for the RWI index as preparation for solr integration
- added options in search index to switch parts of the index on or off
2012-07-22 13:18:45 +02:00
orbiter
05a3ffd03a patches to ensure that solr connectors are active ony if they have a
solr object assigned and vice versa
2012-07-20 11:47:50 +02:00
orbiter
0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
- replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be
done automatically
- implemented some isEmpty() methods
2012-07-10 22:59:03 +02:00
orbiter
c7afa8bc48 using SwitchboardConstants for solr attributes 2012-07-10 12:01:20 +02:00
sixcooler
2c5b68d932 more abstraction of error message 2012-07-05 14:50:37 +02:00
Michael Peter Christen
9758c521ab abstraction of error message 2012-07-05 14:27:28 +02:00
sixcooler
9b6e4e46ca fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4430 2012-07-05 14:06:00 +02:00
Michael Peter Christen
d3964253ae - added @SuppressWarnings to unused servlet method parameters
- removed unnecessary casts
- removed unnecessary throw statements
2012-07-05 09:14:04 +02:00
Michael Peter Christen
03280fb161 removed segments-concept and the Segments class:
the segments had been there to create a tenant-infrastructure but were
never be used since that was all much too complex. There will be a
replacement using a solr navigation using a segment field in the search
index.
2012-06-28 14:27:29 +02:00
Michael Peter Christen
b9dfca4b0a - fixed IndexFederated Servlet / a embedded Solr can now be selected
- added code stub for an embedded Solr but generation of Solr store is
still commented out (it works but is not yet ready for usage)
2012-06-25 11:34:38 +02:00
Michael Peter Christen
2bbb6c52cf added option to clean the triplestore when deleting the index 2012-06-12 01:54:36 +02:00
Roland 'Quix0r' Haeder
edaa09b9b1 Rewrote all String blacklist types to enum 'BlacklistType', closes bug
#143

Conflicts:
	htroot/Supporter.java
	htroot/yacy/crawlReceipt.java
	htroot/yacy/transferRWI.java
	htroot/yacy/transferURL.java
	source/de/anomic/crawler/CrawlStacker.java
	source/de/anomic/data/ListManager.java
	source/net/yacy/peers/Protocol.java
	source/net/yacy/repository/Blacklist.java
	source/net/yacy/repository/LoaderDispatcher.java
	source/net/yacy/search/Switchboard.java
	source/net/yacy/search/index/MetadataRepository.java
	source/net/yacy/search/index/Segment.java
	source/net/yacy/search/query/RWIProcess.java
	source/net/yacy/search/snippet/MediaSnippet.java
2012-06-11 00:17:30 +02:00
Michael Peter Christen
ab7107b34b fixed RWIProcess queue limits: now discovering hidden results for mass
result retrieval
2012-06-08 09:14:54 +02:00
Michael Peter Christen
a1fe65b115 performance hacks 2012-06-05 12:06:26 +02:00
reger
6696cb1313 bugfix: lookup of peernames no result for active peer in page IndexControlRWIs_p.html -> Transfer RWI to other Peer
SeedDB.lookupByName searche for lowercase peerNames, while MapColumnIndex.getIndex uses peername as is in the keyset.
Changed the index init to insert lowercase peer names as key
2012-05-20 05:25:16 +02:00
Michael Peter Christen
4298f00d2d fixed bad usage of given words 2012-05-20 01:35:49 +02:00
Michael Peter Christen
33d1062c79 refactoring: the cache belongs to the crawler 2012-04-21 13:34:07 +02:00
Michael Christen
eff966f396 fix for search process (it was aborted too early during remote search) 2012-01-09 03:02:35 +01:00
Michael Christen
85bd4cc8bc better lookup for peer names 2011-12-25 10:14:15 +01:00
Michael Christen
9e5894c784 Removed handling of components objects for URIMetadataRows.
This is a preparation to replace this rows with nodes from the node
store.
2011-12-17 01:27:08 +01:00
Michael Christen
e9dc99fe15 added rules to set specific RWIs as private RWIs which are not
transmitted to remote peers. This will be used for private index copies
and phonetic indexes.
2011-12-14 22:15:51 +01:00
Michael Christen
f14faf503b better ranking because we wait a very little time during the search
process more to get better remote sear results into the ranking priority
stack
2011-12-06 02:24:51 +01:00
orbiter
a7df70221e refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7987 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-10-04 09:06:24 +00:00