Commit Graph

209 Commits

Author SHA1 Message Date
reger
c9f92abddc fix: application link count
(URIMetadataNode)
2014-04-02 03:21:51 +02:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
Michael Peter Christen
fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
2014-02-28 14:01:09 +01:00
reger
a3e2cca8e9 improve isOlder check to not overwrite node index with metadata on equal load date 2014-01-26 01:00:52 +01:00
orbiter
c351e47a84 fix for bad-formatted lonlat 2014-01-22 21:33:11 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
e8e558a9b7 fix for content domain classification in URIMetadataNode 2013-09-03 10:49:09 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
Michael Peter Christen
0f3d8890db removed an assert which causes a shortcut call circuit 2013-08-22 10:12:25 +02:00
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
reger
a6bf44212e bugfix: location (lat/lon) meta data retrival (Double.NaN check) 2013-06-30 03:50:07 +02:00
Michael Peter Christen
f7e77a21bf Added a citation reference computation for intra-domain link structures.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.
2013-06-07 13:20:57 +02:00
Michael Peter Christen
a1644ca0fd new workflow processor in Segment to enqueue indexing documents to solr 2013-05-30 12:34:53 +02:00
Michael Peter Christen
5344a1c5f7 getting the trash out 2013-05-29 16:09:05 +02:00
orbiter
7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
- check geolocation coordinates and accept only those, which are
well-formed
- the solr push process does not stop crawling any more if after 20
requests to Solr Solr does not accept the record. Instead, a severe log
entry asks the user to create a bug request
2013-05-03 00:24:39 +02:00
Michael Peter Christen
bb4bf3d8fd infinity timeout bug protection patch 2013-04-30 11:06:48 +02:00
orbiter
e1bfe9d07a - reduction of the concurrently running processes to make YaCy more
adjusted to smaller and 1-core devices.
- the workflow processor now starts no process at all. these are started
as soon as parser/condenser/indexing queues are filled.
- better abstraction
2013-04-25 11:33:17 +02:00
Michael Peter Christen
ada3f27de7 added three new field for a better ranking: references_internal_i,
references_external_i and references_exthosts_i. These can be used to
count and evaluate the number of external links to every web page. An
experimental ranking function can be i.e.:
div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))
2013-04-12 16:17:14 +02:00
Michael Peter Christen
addba047e2 changes in ranking computation
- an existing ranking servlet for solr was extended. It is now possible
to set boost values for fields, boost functions and boost queries.
- The ranking can have different instances, but currently only the first
one is used
- added an abstraction layer for fields which can be used for search and
those fields can be edited in the solr ranking configruation
- the ranking value from solr within the field score is used to combine
remote search requests, which all are created using the same locally
defined boost values
- reduced the number of fields which are used for search (makes it
faster)
- replaced some text fields by string fields (makes indexing faster)
- removed classes which had no use
- made a large number of experiments for a better ranking and created a
temporary setting which prefers hits inside titles
- adjusted also the RWI-based ranking computation to 'prefer title'
- made special cases like for portal search where no post-processing and
post-ranking is wanted: this keeps the original ranking order as done by
Solr
- fixed many bugs with old settings for ranking
2013-03-13 14:47:00 +01:00
Michael Peter Christen
2b6c79d347 in method exists() also use the new caching-stacks for
documents/metadata
2013-03-04 01:13:17 +01:00
Michael Peter Christen
3b1d9dc884 made index storage from DHT search result concurrently. This prevents
blocking by high CPU usage during search. Also: removed query from Solr
for DHT search results; results are taken from the pending queue.
2013-03-02 10:25:52 +01:00
orbiter
d74472f562 corrected result counter 2013-02-27 22:40:23 +01:00
Michael Peter Christen
c95a84103a complete redesign of search process:
- removed 'worker' processes
- no internal time-out behaviour: methods either are successful or
return null
- waiting is only done on top-level
- removed snippet-production; this is replaced by solr snippets
- removed statistics based on solr size queries (they had been VERY
long); the statistics (like suggestions or tag cloud) are now again
based on the old but very fast RWI index. In portal or intranet mode the
RWI index is usually switched off; if you like to have statistics again
then you must switch on the rwis again in this mode.
- fixed many bugs regarding correct page counter
2013-02-26 17:16:31 +01:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
structure, but is not filled yet. To have the opportunity of a second
core, multi-core functionality had to be implemented to the
deep-embedded solr:
- migrated the solr_40 directory content to a subdirectory
'collection1'; the previously used default core is now called
collection1
- added solr_40/webgraph subdirectory as second core
- added a servlet configuration for the second core 'webgraph' in
/IndexSchema_p.html
- added instance handling as addition to solr connections: all solr
connectors are now instances of an solr 'instance' object; this required
a complete re-design of the solr embedding
- migrated also caching and sharding ontop of new instance handling
- migrated the search apis to handle now the access to a specific core,
the default core named 'collection1'
- migrated the remote solr search interface to access shards of cores;
for the yacy remote search the default core is now called 'solr'; using
the peer address as solr address
- migrated the solr backup and restore process: old backups cannot be
used after this migration!
- redesign of solr instance handling in all methods which access the
instances: they cannot hold copies of these instances any more; the must
retrieve the actuall connection object every time they want to write to
it (this solves also some bugs when switching the index/network)
- added another schema 'solr.webgraph.schema', the old solr.keys.list is
replaced by solr.collection.schema
2013-02-21 13:23:55 +01:00
reger
3897bb4409 added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index)
- migrates all entries in old urldb

Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), 
- added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format)
- removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0  (leaving it to the compiler/optimizer to choose number format)

current log excerpt for NumberFormatException:
W 2013/01/14 00:10:07 StackTrace For input string: "-"
java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
...
Caused by: java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
2013-01-14 03:06:24 +01:00
Michael Peter Christen
0f5b6f38c1 enhanced root-url detection 2013-01-03 19:21:21 +01:00
Michael Peter Christen
5c0c56cfe1 Preparations to produce a click depth attribute in the search index.
This attribute can be used for ranking and for other purpose (demand by
customer)
The click depth is computed in two steps:
- during indexing the current fill-state of the reverse link index is
used to backtrack the current page to the root page. The length of that
backtrack is the clickdepth. But this does not discover the shortest
click depth. To get this, a second process to check again is needed
- added a process tag that can be used to do operations on the existing
index after a crawl; i.e. calculation the shortest clickpath. Added a
field to control this operation but not a method to operate on this.
- added a visualization of the clickpath length in the host browser
2013-01-02 20:55:43 +01:00
reger
276e63401e small sanitary fixes
- exclude unix shell scripts in NSIS windows install archive
- replace link to env/grafics/yacy.gif to yacy.png (build.nsi)
- remove unused code lines (Blacklist_p, Response, WordReferenceVars)
- type & xhtml (RankingSolr_p.html)
2013-01-02 01:59:47 +01:00
Michael Peter Christen
24c9bb35f7 extended the Scheduler: introduced scheduled events
- an event type (once, regular) can be selected
- for this event type, a fixed time can be selected. This may be either
directly after startup or at one of the full hours at a day (==25
options)
The main point about this feature is the opportunity to start an action
directly after startup. That makes it possible to create YaCy
distributions which, after started at the first time, start to index
parts of the intranet/internet by itself.
2012-12-22 16:27:14 +01:00
reger
ad71747525 fix: set defaul language to "en" 2012-12-16 20:53:45 +01:00
Michael Peter Christen
8fc3679c66 using more pre-compile pattern for split methods 2012-11-26 13:11:55 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
Michael Peter Christen
832eead998 Merge remote-tracking branch 'regerdev/master' 2012-11-18 22:04:11 +01:00
Michael Peter Christen
570e42c4e3 fix for filetype naviagtor 2012-11-07 13:53:29 +01:00
reger
633fbe9188 Fix Metadata handling
- language default on missing lang property to "uk" (fix set to nothing)
-  language set to TLD (added call to existing language calculation from TLD)
-  coordinate number exception on possible lat/lon content of "NaN,NaN"

adjust Netbeans IDE classpath (for Solr/Lucene 4.0.0 jars)
2012-11-04 02:07:59 +01:00
orbiter
276dd6452b removed warnings 2012-10-23 19:08:44 +02:00
Michael Peter Christen
ce0e5b1e17 - more refactoring / private methods
- fix for usage of custom solr field names
2012-10-18 15:09:04 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
2f536cb54d code cleanup: removed unised methods and made more methods and objects
private
2012-10-08 10:50:24 +02:00