Commit Graph

403 Commits

Author SHA1 Message Date
orbiter
7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
- check geolocation coordinates and accept only those, which are
well-formed
- the solr push process does not stop crawling any more if after 20
requests to Solr Solr does not accept the record. Instead, a severe log
entry asks the user to create a bug request
2013-05-03 00:24:39 +02:00
Michael Peter Christen
25499eead5 - added a new field for the regular expression in crawl start
- added the field in crawl profile
- adopted logging end error management
- adopted duplicate document detection
- added a new rule to the indexing process to reject non-matching
content
- full redesign of the expert crawl start servlet
The new filter field can now be seen in /CrawlStartExpert_p.html at
Section "Document Filter", subsection item "Filter on Content of
Document"
2013-04-26 10:49:55 +02:00
Michael Peter Christen
50421171c3 added new schema fields:
hreflang_url_sxt and hreflang_cc_sxt
for
http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077

navigation_url_sxt and navigation_type_sxt
for
http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html

publisher_url_s
for http://support.google.com/plus/answer/1713826?hl=de

all fields are disabled by default and not written to the index.
2013-04-18 17:21:17 +02:00
Michael Peter Christen
7ab5093321 added new solr title_exact_signature_l and
description_exact_signature_l to be able to identify unique title and
unique description fields.
2013-04-16 01:35:15 +02:00
orbiter
17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
and html documents
2013-03-17 22:13:56 +01:00
Michael Peter Christen
addba047e2 changes in ranking computation
- an existing ranking servlet for solr was extended. It is now possible
to set boost values for fields, boost functions and boost queries.
- The ranking can have different instances, but currently only the first
one is used
- added an abstraction layer for fields which can be used for search and
those fields can be edited in the solr ranking configruation
- the ranking value from solr within the field score is used to combine
remote search requests, which all are created using the same locally
defined boost values
- reduced the number of fields which are used for search (makes it
faster)
- replaced some text fields by string fields (makes indexing faster)
- removed classes which had no use
- made a large number of experiments for a better ranking and created a
temporary setting which prefers hits inside titles
- adjusted also the RWI-based ranking computation to 'prefer title'
- made special cases like for portal search where no post-processing and
post-ranking is wanted: this keeps the original ranking order as done by
Solr
- fixed many bugs with old settings for ranking
2013-03-13 14:47:00 +01:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
6a4878940b fix in html parser and bookmark generation 2013-02-11 13:28:08 +01:00
reger
3897bb4409 added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index)
- migrates all entries in old urldb

Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), 
- added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format)
- removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0  (leaving it to the compiler/optimizer to choose number format)

current log excerpt for NumberFormatException:
W 2013/01/14 00:10:07 StackTrace For input string: "-"
java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
...
Caused by: java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
2013-01-14 03:06:24 +01:00
reger
168b1d130d Adding heuristic to get search results from configured systems which support opensearch specification
- any system supporting opensearch specification can be configured
- search query is only forwarded to remote system if not enough results available on local peer
- discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config
     - sample config file with some general search engines with opensearch support
2012-12-29 08:24:48 +01:00
Michael Peter Christen
95712fdc8b update to pdf parser 2012-12-27 04:16:31 +01:00
Michael Peter Christen
34f8786508 removed dependency of vocabulary navigation from Jena and it's
triplestore; the vocabulary search is now done using generic solr fields
which are created on-the-fly during runtime.
2012-12-18 02:29:03 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
b5ee88c6af added more logging to get info which url causes performance problems 2012-12-02 16:52:12 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
orbiter
5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
2012-11-18 01:22:41 +01:00
Michael Peter Christen
d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-11-04 09:21:21 +01:00
Michael Peter Christen
6905182d41 - fix for number of words log message
- adding meta:refresh also to crawler stack
2012-10-29 21:42:31 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
reger
722a447b0d - optimize code of augmented parsing to enhence document tags
- commented out augmentedparser.analyse (not function implemented yet)
- adjust init of document title list to always use same list type
2012-10-26 18:50:45 +02:00
orbiter
276dd6452b removed warnings 2012-10-23 19:08:44 +02:00
Michael Peter Christen
b991685782 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-10-23 18:14:58 +02:00
Michael Peter Christen
b7ac1da6a3 gsa results shall have only one title in metadata and that should be the
visible title in the <title>-tag
2012-10-23 18:03:12 +02:00
reger
87aab9aa7c - fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url
- fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode
   see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/
2012-10-22 22:48:35 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
orbiter
68d0f8de03 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-10-09 20:36:32 +02:00
reger
bfb0d4c69b - add language detection from <html lang="xx"> tag
- add jaudiotagger jar to Netbeans-IDE project classpath
2012-10-09 20:02:58 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
4b5e0c1500 added an url rewriter which can be used to remove session ids from urls 2012-10-09 11:24:48 +02:00
Michael Peter Christen
584663ae8c - redesign of solr query construction
- fix for solr boosts and location search
- fix for number of search results in local search
2012-10-07 07:46:55 +02:00
Michael Peter Christen
6ab64746d7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-06 03:35:32 +02:00
sof
5cb244b79b Merge remote branch 'origin/master' 2012-10-05 18:54:39 +02:00
apfelmaennchen
88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
on the jaudiotagger library. The parser is disabled by default as it
needs to store temporary files for non file:// protocols, which might be
disliked. For your local MP3-collection it loads nicely Artist,
Title, Album etc. from the audio files meta data.
2012-10-05 18:54:26 +02:00
Michael Peter Christen
31485a963d refactoring 2012-10-02 21:57:50 +02:00
Michael Peter Christen
3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
synonyms_sxt
2012-10-02 11:13:06 +02:00
Michael Peter Christen
3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:14:09 +02:00
orbiter
3190347814 added a synonyms_t field to solr and a process to read synonym files.
This can be used to add another stemming to solr using stemming files
that are expressed as synonyms for grammatical alternatives. The
synonym/stemming files must have the following form:
- each line is a comma-separated list of synonyms
- the list of synonyms may be enclosed with {} (like the GSA synonyms
file)
- the file may contain comments which are lines starting with a '#'
The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and
are activated by default whenever a synonym file is in place.
Then, for each word that is found in a document all synonyms are added
to a long text field which is stored into synonyms_t. Processes using
the synonyms must query with that field as optional matcher.
2012-10-02 00:02:50 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
24d2ee3c52 - better date ranking
- more protection against NPE and time travel effects
2012-09-26 18:36:32 +02:00
sixcooler
6c50d016ed pdf- and zipParser should not use forced Memory-Limits 2012-09-26 14:03:51 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00
orbiter
63762d8f89 removed kelondro dependencies from cora 2012-09-20 19:38:22 +02:00
Michael Peter Christen
e54ac38095 - some corrections in usage of getFile() and getFileName()
- added more attributes in json response writer according to yacy
servlet
2012-09-11 23:28:21 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00