Commit Graph

607 Commits

Author SHA1 Message Date
orbiter
deadeb406e image alt tag strings should be tokenized 2013-09-01 13:48:10 +02:00
Michael Peter Christen
1a3e42eca4 index migration to lucene 4.4 2013-08-26 12:49:39 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
sixcooler
1bc6003057 rise autoCommit maxTime to 3 Minutes to reduce IO
lower mergeFactor again (5) for less segments
2013-08-06 03:58:53 +02:00
orbiter
944ae5686c added donation plea to the about box as default (you can replace this in
your peer!)
2013-08-01 12:11:56 +02:00
Michael Peter Christen
58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
orbiter
e7fcb81cea we should not do too much greedylearning at this time as we don't have
enough experience with it. set greedylearning.limit.doccount to a much
lower limit.
2013-07-27 11:22:40 +02:00
orbiter
bf0ad04e1b apply load limitation also to dht-in 2013-07-27 10:42:38 +02:00
orbiter
f50b596e0b do not run dht ditribution if system load is over 2.5 2013-07-23 19:32:32 +02:00
orbiter
e24016e30a added the property federated.service.solr.indexing.timeout to yacy.init
to provide a configurable time-out for solr; see also:
http://bugs.yacy.net/view.php?id=254
2013-07-22 17:45:12 +02:00
Roland Haeder
98e10f95e2 Added some cora package loggers 2013-07-17 18:28:10 +02:00
orbiter
1b43e02b86 Merge branch 'master' of git://gitorious.org/~quix0r/yacy/quix0rs-yacy-rc1 2013-07-13 18:54:18 +02:00
orbiter
a548354c71 replaced type of solr schema object sku of text_en_splitting_tight by
string
2013-07-13 18:54:09 +02:00
Roland Haeder
ebbb3bc5c1 Fixed CHMOD on many files + added missing loggers (e.g. jena) and made some noisy loggers quiet 2013-07-13 13:12:36 +02:00
orbiter
e609ec388a metager whitelist update 2013-07-10 15:13:04 +02:00
Michael Peter Christen
2716dfc46c increase crawler speed by reduction if the busysleep time 2013-07-08 23:40:31 +02:00
Michael Peter Christen
57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
default.
2013-07-03 14:50:06 +02:00
Michael Peter Christen
5a5d411ec0 new robots_i attribute fields 2013-07-02 14:29:13 +02:00
orbiter
7c6ccc426c set crawlingQ to true by default because most webpages are dynamic and
crawlingQ should only be switched off in case of crawler traps
2013-06-29 20:28:14 +02:00
Michael Peter Christen
16d1d744fa added url_file_name_s in default collection schema for the file name
without the file extension. This part of the file path is removed from
the multi-field url_paths_sxt, which has now not the file name as last
part of the path list.

The same applies to the new fields source_file_name_s and
target_file_name_s in the webgraph schema.
2013-06-25 16:27:20 +02:00
orbiter
8792e6c6e9 stub for better image indexing 2013-06-18 13:28:30 +02:00
Michael Peter Christen
570511f3c8 removed fields references_internal_id_sxt and
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
2013-06-13 13:01:28 +02:00
Michael Peter Christen
fd1776a3b0 added a new 'Citations' function: each search result item can now be
explored for citations within other documents. A click on the
'Citations' link shows an analysis with all text lines in the document
each with a complete list of documents which contain the same line. A
second section shows the linking documents in ascending order of number
of citations from the original document. Because documents from
different hosts are most interesting here, they are listed at the top of
the page as possible 'copypasta' source.
2013-06-12 15:02:49 +02:00
Michael Peter Christen
7754a1263b switching back to the merge factor 10; the solr default. 2013-06-12 11:29:35 +02:00
Michael Peter Christen
1762911f57 added synchronizations and timeouts in solr api; missing
synchronizations in index modification methods causes deadlocks inside
solr.
2013-06-12 02:13:18 +02:00
Michael Peter Christen
959ccc4675 increased the solr merge factor because 4 was too much IO load for
frequent index receiving and re-indexing after clickdepth/cr
calculation.
2013-06-11 16:51:40 +02:00
Michael Peter Christen
20fab1feb6 allip net has greedy learning disabled 2013-06-11 14:52:46 +02:00
Michael Peter Christen
6115bef335 added a 'greedy learning' mechanismn which will cause that a 'fresh'
yacy will load linked web pages from search results until the total
number of web pages reaches 15000. This shall give fresh peers a 'boost'
to get faster a personalized search index.
2013-06-11 14:42:30 +02:00
Michael Peter Christen
856e5c42ae the line "Web Search by the People, for the People" is more generic for
P2P and portal search as default search string. Otherwise, if people
switch to Portal mode, the "P2P Web Search" does not make sense.
2013-06-10 18:36:06 +02:00
Michael Peter Christen
713a6199ef activated citation ranking by default 2013-06-07 14:26:14 +02:00
Michael Peter Christen
f7a4377812 usage of the new normalized link polularity CRn as default ranking
function. This replaces the previous formula, which was bad. Before you
update to this version, please check if you changed the ranking function
yourself before, since it will be overwritten.
2013-06-07 13:22:22 +02:00
Michael Peter Christen
f7e77a21bf Added a citation reference computation for intra-domain link structures.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.
2013-06-07 13:20:57 +02:00
reger
8a7fcb391d enable use of solrcore.properties for property substitution of solrconfig.xml
- move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties
- add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties
 
reason: on 32bit MMapDirectoryFactory may fail with.....
Caused by: java.io.IOException: Map failed
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
	at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
2013-06-01 05:43:08 +02:00
Michael Peter Christen
eb9d0ba5b1 ranking and boost function update, small bugfixes, better default search
field for solr
2013-05-30 16:30:35 +02:00
Michael Peter Christen
a8dc4346e8 default configuration of MMapDirectoryFactory for solr, increased lock
timeout, less documents from remote searches (too many results had
easily blocked a peer)
2013-05-30 12:31:28 +02:00
Michael Peter Christen
0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of
soft commits, reduced caching size of search events, ensured that solr
results are processed before connection is closed to keep that stuff not
too long in RAM
2013-05-29 18:27:27 +02:00
Michael Peter Christen
536fd1450e added new keys for update locations 2013-05-29 13:10:32 +02:00
orbiter
a83c2fe833 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-10 12:02:40 +02:00
orbiter
4baa0d4a97 Added a default keystore for ssl encryption of the YaCy web interface.
This will enable https-access to YaCy, but this feature is disabled by
default using the new server.https=false attribute. This has two
purposes:
- make it easier for everyone to use https (just set server.https=true)
- provide the basis for secure yacy-to-yacy communication in the future
2013-05-10 12:02:31 +02:00
reger
da191c839d reduce SolrConnectorLogging setting (from default ALL to INFO) 2013-05-10 05:54:07 +02:00
Michael Peter Christen
9bd2aee180 migrated to solr 4.3.0 2013-05-09 02:17:53 +02:00
Michael Peter Christen
cca19d94d4 re-declared some fields to be of type string rather than text which
makes them more efficient and less large
2013-05-06 16:45:54 +02:00
Michael Peter Christen
cc90f82dbb increased default proxy client timeout to one minute 2013-05-06 14:58:18 +02:00
Michael Peter Christen
50421171c3 added new schema fields:
hreflang_url_sxt and hreflang_cc_sxt
for
http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077

navigation_url_sxt and navigation_type_sxt
for
http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html

publisher_url_s
for http://support.google.com/plus/answer/1713826?hl=de

all fields are disabled by default and not written to the index.
2013-04-18 17:21:17 +02:00
Michael Peter Christen
d05dc07cff setting of new default values for ranking 2013-04-16 15:02:00 +02:00
Michael Peter Christen
97775fbebc fixed ranking for add-function queries: this did not work. The option
was removed. All function queries are now boosts (multiplies the score
according to a function). This is also the recommended way to boost
rankings based on functions as explained in
http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/
2013-04-16 14:45:14 +02:00
Michael Peter Christen
7ab5093321 added new solr title_exact_signature_l and
description_exact_signature_l to be able to identify unique title and
unique description fields.
2013-04-16 01:35:15 +02:00
Michael Peter Christen
27d6222880 added new field host_extent_i which, after a crawl and postprocessing,
holds the number of documents for the host where the document is hosted.
This is necessary for ranking and the norming of references per local
host in the ranking computation.
2013-04-14 20:52:40 +02:00
Michael Peter Christen
ada3f27de7 added three new field for a better ranking: references_internal_i,
references_external_i and references_exthosts_i. These can be used to
count and evaluate the number of external links to every web page. An
experimental ranking function can be i.e.:
div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))
2013-04-12 16:17:14 +02:00
reger
e89491271f - fix opensearch discover err msg - webgraph not enabled - if no opensearchdescription link found in index
- remove search2.net from sample config (is down)
2013-04-04 00:40:59 +02:00
orbiter
17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
and html documents
2013-03-17 22:13:56 +01:00
Michael Peter Christen
2d36a7eaf5 - do not create a new query for all remote peers
- no document search this time
- adjusted banner and network to not show 'WORDS' but DHT Chunks. This
is to avoid confusion for robinson peers which do not create Word
Entries
2013-03-15 00:14:28 +01:00
Michael Peter Christen
4af0839be2 use appropriate ranking for each search situation:
- when using the /date modifier, a date ranking profile is used
- when using a site: modifier, a ranking profile supporting longer urls
is used
2013-03-14 21:13:12 +01:00
Michael Peter Christen
2080fc7406 removed unused tag fields 2013-03-14 10:35:21 +01:00
orbiter
6b13dd0d3d added clickdepth field writing for webgraph core (unfinished) 2013-03-14 01:35:38 +01:00
Michael Peter Christen
addba047e2 changes in ranking computation
- an existing ranking servlet for solr was extended. It is now possible
to set boost values for fields, boost functions and boost queries.
- The ranking can have different instances, but currently only the first
one is used
- added an abstraction layer for fields which can be used for search and
those fields can be edited in the solr ranking configruation
- the ranking value from solr within the field score is used to combine
remote search requests, which all are created using the same locally
defined boost values
- reduced the number of fields which are used for search (makes it
faster)
- replaced some text fields by string fields (makes indexing faster)
- removed classes which had no use
- made a large number of experiments for a better ranking and created a
temporary setting which prefers hits inside titles
- adjusted also the RWI-based ranking computation to 'prefer title'
- made special cases like for portal search where no post-processing and
post-ranking is wanted: this keeps the original ranking order as done by
Solr
- fixed many bugs with old settings for ranking
2013-03-13 14:47:00 +01:00
Michael Peter Christen
25300913fa fixes to search debugging after testing with the different search
debugging options
2013-03-05 21:28:22 +01:00
orbiter
b1140e3d82 added debug switches for detailed search testing 2013-03-05 12:19:32 +01:00
Michael Peter Christen
0d7b4bc891 better protection against OOM during search flush and fixed missing
result push
2013-03-03 23:45:47 +01:00
Michael Peter Christen
3b1d9dc884 made index storage from DHT search result concurrently. This prevents
blocking by high CPU usage during search. Also: removed query from Solr
for DHT search results; results are taken from the pending queue.
2013-03-02 10:25:52 +01:00
orbiter
0f7ea7ad9f - enhanced solr.add procedure for mass adds
- removed unused solr access classes
- made snippet generation for documents aus YaCy RWI/DHT concurrent (as
it was before the search process removation)
- reduced the number of remote results in settings file because the
processing of such mass documents add is too CPU-intensive (in Solr)
2013-03-01 15:27:17 +01:00
Michael Peter Christen
089dee1770 - generalized SchemaConfiguration into super-class Configuration and
adopted other classes which used the configuration-only access for that
class
- removed many warnings
- adjusted logging
2013-02-25 00:09:41 +01:00
Michael Peter Christen
56d5946a59 - added flags in IndexFederated_p.html to switch on or off the webgraph
index (new solr core webgraph) .. this is now off by default
- completely redesigned this servlet
- added description how to attach a remote solr
- adjusted naming of servlet and menues
- moved 'lazy initialization' attribut from IndexSchema to
IndexFederated (this is a general option) back again.
2013-02-24 18:09:34 +01:00
Michael Peter Christen
461d46101d - Removed log4j from libraries. This can be removed because the package
log4j-over-slf4j is there. From slf4j all loggings are routed to the jdk
logger. Now all loggings are consistently done to the jdk logger.
- added some lines to the logging properties to suppress many solr
logging statements. The number of the logging entries had already become
a performance issue, therefore removing these from the log should
increase performance.
2013-02-23 16:45:05 +01:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
structure, but is not filled yet. To have the opportunity of a second
core, multi-core functionality had to be implemented to the
deep-embedded solr:
- migrated the solr_40 directory content to a subdirectory
'collection1'; the previously used default core is now called
collection1
- added solr_40/webgraph subdirectory as second core
- added a servlet configuration for the second core 'webgraph' in
/IndexSchema_p.html
- added instance handling as addition to solr connections: all solr
connectors are now instances of an solr 'instance' object; this required
a complete re-design of the solr embedding
- migrated also caching and sharding ontop of new instance handling
- migrated the search apis to handle now the access to a specific core,
the default core named 'collection1'
- migrated the remote solr search interface to access shards of cores;
for the yacy remote search the default core is now called 'solr'; using
the peer address as solr address
- migrated the solr backup and restore process: old backups cannot be
used after this migration!
- redesign of solr instance handling in all methods which access the
instances: they cannot hold copies of these instances any more; the must
retrieve the actuall connection object every time they want to write to
it (this solves also some bugs when switching the index/network)
- added another schema 'solr.webgraph.schema', the old solr.keys.list is
replaced by solr.collection.schema
2013-02-21 13:23:55 +01:00
Michael Peter Christen
4111606654 removed the commitWithin attribute because that is not the way how the
index is updated the right way for us. May also be be superfluous with
the solr 4.0 softcommit.
2013-02-13 02:29:47 +01:00
Michael Peter Christen
d70d99fab5 added more metadata fields and facets to OpensearchResponseWriter.
This should make it possible to replace the original and enriched yacy
opensearch result with a solr output in opensearch format.
2013-02-11 22:10:14 +01:00
Michael Peter Christen
8651ec35fe turned author_s into the multi-valued field author_sxt 2013-01-24 18:24:31 +01:00
Michael Peter Christen
4735bd47f4 - changed solr commit call and added an optimize option. Since Solr
4.0.0 there is a new softcommit feature which implements a
near-real-time (NRT) search option. The softcommit does not do IO and
does not cause performance issues.
YaCy has now an extension in its solr connectors to use the softcommit
feature. The softcommit call now replaces all places where a hard commit
was used. Furthermore the commit strategy in when doing a search from
the web interface was changed (it's done every time before a search is
done).

The softcommit feature was implemented because it was needed for the
following changes (customer demands), which is also included in this
git commit:

- added a feature to identify all documents which have unique titles
and/or unique descriptions. These unique flags are disabled by default.
- added also a feature to set a flag when the url from a canonical tag
is equal to the document url. This is also disabled by default.

To support the new softcommit strategy, the commitWithinMs option was
set to -1 do disable automatic commit based on document insert times. If
documents are inserted permanently then also a commit would happen
permanently whenever the commitWithinMs time is reached. This would
conflict with the regular autocommit of 10 minutes and the new
softcommit strategy.
2013-01-23 14:40:58 +01:00
Michael Peter Christen
db024a4e19 added new solr fields (unused yet; implementation will follow) 2013-01-21 18:02:29 +01:00
Michael Peter Christen
9b5bdae1b4 Reverted setting of MMapDirectoryFactory from solrconfig; see
http://forum.yacy-websuche.de/viewtopic.php?p=27509#p27509
Instead, in the start script is checked if the host is a 64 host and
-Dsolr.directoryFactory=solr.MMapDirectoryFactory is set as java option

Reverted the ramBufferSizeMB setting (this was not enabled anyway)
because that may be too much memory for small peers and embedded
systems.

Activated the mergeFactor 4; this was commented out by mistake
2013-01-21 17:55:28 +01:00
orbiter
eb68a30947 solr performance settings
the target of these performance settings is the reduction of IO in
general and during search in particual.
- reduced mergeFactor to 4. This will increase the IO during indexing,
but will reduce IO during search. It will also greatly reduce the number
of open files which should make it possible to have overall larger
indexes until the number of open files in an OS is reached.
- increased ramBufferSizeMB to 256mb. This will reduce the number of
commits. This change may compensate the reduction of the mergeFactor.
- disabled updateLog. This is a real-time search feature which is
available in YaCy anyway because a commit is forced if index.html is
called. The updateLog feature causes a lot of IO during indexing and
search and produced a lot of files in SEGMENTS/solr_40/data/tlog
2013-01-19 11:21:33 +01:00
Michael Peter Christen
f53703df62 using MMapDirectoryFactory as solution for ClosedChannelException given
in https://issues.apache.org/jira/browse/SOLR-2247
2013-01-16 14:35:37 +01:00
Michael Peter Christen
22c694f906 activated the clickdepth_i attribute for solr again because the
calculcation of that value is not as extensive as expected and
furthermore the value is very useful for ranking
2013-01-05 01:00:18 +01:00
Michael Peter Christen
5a0eb1b268 clickpath should not be active by default because it needs extensive
computation - partly to be implemented
2013-01-03 01:30:05 +01:00
Michael Peter Christen
5c0c56cfe1 Preparations to produce a click depth attribute in the search index.
This attribute can be used for ranking and for other purpose (demand by
customer)
The click depth is computed in two steps:
- during indexing the current fill-state of the reverse link index is
used to backtrack the current page to the root page. The length of that
backtrack is the clickdepth. But this does not discover the shortest
click depth. To get this, a second process to check again is needed
- added a process tag that can be used to do operations on the existing
index after a crawl; i.e. calculation the shortest clickpath. Added a
field to control this operation but not a method to operate on this.
- added a visualization of the clickpath length in the host browser
2013-01-02 20:55:43 +01:00
Michael Peter Christen
295884fd54 - Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8'
- fixed conflict in	htroot/yacysearch.java
- removed nedres check because that causes that the remote server is not
called at all in most cases (local index has already results but we want
more)
- fixed a regex bug (a '=' too much)
2013-01-02 15:08:07 +01:00
reger
168b1d130d Adding heuristic to get search results from configured systems which support opensearch specification
- any system supporting opensearch specification can be configured
- search query is only forwarded to remote system if not enough results available on local peer
- discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config
     - sample config file with some general search engines with opensearch support
2012-12-29 08:24:48 +01:00
reger
7761b60325 fix: Broken Link on Crawler_p.html - issue 218
http://bugs.yacy.net/view.php?id=218
- reduced Solr logging (/select)
2012-12-29 04:53:20 +01:00
reger
e9e0d63897 Add config option to show HostBrowser link in search result
- ConfigPortal: added checkbox Host Browser
- yacy.init: added search.result.show.hostbrowser as default = on (true)
- fix HostBrowser: broken link to protected WebStructurePicture for public user
2012-12-27 10:01:10 +01:00
Michael Peter Christen
98819ec3d9 use solr boost configuration to select search fields. At this time it is
possible to enter a negative boost value to switch that value off. This
might be different in the future with a better input interface.
2012-12-27 03:17:45 +01:00
Michael Peter Christen
01200f06cc using the author field as solr-native facet. this makes it necessary to
introduce a copy-field for the author field to be copied to a string
field. This field is then used to generate facets. Without this field,
the facet would consist only of the words of the author names, not of
the full author string.
2012-12-19 01:56:33 +01:00
Michael Peter Christen
eac9650b31 added another solr field clickdepth_i which reflects the number of
clicks which are necessary to get from the portal of a host to a
specific document. At this time, only the start document is flagged with
clickdepth '0', all other with '-1'. To get the actual clickdepth, a
process must use crawled information to collect the actual number of
clicks. This will be added in another/next step.
2012-12-18 17:20:42 +01:00
Michael Peter Christen
1052263af3 - added a new solr field references_i which stores the number of
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
2012-12-18 14:42:35 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
ea033f8f8e added number of characters in url to default index to be able to use
this field for ranking
2012-12-02 16:53:02 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
reger
328ce0b297 fix: remove fixed individual testing IP (85.25.151.30 = server4you.de) from default/yacy.network.freeworld.unit 2012-11-11 21:19:18 +01:00
Michael Peter Christen
e2c4c3c7d3 migration to solr 4.0.0 2012-11-02 12:29:48 +01:00
sixcooler
2d972f289a rise commitWithinMs to default-value from SwitchBoard
(result in lower hd-io)

no dots in memory-graph (there are to much of them)
2012-10-26 02:12:45 +02:00
Michael Peter Christen
1baf498d59 - show more lines in online log
- reverse order is default now
2012-10-25 18:38:39 +02:00
sixcooler
206e7bcf94 whitelist yacyportalsearch aka search.yacy.net 2012-10-23 03:49:27 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
42e525ca9a enhanced the host browser 2012-10-08 14:00:14 +02:00
sof
5cb244b79b Merge remote branch 'origin/master' 2012-10-05 18:54:39 +02:00
apfelmaennchen
88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
on the jaudiotagger library. The parser is disabled by default as it
needs to store temporary files for non file:// protocols, which might be
disliked. For your local MP3-collection it loads nicely Artist,
Title, Album etc. from the audio files meta data.
2012-10-05 18:54:26 +02:00
Michael Peter Christen
3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
synonyms_sxt
2012-10-02 11:13:06 +02:00
Michael Peter Christen
3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:14:09 +02:00
orbiter
3190347814 added a synonyms_t field to solr and a process to read synonym files.
This can be used to add another stemming to solr using stemming files
that are expressed as synonyms for grammatical alternatives. The
synonym/stemming files must have the following form:
- each line is a comma-separated list of synonyms
- the list of synonyms may be enclosed with {} (like the GSA synonyms
file)
- the file may contain comments which are lines starting with a '#'
The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and
are activated by default whenever a synonym file is in place.
Then, for each word that is found in a document all synonyms are added
to a long text field which is stored into synonyms_t. Processes using
the synonyms must query with that field as optional matcher.
2012-10-02 00:02:50 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
f45f7fc12e added new Host Browser to main menu:
this new search interface is something completely new for search, but
completely common on desktops: browser a web space like one would browse
a file system in a file browser. The file listing is created using the
search index and a faceted restriction to specific domains.
2012-09-28 22:45:16 +02:00
Michael Peter Christen
80edd8ecd7 some more after-refactoring fixes 2012-09-28 10:24:57 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
0504b01bdc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-14 00:48:17 +02:00
orbiter
9413f77b65 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-13 23:54:26 +02:00
orbiter
a55e77a115 added twitter search heuristic 2012-09-13 23:53:53 +02:00
Michael Peter Christen
62add1d564 added the protocol and the file name extension to the solr fields since
these fields are probably facets in file search
2012-09-11 22:46:39 +02:00
Michael Peter Christen
9db032664e activate two solr fields which will be used by administration interface
(later)
2012-09-11 20:15:54 +02:00
Michael Peter Christen
10b911eed4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-07 22:07:02 +02:00
Michael Peter Christen
be67c70a47 added Solr fields:
inboundlinks_text_chars_val
inboundlinks_text_words_val
inboundlinks_alttag_txt
outboundlinks_text_chars_val
outboundlinks_text_words_val
outboundlinks_alttag_txt
2012-09-07 22:06:51 +02:00
orbiter
d73fff0e0e added solr field images_withalt_i 2012-09-07 21:33:45 +02:00
Michael Peter Christen
ee23fc7a32 added h1..h6 counter fields 2012-09-04 14:11:11 +02:00
Michael Peter Christen
b2b516cc3e added a collection attribute to crawls and searches:
- a solr field collection_sxt can be used to store a set of crawl tags
- when this field is activated, a crawl tag can be assigned when crawls
are started
- the content of the collection field can be comma-separated, all of
them are assigned to the documents when they are indexed as result of
such a crawl start
- a search result can be drilled down to a specific collection; this is
currently only available in the solr interface and also in the gsa
interface using the 'site' option
- this adds a mandatory field for gsa queries (the google api demands
that field all the time)
2012-09-03 15:26:08 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00
Michael Peter Christen
2ddc33646a added new field for solr:
url_paths_sxt
url_parameter_i
url_parameter_key_sxt
url_parameter_value_sxt
url_chars_i
2012-08-29 16:11:23 +02:00
Michael Peter Christen
75d5e3475d Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-29 10:13:51 +02:00
cominch
dc468dad01 add content control features for custom filter lists 2012-08-29 09:04:28 +02:00
Michael Peter Christen
316b5fe116 - added a solr type definition verifier
- fixed type definition found by the verifier
- added multivalue-string fields for solr with extension 'sxt'
- added multivalue-integer fields for solr with extension 'val'
- renamed some solr attributes from txt to sxt
- changed solr query line to an explicit AND/OR structure
- added a country code second level domain list to Domains class; with
parser
- added a host string parser to get domain class name, country-code
second-level domain and subdomain out of it
- removed old coordinate attributes
2012-08-28 16:58:06 +02:00
Michael Peter Christen
4c79ddb91e switched off some solr logging 2012-08-27 14:41:47 +02:00
Michael Peter Christen
e8acd542b5 - added faceted drill-down for host and geolocation to solr queries
- added a new geolocation field to index schema, the old values are
migrated if possible
2012-08-27 14:41:33 +02:00
Michael Peter Christen
af764c106c re-activated audio and video search because they obviously work (!) 2012-08-22 01:56:13 +02:00
orbiter
716ea0cfe2 sorted the solr schema into mandatory and optional fields; reduced
number of used field to reduce solr index size
2012-08-21 23:52:56 +02:00
orbiter
db6863db77 reduced solr cache sizes to check if that solves memory problems a bit 2012-08-18 13:45:37 +02:00
Michael Peter Christen
23226676c6 FOR THE BRAVE.. this is a forced migration to solr which is now ready
for production as a replacement of the metadata-db.
This intermediate release 1.041 will switch on the previously optional
solr index and the old metadata-db will still work as it did before.
Solr+metadata are accessed in mixed mode, no migration is done yet.
If this causes not a catastrophe until the end of the weekend, we will
do a YaCy 1.1 main release containing this as default.
2012-08-16 18:17:47 +02:00
Michael Peter Christen
a1b2c9a67d doctype2mime fix, influences metadata conversion between old metadata
and solr
2012-08-16 17:49:35 +02:00
Michael Peter Christen
703f427303 fixed some peer-ping connection details
- larger time-out
- removed too old seedlist
- fixed a bug in connection test
2012-08-16 17:11:54 +02:00
Michael Peter Christen
ea49a8aa8c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-14 12:40:44 +02:00
Michael Peter Christen
aab0b680c3 - added xslt support for solr result formats.
try i.e.
http://localhost:8090/solr/select?q=*:*&start=0&rows=10&wt=xslt&tr=json.xsl
- added servlet-side mime-type configuration for streamed servlets. this
is used for the result formatters in solr result formats
2012-08-14 11:12:50 +02:00
cominch
e2119f4e76 augmented browsing: replace htmlparser by jsoup, which is more stable
and reliable
2012-08-14 10:06:12 +02:00
Michael Peter Christen
b51df6c7e8 - added coordinate storage in solr schema
- fixed shutdown process
- fixed some solr-to-metadata reading
- added a large number of metadata attributes in ViewFile.html
2012-08-13 10:40:04 +02:00
Michael Peter Christen
f9c0e6e950 - Implemented and integrated the URIMetadataNode object which is a
metadata representation from the solr index. This shall replace metadata
from the built-in database in the future.
- added the Solr-driven metadata into the search index of YaCy which
makes it now possible to run YaCy without the old metadata index. This
is a major stept forward to a full migration to Solr.
2012-08-10 13:26:51 +02:00
Michael Peter Christen
bca4a16603 replaced the multivalue generic string field name suffix _ss by _txt
because _ss is not part of the standard solr example schema.
2012-08-06 17:58:09 +02:00
orbiter
67edfd991c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-05 15:49:48 +02:00
orbiter
d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All
writings to the Metadata-DB are now also done to solr. This includes
metadata transfer during search and rwi transfer.

The new/added solr fields are:

## time when resource was loaded
load_date_dt

## date until resource shall be considered as fresh
fresh_date_dt

## id of the host, a 6-byte hash that is part of the document id
host_id_s

## ids of referrer to this document
referrer_id_ss

## the md5 of the raw source
md5_s

## the name of the publisher of the document
publisher_t

## the language used in the document; starts with primary language
language_ss

## an external ranking value
ranking_i

## the size of the raw source
size_i

## number of links to audio resources
audiolinkscount_i

## number of links to video resources
videolinkscount_i

## number of links to application resources
applinkscount_i
2012-08-05 15:49:27 +02:00
Michael Peter Christen
3ce04cecf3 bad hack to prevent a bug appearing in solr 2012-07-31 23:49:07 +02:00
Michael Peter Christen
826967513b changed options in IndexFederated_p to switch on/off parts of the index
individually. The settings are experimental and the values of the
settings will be overwritten when an index migration from urldb to solr
starts.
2012-07-23 16:28:39 +02:00
Michael Peter Christen
1517a3b7b9 added webm mime-type 2012-07-08 17:59:20 +02:00
Michael Peter Christen
0301aba1e9 removed unused method parameters 2012-07-05 10:23:07 +02:00
Michael Peter Christen
4de50fe808 adding more principal peers for bootstraping 2012-07-05 00:43:41 +02:00
reger
067728bccc add search result heuristic. adding a crawl job with depth-1 for every displayed search result (crawling every external linked page of displayed search result pages) 2012-07-01 00:12:20 +02:00
Michael Peter Christen
508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
the meta-refresh html header field.
2012-06-28 13:27:45 +02:00
Michael Peter Christen
9116013c64 - allow lazy initialization of solr value (if using 'lazy', then no
0-values and no empty strings are written). This may save a lot of
memory (in ram and on disc) if excessive 0-values or empty strings
appear)
- do not allow default boolean values for checkboxes because that does
not make sense: browsers may omit the checkbox attribute name if the box
is not checked. A default value 'true' would not comply with the
semantic of the browsers response.
- add a checkbox in IndexFederated_p for the lazy initialization of solr
fields.
2012-06-27 12:17:58 +02:00
Michael Peter Christen
c03d306afa shorter autocommit time (now: 1 second) to prevent that user cannot see
results in solr the first time they try it out. The value can now be
easily set to a higher number using the IndexFederated_p interface.
2012-06-26 14:53:45 +02:00
Michael Peter Christen
3fd4a01286 added option to record urls that are forwarded to the solr index 2012-06-26 13:54:48 +02:00
Michael Peter Christen
8dd469b9dd added option to configure the autocommit delay time of solr on-the-fly 2012-06-25 14:59:46 +02:00
Michael Peter Christen
b9dfca4b0a - fixed IndexFederated Servlet / a embedded Solr can now be selected
- added code stub for an embedded Solr but generation of Solr store is
still commented out (it works but is not yet ready for usage)
2012-06-25 11:34:38 +02:00
Michael Peter Christen
1be0025a9c - added test for EmbeddedSolrConnector
- added needed libraries for this test
this includes most (all) files needed for an embedded solr
2012-06-22 00:36:49 +02:00
Michael Peter Christen
dbdd697f4d moved RDFaParser.xsl configuration file to defaults 2012-06-21 16:09:12 +02:00
Michael Peter Christen
8738336408 set Xms lower than Xmx 2012-06-19 08:45:49 +02:00
Michael Peter Christen
96f6a5869f more robust OAI-PMH client (large time-out, three re-tries). OAI-PMH
server appeart to be very slow sometimes
2012-06-16 22:30:31 +02:00
Michael Peter Christen
6d17686258 made triplestore persistent by default
added a size display in triplestore servlet
2012-06-15 19:13:07 +02:00
cominch
3c255c025b Show tags in search results (if activated in ConfigPortal_p.html) 2012-06-15 10:43:05 +02:00
Michael Peter Christen
a5cdfb91de - fixed Cache link (below snippet)
- added 'Augmented Proxy' link below snippet
- added configuration options for augmented proxy
2012-06-14 19:55:34 +02:00
Roland 'Quix0r' Haeder
af5a597e47 Scroogle is not comming back, remove dead code
Conflicts:
	source/net/yacy/search/Switchboard.java
2012-06-10 23:38:41 +02:00
cominch
90512640bf Added config switches for custom parser
Conflicts:
	source/net/yacy/document/TextParser.java
2012-06-10 12:49:36 +02:00
cominch
5d20cd324a Add Triplestore and RDF query interface
Conflicts:
	build.xml
	defaults/yacy.init
	source/net/yacy/interaction/AugmentHtmlStream.java
2012-06-10 10:35:59 +02:00
cominch
a32943b382 add json mimetype 2012-06-10 09:29:09 +02:00
Michael Peter Christen
41c02cb10e - less restrictions for usage of Table RAM copy
- new limit to use the table copy (instead of flag): 400MB available. If
less is available, then a copy is never used. If more is available, then
it can be used if there is a remaining space of at least 200MB
- flush caches more often: flush the Digest cache
2012-06-08 12:48:25 +02:00
Michael Peter Christen
8002fd2578 use less cache space since a large cache would cause more memory usage
in index files.
2012-06-06 14:17:42 +02:00
Michael Peter Christen
5aee19daa4 added show from cache in search results (not yet finished) 2012-06-04 23:44:26 +02:00
Michael Peter Christen
0d32a766ed relax verify attribute for search widget to make it faster:
set to "cacheonly"
2012-05-20 00:50:54 +02:00
Michael Peter Christen
7eece0256f moved yacy.logging to defaults according to request in
http://bugs.yacy.net/view.php?id=55
2012-05-17 04:26:03 +02:00
Michael Peter Christen
db9d81cb7a ups 2012-05-16 01:04:08 +02:00
Michael Peter Christen
e7e381d110 added configuration to switch off redirection following in crawler 2012-05-15 12:25:46 +02:00
Michael Peter Christen
2be327b5ab update location update 2012-04-19 11:49:43 +02:00
Michael Peter Christen
99c74699de removed scroogle (scroogle is dead) 2012-02-25 12:57:59 +01:00
Michael Peter Christen
8bee1472c9 there is no noindex, only nofollow in links 2012-01-31 23:46:35 +01:00
Michael Peter Christen
4c5edab1ec added option to have exception search result windows 2012-01-26 15:32:30 +01:00
Michael Peter Christen
696ee5fc16 removed pdf from default parser deny list 2012-01-23 17:27:58 +01:00
Lotus
c73af39e54 refactoring of tray icon class,
now uses Java 6 methods natively
2012-01-18 20:47:09 +01:00
Michael Peter Christen
987b412491 updated solr scheme: generic declaration of solr schemes 2012-01-13 11:25:15 +01:00
Michael Peter Christen
0bcef2d156 added feature as requested in
http://forum.yacy-websuche.de/viewtopic.php?f=18&t=3461
The search can now be configured with a non-display host list.
the search will always exlude the given list of host unless they are
requested directly using the host navigation
2011-12-13 00:16:05 +01:00
Michael Christen
17f962fceb translator updates:
- config string for chinese
- do not copy the language file to DATA/LOCALE any more (and do not use
them there, this is really confusing for new translators)
2011-12-08 10:25:26 +01:00
Michael Christen
c715d19c09 fixes for dependency on svn 2011-12-06 22:05:22 +01:00
Michael Christen
f62e6fb438 less frequent DHT distribution to reduce the load a bit on every peer 2011-12-05 15:45:33 +01:00
Michael Christen
9dbc93613e now that the whole world knows that we actually do p2p and not
metasearch we can support a default look-up to scroogle to gain more
attention to people who say that your search results are incomplete
2011-12-05 11:52:24 +01:00
orbiter
f9216e388c - faster ping to clean up old peers faster
- clean up more news

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8125 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-30 21:21:16 +00:00
orbiter
ac5bda205f - removed lower page navigation (it never looks nice)
- added visibility of metadata and parser in search results since that shows what YaCy can do in a nice way

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8091 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-24 13:30:42 +00:00
orbiter
c659310e89 - removed option to search for audio, video and applications. These things are still experimental and should not be shown to new users since this would cause them to argue that YaCy does not work. The functions are stil available, because:
- added a configuration option in ConfigPortal to swtich the search media types on or off

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8090 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-24 13:07:03 +00:00
orbiter
6cd27473f5 - better default values for caching and cache usage
- set new caching and verification behavior according to use case automatically

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8087 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-24 10:22:02 +00:00
orbiter
5866c73a09 fix for compare search: use scroogle instead of bing and get a default search if configured search engine is not available
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-23 15:17:46 +00:00
orbiter
e4a82ddd8b produce a bookmark entry from every crawl start. these bookmarks are always private.
these bookmarks will be used to get a source reference for the search in case of intranet or portal searches.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-21 23:10:29 +00:00
orbiter
f183d3822c added a default accept header in http requests since some http fraud detection functions check that this header field exist
see also: http://bad-behavior.ioerror.us/ in source file browser.inc.php

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8048 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 15:27:43 +00:00
orbiter
78ce3b13be typo
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8027 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-10 11:57:26 +00:00
suessthomas
887f088dad The IP address of the YaCy-Demo portal added to Whitelist.
This is only a temporary workaround.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8013 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-03 23:44:49 +00:00
orbiter
1b45e33f04 added robots tag parser to solr scheme
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7986 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 13:39:01 +00:00
orbiter
cf4fd525ee added directDocByURL attribute in crawl profile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 12:38:28 +00:00
orbiter
5ad7f9612b added crawl settings for three new filters for each crawl:
must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue)
must-not-match for IPs
must-match against a list of country codes (allows only loading from hosts that are hostet in given countries)

note: the settings and input environment is there with that commit, but the values are not yet evaluated

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-27 21:58:18 +00:00
orbiter
2c3161b4ac refactoring:
RankingProcess -> RWIProcess
ResultFetcher -> SnippetProcess


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7974 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-26 21:42:28 +00:00
orbiter
6b22865dbc - removed some warinings
- removed a dead update location

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7970 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-24 01:58:54 +00:00
orbiter
e48ce5d80e - style change for search box: larger font, selected by default
- style change for search results: by default no parser, size, image info

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7949 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-14 09:05:06 +00:00
sixcooler
ecb4986b38 refactored stuff from last commit to ReferenceContainer
see: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=3353&p=23163#p23163
the limiting of references is disabled per default
to enable this set yacy.conf - index.maxReferences to a value of e.g. 100000

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7935 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-07 18:55:16 +00:00
orbiter
49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-07 10:08:57 +00:00
orbiter
9a8937f8b6 be more liberal when evaluating search results. This may cause that it is possible to fraud content on fresh peers, but that is better than looong waiting times for the evaluation of every link which causes that everybody rejects YaCy as 'too slow'. But this is only because of the high standards that YaCy sets to itself. If we are able to gain more users by lowering the standard, then that is useful. The option to set that flag to verify each link is still there.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7918 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-01 16:02:15 +00:00
orbiter
1c007188ad bugfixes in html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-31 16:02:06 +00:00
orbiter
5dd2efc9a2 - bugfixes in html parser
- new fields in solr
- extended file viewer to debug parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-25 15:52:25 +00:00
sixcooler
4fec99115b Implementation of strategies for controlling memory resources.
You can toggle between previous (standard) and new (generation) strategy at PerformanceMemory_p.html.
The generation memory strategy is implemented with the objective of running more robust
but with the cost of early stopping some tasks (eg. dht) while running low on memory.
This new strategy does respect the generational way a heap is organized on most used jvms.
These changes run fine on my 3 peers for weeks now, but as I'm human, I may fail.
Please be carefull using generation memory strategy and report errors by naming
OS, jvm and java_args.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7886 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-22 17:50:03 +00:00
orbiter
77a9af99f1 same values for Xmx and Xms: memory extension may be difficult if the OS has not the remaining memory available and may kill the jvm. If the memory is reserved at the start but never used the OS may handle that as well and leave non-used space in swap area (and never swap)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7867 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-11 21:54:27 +00:00
orbiter
768c59740c - replaced solrj 3.1 with solrj 3.3
- updated also slf4j
- added authentication for solrj


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7829 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-04 16:35:30 +00:00
orbiter
e7c7598923 docfix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7828 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-04 10:48:01 +00:00
orbiter
b84089ff04 fix for solr scheme list definition
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7826 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 22:59:43 +00:00
orbiter
2d4bb139d3 - added counting of links with noindex tag for solr index
- bugfixes for solr index

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7820 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 06:40:05 +00:00
lotus
fa6f2c2b44 use proxy accounts by default for more security
http://bugs.yacy.net/view.php?id=45

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7815 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-02 17:16:00 +00:00
orbiter
bda3eec0ff added parsing of canonical link element to html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-01 16:38:01 +00:00
orbiter
b6f09a475d - added an index profile editor in the /indexFederated_p.html servlet for solr indexes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-30 15:49:21 +00:00
orbiter
6deef60bc0 added keyword list for solr index attributes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7807 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-29 15:33:27 +00:00
f1ori
fdc84d8319 small pi link on index page to administration pages
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7804 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-29 09:32:00 +00:00
orbiter
84c9658644 added a file type navigator
added a protocol navigator

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7795 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-23 15:39:52 +00:00
suessthomas
66c477129e Creates a new network definition, yacy.networks.metager.unit.
The YaCy freeworld network used in this network definition, minor enhancements for the feed of MetaGer were integrated.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7771 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-03 22:34:42 +00:00
f1ori
900dacbf97 * improve link rewriting in proxy-url
* only rewrites links, which are in current search domain

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7765 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-01 13:27:04 +00:00
orbiter
cc239b18cd fix for IPv6 localhost proxy client
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7744 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 16:24:11 +00:00
orbiter
10e2f588f8 - enhanced ybr ranking computation
- many speed/performance hacks
- added solr charding and new charding web interface
- added option to switch off the yacy index when using solr
- added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr
- refactoring/renaming of some method names to distinguish host/url hashes better
- a large number of bug/npe fixes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 10:57:02 +00:00
orbiter
3ed4a09368 small features, some bug fixes and performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-23 21:08:04 +00:00
orbiter
d8e934c085 better abstraction of http client identification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7675 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 13:35:29 +00:00
orbiter
b77b8cac0c - enhanced html parser: recognized much more details in the content
- added more properties to solr index
- refactoring
- more constants in switchboard
- fix for some NPEs
- recognition of more images
- removed synchronization in HandleMap (obviously not necessary?)
- added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-21 13:58:49 +00:00
orbiter
19fd13d3bc Added federated index storage to solr.
YaCy supports now the storage to remote solr indexes.
More federated storage (and search) methods may follow.

The remote index scheme is the same as produced by the SolrCell; see
http://wiki.apache.org/solr/ExtractingRequestHandler
Because this default scheme is used, the default example scheme can be used as solr configuration
This is also the same scheme that solr uses if documents are imported with apache tika.

federated solr storage is switched off by default.

To use this, do the following:
- set federated.service.solr.indexing.enabled = true
- download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/
- extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar'
- start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes.
- to check whats in solr after indexing, open http://localhost:8983/solr/admin/

Until now it is not possible to use the solr index to search with YaCy in that solr index.
This functionality is now available for two reasons:
1) to compare the functionality of Solr and YaCy and to compare the search speed
2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods
   that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still
   want to use solr instead of YaCy.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-14 20:05:04 +00:00
orbiter
b1a8d0c020 enhancements to web cache and less strict caching rules
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7620 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-22 10:35:26 +00:00
orbiter
ba03ca8620 added more configuration options for search:
- removed configuration button for 'search only for admin' from index.html and added this to ConfigPortal
- added configuration of link verification options (iffresh, cacheonly, nocache, ifexist) to ConfigPortal
- added configuration of navigation options to ConfigPortal
- added an option to switch off automatic index cleaning in case that a link verification method fails


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7613 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 07:50:34 +00:00
orbiter
bed79402be introduction of a new remote search load control: the remote search has taken 10 results per peer with a time-out of 3 seconds so far. The attributes of number of results per peer and time-out time can now be configured.
This has two aspects: the user who searches may want to increase these values to get more results and more load on the remote side and the user of the server which is accessed for this search may want to restrict the load. Both sides can now be configured. The server-site maximum load parameters are defined by a network definition and the client-side search request load can be defined by each user individually but when the remote search is done the requested service is limited to the network definition.

You can find now in the network definition file:
network.unit.remotesearch.maxcount and network.unit.remotesearch.maxtime
and in the yacy.conf file:
remotesearch.maxcount and remotesearch.maxtime

There is currently no web interface to define the client-side remote search attributes, please set them manually
    

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7548 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-04 13:44:00 +00:00
f1ori
59dea3a284 * implement url proxy, a proxy via the url http://peer:port/proxy.html?url=http://domain.tld/path
* enable with proxyURL = true
* could be useful to browse specific pages with proxy or use own improvements in proxy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7538 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-27 21:39:38 +00:00
orbiter
e3ef4e3021 - increased default peer ping time from 2 minutes to 1 minute
- filtering out too old peers when reading seed lists (limit is now 240 minutes)
- added concurrent host names resolving in front of the http client because the http client uses the java built-in DNS resolve which is not multithreading-safe (i have seen deadlocks in thread dumps showing that this bug in jdk is still there)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7515 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-23 09:42:01 +00:00
orbiter
d28f8040e0 removed unnecessary recording function that caused also a performance problem after serving too much files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7512 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-22 13:33:28 +00:00
orbiter
addbd5b482 moved the main update url - because of the many languages we support now on yacy.net
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7487 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-17 01:22:22 +00:00
orbiter
6c52e31993 new methods to open a browser
- if YaCy is started with the option -gui, it is not in headless mode. Then the java 1.6 browse method is used if all other methods fail
- in linux, the path /etc/alternatives/www-browser is used if no firefox is installed

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7480 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-14 16:15:14 +00:00
orbiter
5892fff51f introduction of dht-burst modes: this can expand the number of target peers in some cases where a better heuristic is needed. The problematic cases are either when a muti-word search is made (still a hard case for our term-oriented DHT) or when a network operator wants that all robinson peers are asked. We therefore introduced two new network steering values that switch on more peers during the peer selection. Because the number of peers can now be very large, the number of maximum httpc connections was also increased.
Please see new coments in yacy.network.freeworld.unit for details of the new DHT selection methods.
The number of maximum peers is now not fixed to a specific number but may increase with
- the partition exponent
- the number of redundant peers
- the robinson burst percentage
- the multiword burst percentage
The maximum can then be the number of senior peers (all visible peers).

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7479 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-13 17:37:28 +00:00
orbiter
4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
- some restructuring of the document counting and logging structures was necessary
- better abstraction of CrawlProfiles
- added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation
- more refactoring to get the LibraryProvider more clean
- some refactoring of the Condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-12 00:01:40 +00:00
low012
64f32e8f00 *) replaced all IPs in IP filters for proxy with the proper regular expression
*) some cleanup

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7477 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-11 23:37:13 +00:00
orbiter
fe93caac5a added flags and administration options to show advanced search and to show search result attributes (for each search result)
Administration can be done at ConfigPortal.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7466 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-02 15:54:13 +00:00
orbiter
88773e4daa changed the default port from 8080 to 8090
see also: http://forum.yacy-websuche.de/viewtopic.php?p=21683#p21683

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7454 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-28 10:54:13 +00:00
orbiter
6c35b68f17 - removed 'peerName' property from the yacy settings file because this information is stored in the yacy seed file
- the own seed file gets the lead for storage of the peer name
- exchanged default peer name generation method with one that does not use the local ip
- default peer names are now strings starting with '_anon'
- added another switch to suppress forwarding to ConfigBasic if the name was already changed
- replaced all usages of the yacy.conf peerName with access to the local seed
- changes to the peer name are now applied directly and not after the next peer ping


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7453 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-28 10:12:17 +00:00
orbiter
786166041a - added recording of all accessed and submitted servlets
- this recording is then used to redirect from the Status.html page to BasicConfig in case that servlet was never submitted
- this acts as an addition to the new default pop-up page 'index.html' which offers an administration link to Status.html. For a first-time user this then redirects directly to the former start page BasicConfig.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7451 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-27 11:17:11 +00:00
orbiter
3fe03f153d - search page becomes default start page (new users are not forced to do configuration since this is not necessary)
- adjusted top menu on search page (shows less stuff and now also the network graphics)
- adjusted the network page (looks better in when showing no other navigation on top)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7448 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-26 14:58:28 +00:00
orbiter
59d9fe1bd7 added more php mime types
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7443 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:52:36 +00:00
orbiter
3ae8f40fc8 removed yacy.network.group - this feature was never used
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7442 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:50:36 +00:00
orbiter
efb4ca8fa8 modified auto-delete of search failure-words:
- words are now not deleted from the search index automatically if index receive is switched off
- a flag in the network definition defines if this feature is switched on at all
- the search filter for not-found word references is switched off for server-side remote searches

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7441 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:46:00 +00:00
f1ori
4e29e9712a * create cleanupjob for cached failed urls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7437 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-17 15:04:00 +00:00
lotus
b1484299b2 same units for memory observer configuration (MiB)
old setting for DHT (RAM) will be lost after update
can be set on /Performance_p.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7418 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-02 20:38:01 +00:00
low012
11ea966f9e *) added SID file (Commodore 64) sound file parser
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7403 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-28 12:06:04 +00:00
low012
936e976c23 *) added FreeMind (http://freemind.sourceforge.net/) mindmap parser
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7397 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-27 20:13:31 +00:00
orbiter
4565b2f2c0 removed the display option from index.html, yacysearch.html and yacyinteractive.html
instead, a setting at ConfigPortal.html can be made to define if the topmenu shall be shown at these pages or if there is no naviagtion at all. 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7366 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-08 10:50:23 +00:00
orbiter
fc2e41e691 added a forwarder for the default page. The forwarder forwards a browser to a different page if the root file index.html is accessed. This can be done by setting the name of the forwarder page to the field
"Default index.html Page (by forwarder)" in /ConfigPortal.html
The purpose is to forward to /yacyinteractive.html for the 27C3 FTP search plattform

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7365 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-07 15:46:04 +00:00
orbiter
cc6499bf8d - added http://blekko.com as search heuristic (like scroogle). This was easy since they deliver their search results also as rss feed
- renamed YaCys search result modifications keywords for RECENT, NEAR and language: to the blekko slashtag naming scheme. YaCy now supports the following blekko-like slash built-in slashtags:
/date
 - for search results ordered by date (most recent up)
 /near
 - for search results where search words appear near to each other (closest up)
 /language/<lang>
 - for a sorting by language where the wanted language gets up. Example: /language/de
  

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7350 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-29 18:08:20 +00:00
orbiter
a9f754c45f removed unused CR accumulation and distribution process
this was never used and extended in the last years. The resulting YBR ranking criteria
is still a good idea and will be used in the future. Possible generation methods for YBR
ranking are:
- "trust-rank" using the link structure as can be discovered in a single crawl (idea from FSCONS)
- "block-rank" calculated from the local link structure
- a distributed "block-rank" using the xml API to the link structure from other peers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7349 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-29 11:07:42 +00:00
f1ori
442bebca2b * %0 does not belong to the IPv6-Address -> entry does not work on some systems
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7310 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-06 15:09:28 +00:00
f1ori
6ac4f8142e * allow proxy requests from localhost via ipv6
(%0 does not belong to the address)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7303 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-04 10:52:54 +00:00