Commit Graph

538 Commits

Author SHA1 Message Date
Michael Peter Christen
8651ec35fe turned author_s into the multi-valued field author_sxt 2013-01-24 18:24:31 +01:00
Michael Peter Christen
4735bd47f4 - changed solr commit call and added an optimize option. Since Solr
4.0.0 there is a new softcommit feature which implements a
near-real-time (NRT) search option. The softcommit does not do IO and
does not cause performance issues.
YaCy has now an extension in its solr connectors to use the softcommit
feature. The softcommit call now replaces all places where a hard commit
was used. Furthermore the commit strategy in when doing a search from
the web interface was changed (it's done every time before a search is
done).

The softcommit feature was implemented because it was needed for the
following changes (customer demands), which is also included in this
git commit:

- added a feature to identify all documents which have unique titles
and/or unique descriptions. These unique flags are disabled by default.
- added also a feature to set a flag when the url from a canonical tag
is equal to the document url. This is also disabled by default.

To support the new softcommit strategy, the commitWithinMs option was
set to -1 do disable automatic commit based on document insert times. If
documents are inserted permanently then also a commit would happen
permanently whenever the commitWithinMs time is reached. This would
conflict with the regular autocommit of 10 minutes and the new
softcommit strategy.
2013-01-23 14:40:58 +01:00
Michael Peter Christen
db024a4e19 added new solr fields (unused yet; implementation will follow) 2013-01-21 18:02:29 +01:00
Michael Peter Christen
9b5bdae1b4 Reverted setting of MMapDirectoryFactory from solrconfig; see
http://forum.yacy-websuche.de/viewtopic.php?p=27509#p27509
Instead, in the start script is checked if the host is a 64 host and
-Dsolr.directoryFactory=solr.MMapDirectoryFactory is set as java option

Reverted the ramBufferSizeMB setting (this was not enabled anyway)
because that may be too much memory for small peers and embedded
systems.

Activated the mergeFactor 4; this was commented out by mistake
2013-01-21 17:55:28 +01:00
orbiter
eb68a30947 solr performance settings
the target of these performance settings is the reduction of IO in
general and during search in particual.
- reduced mergeFactor to 4. This will increase the IO during indexing,
but will reduce IO during search. It will also greatly reduce the number
of open files which should make it possible to have overall larger
indexes until the number of open files in an OS is reached.
- increased ramBufferSizeMB to 256mb. This will reduce the number of
commits. This change may compensate the reduction of the mergeFactor.
- disabled updateLog. This is a real-time search feature which is
available in YaCy anyway because a commit is forced if index.html is
called. The updateLog feature causes a lot of IO during indexing and
search and produced a lot of files in SEGMENTS/solr_40/data/tlog
2013-01-19 11:21:33 +01:00
Michael Peter Christen
f53703df62 using MMapDirectoryFactory as solution for ClosedChannelException given
in https://issues.apache.org/jira/browse/SOLR-2247
2013-01-16 14:35:37 +01:00
Michael Peter Christen
22c694f906 activated the clickdepth_i attribute for solr again because the
calculcation of that value is not as extensive as expected and
furthermore the value is very useful for ranking
2013-01-05 01:00:18 +01:00
Michael Peter Christen
5a0eb1b268 clickpath should not be active by default because it needs extensive
computation - partly to be implemented
2013-01-03 01:30:05 +01:00
Michael Peter Christen
5c0c56cfe1 Preparations to produce a click depth attribute in the search index.
This attribute can be used for ranking and for other purpose (demand by
customer)
The click depth is computed in two steps:
- during indexing the current fill-state of the reverse link index is
used to backtrack the current page to the root page. The length of that
backtrack is the clickdepth. But this does not discover the shortest
click depth. To get this, a second process to check again is needed
- added a process tag that can be used to do operations on the existing
index after a crawl; i.e. calculation the shortest clickpath. Added a
field to control this operation but not a method to operate on this.
- added a visualization of the clickpath length in the host browser
2013-01-02 20:55:43 +01:00
Michael Peter Christen
295884fd54 - Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8'
- fixed conflict in	htroot/yacysearch.java
- removed nedres check because that causes that the remote server is not
called at all in most cases (local index has already results but we want
more)
- fixed a regex bug (a '=' too much)
2013-01-02 15:08:07 +01:00
reger
168b1d130d Adding heuristic to get search results from configured systems which support opensearch specification
- any system supporting opensearch specification can be configured
- search query is only forwarded to remote system if not enough results available on local peer
- discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config
     - sample config file with some general search engines with opensearch support
2012-12-29 08:24:48 +01:00
reger
7761b60325 fix: Broken Link on Crawler_p.html - issue 218
http://bugs.yacy.net/view.php?id=218
- reduced Solr logging (/select)
2012-12-29 04:53:20 +01:00
reger
e9e0d63897 Add config option to show HostBrowser link in search result
- ConfigPortal: added checkbox Host Browser
- yacy.init: added search.result.show.hostbrowser as default = on (true)
- fix HostBrowser: broken link to protected WebStructurePicture for public user
2012-12-27 10:01:10 +01:00
Michael Peter Christen
98819ec3d9 use solr boost configuration to select search fields. At this time it is
possible to enter a negative boost value to switch that value off. This
might be different in the future with a better input interface.
2012-12-27 03:17:45 +01:00
Michael Peter Christen
01200f06cc using the author field as solr-native facet. this makes it necessary to
introduce a copy-field for the author field to be copied to a string
field. This field is then used to generate facets. Without this field,
the facet would consist only of the words of the author names, not of
the full author string.
2012-12-19 01:56:33 +01:00
Michael Peter Christen
eac9650b31 added another solr field clickdepth_i which reflects the number of
clicks which are necessary to get from the portal of a host to a
specific document. At this time, only the start document is flagged with
clickdepth '0', all other with '-1'. To get the actual clickdepth, a
process must use crawled information to collect the actual number of
clicks. This will be added in another/next step.
2012-12-18 17:20:42 +01:00
Michael Peter Christen
1052263af3 - added a new solr field references_i which stores the number of
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
2012-12-18 14:42:35 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
ea033f8f8e added number of characters in url to default index to be able to use
this field for ranking
2012-12-02 16:53:02 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
reger
328ce0b297 fix: remove fixed individual testing IP (85.25.151.30 = server4you.de) from default/yacy.network.freeworld.unit 2012-11-11 21:19:18 +01:00
Michael Peter Christen
e2c4c3c7d3 migration to solr 4.0.0 2012-11-02 12:29:48 +01:00
sixcooler
2d972f289a rise commitWithinMs to default-value from SwitchBoard
(result in lower hd-io)

no dots in memory-graph (there are to much of them)
2012-10-26 02:12:45 +02:00
Michael Peter Christen
1baf498d59 - show more lines in online log
- reverse order is default now
2012-10-25 18:38:39 +02:00
sixcooler
206e7bcf94 whitelist yacyportalsearch aka search.yacy.net 2012-10-23 03:49:27 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
42e525ca9a enhanced the host browser 2012-10-08 14:00:14 +02:00
sof
5cb244b79b Merge remote branch 'origin/master' 2012-10-05 18:54:39 +02:00
apfelmaennchen
88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
on the jaudiotagger library. The parser is disabled by default as it
needs to store temporary files for non file:// protocols, which might be
disliked. For your local MP3-collection it loads nicely Artist,
Title, Album etc. from the audio files meta data.
2012-10-05 18:54:26 +02:00
Michael Peter Christen
3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
synonyms_sxt
2012-10-02 11:13:06 +02:00
Michael Peter Christen
3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:14:09 +02:00
orbiter
3190347814 added a synonyms_t field to solr and a process to read synonym files.
This can be used to add another stemming to solr using stemming files
that are expressed as synonyms for grammatical alternatives. The
synonym/stemming files must have the following form:
- each line is a comma-separated list of synonyms
- the list of synonyms may be enclosed with {} (like the GSA synonyms
file)
- the file may contain comments which are lines starting with a '#'
The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and
are activated by default whenever a synonym file is in place.
Then, for each word that is found in a document all synonyms are added
to a long text field which is stored into synonyms_t. Processes using
the synonyms must query with that field as optional matcher.
2012-10-02 00:02:50 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
f45f7fc12e added new Host Browser to main menu:
this new search interface is something completely new for search, but
completely common on desktops: browser a web space like one would browse
a file system in a file browser. The file listing is created using the
search index and a faceted restriction to specific domains.
2012-09-28 22:45:16 +02:00
Michael Peter Christen
80edd8ecd7 some more after-refactoring fixes 2012-09-28 10:24:57 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
0504b01bdc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-14 00:48:17 +02:00
orbiter
9413f77b65 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-13 23:54:26 +02:00
orbiter
a55e77a115 added twitter search heuristic 2012-09-13 23:53:53 +02:00
Michael Peter Christen
62add1d564 added the protocol and the file name extension to the solr fields since
these fields are probably facets in file search
2012-09-11 22:46:39 +02:00
Michael Peter Christen
9db032664e activate two solr fields which will be used by administration interface
(later)
2012-09-11 20:15:54 +02:00
Michael Peter Christen
10b911eed4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-07 22:07:02 +02:00
Michael Peter Christen
be67c70a47 added Solr fields:
inboundlinks_text_chars_val
inboundlinks_text_words_val
inboundlinks_alttag_txt
outboundlinks_text_chars_val
outboundlinks_text_words_val
outboundlinks_alttag_txt
2012-09-07 22:06:51 +02:00
orbiter
d73fff0e0e added solr field images_withalt_i 2012-09-07 21:33:45 +02:00
Michael Peter Christen
ee23fc7a32 added h1..h6 counter fields 2012-09-04 14:11:11 +02:00
Michael Peter Christen
b2b516cc3e added a collection attribute to crawls and searches:
- a solr field collection_sxt can be used to store a set of crawl tags
- when this field is activated, a crawl tag can be assigned when crawls
are started
- the content of the collection field can be comma-separated, all of
them are assigned to the documents when they are indexed as result of
such a crawl start
- a search result can be drilled down to a specific collection; this is
currently only available in the solr interface and also in the gsa
interface using the 'site' option
- this adds a mandatory field for gsa queries (the google api demands
that field all the time)
2012-09-03 15:26:08 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00
Michael Peter Christen
2ddc33646a added new field for solr:
url_paths_sxt
url_parameter_i
url_parameter_key_sxt
url_parameter_value_sxt
url_chars_i
2012-08-29 16:11:23 +02:00
Michael Peter Christen
75d5e3475d Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-29 10:13:51 +02:00
cominch
dc468dad01 add content control features for custom filter lists 2012-08-29 09:04:28 +02:00
Michael Peter Christen
316b5fe116 - added a solr type definition verifier
- fixed type definition found by the verifier
- added multivalue-string fields for solr with extension 'sxt'
- added multivalue-integer fields for solr with extension 'val'
- renamed some solr attributes from txt to sxt
- changed solr query line to an explicit AND/OR structure
- added a country code second level domain list to Domains class; with
parser
- added a host string parser to get domain class name, country-code
second-level domain and subdomain out of it
- removed old coordinate attributes
2012-08-28 16:58:06 +02:00
Michael Peter Christen
4c79ddb91e switched off some solr logging 2012-08-27 14:41:47 +02:00
Michael Peter Christen
e8acd542b5 - added faceted drill-down for host and geolocation to solr queries
- added a new geolocation field to index schema, the old values are
migrated if possible
2012-08-27 14:41:33 +02:00
Michael Peter Christen
af764c106c re-activated audio and video search because they obviously work (!) 2012-08-22 01:56:13 +02:00
orbiter
716ea0cfe2 sorted the solr schema into mandatory and optional fields; reduced
number of used field to reduce solr index size
2012-08-21 23:52:56 +02:00
orbiter
db6863db77 reduced solr cache sizes to check if that solves memory problems a bit 2012-08-18 13:45:37 +02:00
Michael Peter Christen
23226676c6 FOR THE BRAVE.. this is a forced migration to solr which is now ready
for production as a replacement of the metadata-db.
This intermediate release 1.041 will switch on the previously optional
solr index and the old metadata-db will still work as it did before.
Solr+metadata are accessed in mixed mode, no migration is done yet.
If this causes not a catastrophe until the end of the weekend, we will
do a YaCy 1.1 main release containing this as default.
2012-08-16 18:17:47 +02:00
Michael Peter Christen
a1b2c9a67d doctype2mime fix, influences metadata conversion between old metadata
and solr
2012-08-16 17:49:35 +02:00
Michael Peter Christen
703f427303 fixed some peer-ping connection details
- larger time-out
- removed too old seedlist
- fixed a bug in connection test
2012-08-16 17:11:54 +02:00
Michael Peter Christen
ea49a8aa8c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-14 12:40:44 +02:00
Michael Peter Christen
aab0b680c3 - added xslt support for solr result formats.
try i.e.
http://localhost:8090/solr/select?q=*:*&start=0&rows=10&wt=xslt&tr=json.xsl
- added servlet-side mime-type configuration for streamed servlets. this
is used for the result formatters in solr result formats
2012-08-14 11:12:50 +02:00
cominch
e2119f4e76 augmented browsing: replace htmlparser by jsoup, which is more stable
and reliable
2012-08-14 10:06:12 +02:00
Michael Peter Christen
b51df6c7e8 - added coordinate storage in solr schema
- fixed shutdown process
- fixed some solr-to-metadata reading
- added a large number of metadata attributes in ViewFile.html
2012-08-13 10:40:04 +02:00
Michael Peter Christen
f9c0e6e950 - Implemented and integrated the URIMetadataNode object which is a
metadata representation from the solr index. This shall replace metadata
from the built-in database in the future.
- added the Solr-driven metadata into the search index of YaCy which
makes it now possible to run YaCy without the old metadata index. This
is a major stept forward to a full migration to Solr.
2012-08-10 13:26:51 +02:00
Michael Peter Christen
bca4a16603 replaced the multivalue generic string field name suffix _ss by _txt
because _ss is not part of the standard solr example schema.
2012-08-06 17:58:09 +02:00
orbiter
67edfd991c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-05 15:49:48 +02:00
orbiter
d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All
writings to the Metadata-DB are now also done to solr. This includes
metadata transfer during search and rwi transfer.

The new/added solr fields are:

## time when resource was loaded
load_date_dt

## date until resource shall be considered as fresh
fresh_date_dt

## id of the host, a 6-byte hash that is part of the document id
host_id_s

## ids of referrer to this document
referrer_id_ss

## the md5 of the raw source
md5_s

## the name of the publisher of the document
publisher_t

## the language used in the document; starts with primary language
language_ss

## an external ranking value
ranking_i

## the size of the raw source
size_i

## number of links to audio resources
audiolinkscount_i

## number of links to video resources
videolinkscount_i

## number of links to application resources
applinkscount_i
2012-08-05 15:49:27 +02:00
Michael Peter Christen
3ce04cecf3 bad hack to prevent a bug appearing in solr 2012-07-31 23:49:07 +02:00
Michael Peter Christen
826967513b changed options in IndexFederated_p to switch on/off parts of the index
individually. The settings are experimental and the values of the
settings will be overwritten when an index migration from urldb to solr
starts.
2012-07-23 16:28:39 +02:00
Michael Peter Christen
1517a3b7b9 added webm mime-type 2012-07-08 17:59:20 +02:00
Michael Peter Christen
0301aba1e9 removed unused method parameters 2012-07-05 10:23:07 +02:00
Michael Peter Christen
4de50fe808 adding more principal peers for bootstraping 2012-07-05 00:43:41 +02:00
reger
067728bccc add search result heuristic. adding a crawl job with depth-1 for every displayed search result (crawling every external linked page of displayed search result pages) 2012-07-01 00:12:20 +02:00
Michael Peter Christen
508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
the meta-refresh html header field.
2012-06-28 13:27:45 +02:00
Michael Peter Christen
9116013c64 - allow lazy initialization of solr value (if using 'lazy', then no
0-values and no empty strings are written). This may save a lot of
memory (in ram and on disc) if excessive 0-values or empty strings
appear)
- do not allow default boolean values for checkboxes because that does
not make sense: browsers may omit the checkbox attribute name if the box
is not checked. A default value 'true' would not comply with the
semantic of the browsers response.
- add a checkbox in IndexFederated_p for the lazy initialization of solr
fields.
2012-06-27 12:17:58 +02:00
Michael Peter Christen
c03d306afa shorter autocommit time (now: 1 second) to prevent that user cannot see
results in solr the first time they try it out. The value can now be
easily set to a higher number using the IndexFederated_p interface.
2012-06-26 14:53:45 +02:00
Michael Peter Christen
3fd4a01286 added option to record urls that are forwarded to the solr index 2012-06-26 13:54:48 +02:00
Michael Peter Christen
8dd469b9dd added option to configure the autocommit delay time of solr on-the-fly 2012-06-25 14:59:46 +02:00
Michael Peter Christen
b9dfca4b0a - fixed IndexFederated Servlet / a embedded Solr can now be selected
- added code stub for an embedded Solr but generation of Solr store is
still commented out (it works but is not yet ready for usage)
2012-06-25 11:34:38 +02:00
Michael Peter Christen
1be0025a9c - added test for EmbeddedSolrConnector
- added needed libraries for this test
this includes most (all) files needed for an embedded solr
2012-06-22 00:36:49 +02:00
Michael Peter Christen
dbdd697f4d moved RDFaParser.xsl configuration file to defaults 2012-06-21 16:09:12 +02:00
Michael Peter Christen
8738336408 set Xms lower than Xmx 2012-06-19 08:45:49 +02:00
Michael Peter Christen
96f6a5869f more robust OAI-PMH client (large time-out, three re-tries). OAI-PMH
server appeart to be very slow sometimes
2012-06-16 22:30:31 +02:00
Michael Peter Christen
6d17686258 made triplestore persistent by default
added a size display in triplestore servlet
2012-06-15 19:13:07 +02:00
cominch
3c255c025b Show tags in search results (if activated in ConfigPortal_p.html) 2012-06-15 10:43:05 +02:00
Michael Peter Christen
a5cdfb91de - fixed Cache link (below snippet)
- added 'Augmented Proxy' link below snippet
- added configuration options for augmented proxy
2012-06-14 19:55:34 +02:00
Roland 'Quix0r' Haeder
af5a597e47 Scroogle is not comming back, remove dead code
Conflicts:
	source/net/yacy/search/Switchboard.java
2012-06-10 23:38:41 +02:00
cominch
90512640bf Added config switches for custom parser
Conflicts:
	source/net/yacy/document/TextParser.java
2012-06-10 12:49:36 +02:00
cominch
5d20cd324a Add Triplestore and RDF query interface
Conflicts:
	build.xml
	defaults/yacy.init
	source/net/yacy/interaction/AugmentHtmlStream.java
2012-06-10 10:35:59 +02:00
cominch
a32943b382 add json mimetype 2012-06-10 09:29:09 +02:00
Michael Peter Christen
41c02cb10e - less restrictions for usage of Table RAM copy
- new limit to use the table copy (instead of flag): 400MB available. If
less is available, then a copy is never used. If more is available, then
it can be used if there is a remaining space of at least 200MB
- flush caches more often: flush the Digest cache
2012-06-08 12:48:25 +02:00
Michael Peter Christen
8002fd2578 use less cache space since a large cache would cause more memory usage
in index files.
2012-06-06 14:17:42 +02:00
Michael Peter Christen
5aee19daa4 added show from cache in search results (not yet finished) 2012-06-04 23:44:26 +02:00
Michael Peter Christen
0d32a766ed relax verify attribute for search widget to make it faster:
set to "cacheonly"
2012-05-20 00:50:54 +02:00
Michael Peter Christen
7eece0256f moved yacy.logging to defaults according to request in
http://bugs.yacy.net/view.php?id=55
2012-05-17 04:26:03 +02:00
Michael Peter Christen
db9d81cb7a ups 2012-05-16 01:04:08 +02:00
Michael Peter Christen
e7e381d110 added configuration to switch off redirection following in crawler 2012-05-15 12:25:46 +02:00
Michael Peter Christen
2be327b5ab update location update 2012-04-19 11:49:43 +02:00
Michael Peter Christen
99c74699de removed scroogle (scroogle is dead) 2012-02-25 12:57:59 +01:00
Michael Peter Christen
8bee1472c9 there is no noindex, only nofollow in links 2012-01-31 23:46:35 +01:00
Michael Peter Christen
4c5edab1ec added option to have exception search result windows 2012-01-26 15:32:30 +01:00
Michael Peter Christen
696ee5fc16 removed pdf from default parser deny list 2012-01-23 17:27:58 +01:00
Lotus
c73af39e54 refactoring of tray icon class,
now uses Java 6 methods natively
2012-01-18 20:47:09 +01:00
Michael Peter Christen
987b412491 updated solr scheme: generic declaration of solr schemes 2012-01-13 11:25:15 +01:00
Michael Peter Christen
0bcef2d156 added feature as requested in
http://forum.yacy-websuche.de/viewtopic.php?f=18&t=3461
The search can now be configured with a non-display host list.
the search will always exlude the given list of host unless they are
requested directly using the host navigation
2011-12-13 00:16:05 +01:00
Michael Christen
17f962fceb translator updates:
- config string for chinese
- do not copy the language file to DATA/LOCALE any more (and do not use
them there, this is really confusing for new translators)
2011-12-08 10:25:26 +01:00
Michael Christen
c715d19c09 fixes for dependency on svn 2011-12-06 22:05:22 +01:00
Michael Christen
f62e6fb438 less frequent DHT distribution to reduce the load a bit on every peer 2011-12-05 15:45:33 +01:00
Michael Christen
9dbc93613e now that the whole world knows that we actually do p2p and not
metasearch we can support a default look-up to scroogle to gain more
attention to people who say that your search results are incomplete
2011-12-05 11:52:24 +01:00
orbiter
f9216e388c - faster ping to clean up old peers faster
- clean up more news

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8125 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-30 21:21:16 +00:00
orbiter
ac5bda205f - removed lower page navigation (it never looks nice)
- added visibility of metadata and parser in search results since that shows what YaCy can do in a nice way

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8091 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-24 13:30:42 +00:00
orbiter
c659310e89 - removed option to search for audio, video and applications. These things are still experimental and should not be shown to new users since this would cause them to argue that YaCy does not work. The functions are stil available, because:
- added a configuration option in ConfigPortal to swtich the search media types on or off

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8090 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-24 13:07:03 +00:00
orbiter
6cd27473f5 - better default values for caching and cache usage
- set new caching and verification behavior according to use case automatically

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8087 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-24 10:22:02 +00:00
orbiter
5866c73a09 fix for compare search: use scroogle instead of bing and get a default search if configured search engine is not available
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-23 15:17:46 +00:00
orbiter
e4a82ddd8b produce a bookmark entry from every crawl start. these bookmarks are always private.
these bookmarks will be used to get a source reference for the search in case of intranet or portal searches.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-21 23:10:29 +00:00
orbiter
f183d3822c added a default accept header in http requests since some http fraud detection functions check that this header field exist
see also: http://bad-behavior.ioerror.us/ in source file browser.inc.php

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8048 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 15:27:43 +00:00
orbiter
78ce3b13be typo
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8027 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-10 11:57:26 +00:00
suessthomas
887f088dad The IP address of the YaCy-Demo portal added to Whitelist.
This is only a temporary workaround.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8013 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-03 23:44:49 +00:00
orbiter
1b45e33f04 added robots tag parser to solr scheme
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7986 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 13:39:01 +00:00
orbiter
cf4fd525ee added directDocByURL attribute in crawl profile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 12:38:28 +00:00
orbiter
5ad7f9612b added crawl settings for three new filters for each crawl:
must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue)
must-not-match for IPs
must-match against a list of country codes (allows only loading from hosts that are hostet in given countries)

note: the settings and input environment is there with that commit, but the values are not yet evaluated

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-27 21:58:18 +00:00
orbiter
2c3161b4ac refactoring:
RankingProcess -> RWIProcess
ResultFetcher -> SnippetProcess


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7974 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-26 21:42:28 +00:00
orbiter
6b22865dbc - removed some warinings
- removed a dead update location

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7970 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-24 01:58:54 +00:00
orbiter
e48ce5d80e - style change for search box: larger font, selected by default
- style change for search results: by default no parser, size, image info

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7949 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-14 09:05:06 +00:00
sixcooler
ecb4986b38 refactored stuff from last commit to ReferenceContainer
see: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=3353&p=23163#p23163
the limiting of references is disabled per default
to enable this set yacy.conf - index.maxReferences to a value of e.g. 100000

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7935 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-07 18:55:16 +00:00
orbiter
49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-07 10:08:57 +00:00
orbiter
9a8937f8b6 be more liberal when evaluating search results. This may cause that it is possible to fraud content on fresh peers, but that is better than looong waiting times for the evaluation of every link which causes that everybody rejects YaCy as 'too slow'. But this is only because of the high standards that YaCy sets to itself. If we are able to gain more users by lowering the standard, then that is useful. The option to set that flag to verify each link is still there.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7918 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-01 16:02:15 +00:00
orbiter
1c007188ad bugfixes in html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-31 16:02:06 +00:00
orbiter
5dd2efc9a2 - bugfixes in html parser
- new fields in solr
- extended file viewer to debug parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-25 15:52:25 +00:00
sixcooler
4fec99115b Implementation of strategies for controlling memory resources.
You can toggle between previous (standard) and new (generation) strategy at PerformanceMemory_p.html.
The generation memory strategy is implemented with the objective of running more robust
but with the cost of early stopping some tasks (eg. dht) while running low on memory.
This new strategy does respect the generational way a heap is organized on most used jvms.
These changes run fine on my 3 peers for weeks now, but as I'm human, I may fail.
Please be carefull using generation memory strategy and report errors by naming
OS, jvm and java_args.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7886 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-22 17:50:03 +00:00
orbiter
77a9af99f1 same values for Xmx and Xms: memory extension may be difficult if the OS has not the remaining memory available and may kill the jvm. If the memory is reserved at the start but never used the OS may handle that as well and leave non-used space in swap area (and never swap)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7867 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-11 21:54:27 +00:00
orbiter
768c59740c - replaced solrj 3.1 with solrj 3.3
- updated also slf4j
- added authentication for solrj


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7829 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-04 16:35:30 +00:00
orbiter
e7c7598923 docfix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7828 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-04 10:48:01 +00:00
orbiter
b84089ff04 fix for solr scheme list definition
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7826 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 22:59:43 +00:00
orbiter
2d4bb139d3 - added counting of links with noindex tag for solr index
- bugfixes for solr index

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7820 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 06:40:05 +00:00
lotus
fa6f2c2b44 use proxy accounts by default for more security
http://bugs.yacy.net/view.php?id=45

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7815 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-02 17:16:00 +00:00
orbiter
bda3eec0ff added parsing of canonical link element to html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-01 16:38:01 +00:00
orbiter
b6f09a475d - added an index profile editor in the /indexFederated_p.html servlet for solr indexes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-30 15:49:21 +00:00
orbiter
6deef60bc0 added keyword list for solr index attributes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7807 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-29 15:33:27 +00:00
f1ori
fdc84d8319 small pi link on index page to administration pages
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7804 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-29 09:32:00 +00:00
orbiter
84c9658644 added a file type navigator
added a protocol navigator

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7795 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-23 15:39:52 +00:00
suessthomas
66c477129e Creates a new network definition, yacy.networks.metager.unit.
The YaCy freeworld network used in this network definition, minor enhancements for the feed of MetaGer were integrated.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7771 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-03 22:34:42 +00:00
f1ori
900dacbf97 * improve link rewriting in proxy-url
* only rewrites links, which are in current search domain

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7765 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-01 13:27:04 +00:00
orbiter
cc239b18cd fix for IPv6 localhost proxy client
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7744 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 16:24:11 +00:00
orbiter
10e2f588f8 - enhanced ybr ranking computation
- many speed/performance hacks
- added solr charding and new charding web interface
- added option to switch off the yacy index when using solr
- added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr
- refactoring/renaming of some method names to distinguish host/url hashes better
- a large number of bug/npe fixes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 10:57:02 +00:00
orbiter
3ed4a09368 small features, some bug fixes and performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-23 21:08:04 +00:00
orbiter
d8e934c085 better abstraction of http client identification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7675 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 13:35:29 +00:00
orbiter
b77b8cac0c - enhanced html parser: recognized much more details in the content
- added more properties to solr index
- refactoring
- more constants in switchboard
- fix for some NPEs
- recognition of more images
- removed synchronization in HandleMap (obviously not necessary?)
- added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-21 13:58:49 +00:00
orbiter
19fd13d3bc Added federated index storage to solr.
YaCy supports now the storage to remote solr indexes.
More federated storage (and search) methods may follow.

The remote index scheme is the same as produced by the SolrCell; see
http://wiki.apache.org/solr/ExtractingRequestHandler
Because this default scheme is used, the default example scheme can be used as solr configuration
This is also the same scheme that solr uses if documents are imported with apache tika.

federated solr storage is switched off by default.

To use this, do the following:
- set federated.service.solr.indexing.enabled = true
- download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/
- extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar'
- start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes.
- to check whats in solr after indexing, open http://localhost:8983/solr/admin/

Until now it is not possible to use the solr index to search with YaCy in that solr index.
This functionality is now available for two reasons:
1) to compare the functionality of Solr and YaCy and to compare the search speed
2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods
   that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still
   want to use solr instead of YaCy.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-14 20:05:04 +00:00
orbiter
b1a8d0c020 enhancements to web cache and less strict caching rules
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7620 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-22 10:35:26 +00:00
orbiter
ba03ca8620 added more configuration options for search:
- removed configuration button for 'search only for admin' from index.html and added this to ConfigPortal
- added configuration of link verification options (iffresh, cacheonly, nocache, ifexist) to ConfigPortal
- added configuration of navigation options to ConfigPortal
- added an option to switch off automatic index cleaning in case that a link verification method fails


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7613 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 07:50:34 +00:00
orbiter
bed79402be introduction of a new remote search load control: the remote search has taken 10 results per peer with a time-out of 3 seconds so far. The attributes of number of results per peer and time-out time can now be configured.
This has two aspects: the user who searches may want to increase these values to get more results and more load on the remote side and the user of the server which is accessed for this search may want to restrict the load. Both sides can now be configured. The server-site maximum load parameters are defined by a network definition and the client-side search request load can be defined by each user individually but when the remote search is done the requested service is limited to the network definition.

You can find now in the network definition file:
network.unit.remotesearch.maxcount and network.unit.remotesearch.maxtime
and in the yacy.conf file:
remotesearch.maxcount and remotesearch.maxtime

There is currently no web interface to define the client-side remote search attributes, please set them manually
    

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7548 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-04 13:44:00 +00:00
f1ori
59dea3a284 * implement url proxy, a proxy via the url http://peer:port/proxy.html?url=http://domain.tld/path
* enable with proxyURL = true
* could be useful to browse specific pages with proxy or use own improvements in proxy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7538 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-27 21:39:38 +00:00
orbiter
e3ef4e3021 - increased default peer ping time from 2 minutes to 1 minute
- filtering out too old peers when reading seed lists (limit is now 240 minutes)
- added concurrent host names resolving in front of the http client because the http client uses the java built-in DNS resolve which is not multithreading-safe (i have seen deadlocks in thread dumps showing that this bug in jdk is still there)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7515 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-23 09:42:01 +00:00
orbiter
d28f8040e0 removed unnecessary recording function that caused also a performance problem after serving too much files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7512 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-22 13:33:28 +00:00
orbiter
addbd5b482 moved the main update url - because of the many languages we support now on yacy.net
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7487 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-17 01:22:22 +00:00
orbiter
6c52e31993 new methods to open a browser
- if YaCy is started with the option -gui, it is not in headless mode. Then the java 1.6 browse method is used if all other methods fail
- in linux, the path /etc/alternatives/www-browser is used if no firefox is installed

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7480 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-14 16:15:14 +00:00
orbiter
5892fff51f introduction of dht-burst modes: this can expand the number of target peers in some cases where a better heuristic is needed. The problematic cases are either when a muti-word search is made (still a hard case for our term-oriented DHT) or when a network operator wants that all robinson peers are asked. We therefore introduced two new network steering values that switch on more peers during the peer selection. Because the number of peers can now be very large, the number of maximum httpc connections was also increased.
Please see new coments in yacy.network.freeworld.unit for details of the new DHT selection methods.
The number of maximum peers is now not fixed to a specific number but may increase with
- the partition exponent
- the number of redundant peers
- the robinson burst percentage
- the multiword burst percentage
The maximum can then be the number of senior peers (all visible peers).

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7479 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-13 17:37:28 +00:00
orbiter
4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
- some restructuring of the document counting and logging structures was necessary
- better abstraction of CrawlProfiles
- added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation
- more refactoring to get the LibraryProvider more clean
- some refactoring of the Condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-12 00:01:40 +00:00
low012
64f32e8f00 *) replaced all IPs in IP filters for proxy with the proper regular expression
*) some cleanup

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7477 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-11 23:37:13 +00:00
orbiter
fe93caac5a added flags and administration options to show advanced search and to show search result attributes (for each search result)
Administration can be done at ConfigPortal.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7466 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-02 15:54:13 +00:00
orbiter
88773e4daa changed the default port from 8080 to 8090
see also: http://forum.yacy-websuche.de/viewtopic.php?p=21683#p21683

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7454 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-28 10:54:13 +00:00
orbiter
6c35b68f17 - removed 'peerName' property from the yacy settings file because this information is stored in the yacy seed file
- the own seed file gets the lead for storage of the peer name
- exchanged default peer name generation method with one that does not use the local ip
- default peer names are now strings starting with '_anon'
- added another switch to suppress forwarding to ConfigBasic if the name was already changed
- replaced all usages of the yacy.conf peerName with access to the local seed
- changes to the peer name are now applied directly and not after the next peer ping


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7453 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-28 10:12:17 +00:00
orbiter
786166041a - added recording of all accessed and submitted servlets
- this recording is then used to redirect from the Status.html page to BasicConfig in case that servlet was never submitted
- this acts as an addition to the new default pop-up page 'index.html' which offers an administration link to Status.html. For a first-time user this then redirects directly to the former start page BasicConfig.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7451 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-27 11:17:11 +00:00
orbiter
3fe03f153d - search page becomes default start page (new users are not forced to do configuration since this is not necessary)
- adjusted top menu on search page (shows less stuff and now also the network graphics)
- adjusted the network page (looks better in when showing no other navigation on top)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7448 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-26 14:58:28 +00:00
orbiter
59d9fe1bd7 added more php mime types
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7443 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:52:36 +00:00
orbiter
3ae8f40fc8 removed yacy.network.group - this feature was never used
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7442 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:50:36 +00:00
orbiter
efb4ca8fa8 modified auto-delete of search failure-words:
- words are now not deleted from the search index automatically if index receive is switched off
- a flag in the network definition defines if this feature is switched on at all
- the search filter for not-found word references is switched off for server-side remote searches

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7441 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:46:00 +00:00
f1ori
4e29e9712a * create cleanupjob for cached failed urls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7437 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-17 15:04:00 +00:00
lotus
b1484299b2 same units for memory observer configuration (MiB)
old setting for DHT (RAM) will be lost after update
can be set on /Performance_p.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7418 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-02 20:38:01 +00:00
low012
11ea966f9e *) added SID file (Commodore 64) sound file parser
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7403 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-28 12:06:04 +00:00
low012
936e976c23 *) added FreeMind (http://freemind.sourceforge.net/) mindmap parser
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7397 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-27 20:13:31 +00:00
orbiter
4565b2f2c0 removed the display option from index.html, yacysearch.html and yacyinteractive.html
instead, a setting at ConfigPortal.html can be made to define if the topmenu shall be shown at these pages or if there is no naviagtion at all. 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7366 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-08 10:50:23 +00:00
orbiter
fc2e41e691 added a forwarder for the default page. The forwarder forwards a browser to a different page if the root file index.html is accessed. This can be done by setting the name of the forwarder page to the field
"Default index.html Page (by forwarder)" in /ConfigPortal.html
The purpose is to forward to /yacyinteractive.html for the 27C3 FTP search plattform

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7365 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-07 15:46:04 +00:00
orbiter
cc6499bf8d - added http://blekko.com as search heuristic (like scroogle). This was easy since they deliver their search results also as rss feed
- renamed YaCys search result modifications keywords for RECENT, NEAR and language: to the blekko slashtag naming scheme. YaCy now supports the following blekko-like slash built-in slashtags:
/date
 - for search results ordered by date (most recent up)
 /near
 - for search results where search words appear near to each other (closest up)
 /language/<lang>
 - for a sorting by language where the wanted language gets up. Example: /language/de
  

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7350 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-29 18:08:20 +00:00
orbiter
a9f754c45f removed unused CR accumulation and distribution process
this was never used and extended in the last years. The resulting YBR ranking criteria
is still a good idea and will be used in the future. Possible generation methods for YBR
ranking are:
- "trust-rank" using the link structure as can be discovered in a single crawl (idea from FSCONS)
- "block-rank" calculated from the local link structure
- a distributed "block-rank" using the xml API to the link structure from other peers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7349 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-29 11:07:42 +00:00
f1ori
442bebca2b * %0 does not belong to the IPv6-Address -> entry does not work on some systems
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7310 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-06 15:09:28 +00:00
f1ori
6ac4f8142e * allow proxy requests from localhost via ipv6
(%0 does not belong to the address)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7303 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-04 10:52:54 +00:00
orbiter
917d715374 lulabad found his signature
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7287 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-01 23:32:20 +00:00
f1ori
def4253555 * add option to network definition to provide a domainlist (syntax like in blacklists)
* crawler and search allow only urls matching one in domainlist (if list is provided)
* this may be useful to prevent dedicated networks from being "polluted"
* FilterEngine is improved Backlist-object, Blacklist may inherit from FilterEngine in the future

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7285 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-30 14:44:33 +00:00
orbiter
482127e777 removed release key from location 2 because the signature of that source can not be verified. But the source is ok.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7283 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-29 09:22:56 +00:00
orbiter
facfd204e9 added a parent configuration option.
see /ConfigPortal.html
requested here:
http://forum.yacy-websuche.de/viewtopic.php?p=21099#p21099

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7271 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-25 22:16:07 +00:00
orbiter
6a166c2040 patches for bad proxy behaviour
- accept ipv6 localhost clients
- index media files (url only)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7238 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-11 11:38:36 +00:00
orbiter
45b1ab3d07 custom + generic skins:
- added a generic skin which is filled with actual color assignment using a servlet
- enabled css servlets
- added a generic color scheme in configuration file
- added configuration input in Customization/Appearance servlet
- added a jquery color picker widget
- placed color picked widget to input field of generic colour definition input fields

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7235 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-11 00:00:10 +00:00
orbiter
2c549ae341 fixed a number of small bugs:
- better crawl star for files paths and smb paths
- added time-out wrapper for dns resolving and reverse resolving to prevent blockings
- fixed intranet scanner result list check boxes
- prevented htcache usage in case of file and smb crawling (not necessary, documents are locally available)
- fixed rss feed loader
- fixes sitemap loader which had not been restricted to single files (crawl-depth must be zero)
- clearing of crawl result lists when a network switch was done
- higher maximum file size for crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7214 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-30 23:57:58 +00:00
orbiter
37baa8bae3 - fixes for concurrency exceptions and failed database integrity verification
- added link to yacystats peer when peer is more than one day old

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7164 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-17 10:20:04 +00:00
orbiter
461a2a6ec7 enhanced remote crawling:
- 300 ppm is default now (but this is switched off by default; if you switch it on you may want more traffic?)
- better timing for busy queue
- better amount of remote url retrieval
- better time-out values
- better tracking of availability of remote crawl urls
- more logging for result of receipt sending

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7159 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-16 09:34:17 +00:00
orbiter
670ba4d52b - removed the remote crawl option from the network configuration submenu and
- added a remote crawl menu item to the index create menu. This menu also shows a list of peers that provide remote crawl urls
- set remote crawl option by default to off. This option may be important but it also confuses first-time users


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7158 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-16 00:39:05 +00:00
orbiter
4c21d8dc9d - changed default values for online caution (the pausing may not be necessary any more)
- fixed bug in WeakPriorityBlockingQueue
- show favicon faster using pre-loading (same technique as used for fast image search)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7130 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-09 23:25:19 +00:00
orbiter
0ab6a462ee - added a missing entry in YaCy interface robots.txt for bookmarks
- changed default robots.txt deny list to include some more interface pages because the loading of such pages are a peer load issue for YaCy when crawlers come by and information on these pages are not useful for public search. 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7112 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-06 09:58:54 +00:00
orbiter
3f1d5a061f by default store crawled pages to HTCache to support verify=false snippet generation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7087 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-31 09:28:01 +00:00
lotus
23ba107834 UPnP port forwarding default on now. This also displays a message on the entry settings page if not successful, so the user gets an extra hint to open his ports.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7077 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-27 08:45:00 +00:00
orbiter
70dd26ec95 added the new crawl scheduling function to the crawl start menu:
- the scheduler extends the option for re-crawl timing. Many people misunderstood the re-crawl timing feature because that was just a criteria for the url double-check and not a scheduler. Now the scheduler setting is combined with the re-crawl setting and people will have the choice between no re-crawl, re-crawl as was possible so far and a scheduled re-crawl. The 'classic' re-crawl time is set automatically when the scheduling function is selected
- removed the bookmark-based scheduler. This scheduler was not able to transport all attributes of a crawl start and did therefore not support special crawling starts i.e. for forums and wikis
- since the old scheduler was not aber to crawl special forums and wikis, the must-not-match filter was statically fixed to all bad pages for these special use cases. Since the new scheduler can handle these filters, it is possible to remove the default settings for the filters
- removed the busy thread that was used to trigger the bookmark-based scheduler
- removed the crontab for the bookmark-based scheduler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7051 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-19 23:52:38 +00:00
orbiter
59c035c40b changed explanation of Xmx and Xms
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7038 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-13 20:03:10 +00:00
orbiter
171f2bd84e - removed unused network oanet
- added new network definition 'allip' which can be used in networks where intranet and internet-addresses shall be indexed
- added a auto-switch-off for global search if there are no global peers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7030 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-09 23:41:17 +00:00
low012
8e88fa4a62 *) fixed indetion (tab vs. spaces)
*) added Android packages MIME type

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6956 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-29 21:31:22 +00:00
orbiter
b6fb239e74 redesign of parser interface:
some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure.
This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents.
To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files.
additionally, parsing of jar manifest files had been added.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-29 19:20:45 +00:00
orbiter
11b7853940 added a configuration page for search heuristics. currently you can switch on there:
- a site-operation heuristic that loads all direct links from a portal page if the site-operator is used
- a direct crawl for search results from scroogle for the given search terms
The configuration page can be found directly beside the network configuration page


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6951 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-27 21:38:16 +00:00
orbiter
353a924760 - changed default memory to 500m
- now xms is lower than xmx (lets try what happens)
- removed default path for intranet crawl starts to avoid confusion as seen on linuxtag
- added time-out to upnp request (i have a new router which may need that)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6916 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-14 21:36:40 +00:00
orbiter
11639aef35 - added new protocol loader for 'file'-type URLs
- it is now possible to crawl the local file system with an intranet peer
- redesign of URL handling
- refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-25 12:54:57 +00:00
orbiter
90fa8fd4d4 - support gpx file extension
- non-blocking location search (time-out handling was wrong)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6871 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-12 08:49:20 +00:00
orbiter
cf43bdc87e This is a large bugfix and enhancement commit to support a better location detection for data
- fixes to http file server session handling
- fixes and enhancements to metadata date/time handling
- added dc:publisher metadata field and updated all document parser
- fixed bug in metdata read procedure
- enhanced dublin core and rss parser to understand more fields more properly
- enhanced url selection in case that multiple urls are given in surrogates
- fix for condenser; failure when last word does not end with termination symbol

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6863 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-11 11:14:05 +00:00
orbiter
a7d038bb7a The oai ListFriends source list becomes configurable: just write them into defaults/oaiListFriendsSource.xml
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6857 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-06 10:01:37 +00:00
orbiter
cf13c65bdd added another network definition file for the open access (decentral OAI) search network
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6856 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-05 22:47:03 +00:00
orbiter
5efc0dce0b fix for domain options in search box
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6848 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-30 21:53:20 +00:00
orbiter
f83b1b91b9 increased dht busy sleep time to 10 seconds to reduce TCP/IP traffic for default settings. 2 seconds had been too much traffic for home-use routers.
Please try to set your dht busy sleep time in existing installations also to 10 seconds.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6779 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-27 23:04:00 +00:00
orbiter
9623d9e6d2 added a smb loader component for the YaCy crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6737 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-10 08:55:29 +00:00
orbiter
72f00dee59 removed never-used server access account function
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6731 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-08 22:30:45 +00:00
orbiter
30c8185139 fix for sid check
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6673 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 23:31:32 +00:00
orbiter
ef62d017e5 integrated session id filtering for crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 23:15:17 +00:00
orbiter
d8d9984913 added framework for session id filtering (not ready yet)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 22:30:41 +00:00
lotus
945e0ba5a5 allow global search if res. observer disabled index transmission
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6658 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-09 17:14:16 +00:00
lotus
5cbef63c37 fixed bad ip pattern
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6624 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-26 14:25:15 +00:00
orbiter
8df1694288 - added options to switch on/off search domains (text, image, audio, video, app)
- more memory by default


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6605 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-21 22:03:02 +00:00
orbiter
dff4f95c78 some patches to get the torrent parser working
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6551 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-07 00:42:12 +00:00
lotus
12dd8ece3e enabled memory protection from 6459 with 50000kb (disables dht-in)
this should only apply if there is really little memory available because it is checked by threads explictly requesting memory

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6479 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-13 16:26:45 +00:00
orbiter
2bab0679e0 lost my key :-(
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6466 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-06 23:46:29 +00:00
lotus
6edc168cfe option to disable dht by memory limit:
memory.acceptDHT in kbytes
not yet pre-enabled, will clear on every startup
please review since this could break dht in freeworld

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6459 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-06 19:13:30 +00:00
lotus
79251e6f60 configurable disk space hardlimit for dht
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6441 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-31 19:12:53 +00:00
orbiter
8a1046feaa less maximum file size, too many problems with larger size
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6439 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-30 20:21:45 +00:00
orbiter
3d5eeb842a new default skin 'pdblue'
The old default skin named 'default' is renamed to 'classic-blue'.
All users will keep their current default skin named default, but YaCy will copy the classic-blue also to the skin folder.
For all new peers, the new skin pdblue is used.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6416 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-15 12:59:44 +00:00
orbiter
c864901087 - moved httpd.mime to defaults path
- some documentation fixes
- adopted a default setting for the search window: moves css setting to base.css
- some enhancements for the DocumentIndex class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6410 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-14 13:29:09 +00:00
orbiter
735e2737e3 * added index segments
This is a major change in the organization of indexes.
Please consider a back-up of your data before you run this update.
All existing index files will be moved and renamed to a new position.
With this change, it will be possible to maintain different indexes for different purposes and it will be possible to have a distinction between DHT-in and DHT-out specific indexes. Tenants may also have their own index, and it may be possible to have histories and back-ups of indexes. This is just the beginning, many servlets must be adopted after this change, but all functions that had been there should still work.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6389 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 14:44:20 +00:00
orbiter
6e0dc39a7d - some fixes to prevent blocking situations
- better logging for the crawler
- better default values for the crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6377 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-06 21:52:55 +00:00
orbiter
23ab6fbca4 - navigation appear at correct position when opengeodb-results are also presented after a search
- show an about box if about.headline and about.body is set

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6332 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-20 22:10:45 +00:00
orbiter
721b88efbd - fixed a problem loading blacklists with new yacycore.jar
- fixed badwords and stopwords initialization

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6315 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-15 11:46:02 +00:00
orbiter
573d03c7d7 added configuration to enable ram table copy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6304 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-07 20:30:57 +00:00
orbiter
d656a94f55 fix for bad paths in dictionary processing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6285 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-02 18:24:41 +00:00
orbiter
39ae96450b draw more peers in network picture
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6248 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-07 08:36:15 +00:00
orbiter
c6c97f23ad - added cache usage properties to crawl start
- added special rule to balancer to omit forced delays if cache is used exclusively
- extended the htCache size by default to 32GB

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6241 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-24 11:54:04 +00:00
orbiter
5bb8074150 removed the indexing queue. This queue was superfluous since the introduction of the blocking queues last year, where documents are parsed, analysed and stored in the index with concurrency.
- The indexing queue was a historic data structure that was introduced at the very beginning at the project as a part of the switchboard organisation object structure. Without the indexing queue the switchboard queue becomes also superfluous. It has been removed as well.
- Removing the switchboard queue requires that all servlets are called without a opaque generic ('<?>'). That caused that all serlets had to be modified.
- Many servlets displayed the indexing queue or the size of that queue. In the past months the indexer was so fast that mostly the indexing queue appeared empty, so there was no use of it any more. Because the queue has been removed, the display in the servlets had also to be removed.
- The surrogate work task had been a part of the indexing queue control structure. Without the indexing queue the surrogates needed its own task management. That has been integrated here.
- Because the indexing queue had a special queue entry object and properties attached to this object, the propterties had to be moved to the queue entry object which is part of the new indexing queue withing the blocking queue, the Response Object. That object has now also the new properties of the removed indexing queue entry object.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6225 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-17 13:59:21 +00:00
orbiter
b2263bc720 enhanced document type recognition
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6209 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-14 11:01:05 +00:00
orbiter
57a88d435b redesign of parser mime type detection and parser steering
There is now a mime-blacklist instead of a mime-whitelist

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6190 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-10 14:22:17 +00:00
orbiter
8ca1f5d400 - some work to integrate the html parser the same way as the other parsers are integrated (not finished)
- added migration of code of settings pages (hmm.. does not work correctly yet, sorry)
- more refactoring
- removed more unused code

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6187 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-09 20:56:30 +00:00
orbiter
801aa08162 added f1oris update location
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6174 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-04 20:56:25 +00:00
lotus
ec2970cc76 higher dht distribution speed by default
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6168 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-03 13:28:19 +00:00
orbiter
995da28c73 all stack/heap files that had been stored in DATA/PLASMA are now stored in the network-specific QUEUES path
There is no migration. All crawls must be restarted.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6167 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-02 17:01:23 +00:00
low012
457b6c0d6d *) updated Apache POI library to be able to parse Visio files
*) updated PPT and XLS parsers to use new Apache POI library
*) added new Visio (VSD) parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6145 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-27 09:33:09 +00:00
lotus
db70badcf0 possibility to set remote host on upnp device
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6097 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-19 15:32:59 +00:00
lotus
aec3e7995a autoconfig.pac can be used to browse .yacy-domains only
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6077 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-15 19:48:11 +00:00
orbiter
f348190566 tried to insert a database dump import method to the phpBB3 import function. Reason: imports or large database dumps are cannot be handled with phpMyAdmin and this should be an easy way to the database dumps into a mySQL database where it can be exported again with the phpBB3 content integration adapter. Completion or removal of this function stub will follow before next main release.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6065 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-13 23:03:40 +00:00
orbiter
d50be59088 - added a automatic re-construction of the domain stack after 10 minutes. this includes then urls to the domain stack that were left over in case of stack size limitations when the domain stack was created the last time
- changed the busy sleep time for the crawl thread to 30 millisecons. This is sufficient to crawl with 2000 PPM.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6028 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-06 09:34:44 +00:00
orbiter
5fdba0fa51 - fixed a not working selection rule in balancer
- more security about crawl-delay, be more fail-save
- better logging in case of long forced crawl-delays

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6027 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-06 08:46:59 +00:00
orbiter
4522c13ee7 added option for a table prefix when importing phpbb3
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5996 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-29 14:29:02 +00:00
orbiter
4b4bddca00 added new submenu to crawler menu: import of phpbb3 forum postings from mysql
- yacy can import phpbb3 posts without crawling
- all data is written as surrogate
- indexed surrogate files can be re-used

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5985 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-27 14:53:23 +00:00
orbiter
26a46b5521 increased default maximum file size for database files to 2GB
Other file sizes can now be configured with the attributes
filesize.max.win and filesize.max.other
the default maximum file size for non-windows OS is now 32GB

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5974 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-25 06:59:21 +00:00
orbiter
27eb8d62cb - new development cycle
- removed temporary configuration with safe setting for indexer threads (=1) and replaced it with best value computed during performance tests (1/2 of number of processors)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5963 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-18 21:20:06 +00:00