Commit Graph

500 Commits

Author SHA1 Message Date
luccioman
aa9ddf3c23 Added control over Robots.txt active threads maximum number.
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.

To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
2016-11-23 18:13:05 +01:00
reger
bad8f87998 remove old/obsolete clear text "adminAccount" credential entry from init
and setConfig (.,empty) from servlets/code
2016-11-20 00:20:47 +01:00
reger
59448461d3 make use of userInRole for quick login verification 2016-11-16 01:42:53 +01:00
reger
2a4d826d9e adjust servlet RequestHeader.getLocale
init jvm defaultLocale matching UI language
2016-11-15 02:37:03 +01:00
reger
af39a76bf6 Reduce number of default max. search navigator lines (from 10000)
to 100 + make it configurable
2016-10-29 04:19:46 +02:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
luccioman
7d5ba2afa4 Added some JavaDoc and moved crawlStacker close at the right place. 2016-09-22 08:21:14 +02:00
Michael Peter Christen
efeb592661 don't do solr optimization, this create high IO load. We should leave
this task to solr to do that on it's own instead of forcing it.
2016-08-19 15:30:53 +02:00
reger
2910fe35c1 add missing scheduler calc of next exec_date (call of calculateAPIScheduler)
- after last_exec_date is altered, next_exec_date should be recalculated
- makes the recalculation of next_exec in advance (without api call surely made) in Switchbard.schedulerJob() obsolete
Slightly modify next_exec calc. on missed event to now+schedule_time (from fix 10min)
2016-08-09 03:03:04 +02:00
reger
70d47ae38a keep scheduler selection by repeat entry from 07311020d4
to allow exec schedule on actual exec event.
Iterate on exec date (of advantage after interruption/shutdown) to schedule
older or missed events first.
2016-08-08 02:19:48 +02:00
reger
7c3f932e5d revert due to conflict with double count recording by schedulter / servlet by the commit under normal operation (no shutdown) 2016-08-08 01:57:31 +02:00
reger
07311020d4 postpone apicall exec date init until actual call
fix for http://mantis.tokeek.de/view.php?id=677
The difference is on scheduling a large number of rss feeds and loading 
is not finished before shutdown of YaCy. The change makes sure not already
loaded RSS will be loaded by the scheduler on next startup.
2016-08-07 05:08:55 +02:00
reger
fcad2d0744 add uses of config constant INDEX_RECEIVE_ALLOW 2016-07-27 02:16:20 +02:00
Michael Peter Christen
7466d390b2 small refactoring + do not accept too old peers during bootstrap 2016-07-04 11:02:15 +02:00
reger
8d58a48029 remove wrong log line in CrawlSwitchboard
+ don't allow CrawlSwitchboard to exit application
making network param unused
2016-07-02 20:33:23 +02:00
reger
b119ff65be clean out not used Switchboard variables
counter indexedPages, const xstackCrawlSlots
2016-06-14 01:50:32 +02:00
reger
bd8f7c11f5 Use transparent addToCrawler in AutoSearch instead of addToIndex
This would likely also be of advantage for RSS import/schedule as
following bug-reports suggest
http://mantis.tokeek.de/view.php?id=569
http://mantis.tokeek.de/view.php?id=655
2016-06-01 01:14:22 +02:00
JeremyRand
433217b33e Properly support multiple Boost Queries. (Previous code was broken because it concatenated multiple Boost Queries together rather than passing Solr an array.) 2016-05-20 20:17:51 -05:00
reger
d0a571bed2 del cytag trail for own index.html (save resource not used by default) 2016-05-19 01:59:00 +02:00
reger
7097dcbdbd cleanup hack for partial Solr update on multivalued datefields
has been fixed in Solr http://issues.apache.org/jira/browse/SOLR-8050
2016-05-06 02:47:04 +02:00
reger
ef24593347 delete obsolete SEARCHRESULT busythread constants
not used since 29.05.2013 18:27:27
0c1a018bbd
2016-05-04 01:30:10 +02:00
reger
d9adc2c255 load handler for Transparent Proxy on startup only if feature is activated
to save the resources and keep handler chain small if the feature is not used.
+add a warning message on settingsack_p page to restart on first activation
2016-03-25 05:26:48 +01:00
Michael Peter Christen
849ab671a9 0n: modified the p2p bootstraping process - rules had been too tight and
did not support the re-start of a network with just one principal peer.
2016-03-11 08:54:42 +01:00
reger
06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
2016-02-16 02:05:58 +01:00
reger
a6617ad887 expand initRemoteCrawler() to terminate worker threads if called to deactivate
remote crawl.
On startup we save the resources for remote crawler if disabled. Once started
threads are running idle after disable remote crawl. Now threads are terminated
to save the resources also while disabeling during runtime.
+ remove empty class Channels
2016-01-28 23:14:09 +01:00
Ryszard Goń
a98c395023 Add the Autocrawl thread 2016-01-14 00:50:23 +01:00
reger
1af0e9ef74 remove workaround for Solr bug regarding multivalued date fields
fixed in 5.4.0
http://issues.apache.org/jira/browse/SOLR-8050
2016-01-03 01:11:27 +01:00
reger
6d54eb3d36 skip loading document on crawl start for YMark bookmarks
by adding a constructor giving the already loaded document as parameter.
2015-12-26 01:15:07 +01:00
luc
8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
failing. Looks like it was broken since Commit
b43811d38c
2015-12-08 03:34:03 +01:00
reger
52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
reger
a60b1fb6c2 differentiate api call getLocalPort() from getConfigInt() 2015-10-31 23:09:03 +01:00
Michael Peter Christen
3d7dd9d3aa follow-up to latest commit: also flush the search cache if all crawls
had been terminated.
2015-10-01 13:21:28 +02:00
reger
7889fc2389 Hack to prevent Solr issue on partial update on a document containing multivalued date field
(regardless if these fields part of update).
Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field).
see http://mantis.tokeek.de/view.php?id=601
2015-09-13 20:23:15 +02:00
reger
e37a4f0b3d prevent metadata records in index w/o valid url
by throwing MalformedURL exception on URIMetadataNode creation
2015-09-06 22:19:05 +02:00
Michael Peter Christen
df3314ac1a added a new facet type based on a probabilistic classifier using
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.

New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".

To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.

Then, each new document is classified to match within one of the given
categories for each context.
2015-08-10 14:27:44 +02:00
reger
cb67eb7baf use more absolute path for config file opening
as suggested in pull request 5 (https://github.com/yacy/yacy_search_server/pull/5)
2015-08-01 23:54:26 +02:00
Michael Peter Christen
90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
during surrogate reading: those attributes from the dump are removed
during the import process and replaced by new detected attributes
according to the setting of the YaCy peer.
This may cause that all such attributes are removed if the importing
peer has no synonyms and/or no vocabularies defined.
2015-07-02 00:23:50 +02:00
Michael Peter Christen
593de05922 enhanced surrogate import process speed (dramatically!) 2015-06-29 12:28:34 +02:00
Michael Peter Christen
694b22f165 migration to Solr 5.2: huge benefits - this is a lot faster!
This is a very complex migration: many classes had been renamed or
removed, dependencies changed and the solr index type is now aligned to
be a solr cloud repository.
Together with the Solr 5.2 library update, one other dependent library
had been updated as well: httpclient 4.4->4.4.1

Older indexes are migrated from 4_10 to 5_2. However, the new index
structure is more efficient and we recommend to re-index everything.
Please use the index export before you do the update to a large
surrogate xml file. After the update, start with an empty index and then
initialize this with your dump.
2015-06-24 01:55:51 +02:00
reger
121972752c implement deleteOldDownloads in RexourceObserver on low diskspace
- direct assign sb.observer (skip redundant InitThread)
2015-06-08 02:52:13 +02:00
reger
49b79987c9 remove obsolete searchfl work table
was used to register urls with not complete words in snippet but is never accessed
2015-06-04 22:44:01 +02:00
Michael Peter Christen
d0aff91f23 fix for index import 2015-06-01 01:56:09 +02:00
Michael Peter Christen
34de1e8cbc gzip compression will perform more efficient and with better compression
level
2015-06-01 01:24:33 +02:00
Michael Peter Christen
b43811d38c added surrogate import process for exported solr dumps.
Just throw your solr dump file into DATA/SURROGATES/in/ and it will be
imported!
2015-05-30 13:19:59 +02:00
Michael Peter Christen
197f7449e5 All entities of crawl profiles are now editable in the crawl profile
editor.
2015-05-28 16:07:40 +02:00
reger
3e742d1e34 Init remote crawler on demand
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
reger
2bc9cb5828 fix early return in addToCrawler
check / handle all supplied urls after error url
2015-05-13 21:58:43 +02:00
reger
752eec6697 fix NPE in addToIndex when used outside searchEvent 2015-05-10 05:18:23 +02:00
Michael Peter Christen
ff29b0e503 added option to re-index exported xml snapshot dumps to
HTCACHE/snapshots by just placing them in the SURROGATES/in path
2015-05-08 15:30:26 +02:00
Michael Peter Christen
6f4fe4b175 revert of 8a7c68e4c7
keeping surrogates after processing is essential for some users. If the
space they are taking is too high, please set up an automatic deletion
process (like a cronjob).
2015-05-08 14:01:30 +02:00
reger
8a5b8f8789 on bookmaring of search result, remember orig. query in separate bookmark property
(instead of using the description field)
- adjust display and autosearch
- don't overwrite existing bookmark but combine info
2015-05-03 02:31:50 +02:00
reger
296e97c78e put https port in peers dna
as we flag if a peer is accesible via https, we need to know the port if we want to use is (e.g. for interYaCy communication)
start to provide / tansport the port by recording it in peers dna.
- add https link on the Network.html lock symbol
2015-04-16 02:36:12 +02:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
4b97ddb9ec stop sending crawl receipts if receiver got offline 2015-02-17 03:16:10 +01:00
Michael Peter Christen
b5ac29c9a5 added a html field scraper which reads text from html entities of a
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.

To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
2015-01-30 13:20:56 +01:00
Michael Peter Christen
a8a2b7a803 persistency for vocabulary facet switch 2015-01-29 02:16:42 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
Michael Peter Christen
3b51636ecb fix for mediawiki import 2015-01-12 00:35:47 +01:00
Michael Peter Christen
eb78388a98 changed prefer strategy for http unique in such a way that http is
preferred over https. While this is a bad idea from the standpoint of
security it is more common applicable for environments where http and
https mix and for some domains https is not available. Then the
double-check is possible even if no postprocessing is performed.
2014-12-21 19:17:06 +01:00
Michael Peter Christen
8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
parsing into individual pages and add them all using different URLs.
These constructed urls are generated from the source url with an
appended page=<pagenumber> attribute to the url get/post properties.
This will distinguish the different page entries. The search result list
will then replace the post parameter with a url anchor # mark which
causes that the original url is presented in the search result. These
URLs can be opened directly on the correct page using pdf.js which is
now built-in into firefox. That means: if you find a search hit on page
5 and click on the search result, firefox will open the pdf viewer and
shows page 5.
2014-12-21 18:10:15 +01:00
Michael Peter Christen
28683530cd fixes to usage of no-cache: use and recognize also the no-store
directive
2014-12-19 17:37:58 +01:00
reger
13cca2b114 fix missing AppPath
upd Maven plugin versionid
2014-12-19 01:58:37 +01:00
Michael Peter Christen
66b5a56976 Added and integrated new date detection class which can identify date
notions within the fulltext of a document. This class attempts to
identify also dates given abbreviated or with missing year or described
with names for special days, like 'Halloween'. In case that a date has
no year given, the current year and following years are considered.

This process is therefore able to identify a large set of dates to a
document, either because there are several dates given in the document
or the date is ambiguous. Four new Solr fields are used to store the
parsing result:

dates_in_content_sxt:
if date expressions can be found in the content, these dates are listed
here in order of the appearances

dates_in_content_count_i:
the number of entries in dates_in_content_sxt

date_in_content_min_dt:
if dates_in_content_sxt is filled, this contains the oldest date from
the list of available dates

#date_in_content_max_dt:
if dates_in_content_sxt is filled, this contains the youngest date from
the list of available dates, that may also be possibly in the future

These fields are deactiviated by default because the evaluation of
regular expressions to detect the date is yet too CPU intensive. Maybe
future enhancements will cause that this is switched on by default.

The purpose of these fields is the creation of calendar-like search
facets, to be implemented next.
2014-12-14 13:40:45 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
reger
e52370728a fix startup stop on missing HTCACHE/SNAPSHOT directory 2014-12-06 02:25:24 +01:00
reger
ff18129def ViewFile servlet: update index if newer,
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index 
   (the resource is loaded, parsed anyway, so it's not a expensive operation)

Request: remove 2 unused init parameter 
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
2014-12-05 01:13:37 +01:00
Michael Peter Christen
60f27bdf49 added the property timeoutrequests to configuration to disable
TimeoutRequests. The purpose is to test if YaCy runs better on VMs where
there is a limitation of concurrent processes;  see
/proc/user_beancounters in row numproc; this value is limited and should
be low. Try to set timeoutrequests to keep this low. (works only after
restart)
2014-12-01 15:20:10 +01:00
Michael Peter Christen
97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
be transcoded into jpg for image previews. To create such pdfs you must
do:

Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"

Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.

Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
2014-12-01 15:03:09 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
Michael Peter Christen
1d45d9405a security bugfix 2014-11-28 01:19:01 +01:00
reger
1e7ee72240 fix path lookup to ./defaults/yacy.badwords
(fix of commit ee277b9b3e)
2014-11-23 23:29:20 +01:00
reger
ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
   (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)

move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
2014-11-23 05:22:23 +01:00
Michael Peter Christen
70f03f7c8e do not cache search requests to Solr if the result is used for
doublechecking. If a double-check comes from cached results the
doublecheck fails.
2014-11-20 18:45:27 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
Michael Peter Christen
0a879c98e7 added new 'firstSeen' database table and necessary data structures which
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
2014-11-13 00:58:58 +01:00
orbiter
0fcd8097a3 removed unused options from BusyThreads 2014-11-02 20:08:49 +01:00
Michael Peter Christen
2e5214eb21 added field postprocessing.partialUpdate to settings which can be used
to switch on or off partial updates. Both options should cause the same
result. Default is on.
2014-10-17 14:17:49 +02:00
Michael Peter Christen
07c5b57953 removed warnings 2014-10-15 11:19:25 +02:00
Michael Peter Christen
5082feb103 less volume for effect sounds 2014-10-08 15:04:35 +02:00
Michael Peter Christen
0843b12ef3 ipv6 fix: avoid that shrinked own ip set is overwritten with (non-valid)
set of local IPs
2014-10-07 22:36:01 +02:00
Michael Peter Christen
74957f3760 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-07 17:51:18 +02:00
Michael Peter Christen
2a052f446a Added an experimental audio feedback system.
This is the first element of a new 'decoration' component which may hold
switches for different external appearance parameters.
The first switch in that context is decoration.audio (as usual in
yacy.init). This value is set to false by default, that means the audio
feedback element is switched off by default. To switch it on, set
decoration.audio = true (using /ConfigProperties_p.html). You will then
hear sounds for the following events:
- remote searches
- incoming dht transmissions
- new documents from the crawler
Sound clips are stored in htroot/env/soundclips/ which is done so
because a future implementation will read these files using the http
client and with configurable urls which will make it very easy for the
user to replace the given sounds with own sounds.
2014-10-07 17:51:07 +02:00
Marc Nause
1e6e69bc40 Finished implementation of UPNP:
*) will try other ports if YaCy standard ports are not available
*) distinguish between internal and external port (not sure if this
works 100%)

Still to add: propery in config to enter own external port (in case of
manually configured NAT)
2014-10-07 13:10:06 +02:00
Michael Peter Christen
6491270b3a large IPv6 redesign of peer ping methods!
removed preferred IPv4 in start options and added a new field IP6 in
peer seeds which will contain one or more IPv6 addresses. Now every peer
has one or more IP addresses assigned, even several IPv6 addresses are
possible. The peer-ping process must check all given and possible IP
addresses for a backping and return the one IP which was successful when
pinging the peer. The ping-ing peer must be able to recognize which of
the given IPs are available for outside access of the peer and store
this accordingly. If only one IPv6 address is available and no IPv4,
then the IPv6 is stored in the old IP field of the seed DNA.
Many methods in Seed.java are now marked as @deprecated because they had
been used for a single IP only. There is still a large construction site
left in YaCy now where all these deprecated methods must be replaced
with new method calls. The 'extra'-IPs, used by cluster assignment had
been removed since that can be replaced with IPv6 usage in p2p clusters.
All clusters must now use IPv6 if they want an intranet-routing.
2014-09-30 14:53:52 +02:00
Michael Peter Christen
ad35d9294f added a 'stats' table which records some peer statistics twice every
hour. The table can be shown with
http://localhost:8090/Tables_p.html?table=stats

The entries have the following meaning: 
aM: activeLastMonth
aW: activeLastWeek
aD: activeLastDay
aH: activeLastHour
cC: countConnected (Active Senior)
cD: countDisconnected (Passive Senior)
cP: countPotential (Junior)
cR: count of the RWI entries
cI: size of the index (number of documents)

The entry keys are abbreviated to reduce the space in the table as the
name is written again for every row.

This is the beginning of a 'yacystats' micro-alternative als built-in
function in YaCy. Graphics may follow after some time if enough test
data is available.
2014-09-17 12:54:50 +02:00
reger
8284ea751a catch TimeoutException during ping and do not delete yacy.conf during prereadconfigfile
found a situation after crash (reboot) with existing running semaphore but YaCy not running.
Ping generated exception which finally deleted the conf file (during pre-read procedure)
- change to ping (catch exception solved it)
- additionally removed delete yacy.conf file (if needed we need to make a backup)
2014-09-16 23:14:13 +02:00
Michael Peter Christen
524bedc00a fixed text in startup tray icon and added shutdown icon during shutdown 2014-09-10 13:19:08 +02:00
orbiter
f318d7c285 enhanced date-ordered ranking 2014-09-01 13:01:30 +02:00
orbiter
a65df4ce7e do not push noindex errors into log if in intranet mode. noindex
attributes are attached to artificial constructed index.html files which
list directories. Such files are naturally rejected by the crawler and
should not appear in the error log because these files are part of the
construction of file crawlers and confuse users if they see them in the
error log.
2014-08-27 00:10:51 +02:00
Marc Nause
2af56fa37d Improved UPnP. (still not perfect)
*) set HTTPS port if enabled
*) improved data structures (may not be final)
*) moved UPnP to own package
2014-08-26 22:47:13 +02:00
Michael Peter Christen
c465b791af typo 2014-08-04 16:13:39 +02:00
Michael Peter Christen
12fb9d7cd1 log postprocessing constraints in case that postprocessing is not
performed
2014-08-04 14:19:37 +02:00
orbiter
22ce4fb4dd better error handling for remote solr queries and exists-checks 2014-08-01 11:00:10 +02:00
Michael Peter Christen
542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
filled with the date, when the url is recognized as to be outdated. That
field was partly misinterpreted and the time interval was filled in. In
case that all the urls which are in the index shall be treated as
outdated, the field is filled now with Long.MAX_VALUE because then all
crawl dates are before that date and therefore outdated.
2014-07-22 00:23:17 +02:00
Michael Peter Christen
4eec1a7452 refactoring (change Metadata name of load time data structure to avoid
confusion with Node data which is also called metadata)
2014-07-21 23:54:23 +02:00
reger
a2cb366b25 Combine /heuristic search modifier with opensearch configured targets
- with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid)
- this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches
- the index.html searchoption text adjusted to be displayed only if option configured
- add Archive-It to predefined systems
2014-07-20 00:00:43 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00