Commit Graph

1217 Commits

Author SHA1 Message Date
Michael Peter Christen
bee5ee7cce removed some warnings 2015-01-27 17:00:20 +01:00
reger
42b0672be3 Let auto-disabled crawls recover if low resource condition vanished.
Analog to autodisabled DHT switch autodisabled crawls back on upon mem ok
by remembering the autodisable by conf parameter.
2015-01-24 01:53:58 +01:00
Michael Peter Christen
7db2888336 fixed font size and print page generation in pdf snapshots 2015-01-20 17:14:14 +01:00
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
Michael Peter Christen
3b51636ecb fix for mediawiki import 2015-01-12 00:35:47 +01:00
Michael Peter Christen
8cafdb989a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-09 11:00:02 +01:00
reger
66839f73fa remove debug limit from commit before 2015-01-09 02:52:18 +01:00
reger
4214f250d0 Add option for extended search (Autosearch) to Bookmark.html asking all connected peers for the searchterm added as description to the bookmark created by the bookmark icon.
Intended for searches/research projects with not sufficient results from local and DHT selected remote target peers.

Function: the process checks newly created bookmarks for description starting with "query=..." and takes this to ask every peer for 20 search results and adds it to the local index in a background job.
link to start/stop the process added to /Bookmarks.html
2015-01-09 02:06:30 +01:00
Michael Peter Christen
3e6c3e2237 documents pushed over the api/push_p.html interface will have their
unique flag set by default
2015-01-06 15:22:59 +01:00
reger
4eb89d7f15 revert clickservlet
(default was indeed a mistakenly)
2015-01-05 09:10:20 +01:00
reger
d44d8996d0 Added a “don't store remote search results” option
This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. 
The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules).
Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index.

To be able to improve the local index a Click-Servlet option was added additionally.
If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index.
The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks)

The option check-boxes are placed in ConfigPortal.html
2015-01-04 11:10:45 +01:00
Michael Peter Christen
d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
for crawling
2015-01-02 02:44:03 +01:00
Michael Peter Christen
ecb6a59e9e do not translate gif images into png images for thumbnails. Instead,
stream the original to the search result thumb viewer. This has two
reasons:
- animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a
known bug which is obviously not yet fixed
- animated gifs now appear in the search result also as animation
2014-12-28 14:53:55 +01:00
reger
73ba5d8ef7 adjust fieldtype and description of field httpstatus_redirect_s in CollectionSchema
- the field is not used (delete candidate)
2014-12-26 18:21:35 +01:00
Michael Peter Christen
eb78388a98 changed prefer strategy for http unique in such a way that http is
preferred over https. While this is a bad idea from the standpoint of
security it is more common applicable for environments where http and
https mix and for some domains https is not available. Then the
double-check is possible even if no postprocessing is performed.
2014-12-21 19:17:06 +01:00
Michael Peter Christen
aaf7d4775a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-21 18:10:25 +01:00
Michael Peter Christen
8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
parsing into individual pages and add them all using different URLs.
These constructed urls are generated from the source url with an
appended page=<pagenumber> attribute to the url get/post properties.
This will distinguish the different page entries. The search result list
will then replace the post parameter with a url anchor # mark which
causes that the original url is presented in the search result. These
URLs can be opened directly on the correct page using pdf.js which is
now built-in into firefox. That means: if you find a search hit on page
5 and click on the search result, firefox will open the pdf viewer and
shows page 5.
2014-12-21 18:10:15 +01:00
reger
198102304b refactor size() -> filesize() of URIMetadataNode
(harmonize with ResultEntry and to not get confused with Collection.size())
2014-12-21 06:05:35 +01:00
Michael Peter Christen
d3e71ed070 fixes for searches when initialization of large autotagging libraries
have not been finished
2014-12-19 17:38:58 +01:00
Michael Peter Christen
28683530cd fixes to usage of no-cache: use and recognize also the no-store
directive
2014-12-19 17:37:58 +01:00
reger
13cca2b114 fix missing AppPath
upd Maven plugin versionid
2014-12-19 01:58:37 +01:00
Michael Peter Christen
65125439fe added query modifier 'on'. This makes it possible to search for date
occurrences within the (web) page documents (not the document
last-modified!). This works only if the solr field dates_in_content_sxt
is enabled. A search request may then have the form "term on:<date>",
like
gift on:24.12.2014
gift on:2014/12/24
* on:2014/12/31
For the date format you may use any kind of human-readable date
representation(!yes!) - the on:<date> parser tries to identify language
and also knows event names, like:
bunny on:eastern
.. as long as the date term has no spaces inside (use a dot). Further
enhancement will be made to accept also strings encapsulated with
quotes.
2014-12-16 13:53:12 +01:00
Michael Peter Christen
932faafffe reactivated on-demand snapshot loading 2014-12-16 12:09:57 +01:00
Michael Peter Christen
66b5a56976 Added and integrated new date detection class which can identify date
notions within the fulltext of a document. This class attempts to
identify also dates given abbreviated or with missing year or described
with names for special days, like 'Halloween'. In case that a date has
no year given, the current year and following years are considered.

This process is therefore able to identify a large set of dates to a
document, either because there are several dates given in the document
or the date is ambiguous. Four new Solr fields are used to store the
parsing result:

dates_in_content_sxt:
if date expressions can be found in the content, these dates are listed
here in order of the appearances

dates_in_content_count_i:
the number of entries in dates_in_content_sxt

date_in_content_min_dt:
if dates_in_content_sxt is filled, this contains the oldest date from
the list of available dates

#date_in_content_max_dt:
if dates_in_content_sxt is filled, this contains the youngest date from
the list of available dates, that may also be possibly in the future

These fields are deactiviated by default because the evaluation of
regular expressions to detect the date is yet too CPU intensive. Maybe
future enhancements will cause that this is switched on by default.

The purpose of these fields is the creation of calendar-like search
facets, to be implemented next.
2014-12-14 13:40:45 +01:00
Michael Peter Christen
6a1865f507 refactoring date -> lastModified 2014-12-11 23:37:41 +01:00
Michael Peter Christen
7bfc5b80cb added new options to vocabulary editor:
- new switch 'isFacet' which causes that the usage of the vocabulary for
search facets is enabled or disabled. This shall be used for large
vocabularies sind searched in solr are extremely slow if facets for a
large set of alternative terms are generated
- new option to disable auto-enrichment from synonyms
- new option to add synonyms from another column when importing from csv
- automatically recognize double-occurrences in synonyms and bundling
terms for such synonyms
2014-12-10 12:20:27 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
reger
5f0bb1214f modified FieldReIndex to reindex queries with low number of documents first
by using a internally a score map with number of documents as score
and working through the list from low to high.
2014-12-07 04:31:09 +01:00
reger
e52370728a fix startup stop on missing HTCACHE/SNAPSHOT directory 2014-12-06 02:25:24 +01:00
reger
70cf7060a4 coding fixes suggested in
http://mantis.tokeek.de/view.php?id=509
http://mantis.tokeek.de/view.php?id=510
2014-12-06 01:42:24 +01:00
reger
ff18129def ViewFile servlet: update index if newer,
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index 
   (the resource is loaded, parsed anyway, so it's not a expensive operation)

Request: remove 2 unused init parameter 
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
2014-12-05 01:13:37 +01:00
Michael Peter Christen
60f27bdf49 added the property timeoutrequests to configuration to disable
TimeoutRequests. The purpose is to test if YaCy runs better on VMs where
there is a limitation of concurrent processes;  see
/proc/user_beancounters in row numproc; this value is limited and should
be low. Try to set timeoutrequests to keep this low. (works only after
restart)
2014-12-01 15:20:10 +01:00
Michael Peter Christen
97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
be transcoded into jpg for image previews. To create such pdfs you must
do:

Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"

Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.

Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
2014-12-01 15:03:09 +01:00
reger
0c97cc2440 skip unused call parameter for hashSentence() 2014-11-30 19:42:33 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
Michael Peter Christen
1d45d9405a security bugfix 2014-11-28 01:19:01 +01:00
Michael Peter Christen
ff728b4aa5 ignore url errors during search 2014-11-27 20:50:55 +01:00
Michael Peter Christen
8317914ce3 changed vocabulary navigator object type to TreeMap to get a specific
order into the vocabularies. This is now lexicographic which is not so
much random as a hashed order
2014-11-27 07:44:41 +01:00
Michael Peter Christen
041b605cfe Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-25 09:48:48 +01:00
Michael Peter Christen
30276a2b48 prevent that a local Solr search and a local RWI search are running
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
2014-11-24 20:53:19 +01:00
reger
1e7ee72240 fix path lookup to ./defaults/yacy.badwords
(fix of commit ee277b9b3e)
2014-11-23 23:29:20 +01:00
reger
ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
   (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)

move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
2014-11-23 05:22:23 +01:00
reger
de56266bcb remove redundant toLower for topwords 2014-11-22 22:49:23 +01:00
Michael Peter Christen
70f03f7c8e do not cache search requests to Solr if the result is used for
doublechecking. If a double-check comes from cached results the
doublecheck fails.
2014-11-20 18:45:27 +01:00
reger
ef5dc68313 include domtype to searcheventcache id
to differenciate between local / global events for reuse of cached events 
fix for http://mantis.tokeek.de/view.php?id=493
2014-11-20 02:04:43 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
c67c5c0709 added new solr schema fields which record the occurences of vocabulary
matchings. These matches can be used for result boosting, i.e. if a
document contains words from a specific vocabulary, boost it.
2014-11-18 15:02:34 +01:00
Michael Peter Christen
0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
2014-11-14 16:34:55 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
Michael Peter Christen
7e1b0b6712 fix for wildcard patch in search queries 2014-11-13 00:59:30 +01:00
Michael Peter Christen
0a879c98e7 added new 'firstSeen' database table and necessary data structures which
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
2014-11-13 00:58:58 +01:00
sixcooler
9c6e3a6b1c fix assertation-failure in version-string for Solr-4.10.2 by changing
the assert - hope that is ok
+ add forgotten NB-Projekt-changes
2014-11-07 22:43:50 +01:00
sixcooler
725b206fb4 update to solr-/lucene-4.10.2 2014-11-07 18:51:31 +01:00
Michael Peter Christen
5c97ecb30f fix of bad query generation for search facets 2014-11-07 18:11:49 +01:00
Michael Peter Christen
95d87f00b3 fix for bad query generation in doublecheck in postprocessing 2014-11-07 18:11:23 +01:00
orbiter
5be352da99 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-02 20:35:08 +01:00
orbiter
0fcd8097a3 removed unused options from BusyThreads 2014-11-02 20:08:49 +01:00
Michael Peter Christen
92007e5d2d more enhancements to posprocessing speed 2014-11-02 12:52:23 +01:00
Michael Peter Christen
9a7fe9e0d1 fix for bad timing computation in postprocessing 2014-10-31 23:17:56 +01:00
Michael Peter Christen
bd16119a00 another fix for postprocessing (the query for "" on numeric field did
not work in external solr)
2014-10-31 17:44:45 +01:00
Michael Peter Christen
327e83bfe7 more fixes in postprocessing: partitioning of the complete queue to
enable smaller queries
2014-10-31 17:30:24 +01:00
orbiter
71758f0d62 enhanced postprocessing by usage of a field-list generation to prevent
lazy initialization of the documents. This is useful because the
documents must be read completely anyway.
2014-10-30 18:05:48 +01:00
Michael Peter Christen
fe537679de fix for exact_signature_unique_b, exact_signature_copycount_i,
fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same
criteria for 'valid document' as for title and description uniqueness
test.
2014-10-24 15:04:40 +02:00
Michael Peter Christen
2e5214eb21 added field postprocessing.partialUpdate to settings which can be used
to switch on or off partial updates. Both options should cause the same
result. Default is on.
2014-10-17 14:17:49 +02:00
Michael Peter Christen
77662e08e1 concurrently initialize the error cache; extended also the cache by
factor 10 up to 1000 entries. This error cache is only used to catch up
paused crawls between shutdown+startup
2014-10-17 12:45:26 +02:00
Michael Peter Christen
07c5b57953 removed warnings 2014-10-15 11:19:25 +02:00
Michael Peter Christen
2e09da9832 npe fix 2014-10-14 12:48:15 +02:00
Michael Peter Christen
d80418f1b1 added partial updates to solr during postprocessing: during
postprocessing the solr documents are now not completely retrieved.
instead, only fiels, needed for the postprocessing are extracted. When
Solr document are written, this is done using partial updates.

This increases postprocessing speed by about 50% for embedded Solr
configurations. For external Solr configurations the enhancement should
be much higher because the postprocessing with remote Solr is very slow.
When doing partial updates to a remote Solr, this method should perform
much better than before, it is expected that this is even much higher
than the increase with local Solr.
2014-10-14 12:19:59 +02:00
Michael Peter Christen
b1cfbc4a04 added new solr field url_paths_count_i which can be used to enhance the
index browser and maybe also for ranking; possibly also for
SEO-with-YaCy applications.
2014-10-13 23:51:19 +02:00
Michael Peter Christen
30d4402cd1 fixed location search 2014-10-13 14:28:11 +02:00
Michael Peter Christen
8c1a89cb34 added another decoration flag to switch off network graphics in crawler
monitor and index browser: decoration.grafics.linkstructure
Please set this to false to remove the graphics from the interface.
2014-10-08 17:12:35 +02:00
Michael Peter Christen
5082feb103 less volume for effect sounds 2014-10-08 15:04:35 +02:00
Michael Peter Christen
0bfc69b29b more ipv6 bugfixes 2014-10-08 12:38:56 +02:00
Michael Peter Christen
883622306e Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/peers/Protocol.java
2014-10-07 23:33:28 +02:00
Michael Peter Christen
0843b12ef3 ipv6 fix: avoid that shrinked own ip set is overwritten with (non-valid)
set of local IPs
2014-10-07 22:36:01 +02:00
orbiter
cddf884bc4 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-07 19:27:14 +02:00
Michael Peter Christen
74957f3760 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-07 17:51:18 +02:00
Michael Peter Christen
2a052f446a Added an experimental audio feedback system.
This is the first element of a new 'decoration' component which may hold
switches for different external appearance parameters.
The first switch in that context is decoration.audio (as usual in
yacy.init). This value is set to false by default, that means the audio
feedback element is switched off by default. To switch it on, set
decoration.audio = true (using /ConfigProperties_p.html). You will then
hear sounds for the following events:
- remote searches
- incoming dht transmissions
- new documents from the crawler
Sound clips are stored in htroot/env/soundclips/ which is done so
because a future implementation will read these files using the http
client and with configurable urls which will make it very easy for the
user to replace the given sounds with own sounds.
2014-10-07 17:51:07 +02:00
Marc Nause
1e6e69bc40 Finished implementation of UPNP:
*) will try other ports if YaCy standard ports are not available
*) distinguish between internal and external port (not sure if this
works 100%)

Still to add: propery in config to enter own external port (in case of
manually configured NAT)
2014-10-07 13:10:06 +02:00
orbiter
f3a12801f0 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-05 14:50:35 +02:00
orbiter
d93325a578 lazy handling of process_sxt field (part of postprocessing) 2014-10-05 14:50:22 +02:00
reger
b5ca20de15 preserve content_type (mime) if supplied in preference of construct in from file type.
(this eventually can benefit image search by using mime only)

reduce redundant field assignment for Solrdocuments created from URIMetadataNode (URIMetadataNode = SolrDocument with partially assigned fields)
2014-10-03 22:08:07 +02:00
reger
fb1fcc2b03 handle noarchive tag, skip writing page to cache
http://mantis.tokeek.de/view.php?id=44
2014-10-01 04:35:34 +02:00
Michael Peter Christen
3073c69aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-09-30 14:54:06 +02:00
Michael Peter Christen
6491270b3a large IPv6 redesign of peer ping methods!
removed preferred IPv4 in start options and added a new field IP6 in
peer seeds which will contain one or more IPv6 addresses. Now every peer
has one or more IP addresses assigned, even several IPv6 addresses are
possible. The peer-ping process must check all given and possible IP
addresses for a backping and return the one IP which was successful when
pinging the peer. The ping-ing peer must be able to recognize which of
the given IPs are available for outside access of the peer and store
this accordingly. If only one IPv6 address is available and no IPv4,
then the IPv6 is stored in the old IP field of the seed DNA.
Many methods in Seed.java are now marked as @deprecated because they had
been used for a single IP only. There is still a large construction site
left in YaCy now where all these deprecated methods must be replaced
with new method calls. The 'extra'-IPs, used by cluster assignment had
been removed since that can be replaced with IPv6 usage in p2p clusters.
All clusters must now use IPv6 if they want an intranet-routing.
2014-09-30 14:53:52 +02:00
reger
8b1ce49ee6 remove unused variable timeout 2014-09-29 02:24:29 +02:00
orbiter
a922b122a3 added a hack to forward solr search results from an external attached
solr to the YaCy built-in solr search servlet. Its not complete and not
fully correct (there is still a utf8 encoding problem) but it is a way
to get easily requests forwarded through YaCy to an external Solr.
2014-09-22 15:28:54 +02:00
Michael Peter Christen
2645dc816a added warning for not well-formed postprocessing queries 2014-09-18 14:36:57 +02:00
Michael Peter Christen
6d3d4c4ea6 changed the concurrent enumeration of query results in such a way that
it is now possible to get the results in two steps:
- first retrieve all IDs as given for a query
- then retieve each document individually

This was necessary for very large result sets where a query may run for
hours and is possibly terminated by a solr-internal timeout. This occurs
regulary during postprocessing and therefore this commit may fix
unwanted postprocessing terminations.
2014-09-17 13:58:55 +02:00
Michael Peter Christen
ad35d9294f added a 'stats' table which records some peer statistics twice every
hour. The table can be shown with
http://localhost:8090/Tables_p.html?table=stats

The entries have the following meaning: 
aM: activeLastMonth
aW: activeLastWeek
aD: activeLastDay
aH: activeLastHour
cC: countConnected (Active Senior)
cD: countDisconnected (Passive Senior)
cP: countPotential (Junior)
cR: count of the RWI entries
cI: size of the index (number of documents)

The entry keys are abbreviated to reduce the space in the table as the
name is written again for every row.

This is the beginning of a 'yacystats' micro-alternative als built-in
function in YaCy. Graphics may follow after some time if enough test
data is available.
2014-09-17 12:54:50 +02:00
reger
8284ea751a catch TimeoutException during ping and do not delete yacy.conf during prereadconfigfile
found a situation after crash (reboot) with existing running semaphore but YaCy not running.
Ping generated exception which finally deleted the conf file (during pre-read procedure)
- change to ping (catch exception solved it)
- additionally removed delete yacy.conf file (if needed we need to make a backup)
2014-09-16 23:14:13 +02:00
reger
ffa7c7116f better fix for NPE in image search
replace 8931e14514
2014-09-16 16:43:17 +02:00
Michael Peter Christen
f1032fb8fe more enhancements to image search in case that a restriction to a single
domain is done
2014-09-16 13:41:01 +02:00
Michael Peter Christen
475125f9d7 hack to get more results when doing a remote site search 2014-09-16 00:13:26 +02:00
Michael Peter Christen
81f9b34da7 increaesed ability ot search for all images on a single server within
the p2p remote search
2014-09-15 20:33:22 +02:00
reger
b5e0f70197 - remove repositoryPath post from ConfigBasic (obsolete)
- remove static snippetComputationTime from ResultEntry (not used)
2014-09-13 03:21:52 +02:00
reger
8931e14514 fix NPE in image search 2014-09-13 00:27:39 +02:00
Michael Peter Christen
1735dbc9d9 enhanced image search: bugfixes and performance enhancements 2014-09-12 16:37:01 +02:00
Michael Peter Christen
ebd0be2cea fixes and speed updates for search process 2014-09-10 14:24:03 +02:00
Michael Peter Christen
7611bf79bd Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1
Conflicts:
	locales/ru.lng
2014-09-10 13:24:49 +02:00
Michael Peter Christen
524bedc00a fixed text in startup tray icon and added shutdown icon during shutdown 2014-09-10 13:19:08 +02:00
Michael Peter Christen
e87dc08c0d set the correct fail time in error docs 2014-09-05 14:46:11 +02:00
Michael Peter Christen
a7dd89c4de changed method to write the citation index: do not catch up references
during document parsing; instead use the same references that would also
be written into the webgraph. That should cause that the webgraph and
the citation index express the exact same semantic.
2014-09-02 13:22:12 +02:00
orbiter
f318d7c285 enhanced date-ordered ranking 2014-09-01 13:01:30 +02:00
reger
a6891ff7f8 fix Querygoal.parse exception on +/-null-term
covers http://mantis.tokeek.de/view.php?id=452
2014-09-01 00:16:26 +02:00
orbiter
a65df4ce7e do not push noindex errors into log if in intranet mode. noindex
attributes are attached to artificial constructed index.html files which
list directories. Such files are naturally rejected by the crawler and
should not appear in the error log because these files are part of the
construction of file crawlers and confuse users if they see them in the
error log.
2014-08-27 00:10:51 +02:00
Marc Nause
2af56fa37d Improved UPnP. (still not perfect)
*) set HTTPS port if enabled
*) improved data structures (may not be final)
*) moved UPnP to own package
2014-08-26 22:47:13 +02:00
orbiter
d68438c3d9 make sure that the postprocessing background thread never dies by any
exception
2014-08-23 10:35:38 +02:00
reger
e88537522d allow single quote " ' " in query
see http://mantis.tokeek.de/view.php?id=379
-add QueryGoal test case for this
2014-08-16 14:29:52 +02:00
orbiter
487021fb0a snippet computation update 2014-08-15 01:17:11 +02:00
orbiter
927aaa95a6 concurrency bugfix 2014-08-13 00:59:11 +02:00
reger
7584352e7b use more predefined Solr query parameter constants
- use CommonParams and DisMaxParams constants
- fix typo in get sort parameter
- getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface
2014-08-10 22:33:10 +02:00
reger
f9db5dd6c5 reduce doublecontent check document (prevent out of memory)
see http://mantis.tokeek.de/view.php?id=437

test result (concurrency=7)
2000 docs = eom always
1000 docs = eom always
100 docs = eom never

chosen -> 200 docs (eom not encountered during test with 1GB mem setting)
2014-08-10 03:18:15 +02:00
reger
a8508417d1 catch NPE during crawl (OAI import)
- condenseDocument mime=null (allowed)
- collectionconfiguration responseheader = null (allowed)
2014-08-08 00:02:59 +02:00
Michael Peter Christen
6344718f8b reducing the concurrent query stack size and reduced concurrency of
postprocessing to avoid OOM situations
2014-08-06 12:36:59 +02:00
Michael Peter Christen
c465b791af typo 2014-08-04 16:13:39 +02:00
Michael Peter Christen
191ec8c82a added concurrency to postprocess rewrite process 2014-08-04 15:28:58 +02:00
Michael Peter Christen
a1e8bdd5e9 log ppm instead of docs/second 2014-08-04 14:44:42 +02:00
Michael Peter Christen
cc0ded7abd set process type of web graph according to fields as defined in the
schema
2014-08-04 14:44:20 +02:00
Michael Peter Christen
12fb9d7cd1 log postprocessing constraints in case that postprocessing is not
performed
2014-08-04 14:19:37 +02:00
Michael Peter Christen
338f574bdc no sorting if http/www unique fields are not demanded (makes query
faster) and some code restrucuring
2014-08-04 12:59:38 +02:00
Michael Peter Christen
0ceeceb35e more logic on Solr queries; usage of the query terms in posprocessing,
saving one query for double document detection now per document
2014-08-04 02:35:38 +02:00
orbiter
4099296b45 added new classes which shall reduce call overhead to Solr (stub) 2014-08-03 22:44:22 +02:00
orbiter
3491ab4c38 removed unused images from webgraph edge computation 2014-08-01 13:21:16 +02:00
orbiter
2371d6b8db target linktexts must be string to enable search facets on these fields 2014-08-01 13:20:25 +02:00
Michael Peter Christen
001e05bb80 do not store failure of loading of robots.txt into the index as a fail
document
2014-08-01 12:15:14 +02:00
Michael Peter Christen
05d58e4df0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-01 12:04:25 +02:00
Michael Peter Christen
98f45c9032 fix for image alt attachment to AnchorURLs in html parser. 2014-08-01 12:04:15 +02:00
orbiter
22ce4fb4dd better error handling for remote solr queries and exists-checks 2014-08-01 11:00:10 +02:00
orbiter
738989aab7 reverted commit f94c91315b because the
webgraph has not enough performance for that
2014-07-29 18:49:42 +02:00
Michael Peter Christen
c115f3869c enhanced snippet computation and test method in ViewFile 2014-07-28 15:42:57 +02:00
orbiter
1027f3d04a fix for the usage of ready-prepared solr queries, some queries are
formulated as edismax query but this was not set as query attribut. The
defType=edismax property needs a qf-field, so this was added as well. Do
not remove that field again! This fixes also a problem with title-unique
computation.
2014-07-25 18:53:13 +02:00
Michael Peter Christen
f94c91315b if the webgraph is used, then use it also for reference computation to
avoid contradictions with references_i in the collection index.
2014-07-24 15:35:53 +02:00
Michael Peter Christen
6e1dc444c3 added a snippet test function in ViewFile: you can now search for a
specific word on the document; the servlet returns the snippet in the
same way as it would be shown in a search result.
2014-07-24 14:59:37 +02:00
Michael Peter Christen
b44626e55b fixed target_alt_t in webgraph 2014-07-22 18:24:10 +02:00
Michael Peter Christen
504327b15c fix for condition for writing the webgraph 2014-07-22 00:59:08 +02:00
Michael Peter Christen
542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
filled with the date, when the url is recognized as to be outdated. That
field was partly misinterpreted and the time interval was filled in. In
case that all the urls which are in the index shall be treated as
outdated, the field is filled now with Long.MAX_VALUE because then all
crawl dates are before that date and therefore outdated.
2014-07-22 00:23:17 +02:00
Michael Peter Christen
4eec1a7452 refactoring (change Metadata name of load time data structure to avoid
confusion with Node data which is also called metadata)
2014-07-21 23:54:23 +02:00
reger
f96cfdc84d prevent array out of bound exception on getRankingProfile(x)
on faulty &profileNr=  query parameter
2014-07-21 00:04:54 +02:00
reger
a2cb366b25 Combine /heuristic search modifier with opensearch configured targets
- with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid)
- this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches
- the index.html searchoption text adjusted to be displayed only if option configured
- add Archive-It to predefined systems
2014-07-20 00:00:43 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
Michael Peter Christen
bf1b6b93e7 do not write CR values to webgraph if no CR values are computed 2014-07-16 18:13:29 +02:00
Michael Peter Christen
d07cdd8c3b added SolrCloud access mode and configuration 2014-07-16 14:57:51 +02:00
Michael Peter Christen
8514bffc22 enhanced postprocessing status report 2014-07-16 14:57:25 +02:00
Michael Peter Christen
b5fc2b63ea removed exist() retrieval functions from error cache and replaced it
with metadata retrieval from connectors directly. This should cause
better usage of the cache. Automatically increase the metadata cache if
more memory is available.
2014-07-11 19:52:25 +02:00
Michael Peter Christen
62c72360ee cleanup of checkAcceptanceInitially in CrawlStacker, should avoid
double-calling of solr
2014-07-11 18:36:04 +02:00
Michael Peter Christen
b5d78ba156 reduced number of solr queries during crawling 2014-07-11 18:05:11 +02:00
Michael Peter Christen
fd87fa1613 removed more unnecessary exist-checks in ErrorCache 2014-07-11 16:48:08 +02:00
Michael Peter Christen
f2b476e08b don't do a double check to solr for failed documents if they are not
written to solr
2014-07-11 16:26:52 +02:00
orbiter
dab9a0786a Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-07-11 04:04:34 +02:00
orbiter
51bf5c85b0 Renamed the transmission cloud to buffer in dispatcher since the name
'cloud' was a bad idea. Changed also the accumulation process for peer
targets so that every dht chunk is not assigned the set of redundant
targets but they are assigned to redundant targets individually. This
enhances the granularity of the target accumulation and should enhance
the efficiency of the process. Finally the dht protocol client was
enriched with the ability to remove the 'accept remote index' flag from
peers or remove peers completely if they do not answer at all.
2014-07-11 04:04:09 +02:00
Michael Peter Christen
fb3dd56b02 fix for processing of noindex flag in http header 2014-07-10 17:13:35 +02:00
Michael Peter Christen
b0d941626f fixed bugs in canonical, robots and title/description unique calculation 2014-07-10 15:40:38 +02:00
reger
d9472d043a cleanup older unused classes 2014-07-10 02:20:01 +02:00
reger
665e12f88e move startup time from old serverCore to switchboard (most used here)
to make servercore eventually obsolete.
2014-07-10 02:17:56 +02:00
reger
336425912a remove unused localSearchThread from SearchEvent 2014-07-10 02:14:03 +02:00
Michael Peter Christen
1092e798a5 fixed double content postprocessing 2014-07-07 19:15:11 +02:00
orbiter
59160984cc timeline performance update 2014-07-03 13:06:29 +02:00
orbiter
2073e69034 fix for long periods in timeline 2014-07-02 11:29:50 +02:00
Michael Peter Christen
09dcdb9b19 update to solr 4.9.0 2014-07-01 16:39:00 +02:00
Michael Peter Christen
1cd4b2e8be Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-01 16:06:12 +02:00
Michael Peter Christen
8c52f0651b refactoring of AccessTracker events & timeline fix 2014-07-01 16:06:01 +02:00
reger
431a5f9c4e added test case for TextSnippet,
removed obsolete/unused parameter and reference to MediaSnippet
2014-06-30 05:36:48 +02:00
Michael Peter Christen
5b94a257ce no timeout for large reference collections 2014-06-29 22:26:22 +02:00
Michael Peter Christen
f5b817bac4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-29 22:25:08 +02:00
reger
a5707cd2eb enable proper Author navigator
- author facet is based on omitted author_sxt field
- adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?)
- add check for querymodifier author in searchevent
2014-06-27 23:05:06 +02:00
Michael Peter Christen
74206a10c7 refactoring 2014-06-27 14:40:36 +02:00
orbiter
fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-27 10:15:37 +02:00
orbiter
c59da9fe7a added access tracker log reader stub 2014-06-27 10:14:36 +02:00
Michael Peter Christen
36e623d8bf enhanced metadata enrichment for media file type search:
- Web servers may now deliver YaCy-specific http header field with a
title and keywords. The new http header fields are:
X-YaCy-Media-Title - to be used for media (image, audio, video) titles
X-YaCy-Media-Keywords - to be used for media (image, audio, video)
keywords
- both fields are written to document fields title and keywords and are
searched also during image search.
- to make the usage of arbitrary http header fields (including this new
fields) possible in the /api/push_p.json servlet, a new POST argument is
also introduced to push http header fields. The new POST attribute is
named "responseHeader-X" (where X is the counter). It is allowed to use
this attribute as multi-attribute several times, each can be filled with
a http header line.
- see /api/push_p.html for examples
2014-06-26 13:02:35 +02:00
Michael Peter Christen
b893c42a0f bugfix for image search 2014-06-26 12:56:33 +02:00
orbiter
0bbb5040b8 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-15 12:38:52 +02:00
orbiter
9d5d86cd03 Added filter query options to the ranking servlet /RankingSolr_p.html.
Filter queries are not actually related to ranking, but user requests
have pointed out that specific boost queries to move results to the end
of the result list are not sufficient. Such boost filters may be better
executed as actual filter and therefore such a filter can now be
statically applied to every search request. A typical use could be the
expression "http_unique_b:true AND www_unique_b:true" which uses the
recently introduced fields http_unique_b and www_unique_b which are true
only for one of the alternatives with/without http(s) and with/without
prefix 'www.' in host names.
2014-06-15 12:38:30 +02:00
Michael Peter Christen
d2151857f1 Added collection navigation:
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
2014-06-15 12:11:23 +02:00
Michael Peter Christen
74c249288a added a push api to make it possible to upload files directly without
crawling to the YaCy indexer. Files are uploaded using POST multipart
requests; multiple file uploads are possible as well. Each file has
attached the file date and mime type which is used to get the right
parser for the submitted data. Also an url is submitted which is
assigned to the document.
The CrawlSwitchboard has a new option for default Crawl Profiles which
are assigned dynamically from the new push interface.
2014-06-12 18:10:07 +02:00
Michael Peter Christen
ba6ffddefc refactoring 2014-06-12 05:23:26 +02:00
Michael Peter Christen
0c324d735c NPE fix for postprocessing without term index 2014-06-04 12:28:28 +02:00
Michael Peter Christen
922979aae1 added option to prefer http over https in unique-protocol ranking 2014-06-02 17:40:56 +02:00
Michael Peter Christen
b3b174e2b8 fixed webgraph postprocessing and status display in Crawler_p servlet 2014-06-02 15:06:38 +02:00
Michael Peter Christen
f23c4142e0 added option to configure a custom user agent within allip networks 2014-06-01 01:02:03 +02:00
Michael Peter Christen
8ad41a882c fixed several problems with postprocessing:
- unique-postprocessing was destroying results from other
postprocessings; removed cross-updates as they had been not necessary
- unique-postprocessing did not restrict on same protocol
- inefficient concurrent update cache was redesigned completely
- increased limits for concurrent blocking queues to prevent early
time-out
2014-05-29 13:24:24 +02:00
Michael Peter Christen
ff5b3ac84d added new fields http_unique_b and www_unique_b which can be used for
ranking to prefer urls containing a www subdomain or using the https
protocol
2014-05-27 15:28:28 +02:00
Michael Peter Christen
f0db501630 better handling of ranking parameters and new default values for date
navigation which is done using ranking in solr.
2014-05-22 03:01:07 +02:00
Michael Peter Christen
53948da7d0 tried to make last_modified recognition smarter 2014-05-22 00:28:51 +02:00
Michael Peter Christen
6634b5b737 debug code for index distribution testing 2014-05-21 18:20:16 +02:00
orbiter
97983ba89f fixed generics warnings for generic array instantiation that appeared
after migration to Java 7
2014-05-20 21:50:16 +02:00
sixcooler
830057d788 lower Segment-size (hope to get Segments of 10GB)
see:
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5216&p=30036#p30034
2014-05-19 17:55:03 +02:00
orbiter
c028ae9b09 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-05-18 21:21:17 +02:00
reger
e31493e139 "Use remote proxy for yacy" has no function, remove option and related config item
see/fix bug http://mantis.tokeek.de/view.php?id=23
http://mantis.tokeek.de/view.php?id=189
2014-05-17 23:36:59 +02:00
orbiter
0d8072aa99 removed warnings 2014-05-13 22:29:05 +02:00
Michael Peter Christen
a1ac4c3b76 automatically clear graphics cache 2014-05-12 15:45:25 +02:00
reger
1432a817dd respect "index media" switched off in CrawlStartExpert.html
fix http://mantis.tokeek.de/view.php?id=64
2014-05-08 22:21:24 +02:00
Michael Peter Christen
4e734815e8 enhanced snippets: remove lines which are identical to the title and
choose longer versions if possible. Prefer the description part.
2014-05-06 16:48:50 +02:00
Michael Peter Christen
e84e07399a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-05-06 14:51:57 +02:00
reger
8a7c68e4c7 content of surrogates/out never accessed (remove)
After import the conent is never accessed but may take up a lot of disk space,
also the getLoadedOAIServer (which lists the files in surrogate out) is not used.
Making the surrogate.out obsolete. Removed keeping of xmls after import.
2014-05-04 09:29:07 +02:00
Michael Peter Christen
229f2248b8 added configuration option for maxmimum load and minimum ram for
postprocessing
2014-04-30 13:26:32 +02:00
orbiter
8e5ce7cd51 fixed a situation where finished crawls had not been detected. 2014-04-23 23:13:07 +02:00
orbiter
ccb1864d55 catch IllegalArgumentException for wrong process types (that is needed
for migrations when new process types are introduced or disappear)
2014-04-22 23:14:05 +02:00
orbiter
4ee4ba1576 fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of
lazy value instantiation of 0-value in crawldepth_i
2014-04-22 19:48:49 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00