Commit Graph

1085 Commits

Author SHA1 Message Date
Michael Peter Christen
97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
be transcoded into jpg for image previews. To create such pdfs you must
do:

Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"

Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.

Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
2014-12-01 15:03:09 +01:00
reger
0c97cc2440 skip unused call parameter for hashSentence() 2014-11-30 19:42:33 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
Michael Peter Christen
1d45d9405a security bugfix 2014-11-28 01:19:01 +01:00
Michael Peter Christen
ff728b4aa5 ignore url errors during search 2014-11-27 20:50:55 +01:00
Michael Peter Christen
8317914ce3 changed vocabulary navigator object type to TreeMap to get a specific
order into the vocabularies. This is now lexicographic which is not so
much random as a hashed order
2014-11-27 07:44:41 +01:00
Michael Peter Christen
041b605cfe Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-25 09:48:48 +01:00
Michael Peter Christen
30276a2b48 prevent that a local Solr search and a local RWI search are running
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
2014-11-24 20:53:19 +01:00
reger
1e7ee72240 fix path lookup to ./defaults/yacy.badwords
(fix of commit ee277b9b3e)
2014-11-23 23:29:20 +01:00
reger
ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
   (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)

move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
2014-11-23 05:22:23 +01:00
reger
de56266bcb remove redundant toLower for topwords 2014-11-22 22:49:23 +01:00
Michael Peter Christen
70f03f7c8e do not cache search requests to Solr if the result is used for
doublechecking. If a double-check comes from cached results the
doublecheck fails.
2014-11-20 18:45:27 +01:00
reger
ef5dc68313 include domtype to searcheventcache id
to differenciate between local / global events for reuse of cached events 
fix for http://mantis.tokeek.de/view.php?id=493
2014-11-20 02:04:43 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
c67c5c0709 added new solr schema fields which record the occurences of vocabulary
matchings. These matches can be used for result boosting, i.e. if a
document contains words from a specific vocabulary, boost it.
2014-11-18 15:02:34 +01:00
Michael Peter Christen
0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
2014-11-14 16:34:55 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
Michael Peter Christen
7e1b0b6712 fix for wildcard patch in search queries 2014-11-13 00:59:30 +01:00
Michael Peter Christen
0a879c98e7 added new 'firstSeen' database table and necessary data structures which
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
2014-11-13 00:58:58 +01:00
sixcooler
9c6e3a6b1c fix assertation-failure in version-string for Solr-4.10.2 by changing
the assert - hope that is ok
+ add forgotten NB-Projekt-changes
2014-11-07 22:43:50 +01:00
sixcooler
725b206fb4 update to solr-/lucene-4.10.2 2014-11-07 18:51:31 +01:00
Michael Peter Christen
5c97ecb30f fix of bad query generation for search facets 2014-11-07 18:11:49 +01:00
Michael Peter Christen
95d87f00b3 fix for bad query generation in doublecheck in postprocessing 2014-11-07 18:11:23 +01:00
orbiter
5be352da99 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-02 20:35:08 +01:00
orbiter
0fcd8097a3 removed unused options from BusyThreads 2014-11-02 20:08:49 +01:00
Michael Peter Christen
92007e5d2d more enhancements to posprocessing speed 2014-11-02 12:52:23 +01:00
Michael Peter Christen
9a7fe9e0d1 fix for bad timing computation in postprocessing 2014-10-31 23:17:56 +01:00
Michael Peter Christen
bd16119a00 another fix for postprocessing (the query for "" on numeric field did
not work in external solr)
2014-10-31 17:44:45 +01:00
Michael Peter Christen
327e83bfe7 more fixes in postprocessing: partitioning of the complete queue to
enable smaller queries
2014-10-31 17:30:24 +01:00
orbiter
71758f0d62 enhanced postprocessing by usage of a field-list generation to prevent
lazy initialization of the documents. This is useful because the
documents must be read completely anyway.
2014-10-30 18:05:48 +01:00
Michael Peter Christen
fe537679de fix for exact_signature_unique_b, exact_signature_copycount_i,
fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same
criteria for 'valid document' as for title and description uniqueness
test.
2014-10-24 15:04:40 +02:00
Michael Peter Christen
2e5214eb21 added field postprocessing.partialUpdate to settings which can be used
to switch on or off partial updates. Both options should cause the same
result. Default is on.
2014-10-17 14:17:49 +02:00
Michael Peter Christen
77662e08e1 concurrently initialize the error cache; extended also the cache by
factor 10 up to 1000 entries. This error cache is only used to catch up
paused crawls between shutdown+startup
2014-10-17 12:45:26 +02:00
Michael Peter Christen
07c5b57953 removed warnings 2014-10-15 11:19:25 +02:00
Michael Peter Christen
2e09da9832 npe fix 2014-10-14 12:48:15 +02:00
Michael Peter Christen
d80418f1b1 added partial updates to solr during postprocessing: during
postprocessing the solr documents are now not completely retrieved.
instead, only fiels, needed for the postprocessing are extracted. When
Solr document are written, this is done using partial updates.

This increases postprocessing speed by about 50% for embedded Solr
configurations. For external Solr configurations the enhancement should
be much higher because the postprocessing with remote Solr is very slow.
When doing partial updates to a remote Solr, this method should perform
much better than before, it is expected that this is even much higher
than the increase with local Solr.
2014-10-14 12:19:59 +02:00
Michael Peter Christen
b1cfbc4a04 added new solr field url_paths_count_i which can be used to enhance the
index browser and maybe also for ranking; possibly also for
SEO-with-YaCy applications.
2014-10-13 23:51:19 +02:00
Michael Peter Christen
30d4402cd1 fixed location search 2014-10-13 14:28:11 +02:00
Michael Peter Christen
8c1a89cb34 added another decoration flag to switch off network graphics in crawler
monitor and index browser: decoration.grafics.linkstructure
Please set this to false to remove the graphics from the interface.
2014-10-08 17:12:35 +02:00
Michael Peter Christen
5082feb103 less volume for effect sounds 2014-10-08 15:04:35 +02:00
Michael Peter Christen
0bfc69b29b more ipv6 bugfixes 2014-10-08 12:38:56 +02:00
Michael Peter Christen
883622306e Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/peers/Protocol.java
2014-10-07 23:33:28 +02:00
Michael Peter Christen
0843b12ef3 ipv6 fix: avoid that shrinked own ip set is overwritten with (non-valid)
set of local IPs
2014-10-07 22:36:01 +02:00
orbiter
cddf884bc4 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-07 19:27:14 +02:00
Michael Peter Christen
74957f3760 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-07 17:51:18 +02:00
Michael Peter Christen
2a052f446a Added an experimental audio feedback system.
This is the first element of a new 'decoration' component which may hold
switches for different external appearance parameters.
The first switch in that context is decoration.audio (as usual in
yacy.init). This value is set to false by default, that means the audio
feedback element is switched off by default. To switch it on, set
decoration.audio = true (using /ConfigProperties_p.html). You will then
hear sounds for the following events:
- remote searches
- incoming dht transmissions
- new documents from the crawler
Sound clips are stored in htroot/env/soundclips/ which is done so
because a future implementation will read these files using the http
client and with configurable urls which will make it very easy for the
user to replace the given sounds with own sounds.
2014-10-07 17:51:07 +02:00
Marc Nause
1e6e69bc40 Finished implementation of UPNP:
*) will try other ports if YaCy standard ports are not available
*) distinguish between internal and external port (not sure if this
works 100%)

Still to add: propery in config to enter own external port (in case of
manually configured NAT)
2014-10-07 13:10:06 +02:00
orbiter
f3a12801f0 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-05 14:50:35 +02:00
orbiter
d93325a578 lazy handling of process_sxt field (part of postprocessing) 2014-10-05 14:50:22 +02:00
reger
b5ca20de15 preserve content_type (mime) if supplied in preference of construct in from file type.
(this eventually can benefit image search by using mime only)

reduce redundant field assignment for Solrdocuments created from URIMetadataNode (URIMetadataNode = SolrDocument with partially assigned fields)
2014-10-03 22:08:07 +02:00
reger
fb1fcc2b03 handle noarchive tag, skip writing page to cache
http://mantis.tokeek.de/view.php?id=44
2014-10-01 04:35:34 +02:00
Michael Peter Christen
3073c69aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-09-30 14:54:06 +02:00
Michael Peter Christen
6491270b3a large IPv6 redesign of peer ping methods!
removed preferred IPv4 in start options and added a new field IP6 in
peer seeds which will contain one or more IPv6 addresses. Now every peer
has one or more IP addresses assigned, even several IPv6 addresses are
possible. The peer-ping process must check all given and possible IP
addresses for a backping and return the one IP which was successful when
pinging the peer. The ping-ing peer must be able to recognize which of
the given IPs are available for outside access of the peer and store
this accordingly. If only one IPv6 address is available and no IPv4,
then the IPv6 is stored in the old IP field of the seed DNA.
Many methods in Seed.java are now marked as @deprecated because they had
been used for a single IP only. There is still a large construction site
left in YaCy now where all these deprecated methods must be replaced
with new method calls. The 'extra'-IPs, used by cluster assignment had
been removed since that can be replaced with IPv6 usage in p2p clusters.
All clusters must now use IPv6 if they want an intranet-routing.
2014-09-30 14:53:52 +02:00
reger
8b1ce49ee6 remove unused variable timeout 2014-09-29 02:24:29 +02:00
orbiter
a922b122a3 added a hack to forward solr search results from an external attached
solr to the YaCy built-in solr search servlet. Its not complete and not
fully correct (there is still a utf8 encoding problem) but it is a way
to get easily requests forwarded through YaCy to an external Solr.
2014-09-22 15:28:54 +02:00
Michael Peter Christen
2645dc816a added warning for not well-formed postprocessing queries 2014-09-18 14:36:57 +02:00
Michael Peter Christen
6d3d4c4ea6 changed the concurrent enumeration of query results in such a way that
it is now possible to get the results in two steps:
- first retrieve all IDs as given for a query
- then retieve each document individually

This was necessary for very large result sets where a query may run for
hours and is possibly terminated by a solr-internal timeout. This occurs
regulary during postprocessing and therefore this commit may fix
unwanted postprocessing terminations.
2014-09-17 13:58:55 +02:00
Michael Peter Christen
ad35d9294f added a 'stats' table which records some peer statistics twice every
hour. The table can be shown with
http://localhost:8090/Tables_p.html?table=stats

The entries have the following meaning: 
aM: activeLastMonth
aW: activeLastWeek
aD: activeLastDay
aH: activeLastHour
cC: countConnected (Active Senior)
cD: countDisconnected (Passive Senior)
cP: countPotential (Junior)
cR: count of the RWI entries
cI: size of the index (number of documents)

The entry keys are abbreviated to reduce the space in the table as the
name is written again for every row.

This is the beginning of a 'yacystats' micro-alternative als built-in
function in YaCy. Graphics may follow after some time if enough test
data is available.
2014-09-17 12:54:50 +02:00
reger
8284ea751a catch TimeoutException during ping and do not delete yacy.conf during prereadconfigfile
found a situation after crash (reboot) with existing running semaphore but YaCy not running.
Ping generated exception which finally deleted the conf file (during pre-read procedure)
- change to ping (catch exception solved it)
- additionally removed delete yacy.conf file (if needed we need to make a backup)
2014-09-16 23:14:13 +02:00
reger
ffa7c7116f better fix for NPE in image search
replace 8931e14514
2014-09-16 16:43:17 +02:00
Michael Peter Christen
f1032fb8fe more enhancements to image search in case that a restriction to a single
domain is done
2014-09-16 13:41:01 +02:00
Michael Peter Christen
475125f9d7 hack to get more results when doing a remote site search 2014-09-16 00:13:26 +02:00
Michael Peter Christen
81f9b34da7 increaesed ability ot search for all images on a single server within
the p2p remote search
2014-09-15 20:33:22 +02:00
reger
b5e0f70197 - remove repositoryPath post from ConfigBasic (obsolete)
- remove static snippetComputationTime from ResultEntry (not used)
2014-09-13 03:21:52 +02:00
reger
8931e14514 fix NPE in image search 2014-09-13 00:27:39 +02:00
Michael Peter Christen
1735dbc9d9 enhanced image search: bugfixes and performance enhancements 2014-09-12 16:37:01 +02:00
Michael Peter Christen
ebd0be2cea fixes and speed updates for search process 2014-09-10 14:24:03 +02:00
Michael Peter Christen
7611bf79bd Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1
Conflicts:
	locales/ru.lng
2014-09-10 13:24:49 +02:00
Michael Peter Christen
524bedc00a fixed text in startup tray icon and added shutdown icon during shutdown 2014-09-10 13:19:08 +02:00
Michael Peter Christen
e87dc08c0d set the correct fail time in error docs 2014-09-05 14:46:11 +02:00
Michael Peter Christen
a7dd89c4de changed method to write the citation index: do not catch up references
during document parsing; instead use the same references that would also
be written into the webgraph. That should cause that the webgraph and
the citation index express the exact same semantic.
2014-09-02 13:22:12 +02:00
orbiter
f318d7c285 enhanced date-ordered ranking 2014-09-01 13:01:30 +02:00
reger
a6891ff7f8 fix Querygoal.parse exception on +/-null-term
covers http://mantis.tokeek.de/view.php?id=452
2014-09-01 00:16:26 +02:00
orbiter
a65df4ce7e do not push noindex errors into log if in intranet mode. noindex
attributes are attached to artificial constructed index.html files which
list directories. Such files are naturally rejected by the crawler and
should not appear in the error log because these files are part of the
construction of file crawlers and confuse users if they see them in the
error log.
2014-08-27 00:10:51 +02:00
Marc Nause
2af56fa37d Improved UPnP. (still not perfect)
*) set HTTPS port if enabled
*) improved data structures (may not be final)
*) moved UPnP to own package
2014-08-26 22:47:13 +02:00
orbiter
d68438c3d9 make sure that the postprocessing background thread never dies by any
exception
2014-08-23 10:35:38 +02:00
reger
e88537522d allow single quote " ' " in query
see http://mantis.tokeek.de/view.php?id=379
-add QueryGoal test case for this
2014-08-16 14:29:52 +02:00
orbiter
487021fb0a snippet computation update 2014-08-15 01:17:11 +02:00
orbiter
927aaa95a6 concurrency bugfix 2014-08-13 00:59:11 +02:00
reger
7584352e7b use more predefined Solr query parameter constants
- use CommonParams and DisMaxParams constants
- fix typo in get sort parameter
- getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface
2014-08-10 22:33:10 +02:00
reger
f9db5dd6c5 reduce doublecontent check document (prevent out of memory)
see http://mantis.tokeek.de/view.php?id=437

test result (concurrency=7)
2000 docs = eom always
1000 docs = eom always
100 docs = eom never

chosen -> 200 docs (eom not encountered during test with 1GB mem setting)
2014-08-10 03:18:15 +02:00
reger
a8508417d1 catch NPE during crawl (OAI import)
- condenseDocument mime=null (allowed)
- collectionconfiguration responseheader = null (allowed)
2014-08-08 00:02:59 +02:00
Michael Peter Christen
6344718f8b reducing the concurrent query stack size and reduced concurrency of
postprocessing to avoid OOM situations
2014-08-06 12:36:59 +02:00
Michael Peter Christen
c465b791af typo 2014-08-04 16:13:39 +02:00
Michael Peter Christen
191ec8c82a added concurrency to postprocess rewrite process 2014-08-04 15:28:58 +02:00
Michael Peter Christen
a1e8bdd5e9 log ppm instead of docs/second 2014-08-04 14:44:42 +02:00
Michael Peter Christen
cc0ded7abd set process type of web graph according to fields as defined in the
schema
2014-08-04 14:44:20 +02:00
Michael Peter Christen
12fb9d7cd1 log postprocessing constraints in case that postprocessing is not
performed
2014-08-04 14:19:37 +02:00
Michael Peter Christen
338f574bdc no sorting if http/www unique fields are not demanded (makes query
faster) and some code restrucuring
2014-08-04 12:59:38 +02:00
Michael Peter Christen
0ceeceb35e more logic on Solr queries; usage of the query terms in posprocessing,
saving one query for double document detection now per document
2014-08-04 02:35:38 +02:00
orbiter
4099296b45 added new classes which shall reduce call overhead to Solr (stub) 2014-08-03 22:44:22 +02:00
orbiter
3491ab4c38 removed unused images from webgraph edge computation 2014-08-01 13:21:16 +02:00
orbiter
2371d6b8db target linktexts must be string to enable search facets on these fields 2014-08-01 13:20:25 +02:00
Michael Peter Christen
001e05bb80 do not store failure of loading of robots.txt into the index as a fail
document
2014-08-01 12:15:14 +02:00
Michael Peter Christen
05d58e4df0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-01 12:04:25 +02:00
Michael Peter Christen
98f45c9032 fix for image alt attachment to AnchorURLs in html parser. 2014-08-01 12:04:15 +02:00
orbiter
22ce4fb4dd better error handling for remote solr queries and exists-checks 2014-08-01 11:00:10 +02:00
orbiter
738989aab7 reverted commit f94c91315b because the
webgraph has not enough performance for that
2014-07-29 18:49:42 +02:00
Michael Peter Christen
c115f3869c enhanced snippet computation and test method in ViewFile 2014-07-28 15:42:57 +02:00
orbiter
1027f3d04a fix for the usage of ready-prepared solr queries, some queries are
formulated as edismax query but this was not set as query attribut. The
defType=edismax property needs a qf-field, so this was added as well. Do
not remove that field again! This fixes also a problem with title-unique
computation.
2014-07-25 18:53:13 +02:00