Commit Graph

3099 Commits

Author SHA1 Message Date
Michael Peter Christen
ec95dfa2e6 fixed crawl profile xml result which did not show the correct crawl
status.
2014-10-08 18:48:57 +02:00
Michael Peter Christen
8c1a89cb34 added another decoration flag to switch off network graphics in crawler
monitor and index browser: decoration.grafics.linkstructure
Please set this to false to remove the graphics from the interface.
2014-10-08 17:12:35 +02:00
Michael Peter Christen
ee27be3399 misc bugfixes (concurrency, memory protection) 2014-10-08 15:22:29 +02:00
Michael Peter Christen
9b1958e8ca more ipv6 bugfixes 2014-10-08 15:21:49 +02:00
Michael Peter Christen
7817fc50c9 added a high cpu cycle monitor to PerformanceQueues 2014-10-08 15:20:43 +02:00
Michael Peter Christen
5082feb103 less volume for effect sounds 2014-10-08 15:04:35 +02:00
Michael Peter Christen
e8392e2ff2 fix for local search 2014-10-08 13:44:03 +02:00
Michael Peter Christen
0bfc69b29b more ipv6 bugfixes 2014-10-08 12:38:56 +02:00
Michael Peter Christen
a27563e5c3 removed the atmo sound clips because they had been too large 2014-10-07 23:42:41 +02:00
Michael Peter Christen
883622306e Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/peers/Protocol.java
2014-10-07 23:33:28 +02:00
Michael Peter Christen
97995a1dd9 fix for remote search process 2014-10-07 23:30:32 +02:00
Michael Peter Christen
0843b12ef3 ipv6 fix: avoid that shrinked own ip set is overwritten with (non-valid)
set of local IPs
2014-10-07 22:36:01 +02:00
Michael Peter Christen
92c5d97486 fix for bad node flag setting with IPv6 2014-10-07 22:16:18 +02:00
orbiter
c27bad9326 more ipv6 fixes 2014-10-07 20:09:48 +02:00
orbiter
cddf884bc4 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-07 19:27:14 +02:00
Michael Peter Christen
460858fb22 more ipv6 fixes 2014-10-07 18:53:23 +02:00
Michael Peter Christen
5cef88a315 argh.. adding missing java class for latest audio feature 2014-10-07 18:32:39 +02:00
Michael Peter Christen
74957f3760 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-07 17:51:18 +02:00
Michael Peter Christen
2a052f446a Added an experimental audio feedback system.
This is the first element of a new 'decoration' component which may hold
switches for different external appearance parameters.
The first switch in that context is decoration.audio (as usual in
yacy.init). This value is set to false by default, that means the audio
feedback element is switched off by default. To switch it on, set
decoration.audio = true (using /ConfigProperties_p.html). You will then
hear sounds for the following events:
- remote searches
- incoming dht transmissions
- new documents from the crawler
Sound clips are stored in htroot/env/soundclips/ which is done so
because a future implementation will read these files using the http
client and with configurable urls which will make it very easy for the
user to replace the given sounds with own sounds.
2014-10-07 17:51:07 +02:00
Marc Nause
1e6e69bc40 Finished implementation of UPNP:
*) will try other ports if YaCy standard ports are not available
*) distinguish between internal and external port (not sure if this
works 100%)

Still to add: propery in config to enter own external port (in case of
manually configured NAT)
2014-10-07 13:10:06 +02:00
Michael Peter Christen
d0358e568b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-06 17:44:38 +02:00
Michael Peter Christen
e1bc768f9d more IPv6 bugfixes 2014-10-06 17:44:27 +02:00
reger
59c6532a65 add link extraction to pdfParser
this extracts clickable links in pdf and adds it to the list of links

include a test case for this function

this is the corrected comment for commit:
aa2e15d846
2014-10-06 04:51:31 +02:00
reger
aa2e15d846 allow url parameter in worktable apicall
allow url=wwwl?param=a&param=b (with ?, & encoded)
fix:  http://mantis.tokeek.de/view.php?id=100

fix double adding of  '&' in MultiProtocolURL.escape()
2014-10-05 20:05:03 +02:00
orbiter
f3a12801f0 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-05 14:50:35 +02:00
orbiter
d93325a578 lazy handling of process_sxt field (part of postprocessing) 2014-10-05 14:50:22 +02:00
Michael Peter Christen
b31db00010 toString fixes 2014-10-05 11:03:57 +02:00
Michael Peter Christen
961f06c0b6 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-05 01:25:12 +02:00
reger
209e0f2fe8 allow url parameter in worktable apicall
allow url=wwwl?param=a&param=b (with ?, & encoded)
fix:  http://mantis.tokeek.de/view.php?id=100

fix double adding of  '&' in MultiProtocolURL.escape()
2014-10-04 04:11:48 +02:00
reger
b5ca20de15 preserve content_type (mime) if supplied in preference of construct in from file type.
(this eventually can benefit image search by using mime only)

reduce redundant field assignment for Solrdocuments created from URIMetadataNode (URIMetadataNode = SolrDocument with partially assigned fields)
2014-10-03 22:08:07 +02:00
reger
fe9f1c594e fix char encoding parameter in UrlProxy 2014-10-03 08:51:23 +02:00
reger
b0c87d8240 fix image search expand box, cut-off of 2nd capture line height
tested with IE11 and Firefox 32 (change worked for both to show 2nd line without cutting off height)

+fix charset parameter in metadataImageParser
+update start errMsgTxt to "java 1.7"
2014-10-03 01:43:05 +02:00
Michael Peter Christen
2c2ed8bf4e typo in javadoc 2014-10-02 09:38:06 +02:00
Michael Peter Christen
528f583d72 ipv6 fixes 2014-10-01 15:32:10 +02:00
Michael Peter Christen
6ee5b4352d Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-01 10:21:13 +02:00
Michael Peter Christen
247e626083 IPv6 host parsing bugfixes 2014-10-01 10:21:03 +02:00
reger
fb1fcc2b03 handle noarchive tag, skip writing page to cache
http://mantis.tokeek.de/view.php?id=44
2014-10-01 04:35:34 +02:00
Michael Peter Christen
fe917deb2d when pinging other peers, be able to select the right IP option 2014-10-01 03:47:57 +02:00
Michael Peter Christen
65e6ae52fb IPv6-enhanced Network monitoring page 2014-10-01 03:10:39 +02:00
Michael Peter Christen
3073c69aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-09-30 14:54:06 +02:00
Michael Peter Christen
6491270b3a large IPv6 redesign of peer ping methods!
removed preferred IPv4 in start options and added a new field IP6 in
peer seeds which will contain one or more IPv6 addresses. Now every peer
has one or more IP addresses assigned, even several IPv6 addresses are
possible. The peer-ping process must check all given and possible IP
addresses for a backping and return the one IP which was successful when
pinging the peer. The ping-ing peer must be able to recognize which of
the given IPs are available for outside access of the peer and store
this accordingly. If only one IPv6 address is available and no IPv4,
then the IPv6 is stored in the old IP field of the seed DNA.
Many methods in Seed.java are now marked as @deprecated because they had
been used for a single IP only. There is still a large construction site
left in YaCy now where all these deprecated methods must be replaced
with new method calls. The 'extra'-IPs, used by cluster assignment had
been removed since that can be replaced with IPv6 usage in p2p clusters.
All clusters must now use IPv6 if they want an intranet-routing.
2014-09-30 14:53:52 +02:00
reger
eaccce3467 added metadataImageParser for tif and psd (Photoshop) images.
This is a modified genericImageParser adding tif (and psd) support even if java ImageIO plugin for tif is not installed in JDK.
Adds just tif and psd to the available parsers.
Uses the same library to extract metadata, so could eventually be merged with genericImageParser.
All detected metadata are added to the parsed document (potentially some more as with genericImageParser)
2014-09-30 05:04:47 +02:00
reger
a69f5358ff use javax ImageIO getReader to add supported image extension/mime
genericImageParser uses javax ImageIO, supported images depend on available plugins for ImageIO package (this is JDK installation specific). Jpeg, png and gif are availabel by default. Tif and others only on avalable plugin (in classpath).
Add supported image type dynamically on startup.
2014-09-29 07:42:51 +02:00
reger
8b1ce49ee6 remove unused variable timeout 2014-09-29 02:24:29 +02:00
reger
48aed15c48 skip loader wait cycle on concurrent access in nocache configuration.
In nocache config resource is loaded online, leaving no benefit to wait for a faster cache hit.
2014-09-26 23:49:10 +02:00
Michael Peter Christen
67cd4c37bd activated the new apk parser which was already ready but not included in
the parser initialization. To make the apk parser usable, the handling
of application type links had to be modified. Now all documents which
have not a parser attached are placed to the noload-queue while all
other documents are parsed using the associated parser class. This may
have side-Effects on other parsers and the display of different file
classes (images, apps, videos).
2014-09-24 13:32:58 +02:00
orbiter
a922b122a3 added a hack to forward solr search results from an external attached
solr to the YaCy built-in solr search servlet. Its not complete and not
fully correct (there is still a utf8 encoding problem) but it is a way
to get easily requests forwarded through YaCy to an external Solr.
2014-09-22 15:28:54 +02:00
Michael Peter Christen
025516f682 fix for crawl limit for number of pages fail 2014-09-20 13:06:46 +02:00
Michael Peter Christen
2645dc816a added warning for not well-formed postprocessing queries 2014-09-18 14:36:57 +02:00
Michael Peter Christen
437ce3b8a0 added internal api for partial updates to Solr 2014-09-18 14:26:45 +02:00
orbiter
3ac31614a3 added option to reverse-sort YaCy tables (internal API change only) 2014-09-18 11:11:09 +02:00
Michael Peter Christen
6d3d4c4ea6 changed the concurrent enumeration of query results in such a way that
it is now possible to get the results in two steps:
- first retrieve all IDs as given for a query
- then retieve each document individually

This was necessary for very large result sets where a query may run for
hours and is possibly terminated by a solr-internal timeout. This occurs
regulary during postprocessing and therefore this commit may fix
unwanted postprocessing terminations.
2014-09-17 13:58:55 +02:00
Michael Peter Christen
ad35d9294f added a 'stats' table which records some peer statistics twice every
hour. The table can be shown with
http://localhost:8090/Tables_p.html?table=stats

The entries have the following meaning: 
aM: activeLastMonth
aW: activeLastWeek
aD: activeLastDay
aH: activeLastHour
cC: countConnected (Active Senior)
cD: countDisconnected (Passive Senior)
cP: countPotential (Junior)
cR: count of the RWI entries
cI: size of the index (number of documents)

The entry keys are abbreviated to reduce the space in the table as the
name is written again for every row.

This is the beginning of a 'yacystats' micro-alternative als built-in
function in YaCy. Graphics may follow after some time if enough test
data is available.
2014-09-17 12:54:50 +02:00
reger
8284ea751a catch TimeoutException during ping and do not delete yacy.conf during prereadconfigfile
found a situation after crash (reboot) with existing running semaphore but YaCy not running.
Ping generated exception which finally deleted the conf file (during pre-read procedure)
- change to ping (catch exception solved it)
- additionally removed delete yacy.conf file (if needed we need to make a backup)
2014-09-16 23:14:13 +02:00
reger
ffa7c7116f better fix for NPE in image search
replace 8931e14514
2014-09-16 16:43:17 +02:00
Michael Peter Christen
759e7d9538 fix for http://forum.yacy-websuche.de/viewtopic.php?p=30720#p30720 2014-09-16 14:53:30 +02:00
Michael Peter Christen
bf18a39d0e replaced warning with info 2014-09-16 14:41:04 +02:00
Michael Peter Christen
f1032fb8fe more enhancements to image search in case that a restriction to a single
domain is done
2014-09-16 13:41:01 +02:00
Michael Peter Christen
475125f9d7 hack to get more results when doing a remote site search 2014-09-16 00:13:26 +02:00
Michael Peter Christen
81f9b34da7 increaesed ability ot search for all images on a single server within
the p2p remote search
2014-09-15 20:33:22 +02:00
Michael Peter Christen
2c26013c50 better contentdom abstraction 2014-09-15 14:00:41 +02:00
Michael Peter Christen
6a8fb8190b changed default value for maximum number of connections to 50 2014-09-15 13:50:40 +02:00
Michael Peter Christen
ca8b2bf099 removed www and welcome servlet, these had been demo servlets and are
not needed any more
2014-09-15 12:48:58 +02:00
reger
03a7a29db3 limit OAI import urn resolver try for Deutsche National Library
The resolver service of National Library uses name space nbn, limit use of nbn-resolving.de accordingly to urn:nbn:
- add resolver for rfc's
2014-09-14 01:38:27 +02:00
Michael Peter Christen
0838326a76 changed error message, see http://mantis.tokeek.de/view.php?id=439 2014-09-13 17:02:26 +02:00
reger
b5e0f70197 - remove repositoryPath post from ConfigBasic (obsolete)
- remove static snippetComputationTime from ResultEntry (not used)
2014-09-13 03:21:52 +02:00
reger
8931e14514 fix NPE in image search 2014-09-13 00:27:39 +02:00
Michael Peter Christen
1735dbc9d9 enhanced image search: bugfixes and performance enhancements 2014-09-12 16:37:01 +02:00
Michael Peter Christen
ebd0be2cea fixes and speed updates for search process 2014-09-10 14:24:03 +02:00
Michael Peter Christen
7611bf79bd Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1
Conflicts:
	locales/ru.lng
2014-09-10 13:24:49 +02:00
Michael Peter Christen
524bedc00a fixed text in startup tray icon and added shutdown icon during shutdown 2014-09-10 13:19:08 +02:00
Michael Peter Christen
4709d8417c npe fix for non-tray users 2014-09-08 10:26:28 +02:00
orbiter
5b5635e187 replaced font for boot tray icon with image and added some more images
for further tray icon displays
2014-09-08 00:21:29 +02:00
orbiter
aa6cdc4ab5 speed-up of start process if remote DNS waits for timeout 2014-09-07 12:28:19 +02:00
orbiter
40b3977c21 added an animation of the tray icon during the boot phase of YaCy.
Additionally, there is a tooltip and a new headline at the tray menu
which states the current booting status.
2014-09-07 12:04:35 +02:00
Michael Peter Christen
ec6082c872 very bad language detection hack fix hack 2014-09-05 23:29:09 +02:00
Michael Peter Christen
39615de3f9 adding the buffer size is not wrong but may cause confusing information
when the buffer is cleaned after a buffer flush which is not then
available in Solr since that is waiting for a commit. In such cases the
counter would run backwards which is prevented by ignoring the buffer
size.
2014-09-05 14:57:40 +02:00
Michael Peter Christen
395edec6f1 changed strategy to count the number of documents: get the max of
solr+buffer and the hit cache. This shall help during first crawls to
see a running document counter even if there was no commit meanwhile to
solr. To support that strategy, the hit cache must be written earlier.
2014-09-05 14:50:22 +02:00
Michael Peter Christen
e87dc08c0d set the correct fail time in error docs 2014-09-05 14:46:11 +02:00
Michael Peter Christen
cfb20bc0ce removing the [] for ipv6 addresses may be a bad idea.. 2014-09-04 18:17:38 +02:00
orbiter
b6d57f06eb enhanced the apk parser (up to beeing production-ready).
The parser is not yet activated and will be after the next release step.
2014-09-04 09:41:42 +02:00
Michael Peter Christen
a7dd89c4de changed method to write the citation index: do not catch up references
during document parsing; instead use the same references that would also
be written into the webgraph. That should cause that the webgraph and
the citation index express the exact same semantic.
2014-09-02 13:22:12 +02:00
Michael Peter Christen
57ce7eeff3 fixed localhost authorization and replaced the adminRealm with an info
string which is visible in the browser. That makes it possible that the
browser instructs the user how to change a forgotten admin password
(during runtime).
2014-09-02 13:15:19 +02:00
orbiter
f318d7c285 enhanced date-ordered ranking 2014-09-01 13:01:30 +02:00
reger
a6891ff7f8 fix Querygoal.parse exception on +/-null-term
covers http://mantis.tokeek.de/view.php?id=452
2014-09-01 00:16:26 +02:00
reger
c7335318eb remove unused legacy procedure from httpserver
(deleted  generateSocketAddress(port) )
2014-08-31 00:33:05 +02:00
Michael Peter Christen
eab0d3e1a9 bugfix for wrong lock display, see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5321&p=30484#p30484
2014-08-28 12:50:45 +02:00
orbiter
49d4f95faf bugfix to latest commit 2014-08-27 00:16:50 +02:00
orbiter
68211f8244 enable Crawler_p servlet if a rss feed or a wiki dump import was
submitted.
2014-08-27 00:15:31 +02:00
orbiter
a65df4ce7e do not push noindex errors into log if in intranet mode. noindex
attributes are attached to artificial constructed index.html files which
list directories. Such files are naturally rejected by the crawler and
should not appear in the error log because these files are part of the
construction of file crawlers and confuse users if they see them in the
error log.
2014-08-27 00:10:51 +02:00
orbiter
688c6d8954 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-08-27 00:04:36 +02:00
orbiter
4ae7aead28 addon to latest fix 2014-08-27 00:03:49 +02:00
Marc Nause
2af56fa37d Improved UPnP. (still not perfect)
*) set HTTPS port if enabled
*) improved data structures (may not be final)
*) moved UPnP to own package
2014-08-26 22:47:13 +02:00
orbiter
b3ebd38079 removed the HTDOCS repository concept because the concept to host files
on the YaCy http server is obsolete; YaCy can index file:// and smb://
paths
2014-08-26 19:02:53 +02:00
reger
1fdcc2d67b change seedfile upload ip check to allow intranet ip in intranet mode
- this allows to setup a principal peer in intranet environment
2014-08-25 01:25:22 +02:00
reger
e31b0e6d67 - update javadoc Seed.getIP
- default mySeed.ip to hostip in SeedDB.initMySeed() if Intranetmode
this allows to become senior status in intranet hosted search network with view peers,
otherwise peer would stay junior because of default init with loopback ip as public (dna) ip.
2014-08-24 21:13:36 +02:00
reger
350c6b8250 in IntranetMode allow intranet hosted seedlist with Network_Domain "any"
- so far intranet seedlist hosts are always denied but need to be allowed in intranet mode
2014-08-24 05:20:06 +02:00
orbiter
d68438c3d9 make sure that the postprocessing background thread never dies by any
exception
2014-08-23 10:35:38 +02:00
orbiter
b4f2a1db6e added a unlock icon for all protected pages that are unlocked because
the administrator is logged in.
2014-08-19 19:58:31 +02:00
reger
ea6c9e9b07 reduce mem buffer overhead for gap files during r/w
(they are typically small compared to idx allowing to use smaller buffersize -> set to 16k records)
2014-08-18 00:03:24 +02:00
reger
e88537522d allow single quote " ' " in query
see http://mantis.tokeek.de/view.php?id=379
-add QueryGoal test case for this
2014-08-16 14:29:52 +02:00
orbiter
487021fb0a snippet computation update 2014-08-15 01:17:11 +02:00
orbiter
1c2f1f233a Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-08-14 20:58:05 +02:00
reger
5a4995ded3 fill solr rss writer dc:subject tag with keyword content 2014-08-14 03:06:41 +02:00
orbiter
927aaa95a6 concurrency bugfix 2014-08-13 00:59:11 +02:00
orbiter
c9e593cf78 removed warnings 2014-08-11 23:53:12 +02:00
reger
7584352e7b use more predefined Solr query parameter constants
- use CommonParams and DisMaxParams constants
- fix typo in get sort parameter
- getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface
2014-08-10 22:33:10 +02:00
reger
f9db5dd6c5 reduce doublecontent check document (prevent out of memory)
see http://mantis.tokeek.de/view.php?id=437

test result (concurrency=7)
2000 docs = eom always
1000 docs = eom always
100 docs = eom never

chosen -> 200 docs (eom not encountered during test with 1GB mem setting)
2014-08-10 03:18:15 +02:00
reger
e9eae45b55 simplify rssreader and improve atom feed link extraction
- type detection (rss/atom) 
    - init type parameter overwritten during parse, parameter obsolete
    - detection by endtag changed to simpler first-tag evaluation
- channel image not used, removed related extra parser handling
    - remove unused code (set/getImage) in rssfeed
- atom link extraction to account for possible multipe link tags
   - spec limits link to one with rel="alternate" or one without rel attribute
     not accounting for the follwing type & hreflang exception yet:

   o  atom:entry elements MUST NOT contain more than one atom:link
      element with a rel attribute value of "alternate" that has the
      same combination of type and hreflang attribute values.
2014-08-10 01:29:16 +02:00
reger
a8508417d1 catch NPE during crawl (OAI import)
- condenseDocument mime=null (allowed)
- collectionconfiguration responseheader = null (allowed)
2014-08-08 00:02:59 +02:00
reger
3dde94422f center searchevent lines on network graph
(PerformanceSearch_p.html)
2014-08-06 23:04:42 +02:00
Michael Peter Christen
3860711aef fix for possible interruption of concurrent queries 2014-08-06 12:55:18 +02:00
Michael Peter Christen
6344718f8b reducing the concurrent query stack size and reduced concurrency of
postprocessing to avoid OOM situations
2014-08-06 12:36:59 +02:00
Michael Peter Christen
eca9380e3d bugfix for crawler double-check: if an url is redirected, the
redirect-target was not double-checked. This is now done by replacing
the redirect-URL on the crawl queue again (where it is double-checked)
2014-08-06 12:35:12 +02:00
Michael Peter Christen
9ac0c93f17 fix for subpath crawl filter 2014-08-06 01:33:24 +02:00
Michael Peter Christen
66106bdaf0 fix for crawler attribute maxdompages 2014-08-05 21:32:25 +02:00
Michael Peter Christen
49d91b94c3 npe fix in crawler 2014-08-05 21:31:59 +02:00
Michael Peter Christen
b7183a7321 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-05 09:54:18 +02:00
reger
ea2e627662 fix ConfigAccounts del user with uppercase letter in name
(usernames are case sensitive, userdb.delete used toLower)
2014-08-05 01:27:27 +02:00
Michael Peter Christen
c465b791af typo 2014-08-04 16:13:39 +02:00
Michael Peter Christen
191ec8c82a added concurrency to postprocess rewrite process 2014-08-04 15:28:58 +02:00
Michael Peter Christen
a1e8bdd5e9 log ppm instead of docs/second 2014-08-04 14:44:42 +02:00
Michael Peter Christen
cc0ded7abd set process type of web graph according to fields as defined in the
schema
2014-08-04 14:44:20 +02:00
Michael Peter Christen
12fb9d7cd1 log postprocessing constraints in case that postprocessing is not
performed
2014-08-04 14:19:37 +02:00
Michael Peter Christen
3c23b89823 less logging 2014-08-04 13:37:34 +02:00
Michael Peter Christen
a0c53174c5 better solr query logging to detect unnecessary sort requests for more
performance profiling
2014-08-04 13:00:45 +02:00
Michael Peter Christen
338f574bdc no sorting if http/www unique fields are not demanded (makes query
faster) and some code restrucuring
2014-08-04 12:59:38 +02:00
Michael Peter Christen
1609763be5 toString fix 2014-08-04 12:58:39 +02:00
Michael Peter Christen
b983e68254 more retries, less sleep 2014-08-04 08:29:35 +02:00
Michael Peter Christen
1503ba7794 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-04 08:24:31 +02:00
reger
8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
(e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue)
2014-08-04 02:38:58 +02:00
Michael Peter Christen
0ceeceb35e more logic on Solr queries; usage of the query terms in posprocessing,
saving one query for double document detection now per document
2014-08-04 02:35:38 +02:00
orbiter
38864ae004 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-08-03 22:44:49 +02:00
orbiter
4099296b45 added new classes which shall reduce call overhead to Solr (stub) 2014-08-03 22:44:22 +02:00
reger
d0c02e1de7 adjust rss lat/lon to double
(common format across other classes)
2014-08-03 20:09:23 +02:00
orbiter
3491ab4c38 removed unused images from webgraph edge computation 2014-08-01 13:21:16 +02:00
orbiter
2371d6b8db target linktexts must be string to enable search facets on these fields 2014-08-01 13:20:25 +02:00
Michael Peter Christen
001e05bb80 do not store failure of loading of robots.txt into the index as a fail
document
2014-08-01 12:15:14 +02:00
Michael Peter Christen
05d58e4df0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-01 12:04:25 +02:00
Michael Peter Christen
98f45c9032 fix for image alt attachment to AnchorURLs in html parser. 2014-08-01 12:04:15 +02:00
orbiter
22ce4fb4dd better error handling for remote solr queries and exists-checks 2014-08-01 11:00:10 +02:00
Marc Nause
9df14fc126 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-29 21:26:43 +02:00
Marc Nause
477be17c51 Replaced old UPNP library with Weupnp. UPNP should
work now, at least it does on my network. UPNP code in YaCy can still
be improved though (see TODO comment: make port on gateway configurable
or find free one).

*) removed old code
*) added new lib
*) changed code to work with new lib
2014-07-29 21:26:27 +02:00
orbiter
738989aab7 reverted commit f94c91315b because the
webgraph has not enough performance for that
2014-07-29 18:49:42 +02:00
orbiter
e9163e7e10 fix for malformed hostpath names in crawl balancer 2014-07-29 11:18:45 +02:00
Michael Peter Christen
c115f3869c enhanced snippet computation and test method in ViewFile 2014-07-28 15:42:57 +02:00
reger
6c10b59f3e move bootstrap peers test systems to its test class
var assignment not needed  elsewhere.
2014-07-27 04:13:07 +02:00
orbiter
1027f3d04a fix for the usage of ready-prepared solr queries, some queries are
formulated as edismax query but this was not set as query attribut. The
defType=edismax property needs a qf-field, so this was added as well. Do
not remove that field again! This fixes also a problem with title-unique
computation.
2014-07-25 18:53:13 +02:00
Michael Peter Christen
f94c91315b if the webgraph is used, then use it also for reference computation to
avoid contradictions with references_i in the collection index.
2014-07-24 15:35:53 +02:00
Michael Peter Christen
6e1dc444c3 added a snippet test function in ViewFile: you can now search for a
specific word on the document; the servlet returns the snippet in the
same way as it would be shown in a search result.
2014-07-24 14:59:37 +02:00
orbiter
4b06adb751 fix for file urls 2014-07-23 17:54:31 +02:00
orbiter
08409ec680 no idea why the words max was an ordered one. This change increaes speed
dunring document processin a bit
2014-07-23 17:54:16 +02:00
reger
e5854a5cdb fix localhost link to opensearchdescription.xml 2014-07-22 21:57:38 +02:00
Michael Peter Christen
b44626e55b fixed target_alt_t in webgraph 2014-07-22 18:24:10 +02:00
Michael Peter Christen
504327b15c fix for condition for writing the webgraph 2014-07-22 00:59:08 +02:00
Michael Peter Christen
542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
filled with the date, when the url is recognized as to be outdated. That
field was partly misinterpreted and the time interval was filled in. In
case that all the urls which are in the index shall be treated as
outdated, the field is filled now with Long.MAX_VALUE because then all
crawl dates are before that date and therefore outdated.
2014-07-22 00:23:17 +02:00
Michael Peter Christen
4eec1a7452 refactoring (change Metadata name of load time data structure to avoid
confusion with Node data which is also called metadata)
2014-07-21 23:54:23 +02:00
reger
c95ba52cf0 improve logexception info
- log a message or class name insted of msgtxt "null"
2014-07-21 22:13:34 +02:00
orbiter
e441831a24 reverted toString() change in AnchorURL to prevent mistakenly used
toString(). This fixes also the update link bug.
2014-07-21 15:58:29 +02:00
reger
47f201a6b8 Add Solr default query fields (&qf) to select servlet
according to the ranking profiles boost fields defined by the peer (if df/qf is not specified in query).
This allows for pretty simple queries ( q=word) without the need to know about the specific index configuration.
Making sure all relevant fields (as determined by the index owner) are searched, still maintaining the option to query specific fields
and does not relay on the duplication of text to text_t.
- add author to reset-default boost fields (support results for author nav)
2014-07-21 00:47:14 +02:00
reger
f96cfdc84d prevent array out of bound exception on getRankingProfile(x)
on faulty &profileNr=  query parameter
2014-07-21 00:04:54 +02:00
reger
5f5fb4ecdc remove unused static (RSS)search from protocol 2014-07-20 02:49:49 +02:00
reger
7c1706d83a use CRLF in generated bat command scripts for windows
- for easier viewing with standard viewers
2014-07-20 00:06:22 +02:00
reger
a2cb366b25 Combine /heuristic search modifier with opensearch configured targets
- with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid)
- this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches
- the index.html searchoption text adjusted to be displayed only if option configured
- add Archive-It to predefined systems
2014-07-20 00:00:43 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
Michael Peter Christen
bf1b6b93e7 do not write CR values to webgraph if no CR values are computed 2014-07-16 18:13:29 +02:00
Michael Peter Christen
e039e78210 small bugfixes 2014-07-16 16:04:38 +02:00
Michael Peter Christen
32a2ff925c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-16 14:58:27 +02:00
Michael Peter Christen
d07cdd8c3b added SolrCloud access mode and configuration 2014-07-16 14:57:51 +02:00
Michael Peter Christen
8514bffc22 enhanced postprocessing status report 2014-07-16 14:57:25 +02:00
reger
b24572f304 fix GSA filter query assignment
- use more parameter constants
2014-07-13 00:11:17 +02:00
Michael Peter Christen
b5fc2b63ea removed exist() retrieval functions from error cache and replaced it
with metadata retrieval from connectors directly. This should cause
better usage of the cache. Automatically increase the metadata cache if
more memory is available.
2014-07-11 19:52:25 +02:00
Michael Peter Christen
62c72360ee cleanup of checkAcceptanceInitially in CrawlStacker, should avoid
double-calling of solr
2014-07-11 18:36:04 +02:00
Michael Peter Christen
dd5cdfe212 reverted filter query hack, it did not work 2014-07-11 18:15:35 +02:00
Michael Peter Christen
b5d78ba156 reduced number of solr queries during crawling 2014-07-11 18:05:11 +02:00
Michael Peter Christen
5326970d6c enhanced solr queries for single document extraction 2014-07-11 18:04:55 +02:00
Michael Peter Christen
525575bd97 added debugging of filter queries in thread dump thread names 2014-07-11 17:34:41 +02:00
Michael Peter Christen
f319ef268f testing filter queries instead of queries to retrieve documents by id 2014-07-11 17:09:46 +02:00
Michael Peter Christen
fd87fa1613 removed more unnecessary exist-checks in ErrorCache 2014-07-11 16:48:08 +02:00
Michael Peter Christen
f2b476e08b don't do a double check to solr for failed documents if they are not
written to solr
2014-07-11 16:26:52 +02:00
Michael Peter Christen
06ab72d1af enhanced crawler host round-robin strategy 2014-07-11 16:01:42 +02:00
orbiter
dab9a0786a Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-07-11 04:04:34 +02:00
orbiter
51bf5c85b0 Renamed the transmission cloud to buffer in dispatcher since the name
'cloud' was a bad idea. Changed also the accumulation process for peer
targets so that every dht chunk is not assigned the set of redundant
targets but they are assigned to redundant targets individually. This
enhances the granularity of the target accumulation and should enhance
the efficiency of the process. Finally the dht protocol client was
enriched with the ability to remove the 'accept remote index' flag from
peers or remove peers completely if they do not answer at all.
2014-07-11 04:04:09 +02:00
Michael Peter Christen
a694b6a8fc another fix for unique field computation 2014-07-10 17:25:33 +02:00
Michael Peter Christen
fb3dd56b02 fix for processing of noindex flag in http header 2014-07-10 17:13:35 +02:00
Michael Peter Christen
b0d941626f fixed bugs in canonical, robots and title/description unique calculation 2014-07-10 15:40:38 +02:00
reger
d9472d043a cleanup older unused classes 2014-07-10 02:20:01 +02:00
reger
665e12f88e move startup time from old serverCore to switchboard (most used here)
to make servercore eventually obsolete.
2014-07-10 02:17:56 +02:00
reger
336425912a remove unused localSearchThread from SearchEvent 2014-07-10 02:14:03 +02:00
reger
32bd2a61c1 add local ip to AbstractRemoteHandler local hostname cache 2014-07-10 02:09:26 +02:00
Michael Peter Christen
f3a6b6e21e fix for bad URL decoding 2014-07-10 01:59:29 +02:00
Michael Peter Christen
1092e798a5 fixed double content postprocessing 2014-07-07 19:15:11 +02:00
Michael Peter Christen
aee5b108e5 added linkScraperParser, a parser which ignores the text like the
generic parser but extracts links like the htmlParser. This should be
used for ASCII documents without known text format annotation like
source code files or json documents. Probably also good for xml files
without known schema.
2014-07-07 13:37:17 +02:00
reger
2b8cc5832c fix seek error for 0 file size records file
by add extra check for file size = 0 in cleanlast()
- (http://mantis.tokeek.de/view.php?id=411)
2014-07-06 20:49:01 +02:00
reger
2ba394333f fix Crawler HostQueue release of stackfile
- close stackfile inputstream at end of ChunkIterator
This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)
2014-07-06 16:04:30 +02:00
reger
40133ba2d0 fix NPE in Condenser,
discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference"
2014-07-06 13:24:36 +02:00
orbiter
59160984cc timeline performance update 2014-07-03 13:06:29 +02:00
orbiter
54bea96e67 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-07-02 23:23:34 +02:00
Michael Peter Christen
841cc77391 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-02 14:35:02 +02:00
Michael Peter Christen
e09218129c remove check for local solr. This check was made during a time when Solr
was optional and another alternative metadata store was available. Since
that store is now removed, Solr is always available (internally or
externally)
2014-07-02 14:34:48 +02:00