Commit Graph

4658 Commits

Author SHA1 Message Date
Michael Peter Christen
3e22d05290 added option for daterange properties in GSA interface to use an left-
or right-open date range;
i.e. using daterange=..2013-09-09 or daterange=2013-09-02.. additional
to daterange=2013-09-02..2013-09-09
2013-09-11 12:52:18 +02:00
reger
36b7159282 - remove double initialization of jetty
- refactor some var assignments
2013-09-11 02:24:47 +02:00
reger
63ed04260a Merge remote-tracking branch 'origin/master' into jetty 2013-09-10 20:42:38 +02:00
Michael Peter Christen
35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
2013-09-10 10:31:57 +02:00
reger
aafef72a8a merged current rc1/master into jetty branch to allow further development with latest version
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI -  needs further work
- TODO: YaCy servlet return values/parameters are not handled
2013-09-09 02:36:06 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
Michael Peter Christen
9e12fdff23 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-03 12:22:57 +02:00
Michael Peter Christen
049c3b3f2e added an option to exclude image search results from text search. This
is on by default.
2013-09-03 11:14:23 +02:00
Michael Peter Christen
5d71a4c8bc fix for dc:description field 2013-09-03 07:54:49 +02:00
reger
392174de8c remove all_words, all_strings lists from QueryGoal
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
2013-09-02 23:09:43 +02:00
Michael Peter Christen
cb85b22725 redesign of the image search process (with much better results,
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
Michael Peter Christen
6184fd9d9a fix for solr/gsa result logging 2013-09-02 08:05:42 +02:00
reger
29967102a2 optimized QueryGoal (reducing mem and computation by removing all_hashes)
- all_hashes used for text highlighting and word distance computation which can be done with include_hashes only
2013-09-02 04:19:53 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
orbiter
5b14bdfffd npe fix 2013-09-01 13:28:37 +02:00
orbiter
1ca4b9612c added special handling of the BinaryResponseWriter in the solr interface
which makes it possible to use solrj with the javabin format which is
much better (compressed, no xml overhead, java object streams) and
faster. Furthermore, this enables the 'shards' option in the solr
interface which connects one solr (YaCy) to another solr (YaCy) ad-hoc.
2013-09-01 13:11:40 +02:00
Michael Peter Christen
a88a62f7aa added a feature to set a collection for a crawl result based on a
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
e6b423c4d9 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-08-19 22:02:41 +02:00
reger
94bec24d14 add back menu to Surftips page (currently no menu is displayed) 2013-08-19 17:53:37 +02:00
Michael Peter Christen
1f299b0d42 removed link.gif as link button because this image is now shown
automatically for expernal links
2013-08-19 10:54:23 +02:00
Michael Peter Christen
48ddd50a6c html fix 2013-08-17 09:32:24 +02:00
reger
96ae332427 revert del _blank (last commit) in template 2013-08-15 00:15:01 +02:00
reger
43348a98a9 add some href target=_blank to ext. links with external icon 2013-08-15 00:05:32 +02:00
reger
82d81a57bd info msg if no embedded Solr http://bugs.yacy.net/view.php?id=279 2013-08-14 20:59:46 +02:00
reger
02fe8b43ba Field Re-Indexing: display list of fields in reindex queue
change servlet to display statistic on 1st click (instead after refresh)
2013-08-11 04:51:29 +02:00
sixcooler
7f501b7c38 clear some caches before reporting low Memory
do not break lines in Network-table-rows
2013-08-08 14:38:26 +02:00
reger
070bf85b33 css fix for IE10 showing border on all img within <a /> tag since introduction of external link icon (commit 112836dcc9) 2013-08-04 05:37:20 +02:00
sixcooler
8a96140f92 fix / workaround for
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4750
+ Seed.hash should be final
2013-08-01 16:40:58 +02:00
Michael Peter Christen
2674d28ef4 protection against self-ping (may be cause by fraud attempts) 2013-08-01 12:35:44 +02:00
orbiter
f3d001c7ab more space in the about section 2013-08-01 11:49:07 +02:00
Michael Peter Christen
e879b97b0a added line to enhance debugging 2013-07-31 13:33:05 +02:00
Michael Peter Christen
76afcccaaf fix for default boolean post values: the default value MUST NOT be TRUE,
because it's normal that a boolean value is missing in the post argument
if a checkbox is not selected.
Added also some style enhancements to IndexFederated, removed the Solr
attachment manual and replaced it with a link to the wiki which explains
this in more detail.
2013-07-31 10:49:26 +02:00
orbiter
252c525709 fixed feed api servlet and and enhanced RSSReader class 2013-07-31 06:18:30 +02:00
Marc Nause
112836dcc9 Improved external links.
*) image links will not be marked (if they have class "yacylogo" or
"forceNoExternalIcon")
*) external links in menu on left (and "fork me"-banner) will open in
new tab/window now
2013-07-30 21:40:37 +02:00
Marc Nause
d64a094f0e External links in HTML interface are marked as external with small icon.
*) added new icon
*) added CSS rules to mark all external links except search results
(target="_self")
2013-07-30 20:46:51 +02:00
Michael Peter Christen
58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
sixcooler
7d53ac86a3 fix for Blacklist (-Administration) 2013-07-29 19:09:28 +02:00
orbiter
f425b2c61c re-try to fetch url after a soft commit 2013-07-27 10:56:02 +02:00
orbiter
bf0ad04e1b apply load limitation also to dht-in 2013-07-27 10:42:38 +02:00
Roland Haeder
b58ca8622d Some cleanups:
- added SKINS_PATH_DEFAULT as same as LISTS_PATH_DEFAULT was added
- Added 'final' keyword to a string
2013-07-27 10:13:57 +02:00
Roland Haeder
e2ee412160 Use SwitchboardConstants.LISTS_PATH_DEFAULT instead of 'DATA/LISTS'
Conflicts:
	htroot/api/blacklists_p.java
2013-07-27 10:12:58 +02:00
Roland Haeder
ae19401af0 Removed another duplicate occurance of Blacklist.BLACKLIST_FILENAME_FILTER 2013-07-27 09:59:09 +02:00
Roland Haeder
59225487ea Fix for blacklist export, also applied the filename filter here 2013-07-27 09:58:56 +02:00
Roland Haeder
952fc0e7bd Removed superfluous check for files ending '.black' as the previous commit already excluded all other files (e.g. .ser dumps), added logging in catch-all block 2013-07-27 09:58:38 +02:00
Roland Haeder
060fec1577 Reuse Blacklist.BLACKLIST_FILENAME_FILTER 2013-07-27 09:57:50 +02:00
Roland Haeder
29049c71f5 Possible fix for ticket http://bugs.yacy.net/view.php?id=270, the filter for only including *.black must be applied 2013-07-27 09:57:07 +02:00
Michael Peter Christen
4c242f9af9 always use a default value for boolean options to have transparency for
the outcome if the attribute is missing in servlets
2013-07-25 12:17:29 +02:00
orbiter
9c681cc00d added segment sizes, postprocessing status and cpu load to crawler
monitor
2013-07-23 19:10:11 +02:00
orbiter
86b514cf46 added load info to status_p.xml 2013-07-23 18:20:07 +02:00
orbiter
056b42f5aa - added information about segment count to status_p.xml
- also moved this information from the old index structure, which is
still in use for the RWI/DHT index to that front-end
2013-07-23 18:03:33 +02:00
orbiter
6fb2811e68 fixes for problems with remote solr and non-activated webgraph index 2013-07-23 16:46:44 +02:00
orbiter
e24016e30a added the property federated.service.solr.indexing.timeout to yacy.init
to provide a configurable time-out for solr; see also:
http://bugs.yacy.net/view.php?id=254
2013-07-22 17:45:12 +02:00
orbiter
232100301c removed double-ocurring value assignments 2013-07-17 19:09:25 +02:00
Roland Haeder
aaedc0405d Fixes and avoid of catching bad exceptions (some):
- Rewrote usage of HashMap/Map to concurrent versions (to avoid a
CME=ConcurrentModificationException)
- Rewrote ConnectionInfo (as an example) to use a synchronized iterator
instead of synchronizing an
  already synced HashSet (see Collections call)
- This avoids catching CMEs again
- Commented out noisy ConcurrentLog.logException() call

Conflicts:
	source/net/yacy/repository/LoaderDispatcher.java
2013-07-17 18:37:34 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Felix Ableitner
376f9cd9d0 Merge branch 'master' of git://gitorious.org/yacy/rc1 into blacklist_structure 2013-07-17 15:58:09 +02:00
Michael Peter Christen
89c0aa0e74 added collection_sxt to error documents 2013-07-17 15:20:56 +02:00
Michael Peter Christen
0df5195cb0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-17 12:42:06 +02:00
Michael Peter Christen
1fd006cc56 fixes using the embedded connector 2013-07-17 12:41:54 +02:00
orbiter
aba7cc5de7 added cpu load information to status page 2013-07-17 12:38:12 +02:00
Roland Haeder
59b4fdd5ad Merge remote-tracking branch 'upstream/master' 2013-07-13 15:12:51 +02:00
orbiter
5493389576 stealth mode shall only be available for authorized users, because
unauthorized users can otherwise be monitored by authorized users
2013-07-13 14:49:36 +02:00
Roland Haeder
ebbb3bc5c1 Fixed CHMOD on many files + added missing loggers (e.g. jena) and made some noisy loggers quiet 2013-07-13 13:12:36 +02:00
Michael Peter Christen
bcc623a843 refactoring of load_delay: this is a matter of client identification 2013-07-12 16:24:56 +02:00
orbiter
2be456e7fb added a postprocessing field into api/status_p.xml to show if the
postprocessing task is running at that time (status: busy) or not
(status:idle)
2013-07-12 14:29:22 +02:00
orbiter
575f913154 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-12 14:17:13 +02:00
orbiter
c4efb612e2 added list of crawls to status_p.xml 2013-07-12 14:16:51 +02:00
Lotus
bb6caa346c Do not allow automatic update in case YaCy is installed to the Program
Files folder on Windows. There are no permissions to write that folder
and update would fail.
2013-07-11 21:50:06 +02:00
orbiter
dac88561ae minimum access time has a tight connection to ClientIdentification,
therefore it is defined there.
2013-07-11 17:04:24 +02:00
Felix Ableitner
a020697d64 Fixed problems with blacklist entry insertion. 2013-07-11 13:10:23 +02:00
sixcooler
bff8c753c6 re-insert this file - was deleted by mistake
+ correct an other case-typo
2013-07-10 18:32:12 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
orbiter
c79f687110 enhanced the network scanner: find more hosts automatically by removal
of common subdomains before application of protocol-specific prefix
2013-07-09 11:42:13 +02:00
orbiter
b4677d1cad fix for bug #252
the naming of the servlet was wrong, the bug may not be present on
systems where upper/lowercase matching is lazy (windows)
2013-07-09 10:50:47 +02:00
Michael Peter Christen
07261fe274 Merge remote-tracking branch 'nutomics/blacklist_structure' 2013-07-08 23:32:15 +02:00
Michael Peter Christen
dea71851d2 - better concurrency for network scanner
- network scanner can now start from the list of all hosts in the search
index
2013-07-08 16:29:30 +02:00
orbiter
9f0cc9b401 enhanced network scanner
- textarea input field can now be used to paste in a large list of hosts
- /31er subnet is possible (only one host)
- auto-detect subdomains for ftp and www subdomains
2013-07-08 13:17:09 +02:00
orbiter
f8c28efd66 fix for rssTerminal coloring 2013-07-04 21:46:46 +02:00
Felix Ableitner
44f8fcf62e Changed class structure of Blacklist. 2013-07-04 18:37:57 +02:00
Michael Peter Christen
3054a6d4b9 added a patch from Sebastian M.B., submitted by email for coloring of
rss terminal
2013-07-04 17:12:19 +02:00
Michael Peter Christen
78af998f8f Merge commit 'fd90fcc4e08f80acbfd1c9a7ec62ce04cd309594' 2013-07-04 16:56:54 +02:00
Michael Peter Christen
57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
default.
2013-07-03 14:50:06 +02:00
Felix Ableitner
fd90fcc4e0 Fixes #196. 2013-07-02 20:45:41 +02:00
Michael Peter Christen
f1c5338210 prepartion for greedy crawl profiles and refactoring 2013-07-01 13:10:09 +02:00
Michael Peter Christen
e6f361f474 adding the canonical tag to crawl queues 2013-07-01 13:09:41 +02:00
Michael Peter Christen
203921006a redesign of citation index storage 2013-06-30 02:11:46 +02:00
Michael Peter Christen
e92b9275ce Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-06-28 15:33:29 +02:00
Michael Peter Christen
56cdcfa2fa fixed greedy learning mode - global is not a search attribute in
searchitems
2013-06-28 15:33:19 +02:00
Michael Peter Christen
32aa1d4569 removed unused option for queries 2013-06-28 15:32:36 +02:00
Michael Peter Christen
0c5bed7e2c added configuration option for greedy learning function to ConfigPortal
servlet
2013-06-28 15:31:36 +02:00
sixcooler
5d1f619f07 possible helpful closing of solr-requests 2013-06-28 15:19:50 +02:00
Michael Peter Christen
9d291764d1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-06-28 15:03:25 +02:00
sixcooler
e5abccdfe4 added optimize-option 2013-06-28 14:51:37 +02:00
Michael Peter Christen
8ea6ddf636 removed attributes from ConfigPortal.html which are redundant to
ConfigSearchPage_p.html
2013-06-28 14:17:14 +02:00
Michael Peter Christen
64140f35cd fix for solr requests if no query part is given (prevent npe) 2013-06-28 13:16:25 +02:00
Michael Peter Christen
23fb458963 - fix to gsa searchresult answer in case that no query part is given
- fix to gsa default number of results (is 'num')
2013-06-28 12:22:33 +02:00
Michael Peter Christen
660a196989 refactoring 2013-06-26 09:27:22 +02:00
Michael Peter Christen
54024958ac added url_file_name_s in qeury for live-search of urls 2013-06-25 16:36:05 +02:00
Michael Peter Christen
16d1d744fa added url_file_name_s in default collection schema for the file name
without the file extension. This part of the file path is removed from
the multi-field url_paths_sxt, which has now not the file name as last
part of the path list.

The same applies to the new fields source_file_name_s and
target_file_name_s in the webgraph schema.
2013-06-25 16:27:20 +02:00
Michael Peter Christen
f542cf7d9c fix for daterange: the to-date is inclusive 2013-06-21 15:47:12 +02:00
Michael Peter Christen
c36720d45f added daterange option to gsa api 2013-06-18 16:25:00 +02:00
Michael Peter Christen
4e3007f4a0 typo 2013-06-13 22:40:46 +02:00
Michael Peter Christen
2cb6b6bc21 added target="_blank" to shutdown links 2013-06-13 22:31:39 +02:00
orbiter
c8e94ad7c7 fix for citation search in case that the citation is very fresh 2013-06-13 18:27:57 +02:00
orbiter
57dcf68665 added a feed-back message inside the shutdown page 2013-06-13 14:44:47 +02:00
Michael Peter Christen
0600d510e1 show the citation report also in ViewFile 2013-06-13 13:22:43 +02:00
Michael Peter Christen
1a92b61d69 fixed usage of ViewFile which needs a commit before showing latest crawl
result pages.
2013-06-13 13:08:24 +02:00
Michael Peter Christen
570511f3c8 removed fields references_internal_id_sxt and
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
2013-06-13 13:01:28 +02:00
Michael Peter Christen
fd1776a3b0 added a new 'Citations' function: each search result item can now be
explored for citations within other documents. A click on the
'Citations' link shows an analysis with all text lines in the document
each with a complete list of documents which contain the same line. A
second section shows the linking documents in ascending order of number
of citations from the original document. Because documents from
different hosts are most interesting here, they are listed at the top of
the page as possible 'copypasta' source.
2013-06-12 15:02:49 +02:00
Michael Peter Christen
1762911f57 added synchronizations and timeouts in solr api; missing
synchronizations in index modification methods causes deadlocks inside
solr.
2013-06-12 02:13:18 +02:00
Michael Peter Christen
2fd7bbb450 reduced load on solr; no seed update in Status and no exists-check in
HTTPLoader in case of redirects, that can be done using the htcache.
2013-06-12 00:14:55 +02:00
Michael Peter Christen
7ee71c2354 changed administration page headline to 'admnistration' 2013-06-12 00:12:04 +02:00
Michael Peter Christen
efd973d29d changed p2p/stealth mode text and links a bit 2013-06-11 16:50:34 +02:00
Michael Peter Christen
6115bef335 added a 'greedy learning' mechanismn which will cause that a 'fresh'
yacy will load linked web pages from search results until the total
number of web pages reaches 15000. This shall give fresh peers a 'boost'
to get faster a personalized search index.
2013-06-11 14:42:30 +02:00
Michael Peter Christen
a5e328d7c5 new icons 2013-06-11 13:16:46 +02:00
Michael Peter Christen
b85db72a73 added another response writer which can present search result with
texts, separated by sentences. Then, these sentences can be used to
search again in the index for the same sentence. This can be used to
provide a tool for plagiarism-search. (not finished yet).
Try the following:
http://localhost:8090/solr/select?q=text_t:flut&grep=wasser&defType=edismax&start=0&rows=3&core=collection1&wt=grephtml
.. to search for 'flut' and show only sentences in the result documents
which contain the word 'wasser'.
Consider this like using a grep-tool on documents: you select the
documents by a search query and you grep sentences inside the found
documents with the 'grep' attribute.
2013-06-10 18:41:00 +02:00
Michael Peter Christen
5132bf719c added new buttons to search result page in p2p mode which show the
switch between p2p search and the 'stealth mode' which is simply a
non-p2p search within the p2p network. The functionality was there all
the time, but the switch to this was not very visible.
2013-06-10 16:22:00 +02:00
orbiter
2b320313d9 replaced yacydoc servlet usage by a solr result output using an html
output writer. This made the creation of a html result writer necessary
which is included in this commit. The yacydoc servlet was used to
present all metadata to a document, but the solr interface can serve for
this purpose in a much better way. All usages (instead one) of yacydoc
were replaced by a solr call. This affects also the 'metadata' link
attached to search results.
2013-06-09 12:12:34 +02:00
orbiter
200769d0c6 show the cache link in search results only if there is actually a cache
entry stored in HTCACHE
2013-06-09 08:15:23 +02:00
Michael Peter Christen
f7e77a21bf Added a citation reference computation for intra-domain link structures.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.
2013-06-07 13:20:57 +02:00
Michael Peter Christen
fdcd4e6a6f fixes to index deletion: quoting of host name (a '-' may be part of the
url) and disabling the engage button when changing the url field at
'Delete by URL matching'
2013-06-07 08:52:07 +02:00
reger
7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247
- append language setting specific stopword list

- remove unused OVERHANG stack type
2013-06-06 22:07:54 +02:00
orbiter
5c7ddc67fe in GSA api enable usage of solr fq-attribute together with GSA
site-attribute
2013-06-06 13:36:58 +02:00
Michael Peter Christen
eb9d0ba5b1 ranking and boost function update, small bugfixes, better default search
field for solr
2013-05-30 16:30:35 +02:00
Michael Peter Christen
5f92c68f1f removed block rank ranking and all YBR files in /ranking 2013-05-30 13:01:22 +02:00
Michael Peter Christen
164603b946 cleanup 2013-05-30 12:47:22 +02:00
Michael Peter Christen
0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of
soft commits, reduced caching size of search events, ensured that solr
results are processed before connection is closed to keep that stuff not
too long in RAM
2013-05-29 18:27:27 +02:00
Michael Peter Christen
709e9b8ce7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-29 13:49:42 +02:00
Michael Peter Christen
9e07447d47 added new link for SMW 2013-05-29 13:45:22 +02:00
Michael Peter Christen
3c04dd11de removed dead link 2013-05-29 13:42:38 +02:00
Michael Peter Christen
281959a2d7 added option to re-boot the embedded solr during run-time. Added also
API recording for this method so it can be repeated automatically. The
index dump generation is now also available for API recording. Added
some synchronization in backend which was necessary for this.
2013-05-29 13:09:34 +02:00
Michael Peter Christen
80a7989e8c fixed ClassCastException: [Ljava.lang.Object; cannot be cast to
[Ljava.util.List; in robots.txt servlet
2013-05-29 12:02:19 +02:00
orbiter
da621e827e prevent NPE in case RWI is disabled 2013-05-28 16:26:38 +02:00
Michael Peter Christen
7300d81f40 include API Table deletion requests to the API recorder 2013-05-28 11:35:56 +02:00
Michael Peter Christen
d2ade87b49 fixed missing thisaddress in yacysearch.html which caused that the
opensearch link was not working
2013-05-28 10:33:41 +02:00
Michael Peter Christen
179d032181 added a (badly formatted) delete button for process scheduler entries 2013-05-27 16:15:58 +02:00
reger
c03f75ebc3 fix DHT url receive see http://bugs.yacy.net/view.php?id=242 2013-05-26 03:24:32 +02:00
Marc Nause
8fb1b1e290 *) simplified banner creation code 2013-05-25 12:56:43 +02:00
Marc Nause
cd0b5f31b4 *) updated links to description of regex 2013-05-25 11:08:06 +02:00
Michael Peter Christen
8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
reduced time-out of robots.txt load limit
2013-05-20 22:05:28 +02:00
Michael Peter Christen
f93501e6e0 nice crawl name if crawl is started with file:// (was: null) 2013-05-20 11:25:26 +02:00
Michael Peter Christen
b4f0cac102 added the reindexing job servlet to the submenu structure 2013-05-20 11:02:21 +02:00
Michael Peter Christen
8dbc80da70 redesign of index.exist-test: this shall now not be done using a single
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
2013-05-17 13:59:37 +02:00
Michael Peter Christen
c91c67c3cd reject bad solr requests 2013-05-15 22:42:05 +02:00
Michael Peter Christen
44e363f37f refactoring of WorkflowProcessor, added process counter, update of
process counter if an blocking thread dies. Added also a new column in
PerformanceConcurrency_p servlet to show the actual number of concurrent
processes.
2013-05-13 13:28:07 +02:00
reger
79401cb938 added reindex option for documents with disabled or obsolete fields to Solr Schema Editor page (IndexSchema_p.html)
this allows to remove obsolete fields from the index (according to current schema config)
by selecting all documents containig disabled fields.
2013-05-13 04:06:57 +02:00
Michael Peter Christen
b24d1d18e4 removed synchronization and concurrency in Fulltext class, concurrent
deletions are now handled in ConcurrentUpdateSolrConnector
2013-05-11 10:53:12 +02:00
Michael Peter Christen
f965d04496 added new peer icons for Mentor peers and Mentee peers (not used yet) 2013-05-10 17:33:02 +02:00
Michael Peter Christen
b9b446bca6 - added ssl configuration sign (a lock) to network statistic/table
- fixed a bug in bitfield
2013-05-10 17:32:21 +02:00
Michael Peter Christen
7095446ad3 added checkbox (near port) to switch on ssl support (https access) to
the admin interface.
2013-05-10 13:49:46 +02:00
Michael Peter Christen
e6c8b545c2 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-10 12:16:55 +02:00
orbiter
4baa0d4a97 Added a default keystore for ssl encryption of the YaCy web interface.
This will enable https-access to YaCy, but this feature is disabled by
default using the new server.https=false attribute. This has two
purposes:
- make it easier for everyone to use https (just set server.https=true)
- provide the basis for secure yacy-to-yacy communication in the future
2013-05-10 12:02:31 +02:00
Michael Peter Christen
038f956821 fix for sitemap detection: the sitemap url was not visible if it
appeared after the declaration of robots allow/deny for the crawler
because the sitemap parser terminated after the allow/deny rules had
been found. Now the parser reads the robots.txt until the end to
discover also sitemap rules at the end of the file.
2013-05-10 04:56:58 +02:00
Michael Peter Christen
e26bdd4a52 fixes to deletion methods (removed unnecessary concurrency and added
removal of crawl queue entries)
2013-05-08 13:26:25 +02:00
Michael Peter Christen
f7f3e28c5e prevent that the size of the index is computed too many times.
Because the index size is now provided by solr, and the only way to do
that is a match for [* TO *], a size computation is quite complex and
time-consuming. Therefore this patch prevents that the method is called
at all and if necessary puts a DOS-preventing barrier in front of it.
2013-05-08 11:50:46 +02:00
Michael Peter Christen
cca19d94d4 re-declared some fields to be of type string rather than text which
makes them more efficient and less large
2013-05-06 16:45:54 +02:00
Michael Peter Christen
ed1d5bace6 draw the names of other peers which receive/send dht into the network
graphic
2013-05-06 14:27:39 +02:00
Michael Peter Christen
b528448332 enlarge network graph circle according to image height and reduce the
image height in the Network servlet. Overall, the image is now larger
but takes less space on the web page.
2013-05-05 23:39:46 +02:00
Michael Peter Christen
f1bb54943e typo 2013-05-04 09:34:06 +02:00
Michael Peter Christen
d7fd346917 - added regular-expression based deletions
- on-demand collection-list generation for collection-based deletions
instead of a default collection-list presentation (this makes calling
the interface much faster since the computation of collections lists for
large indexes may take some seconds)
2013-05-04 01:14:10 +02:00
Michael Peter Christen
3841854c97 abstraction of catchall term 2013-05-04 00:14:22 +02:00
sixcooler
e145afb8d6 fix for PerformanceMemory showing UNRESOLVED_PATTERN by removing
solr-cache-stuff, which is not available anymore
2013-05-02 15:47:21 +02:00
Michael Peter Christen
1b102d98d8 - added index deletion to index administration submenu
- added index deletion processes to the process scheduler/recorder
2013-04-30 02:11:28 +02:00
Michael Peter Christen
0e2ee00fea added an index deletion servlet and some style changes for the
'dangerous' engage-button
2013-04-29 19:30:53 +02:00
Michael Peter Christen
e4f7e5bcfe fixed bad css change 2013-04-28 20:09:45 +02:00
Michael Peter Christen
3502b4c697 refactoring (renaming) of yacy-solr api 2013-04-27 01:32:18 +02:00
Michael Peter Christen
3a0fcfbeda Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-26 10:50:08 +02:00
Michael Peter Christen
25499eead5 - added a new field for the regular expression in crawl start
- added the field in crawl profile
- adopted logging end error management
- adopted duplicate document detection
- added a new rule to the indexing process to reject non-matching
content
- full redesign of the expert crawl start servlet
The new filter field can now be seen in /CrawlStartExpert_p.html at
Section "Document Filter", subsection item "Filter on Content of
Document"
2013-04-26 10:49:55 +02:00
reger
0a9b0992f3 RinkingSolr_p: include warning if boost field not in local index 2013-04-26 02:26:38 +02:00
orbiter
e1bfe9d07a - reduction of the concurrently running processes to make YaCy more
adjusted to smaller and 1-core devices.
- the workflow processor now starts no process at all. these are started
as soon as parser/condenser/indexing queues are filled.
- better abstraction
2013-04-25 11:33:17 +02:00
Michael Peter Christen
c091000165 added collection attribute also to the rss feed reader 2013-04-24 01:14:35 +02:00
orbiter
f7571386a3 added a 'collection' property attribute in yacysearch.html which can be
used to select between different collections as defined during a crawl
start with the 'collection' attribute. This actually implements the
ability to prepare search tenants which restrict their search results to
a specific collection. The main use for this is to provide tenants to
the yaml4 interface (at this time).
2013-04-23 20:42:54 +02:00
orbiter
3e79bd4b1f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-23 12:15:46 +02:00
orbiter
d571e739b6 increased row limitation for authorized users from 10000 to 100000000 in
solr interface
2013-04-23 12:15:33 +02:00
Michael Peter Christen
a1fffe8e86 fixed default ranking values 2013-04-21 12:27:27 +02:00
Michael Peter Christen
1d30082446 added hindi translation configuration 2013-04-17 12:57:27 +02:00
Michael Peter Christen
97775fbebc fixed ranking for add-function queries: this did not work. The option
was removed. All function queries are now boosts (multiplies the score
according to a function). This is also the recommended way to boost
rankings based on functions as explained in
http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/
2013-04-16 14:45:14 +02:00
Michael Peter Christen
298bf2deb5 fix to ranking configuration servlet 2013-04-16 12:38:16 +02:00
Michael Peter Christen
2db058b551 added in RankingSolr_p.html a select box to switch between different
ranking situations. By default, four situations can be configured.
2013-04-16 11:38:51 +02:00
Michael Peter Christen
6fbca35215 fixed api table navigation 2013-04-16 01:39:30 +02:00
Michael Peter Christen
f24ac518e6 redesign of exists()-query (can now be called with query) and the
CachedSolrConnector which based its cache on the key value. This will be
used to correct the title_unique_b and description_unique_b field.
2013-04-15 14:08:30 +02:00
Michael Peter Christen
27d6222880 added new field host_extent_i which, after a crawl and postprocessing,
holds the number of documents for the host where the document is hosted.
This is necessary for ranking and the norming of references per local
host in the ranking computation.
2013-04-14 20:52:40 +02:00
Michael Peter Christen
579eb01a49 showing now the details of references count in host browser:
external (ext), internal (int) and external hosts (hosts) for each
indexed document.
2013-04-14 11:30:57 +02:00
reger
0f4237d8e5 add admin option to delete load errors from index 2013-04-14 05:33:01 +02:00
Marc Nause
e99c8789ff *) fixed encoding of query in link to map (in case geolocalization is
enabled, "Show search results for "köln" on map")
*) applied suggestions of Checkstyle plugin
2013-04-13 21:50:48 +02:00
Michael Peter Christen
082e3274d6 - setting the same default ranking in the solr interface as for YaCy
search interfaces if no other ranking attributes are given
- using the YaCy ranking in the GSA interface only if there was not
given a GSA-style sort attribute
- to avoid confusion about correct ranking attributes, only the default
'0'-ranking profile is used and not scenario-adopted (site, date)
because that should be configurable in the web interface before it is
used actually for ranking.
2013-04-12 10:48:41 +02:00
Michael Peter Christen
edc0b33f6d - showing references count and clickdepth in host browser
- fixed generation and presentation of both values
2013-04-11 14:46:13 +02:00
orbiter
2c3b024196 if the crawl was paused (automatically), show the reason for pausing in
the Crawler_p servlet.
2013-04-09 18:55:26 +02:00
reger
566a3b0294 fix: Index Administration > Reverse Word Index (IndexControlRWIs_p) corrected use of word search to word-hash search
- removed duplicate QueryParams.hashes2Handles , redundant  with .hashes2Set
2013-04-08 21:25:21 +02:00
reger
40b3f2c5fe comment out dead menue link 2013-04-06 02:34:56 +02:00
reger
bf1e1ddca1 fix typo in prev commit 2013-04-06 02:29:49 +02:00
reger
d4d93be779 uncomment "used time" calculation for remote search log 2013-04-06 02:08:01 +02:00
reger
36202f27b0 improve remote search log, set "Returned Results" to transmitcount (instead of no value) 2013-04-05 03:33:33 +02:00
reger
254074b11d Merge branch 'master' of git://gitorious.org/yacy/rc1.git 2013-03-22 03:46:26 +01:00
Michael Peter Christen
870aedf3c6 fixes for better search interface integration in yaml templates 2013-03-20 16:19:49 +01:00
Michael Peter Christen
735eb70525 better search timing; prevents '0 results' for very large local
indexes >> 10 mio documents
2013-03-19 11:23:18 +01:00
Michael Peter Christen
342ba1049b - callback fix
- memory allocation problem in RowCollection: if memory is too low, do
not to try to increase by 1 because this leads to very long execution
time and at the end to the same OOM as if we allocate the memory at the
moment we need it even if the resource observer states that this memory
is not there. To compensate this, the increase size is reduced.
2013-03-19 10:32:01 +01:00
reger
31d16f20d7 fix invisible icon not found 2013-03-18 00:10:23 +01:00
orbiter
243b66ae6d Merge branch 'master' of git://gitorious.org/~frankensteen91/yacy/frankensteen91s-yacy 2013-03-17 13:39:31 +01:00
Frank
7763f2554f add the new PPMbar in Crawler_p for a better style and better use. 2013-03-17 11:43:12 +01:00
orbiter
e4d26d1cb4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-03-17 10:52:42 +01:00
orbiter
940c6849ee enhanced did-you-mean (a bit): can now remember previously searched
words (plus small enhancements)
2013-03-17 10:52:31 +01:00
reger
d57b221921 add: reset Solr schema filed selection to default button in IndexSchema_p 2013-03-17 03:46:29 +01:00
Michael Peter Christen
9406a2e438 fixed NPE during index abstract computation 2013-03-15 10:04:27 +01:00
Michael Peter Christen
d725782440 turned severe message to warning message about network failure events 2013-03-15 09:40:02 +01:00
Michael Peter Christen
2d36a7eaf5 - do not create a new query for all remote peers
- no document search this time
- adjusted banner and network to not show 'WORDS' but DHT Chunks. This
is to avoid confusion for robinson peers which do not create Word
Entries
2013-03-15 00:14:28 +01:00
Michael Peter Christen
2080fc7406 removed unused tag fields 2013-03-14 10:35:21 +01:00
reger
7804c12976 fix error msg in ConfigHeuristics_p 2013-03-14 03:30:25 +01:00
reger
230a12bfe2 adjust Opensearch discover function to new webgraph Solr schema 2013-03-14 03:10:54 +01:00
orbiter
47114910d5 fix for possible memory leaks 2013-03-13 17:55:37 +01:00
Michael Peter Christen
addba047e2 changes in ranking computation
- an existing ranking servlet for solr was extended. It is now possible
to set boost values for fields, boost functions and boost queries.
- The ranking can have different instances, but currently only the first
one is used
- added an abstraction layer for fields which can be used for search and
those fields can be edited in the solr ranking configruation
- the ranking value from solr within the field score is used to combine
remote search requests, which all are created using the same locally
defined boost values
- reduced the number of fields which are used for search (makes it
faster)
- replaced some text fields by string fields (makes indexing faster)
- removed classes which had no use
- made a large number of experiments for a better ranking and created a
temporary setting which prefers hits inside titles
- adjusted also the RWI-based ranking computation to 'prefer title'
- made special cases like for portal search where no post-processing and
post-ranking is wanted: this keeps the original ranking order as done by
Solr
- fixed many bugs with old settings for ranking
2013-03-13 14:47:00 +01:00
Michael Peter Christen
68e739a90b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-03-10 02:29:38 +01:00
Michael Peter Christen
3d9ce9cd04 - added more selection criteria for network seed list
- enhanced up script
2013-03-10 02:26:24 +01:00
orbiter
168e8d9b4d added/fixed missing DOCTYPE line (submitted by Thomas) 2013-03-08 14:40:09 +01:00
Michael Peter Christen
25300913fa fixes to search debugging after testing with the different search
debugging options
2013-03-05 21:28:22 +01:00
Michael Peter Christen
2d472a39f4 DHT-transferred metadata and crawl receipts now also use the delayed
search cache to prevent that too much IO load is on the peer during
search.
2013-03-04 00:07:52 +01:00
Michael Peter Christen
221ed7d764 - enhanced concurrency during search without IO blocking
- introduced a second queue to flush remote search results (now: old
metadata structure from DHT peers)
- fixed result counters
2013-03-03 22:38:50 +01:00
Marc Nause
2714b59f38 *) For some reason this seems to fix a ClassCastException on my system
(OpenJDK).
2013-03-03 20:38:20 +01:00
orbiter
0f7ea7ad9f - enhanced solr.add procedure for mass adds
- removed unused solr access classes
- made snippet generation for documents aus YaCy RWI/DHT concurrent (as
it was before the search process removation)
- reduced the number of remote results in settings file because the
processing of such mass documents add is too CPU-intensive (in Solr)
2013-03-01 15:27:17 +01:00
orbiter
7ff10bdb1b fix of page navigation for formatted totalcount numbers 2013-03-01 00:48:28 +01:00
orbiter
a734fbc4a5 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-02-27 22:44:57 +01:00
orbiter
d74472f562 corrected result counter 2013-02-27 22:40:23 +01:00
orbiter
aa3c26c62e added recrawl/reload to CrawlStartSite for a timeout of 3 days 2013-02-27 11:43:36 +01:00
orbiter
c1b7e61882 added option to create empty vocabularies 2013-02-27 08:24:37 +01:00
bubu
e0edad689d fix link to IndexSchema_p.html 2013-02-26 21:12:44 +01:00
Michael Peter Christen
c95a84103a complete redesign of search process:
- removed 'worker' processes
- no internal time-out behaviour: methods either are successful or
return null
- waiting is only done on top-level
- removed snippet-production; this is replaced by solr snippets
- removed statistics based on solr size queries (they had been VERY
long); the statistics (like suggestions or tag cloud) are now again
based on the old but very fast RWI index. In portal or intranet mode the
RWI index is usually switched off; if you like to have statistics again
then you must switch on the rwis again in this mode.
- fixed many bugs regarding correct page counter
2013-02-26 17:16:31 +01:00
Michael Peter Christen
35fa718b77 testing to use solr for portalsearch caused some bugfixing but no full
success: try to comment out the solr search request in
yacy-portalsearch.js
2013-02-25 14:31:50 +01:00
Michael Peter Christen
008288719c fix for schema export to consider also automatically generated
coordinate fields
2013-02-25 01:13:03 +01:00
Michael Peter Christen
089dee1770 - generalized SchemaConfiguration into super-class Configuration and
adopted other classes which used the configuration-only access for that
class
- removed many warnings
- adjusted logging
2013-02-25 00:09:41 +01:00
Michael Peter Christen
56d5946a59 - added flags in IndexFederated_p.html to switch on or off the webgraph
index (new solr core webgraph) .. this is now off by default
- completely redesigned this servlet
- added description how to attach a remote solr
- adjusted naming of servlet and menues
- moved 'lazy initialization' attribut from IndexSchema to
IndexFederated (this is a general option) back again.
2013-02-24 18:09:34 +01:00
Michael Peter Christen
14cceb6b17 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	htroot/IndexFederated_p.html
	source/net/yacy/cora/federate/solr/YaCySchema.java
	source/net/yacy/peers/Protocol.java
	source/net/yacy/search/Switchboard.java
	source/net/yacy/search/index/Segment.java

also moved portalsearch-dev to yacy-portalsearch to be able to fix
problems with new attachment to solr of the search widget
2013-02-23 08:48:33 +01:00
Michael Peter Christen
58e1e6fa2b fixes to schema 2013-02-23 08:14:10 +01:00
reger
d31a109efe remove obsolete Solr "commit within" input field from IndexFederated
see 4111606654
2013-02-22 22:03:32 +01:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
89ede0fe84 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-02-21 13:24:10 +01:00
Michael Peter Christen
91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
structure, but is not filled yet. To have the opportunity of a second
core, multi-core functionality had to be implemented to the
deep-embedded solr:
- migrated the solr_40 directory content to a subdirectory
'collection1'; the previously used default core is now called
collection1
- added solr_40/webgraph subdirectory as second core
- added a servlet configuration for the second core 'webgraph' in
/IndexSchema_p.html
- added instance handling as addition to solr connections: all solr
connectors are now instances of an solr 'instance' object; this required
a complete re-design of the solr embedding
- migrated also caching and sharding ontop of new instance handling
- migrated the search apis to handle now the access to a specific core,
the default core named 'collection1'
- migrated the remote solr search interface to access shards of cores;
for the yacy remote search the default core is now called 'solr'; using
the peer address as solr address
- migrated the solr backup and restore process: old backups cannot be
used after this migration!
- redesign of solr instance handling in all methods which access the
instances: they cannot hold copies of these instances any more; the must
retrieve the actuall connection object every time they want to write to
it (this solves also some bugs when switching the index/network)
- added another schema 'solr.webgraph.schema', the old solr.keys.list is
replaced by solr.collection.schema
2013-02-21 13:23:55 +01:00
orbiter
594ed63f2a fixed interactive search which caused an error if pubDate is not present
in a search result
2013-02-16 20:33:27 +01:00
Michael Peter Christen
98a4a4aa97 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-02-15 01:38:23 +01:00
Michael Peter Christen
b6de1f42dc Full redesign of solr connection architecture. This was done to support
multiple solr cores instead of just one. Therefore it is now necessary
to distuingish between solr server connections (called an 'Instance')
and a connection to a single solr core. One Instance may now have
multiple connector classes assigned to it, each connecting to a single
core.
To support multiple cores it is also necessary to distinguish between
the connection configuration and the configuration of the index schema.
We will have multiple schema configurations in the future, each for
every solr core. This caused that the IndexFederated servlet had to be
split into two parts, the new Servlet for the Schema editor is now in
the IndexSchema Servlet.
2013-02-15 01:38:10 +01:00
Marc Nause
efb6cf7d21 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2013-02-13 19:31:12 +01:00
Marc Nause
ce5b7afab2 *) removed Skype online indicator (was not working anymore)
*) updated ICQ URLs
2013-02-13 19:29:40 +01:00
Michael Peter Christen
4111606654 removed the commitWithin attribute because that is not the way how the
index is updated the right way for us. May also be be superfluous with
the solr 4.0 softcommit.
2013-02-13 02:29:47 +01:00
Michael Peter Christen
c20fa3640d fix to unbalanced tag and license for null objects 2013-02-13 01:23:05 +01:00
Michael Peter Christen
3a6097966d added jsonp option to yjson result writer 2013-02-13 01:11:57 +01:00
Michael Peter Christen
de58043205 Added image license generation for solr image search results when
results are generated within yjson result writer. This makes it possible
to view images in yacyinteractive from solr.
2013-02-13 00:33:53 +01:00
Michael Peter Christen
d3508fa8ff fixed json search, quotes, auto-facets, urls etc. for
yacyinteractive.html
2013-02-13 00:01:38 +01:00