Commit Graph

4611 Commits

Author SHA1 Message Date
Michael Peter Christen
0db8e34625 enhanced webgraph processing 2013-12-04 01:54:45 +01:00
Michael Peter Christen
9d8b32c63a fixed a division by zero 2013-12-04 01:54:14 +01:00
Michael Peter Christen
957f6297fb Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-30 01:46:03 +01:00
reger
effea4bca0 Merge origin/master into jetty
Conflicts:
	source/net/yacy/cora/federate/solr/SolrServlet.java
2013-11-29 22:39:52 +01:00
reger
b49e90d2e9 remove reference to solrServlet from YaCy servlet select
- reference is not used
- solrServlet is used in Jetty branch and adjustments there conflict with unused solrServlet here.
2013-11-29 22:10:14 +01:00
Michael Peter Christen
38e1e3a707 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-29 02:46:38 +01:00
sixcooler
2c2ebb0d92 tried some hardening in order not letting any Solr-Searchers open 2013-11-29 02:40:12 +01:00
Michael Peter Christen
cca79d12ef setting of some default values to make an client development start easy
using the description at
http://www.yacy-websuche.de/wiki/index.php/Dev:APIhello
2013-11-29 01:28:48 +01:00
Michael Peter Christen
3d4b5e66ce disallow remote robots to crawl the HostBrowser servlet 2013-11-26 07:06:25 +01:00
Michael Peter Christen
234ca720f5 only admins should be able to force a commit 2013-11-26 07:03:20 +01:00
Michael Peter Christen
2c39b65409 fixes for searches containing stopwords. The fix was done using a
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
2013-11-26 02:24:47 +01:00
orbiter
61409788eb less word hash computations (removing some overhead because of MD5
calcs) using the clear word in a normalized form.
2013-11-25 15:20:54 +01:00
reger
5c4a3d1c01 Merge origin/master into jetty 2013-11-24 21:00:39 +01:00
Michael Peter Christen
caa20d63d9 fixed seedlist (hash was missing) 2013-11-22 14:15:52 +01:00
Michael Peter Christen
ccf2f4e43b refactoring of seed attributes (introduced more constants) 2013-11-22 14:15:31 +01:00
Michael Peter Christen
c927b428d3 fixed json 2013-11-22 10:07:08 +01:00
Michael Peter Christen
64048ff217 fir for XSS 2013-11-22 09:53:32 +01:00
orbiter
b7f1e5af51 added new servlet which generates the same file as the principal peers
upload to a bootstrap position
 you can call it either with
 http://localhost:8090/yacy/seedlist.html
 or to generate json (or jsonp) with
 http://localhost:8090/yacy/seedlist.json
 http://localhost:8090/yacy/seedlist.json?callback=seedlist
2013-11-19 15:56:10 +01:00
orbiter
3e552550d1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-18 22:48:00 +01:00
orbiter
c2d720cdaf purge a lucene cache - possible memory leak fix 2013-11-18 22:47:35 +01:00
reger
f111f30ace Merge origin/master into jetty 2013-11-17 00:18:25 +01:00
Michael Peter Christen
f4172cbb3d fix for another XSS bug 2013-11-17 00:17:25 +01:00
orbiter
ff86cb683f fixed some XSS bugs reported by Marius from http://ctf365.com/ 2013-11-16 20:34:31 +01:00
orbiter
19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
monitor page
2013-11-16 18:23:14 +01:00
Michael Peter Christen
fceac8cffd more monitoring for postprocessing 2013-11-16 08:23:42 +01:00
Michael Peter Christen
9d5895f643 enhanced and fixed postprocessing 2013-11-15 15:41:12 +01:00
Michael Peter Christen
087df05e24 added option to Config_Network_p.html to enable remote search while
DHT-Receive is switched off.
2013-11-13 13:38:01 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
69b8d61c47 fix for search requests in GSA interface which contain 'funny'
characters (like ':' etc.)
2013-11-12 15:54:54 +01:00
orbiter
4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-10 18:50:43 +01:00
orbiter
74c86a72a0 better default value for crawler user agent 2013-11-10 18:48:00 +01:00
reger
1437c45383 merge rc1/master 2013-11-07 21:30:17 +01:00
Michael Peter Christen
87a956e881 calculating and showing the number of files and the average size of a
file in the HTCACHE in ConfigHTCache_p.html
2013-11-07 12:13:12 +01:00
Michael Peter Christen
acc1f8a749 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-07 12:01:37 +01:00
Michael Peter Christen
81bb50118e found and fixed a huge memory leak in solr caching (inside Solr). The
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
2013-11-07 10:01:44 +01:00
sixcooler
987f410011 URL-export:add query and fix for cast-class-exception 2013-11-06 19:22:26 +01:00
Michael Peter Christen
ffe8276063 replaced referrer link masking to 'pure' links to the referring page
(that was more useful during testing)
2013-11-06 18:05:46 +01:00
reger
b38de92a16 Merge origin/master into jetty 2013-11-02 00:48:42 +01:00
Michael Peter Christen
434e13b46d in host browser also show the properties of failed documents including
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
2013-11-01 13:30:53 +01:00
orbiter
1ac504ae51 use html encoding for urls in metadata 2013-10-31 16:16:29 +01:00
reger
f017066197 Merge origin/master into jetty 2013-10-27 15:09:24 +01:00
Michael Peter Christen
25951cee14 - fixed opensearchdescription, this delivered an url with missing
'global' option
- added display=2 to compare_yacy to remove the superfluous border
2013-10-26 00:34:55 +02:00
Michael Peter Christen
f1bfe64361 integrated startpage to compare_yacy 2013-10-26 00:33:36 +02:00
Michael Peter Christen
2f57327f20 added boolean load property to CacheResource_p servlet which causes that
the servlet loads the page from the web.
2013-10-26 00:15:25 +02:00
Michael Peter Christen
9bb7eab389 hacks to prevent storage of data longer than necessary during search and
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
2013-10-24 16:20:20 +02:00
Michael Peter Christen
4948c39e48 added concurrency for mass crawl check 2013-10-23 11:27:19 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
16e3b357b3 replaced old tag cloud and adopted design a bit 2013-10-22 14:20:17 +02:00
Michael Peter Christen
dc38d35986 added matching in url field in Table_API_p search 2013-10-22 12:46:10 +02:00
Michael Peter Christen
691d7e70fa added hint to development/commit rss feed 2013-10-21 15:16:29 +02:00
Michael Peter Christen
b81859c751 Show a RSS icon in the right top corner of search results. This replaces
the 'API' icon which was the link for the opensearch result which is an
extension of RSS. Since it is more appropriate to visualize a RSS link
with an RSS icon, this API icon was changed here.
2013-10-21 15:10:58 +02:00
Michael Peter Christen
1a09771be8 fixed sitemap crawl start 2013-10-21 12:49:32 +02:00
orbiter
b743e6d79f - prevent that crawl filter have empty (never-match) content
- rewrite the description of the options "Restrict to start domain(s)"
and "Restrict to sub-path(s)" to an explanation, that the restriction
applies to all links in the link list of the option "From Link-List of
URL" if this option is selected
- allow "Restrict to sub-path(s)" if the "From Link-List of URL" is
selected. This is supported in the crawl start.
2013-10-18 14:14:13 +02:00
orbiter
f597fdb602 make it easier to filter properties (case insensitive) 2013-10-17 18:36:35 +02:00
reger
f46c723398 allow to choose used http server, YaCy-Anomic or Jetty
- defaults to Jetty (in this branch)
- add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking
2013-10-17 03:34:22 +02:00
reger
1adb4b8741 merge rc1/master 2013-10-16 03:02:21 +02:00
reger
37d24f3318 make use of declared static string ACTION_LOCATION 2013-10-16 02:25:39 +02:00
reger
eea504c117 update Info.plist
small DefaultServlet refactoring
2013-10-12 23:01:14 +02:00
reger
a44eede8b8 merge rc1/master 2013-10-11 01:50:25 +02:00
reger
54a0272338 searchpage javascript (latestinfo) causes reset of search statistic after moving to next page
- disabled call via setTimeout in yacysearch.html
2013-10-10 23:23:58 +02:00
Michael Peter Christen
91fa99e9bb added new icon/image for latest commit 2013-10-09 22:07:59 +02:00
Michael Peter Christen
9fac9249bc - replaced 'edit' link with a clone symbol in Table_API_p since that is
what it does: it clones the crawl, it does not change the crawl.
- moved the appearance of this clone link to the type column since this
makes it visible also if the URL column is not visible.
2013-10-09 22:07:32 +02:00
Michael Peter Christen
0f6db6ad5b Merge remote-tracking branch 'jensbees/crawlexpert-post' 2013-10-09 21:32:27 +02:00
Jens Bertram
3252c1ec39 Merge upstream/master into crawlexpert-post 2013-10-09 20:49:14 +02:00
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
bhoerdzn
a3824dfbaa check URL on inital load, if set 2013-10-09 13:52:44 +02:00
bhoerdzn
52f49d475b add a hidden field for "crawlingstart" since jQuery omits the submit button value 2013-10-09 13:38:20 +02:00
bhoerdzn
b0c0ec2dec link recorded crawl starts back to "CrawlStartExpert_p" in "Process Scheduler" 2013-10-09 12:55:42 +02:00
bhoerdzn
d64d45361c use integer types for boolean values 2013-10-09 12:42:04 +02:00
bhoerdzn
eda123d6fd remove debugging code intercepting post requests 2013-10-09 11:51:07 +02:00
bhoerdzn
5057f27bbd fix typo in parsing "cachePolicy" parameter 2013-10-09 11:41:15 +02:00
bhoerdzn
98f5c9018d Fixed template vars for "deleteold". Fixed parsing "deleteold" parameter. Stop "setState" overwriting "deletold" state on load. 2013-10-09 11:32:17 +02:00
bhoerdzn
a6a62986d4 correct state handling for country code restriction 2013-10-09 10:42:35 +02:00
bhoerdzn
4066b85155 correctly set initial state for load filters 2013-10-09 10:36:08 +02:00
bhoerdzn
8c91c3e7cd set form boolean values to 0 & 1 instead of false & true 2013-10-09 10:05:51 +02:00
bhoerdzn
c27fabc88e fixed wrong parameter check 2013-10-09 10:00:16 +02:00
bhoerdzn
2214bf5396 Remove some post parameters, if they are set to default values, as their values are already set by YaCy. Added some documentation. 2013-10-09 09:48:00 +02:00
reger
71d2655c02 downgrade to Jetty 8 to assure support of JRE 1.6
- introduce a YaCyHttp interface to modulize/separate http server
- adjust the Jetty version specific implementation part (in package net.yacy.http)
     - putting the version specific code in classes starting with Jetty8xxxx
     - moved existing Jetty9xxx implementation into a test class (to keep the code)
- adjust build to the changed jars
- make use of the introduced YaCyHttpServer interface in related htroot servlets

- adjust other test cases/classes
2013-10-09 00:40:48 +02:00
orbiter
705b3338ee list more fields available for search and for ranking boosts 2013-10-08 18:15:35 +02:00
bhoerdzn
405878182f Use list template for all other option lists. Fixed some template expressions. 2013-10-08 15:04:31 +02:00
bhoerdzn
8e74098cd4 Use list template for "reloadIfOlderNumber". 2013-10-08 13:26:09 +02:00
bhoerdzn
52bad7b908 Dynamic toggling of form fields, based on passed in and selected values. This will also cut down the post string by disabling not needed fields. 2013-10-08 13:24:27 +02:00
Michael Peter Christen
e56aa4fe93 fixed search navigation 2013-10-07 23:51:08 +02:00
Michael Peter Christen
4fbc4740df removed warnings 2013-10-07 23:41:50 +02:00
bhoerdzn
45cf553bc3 try to guess default crawling mode, if none set 2013-10-07 13:13:22 +02:00
bhoerdzn
b4f0c822f2 assign strings before checking contents 2013-10-07 13:01:39 +02:00
bhoerdzn
499abe8f91 set default values for string parameters 2013-10-07 12:32:23 +02:00
bhoerdzn
42ea56eaad made crawStartExpert_p aware of post variables; extended template where needed 2013-10-07 11:25:59 +02:00
reger
c7c706fd9f merge with rc1/master 2013-09-30 03:46:39 +02:00
Michael Peter Christen
82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
2013-09-26 10:22:31 +02:00
orbiter
8ac2e8c8c9 added location navigator which causes that the image to the map search
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
2013-09-24 11:26:51 +02:00
orbiter
d86d2be5c3 automatically removed Places autotagging if no location library is
wanted
2013-09-24 11:23:45 +02:00
reger
5c4ba9b5db merge rc1 master 2013-09-22 02:21:24 +02:00
reger
70c51775ae Merge remote-tracking branch 'origin/master' into jetty 2013-09-22 02:09:02 +02:00
orbiter
d2effd21db fix for npe during location search 2013-09-21 21:03:58 +02:00
Michael Peter Christen
e40671ddb7 better and consistent deletions for error urls 2013-09-17 15:52:57 +02:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00