Commit Graph

10057 Commits

Author SHA1 Message Date
orbiter
1ac504ae51 use html encoding for urls in metadata 2013-10-31 16:16:29 +01:00
reger
69599566f9 catch one more malformed url in proxy url rewrite 2013-10-27 04:42:33 +01:00
reger
605530fec5 catch proxy url rewrite exception
malformed url (" http:\/\/" ) may cause error response
 testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test
2013-10-27 04:06:11 +01:00
orbiter
aaa945518d next intermediate release 1.64 2013-10-26 01:31:26 +02:00
Michael Peter Christen
25951cee14 - fixed opensearchdescription, this delivered an url with missing
'global' option
- added display=2 to compare_yacy to remove the superfluous border
2013-10-26 00:34:55 +02:00
Michael Peter Christen
f1bfe64361 integrated startpage to compare_yacy 2013-10-26 00:33:36 +02:00
Michael Peter Christen
2f57327f20 added boolean load property to CacheResource_p servlet which causes that
the servlet loads the page from the web.
2013-10-26 00:15:25 +02:00
Michael Peter Christen
9bb7eab389 hacks to prevent storage of data longer than necessary during search and
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
orbiter
3c3cb78555 - removed a lot of garbage and bloated code from GuiHandler.
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
2013-10-24 20:42:34 +02:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
2013-10-24 16:20:20 +02:00
Michael Peter Christen
6aabc4e5c8 reduced logging line memory, 10000 lines had filled up 450MB! grrr.
(thank you, a bomb from the past)
2013-10-24 16:17:53 +02:00
Michael Peter Christen
1a8783147b enhanced computation of number of solr documents. 2013-10-24 15:48:05 +02:00
Michael Peter Christen
4948c39e48 added concurrency for mass crawl check 2013-10-23 11:27:19 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
82621bead0 When doing bootstraping, always accept one seedlist-File without
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
2013-10-22 15:34:51 +02:00
Michael Peter Christen
16e3b357b3 replaced old tag cloud and adopted design a bit 2013-10-22 14:20:17 +02:00
Michael Peter Christen
dc38d35986 added matching in url field in Table_API_p search 2013-10-22 12:46:10 +02:00
Michael Peter Christen
691d7e70fa added hint to development/commit rss feed 2013-10-21 15:16:29 +02:00
Michael Peter Christen
b81859c751 Show a RSS icon in the right top corner of search results. This replaces
the 'API' icon which was the link for the opensearch result which is an
extension of RSS. Since it is more appropriate to visualize a RSS link
with an RSS icon, this API icon was changed here.
2013-10-21 15:10:58 +02:00
Michael Peter Christen
1a09771be8 fixed sitemap crawl start 2013-10-21 12:49:32 +02:00
orbiter
b743e6d79f - prevent that crawl filter have empty (never-match) content
- rewrite the description of the options "Restrict to start domain(s)"
and "Restrict to sub-path(s)" to an explanation, that the restriction
applies to all links in the link list of the option "From Link-List of
URL" if this option is selected
- allow "Restrict to sub-path(s)" if the "From Link-List of URL" is
selected. This is supported in the crawl start.
2013-10-18 14:14:13 +02:00
orbiter
20bbde8665 fix for mustmatch regex computation: result had correct semantic, but
may have contained multiple same expressions within the disjunction of
domain-restrictions. This fix removes the redundant restrictions and
makes the regex shorter.
2013-10-18 13:55:37 +02:00
orbiter
f597fdb602 make it easier to filter properties (case insensitive) 2013-10-17 18:36:35 +02:00
Michael Peter Christen
c833d02cf5 fixed webgraph postprocessing (did nothing and repeated to do this...) 2013-10-16 11:49:04 +02:00
Michael Peter Christen
74d0256e93 enhanced postprocessing: fixed bugs, enable proper postprocessing also
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
2013-10-16 11:27:06 +02:00
Michael Peter Christen
299f51cb7f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-10-16 04:26:19 +02:00
reger
e7a596afda Merge branch 'master' of git://gitorious.org/yacy/rc1.git 2013-10-16 02:28:13 +02:00
reger
37d24f3318 make use of declared static string ACTION_LOCATION 2013-10-16 02:25:39 +02:00
Michael Peter Christen
7b69c438f7 more methods for the table class 2013-10-15 16:46:59 +02:00
Michael Peter Christen
820b896146 Replaced the inframe loading from yacy.net for donations with the
loading of this iframe from the local host. To make this more flexible,
this iframe is loaded once after startup from yacy.net.
2013-10-15 16:46:06 +02:00
sixcooler
dfb73c9519 bump to httpclient-4.3.1 - a bugfix release 2013-10-14 23:32:24 +02:00
reger
0d4efabaa8 fix YaCy version string in proxy headers
(config parameter vString not longer used)
2013-10-13 17:56:53 +02:00
sixcooler
d9a02ed277 NPE fix for my last commit 2013-10-11 00:44:04 +02:00
sixcooler
61f627eb85 fix for ssl-connections from proxy-usage staying in close-wait-state
+ some extra 'close' in HttpClient
2013-10-10 20:57:37 +02:00
Michael Peter Christen
91fa99e9bb added new icon/image for latest commit 2013-10-09 22:07:59 +02:00
Michael Peter Christen
9fac9249bc - replaced 'edit' link with a clone symbol in Table_API_p since that is
what it does: it clones the crawl, it does not change the crawl.
- moved the appearance of this clone link to the type column since this
makes it visible also if the URL column is not visible.
2013-10-09 22:07:32 +02:00
Michael Peter Christen
0f6db6ad5b Merge remote-tracking branch 'jensbees/crawlexpert-post' 2013-10-09 21:32:27 +02:00
bhoerdzn
3fcf7a94c5 rolling back wrong merge 2013-10-09 21:06:11 +02:00
Jens Bertram
3252c1ec39 Merge upstream/master into crawlexpert-post 2013-10-09 20:49:14 +02:00
Michael Peter Christen
d328cc4a83 fix for didyoumean, added also more asian alphabets 2013-10-09 16:17:50 +02:00
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
Jens Bertram
9f6b98d374 Merge master into crawlexpert-post 2013-10-09 14:39:20 +02:00
bhoerdzn
6e33be4ce6 reverting local changes to project.xml 2013-10-09 14:23:06 +02:00
bhoerdzn
a3824dfbaa check URL on inital load, if set 2013-10-09 13:52:44 +02:00
bhoerdzn
52f49d475b add a hidden field for "crawlingstart" since jQuery omits the submit button value 2013-10-09 13:38:20 +02:00
bhoerdzn
b0c0ec2dec link recorded crawl starts back to "CrawlStartExpert_p" in "Process Scheduler" 2013-10-09 12:55:42 +02:00
bhoerdzn
d64d45361c use integer types for boolean values 2013-10-09 12:42:04 +02:00
bhoerdzn
eda123d6fd remove debugging code intercepting post requests 2013-10-09 11:51:07 +02:00
bhoerdzn
5057f27bbd fix typo in parsing "cachePolicy" parameter 2013-10-09 11:41:15 +02:00