Commit Graph

10767 Commits

Author SHA1 Message Date
Michael Peter Christen
71efc76170 new default skin pdbootstrap which keeps the design shapes but slightly
changes the colours to match with bootstrap colours
2014-04-29 16:23:42 +02:00
Michael Peter Christen
bbadccbd8d better buttons 2014-04-29 16:22:31 +02:00
Michael Peter Christen
a9963d5c95 bootstrap update 2014-04-28 11:52:13 +02:00
Michael Peter Christen
b2bbb9a0b5 Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 2014-04-28 09:17:21 +02:00
reger
0b6db04e40 fix contentscraper img height/width parsing
prevent numberformat exception on common "100px" property

- include in test case
2014-04-28 04:59:47 +02:00
malykhin.dmitry
37424b0c42 Update russian translation 2014-04-28 01:54:34 +04:00
reger
4e57000a40 remove redundant javascript & id in index.html
to set focus to query field in IE11
2014-04-27 22:22:00 +02:00
reger
ffc5b75c73 optimize and fix lat / lon assignment 2014-04-27 20:52:06 +02:00
reger
9313447de2 reimplement tighter lat/lon calc in URIMetadataNode
from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272
2014-04-27 18:20:33 +02:00
reger
d812f80784 add exit proxy link to UrlProxy
on proxied pages a link to exit proxy is added to top of page.
Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.
2014-04-26 22:27:59 +02:00
reger
78d08998db throw MalformedURLException on unknown protocol
on other than the supported   http https ftp file smb \\  mailto
2014-04-26 01:30:51 +02:00
reger
bb8181b2be fix: resolve url without path but searchpart
e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/"
fixes http://mantis.tokeek.de/view.php?id=47

added test case for getHost
2014-04-25 20:15:55 +02:00
reger
121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
this allows to continue loading next resumptionToken even if import file caused sax parser error
fix http://mantis.tokeek.de/view.php?id=63
2014-04-25 01:05:28 +02:00
reger
81dc2aa536 add current css to HTMLResponseWriter to fix metadata view
(using css from metas.template except js links)
2014-04-23 23:41:10 +02:00
orbiter
2fd8a0ead6 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-23 23:13:23 +02:00
orbiter
8e5ce7cd51 fixed a situation where finished crawls had not been detected. 2014-04-23 23:13:07 +02:00
orbiter
c6f0bd05f8 better removal of stored urls when doing a crawl start 2014-04-23 23:12:08 +02:00
orbiter
2f63bd0261 enhanced Host Balancer strategy: fair round robin 2014-04-23 23:11:37 +02:00
orbiter
0c88a32c36 do not apply lazy value instantiation for numeric or boolean values
because that is misleading and confusing in case of 0- or false-values
and may cause NPEs in retrieval functions.
2014-04-23 08:41:36 +02:00
orbiter
8e04030596 in case of short memory, do not cut down robinson peers to 1, just
reduce by 50%
2014-04-23 08:37:19 +02:00
reger
86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
- some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags,
remove all tags for text property (inline img tags are still parsed)
- added test case for above (to htmlParserTest)
- fix solr test case
2014-04-23 00:55:16 +02:00
orbiter
469e0a62f1 added new button to terminate all crawls 2014-04-22 23:14:54 +02:00
orbiter
ccb1864d55 catch IllegalArgumentException for wrong process types (that is needed
for migrations when new process types are introduced or disappear)
2014-04-22 23:14:05 +02:00
orbiter
4ee4ba1576 fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of
lazy value instantiation of 0-value in crawldepth_i
2014-04-22 19:48:49 +02:00
orbiter
12ba890205 removed warnings 2014-04-22 19:35:15 +02:00
reger
d51f9cc863 add custom Jetty errorhandler
to provide custom error page footer line
- remove redundant mime check in UrlProxyServlet
2014-04-21 17:28:21 +02:00
reger
c193a02023 defer creation of new ArrayList after possible early return
(to skip not used object allocation)
2014-04-21 17:16:06 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
reger
79e7947442 - remove empty http0_9 status text array
and unused default_charset = ISO-8859-1
2014-04-18 22:03:16 +02:00
reger
2dabe2009d - remove unused manual http KeepAlive config
(reducing references to obsolete httpdemon)
- add port info to settings_http
2014-04-18 19:57:35 +02:00
Michael Peter Christen
5746aae3db add canonical links to the same crawldepth, not the next crawldepth 2014-04-18 06:51:46 +02:00
Michael Peter Christen
74ab5ef9fa increased runtime for postprocessing query job 2014-04-18 06:51:10 +02:00
Michael Peter Christen
8b32dd5f9e special strategy for balancer: do not remove targets with zero wait time
from the queue
2014-04-18 06:50:07 +02:00
Michael Peter Christen
9c6228d948 fix for deadlocks in crawler 2014-04-17 16:58:17 +02:00
Michael Peter Christen
7a2f3e2353 increased resource.disk.used.max.steadystate and
resource.disk.used.max.overshot by 4 times because first users reached
that limit and wondered why the crawler was paused automatically :)

The crawler will now stop at 2TB disk usage :)
2014-04-17 16:19:38 +02:00
Michael Peter Christen
10cf8215bd added crawl depth for failed documents 2014-04-17 13:21:43 +02:00
Michael Peter Christen
7fefebaeca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-17 12:55:38 +02:00
Michael Peter Christen
c2f62e783f - better subgraph handling, less overhead for crawls without the
webgraph
- usage of crawler crawldepth cache for the linkgraph target depth
computation
2014-04-17 12:54:18 +02:00
Michael Peter Christen
06afb568e2 new Strategies in Balancer:
- doublecheck cache now records the crawl depth as well
- doublecheck cache is available from the outside (made static)
- no more need to crawl hosts with lowest depth first, instead all hosts
which have only singleton entries are preferred to reduce the number of
files.
2014-04-17 12:52:54 +02:00
Michael Peter Christen
1aea01fe5b fix for Table in case that requested file does not exist and paths also
do not exist
2014-04-17 12:44:05 +02:00
reger
710054bb37 implement gzip input handling directly in defaultservlet
(making reference to legacy httpdemon obsolete)
2014-04-17 03:20:29 +02:00
Michael Peter Christen
b4b0d14c04 fix for display bug 2014-04-16 22:24:04 +02:00
Michael Peter Christen
9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
2014-04-16 22:16:20 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
075b6f9278 refactoring of the crawl balancer: the balancer is turned into an
interface and the old balancer class is moved into LegacyBalancer to
make room for a fresh implementation of a crawl balancer.
2014-04-14 13:32:35 +02:00
Michael Peter Christen
8470dfe3f8 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-14 12:17:52 +02:00
reger
46016fa153 autoupdate fails to download latest release (1.71) due to default release blacklist
- removed the default version blacklist regex from init (for future versions)

!!!  left existing update  blacklist setting untouched !!! 
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)

- moved old blacklist patch to migration.java
2014-04-13 07:32:32 +02:00
Michael Peter Christen
8aeef73d49 fix for virtual root nodes 2014-04-11 15:12:34 +02:00
Michael Peter Christen
7c7fbb9818 find depth-matches also for edge targets 2014-04-11 12:27:21 +02:00
Michael Peter Christen
dd12dd392f introduction of a data structure for HyperlinkEdges which should use
less memory as it does no double-storage of source links for each edge
of the graph.
2014-04-11 12:09:33 +02:00