Commit Graph

7242 Commits

Author SHA1 Message Date
Michael Peter Christen
c1c1be8f02 fix for slow crawling and better logging in balancer 2014-04-29 19:50:33 +02:00
Michael Peter Christen
3acf416335 npe fix 2014-04-29 19:24:05 +02:00
reger
2eb7682772 add html5 audio/video <source> tag to html content scraper
- <source src=.. type=..> tag content is added to embed collection
2014-04-29 00:41:29 +02:00
reger
0b6db04e40 fix contentscraper img height/width parsing
prevent numberformat exception on common "100px" property

- include in test case
2014-04-28 04:59:47 +02:00
reger
ffc5b75c73 optimize and fix lat / lon assignment 2014-04-27 20:52:06 +02:00
reger
9313447de2 reimplement tighter lat/lon calc in URIMetadataNode
from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272
2014-04-27 18:20:33 +02:00
reger
d812f80784 add exit proxy link to UrlProxy
on proxied pages a link to exit proxy is added to top of page.
Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.
2014-04-26 22:27:59 +02:00
reger
78d08998db throw MalformedURLException on unknown protocol
on other than the supported   http https ftp file smb \\  mailto
2014-04-26 01:30:51 +02:00
reger
bb8181b2be fix: resolve url without path but searchpart
e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/"
fixes http://mantis.tokeek.de/view.php?id=47

added test case for getHost
2014-04-25 20:15:55 +02:00
orbiter
a3542f29b4 npe fix 2014-04-25 09:26:20 +02:00
orbiter
c48d2a2a02 npe fix 2014-04-25 09:23:10 +02:00
reger
121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
this allows to continue loading next resumptionToken even if import file caused sax parser error
fix http://mantis.tokeek.de/view.php?id=63
2014-04-25 01:05:28 +02:00
reger
81dc2aa536 add current css to HTMLResponseWriter to fix metadata view
(using css from metas.template except js links)
2014-04-23 23:41:10 +02:00
orbiter
2fd8a0ead6 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-23 23:13:23 +02:00
orbiter
8e5ce7cd51 fixed a situation where finished crawls had not been detected. 2014-04-23 23:13:07 +02:00
orbiter
2f63bd0261 enhanced Host Balancer strategy: fair round robin 2014-04-23 23:11:37 +02:00
orbiter
0c88a32c36 do not apply lazy value instantiation for numeric or boolean values
because that is misleading and confusing in case of 0- or false-values
and may cause NPEs in retrieval functions.
2014-04-23 08:41:36 +02:00
orbiter
8e04030596 in case of short memory, do not cut down robinson peers to 1, just
reduce by 50%
2014-04-23 08:37:19 +02:00
reger
86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
- some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags,
remove all tags for text property (inline img tags are still parsed)
- added test case for above (to htmlParserTest)
- fix solr test case
2014-04-23 00:55:16 +02:00
orbiter
ccb1864d55 catch IllegalArgumentException for wrong process types (that is needed
for migrations when new process types are introduced or disappear)
2014-04-22 23:14:05 +02:00
orbiter
4ee4ba1576 fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of
lazy value instantiation of 0-value in crawldepth_i
2014-04-22 19:48:49 +02:00
orbiter
12ba890205 removed warnings 2014-04-22 19:35:15 +02:00
reger
d51f9cc863 add custom Jetty errorhandler
to provide custom error page footer line
- remove redundant mime check in UrlProxyServlet
2014-04-21 17:28:21 +02:00
reger
c193a02023 defer creation of new ArrayList after possible early return
(to skip not used object allocation)
2014-04-21 17:16:06 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
reger
79e7947442 - remove empty http0_9 status text array
and unused default_charset = ISO-8859-1
2014-04-18 22:03:16 +02:00
reger
2dabe2009d - remove unused manual http KeepAlive config
(reducing references to obsolete httpdemon)
- add port info to settings_http
2014-04-18 19:57:35 +02:00
Michael Peter Christen
5746aae3db add canonical links to the same crawldepth, not the next crawldepth 2014-04-18 06:51:46 +02:00
Michael Peter Christen
74ab5ef9fa increased runtime for postprocessing query job 2014-04-18 06:51:10 +02:00
Michael Peter Christen
8b32dd5f9e special strategy for balancer: do not remove targets with zero wait time
from the queue
2014-04-18 06:50:07 +02:00
Michael Peter Christen
9c6228d948 fix for deadlocks in crawler 2014-04-17 16:58:17 +02:00
Michael Peter Christen
10cf8215bd added crawl depth for failed documents 2014-04-17 13:21:43 +02:00
Michael Peter Christen
7fefebaeca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-17 12:55:38 +02:00
Michael Peter Christen
c2f62e783f - better subgraph handling, less overhead for crawls without the
webgraph
- usage of crawler crawldepth cache for the linkgraph target depth
computation
2014-04-17 12:54:18 +02:00
Michael Peter Christen
06afb568e2 new Strategies in Balancer:
- doublecheck cache now records the crawl depth as well
- doublecheck cache is available from the outside (made static)
- no more need to crawl hosts with lowest depth first, instead all hosts
which have only singleton entries are preferred to reduce the number of
files.
2014-04-17 12:52:54 +02:00
Michael Peter Christen
1aea01fe5b fix for Table in case that requested file does not exist and paths also
do not exist
2014-04-17 12:44:05 +02:00
reger
710054bb37 implement gzip input handling directly in defaultservlet
(making reference to legacy httpdemon obsolete)
2014-04-17 03:20:29 +02:00
Michael Peter Christen
9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
2014-04-16 22:16:20 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
075b6f9278 refactoring of the crawl balancer: the balancer is turned into an
interface and the old balancer class is moved into LegacyBalancer to
make room for a fresh implementation of a crawl balancer.
2014-04-14 13:32:35 +02:00
Michael Peter Christen
8470dfe3f8 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-14 12:17:52 +02:00
reger
46016fa153 autoupdate fails to download latest release (1.71) due to default release blacklist
- removed the default version blacklist regex from init (for future versions)

!!!  left existing update  blacklist setting untouched !!! 
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)

- moved old blacklist patch to migration.java
2014-04-13 07:32:32 +02:00
Michael Peter Christen
8aeef73d49 fix for virtual root nodes 2014-04-11 15:12:34 +02:00
Michael Peter Christen
7c7fbb9818 find depth-matches also for edge targets 2014-04-11 12:27:21 +02:00
Michael Peter Christen
dd12dd392f introduction of a data structure for HyperlinkEdges which should use
less memory as it does no double-storage of source links for each edge
of the graph.
2014-04-11 12:09:33 +02:00
Michael Peter Christen
6ea8bb7348 using MultiProtocolURL for edge data which is faster (hash computation
is now much easier) and smaller in size
2014-04-11 10:58:37 +02:00
Michael Peter Christen
b21c208b4d enhanced hashcode computation for MultiProtocolURL 2014-04-11 10:23:48 +02:00
Michael Peter Christen
ce1d1b2fa0 fix for maximum tag length in parser 2014-04-11 09:56:44 +02:00
Michael Peter Christen
17e0956312 refactoring of SystemLoad calls (only one backend tool) 2014-04-11 09:25:18 +02:00
Michael Peter Christen
a37d067692 refactoring 2014-04-10 23:46:35 +02:00
orbiter
95780eed32 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-10 21:40:54 +02:00
Michael Peter Christen
67beef657f strong redesign of html parser: object recursion is now made using a
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
2014-04-10 18:58:03 +02:00
Michael Peter Christen
6bd8c6f195 fix for wrong status codes of error pages 2014-04-10 09:08:59 +02:00
Michael Peter Christen
9e503b3376 also delete the robots.txt file from the cache when a new crawl is
started
2014-04-09 21:59:54 +02:00
orbiter
67501c9dda Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-09 19:58:54 +02:00
Michael Peter Christen
1c21b3256d fix for robots.txt handling: delete old entry before starting a new
crawl.
2014-04-09 18:33:48 +02:00
orbiter
c250fac9f4 linkstructure refactoring to get more options for clickdepth analysis 2014-04-09 17:52:51 +02:00
Michael Peter Christen
8068e68474 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-09 12:45:15 +02:00
Michael Peter Christen
bd886054cb new structure and enhancements for link graph computation:
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
2014-04-09 12:45:04 +02:00
reger
f326a67561 fix: typo in default charset in metadata2solr
update pom and NB build to Solr 4.7.1 libs
2014-04-06 22:31:22 +02:00
Michael Peter Christen
df138084c0 do solr optimization independently from memory and load constraints:
- not doing an optimization will likely cause a too many files exception
- without optimization performance will be even worse which would
prevent optimization in the future as well (prevent a deadlock
situation)
2014-04-06 11:04:23 +02:00
Michael Peter Christen
ebd44a7080 replaced solr 4.6.1 with solr 4.7.1 and added index migration to
lucene_47
2014-04-06 10:45:03 +02:00
Michael Peter Christen
734778c0c8 fixed a time-out problem in the default servlet which is also a logging
problem because the error log showed the wrong reason (file not found)
instead the actual reason (time-out).
2014-04-04 15:27:29 +02:00
Michael Peter Christen
466d90ad42 fixed a problem with resource observer; probably coming from uncatched
exceptions within the apache library which appear only in concurrency
environments.
2014-04-04 15:26:39 +02:00
Michael Peter Christen
e8ddd415a8 enhanced the new link structure graph 2014-04-04 14:43:54 +02:00
Michael Peter Christen
926d28dd3f fixed a bug which prevented crawl starts after a network switch 2014-04-04 14:43:35 +02:00
Michael Peter Christen
3ce8eff21b another fix for inbound/outbound detection 2014-04-04 12:41:59 +02:00
Michael Peter Christen
d4b5c457e4 NPE fix 2014-04-04 12:34:34 +02:00
Michael Peter Christen
36a66b0704 fix for parsing of numeric value in case that boolean values are given 2014-04-04 11:59:51 +02:00
orbiter
41730c8048 better logging in template engine: shows filename of servlets where
errors in templates occur
2014-04-04 10:55:46 +02:00
orbiter
3c1274057d fixed thread dump in case of wrong seeds 2014-04-04 10:54:56 +02:00
orbiter
18f9c40302 moved Edge class out of linkstructure servlet as this does not work on
non-eclipse driven environments (all non-dev cases)
2014-04-04 10:54:11 +02:00
orbiter
de95e5e524 reduced search activity corona strength in network image 2014-04-04 10:08:44 +02:00
reger
da413af664 move baseurl after parsing orig source in urlproxyservlet
to calculate absolute href links for rewrite from unmodified source.
2014-04-04 03:11:16 +02:00
reger
af6ad20728 fix: remove obsolete ref to yacy.home
(use Switchboard instead)
2014-04-04 02:45:04 +02:00
Michael Peter Christen
74ab094587 fix for solr query size; too many documents had been retrieved in case
that less than _pagesize_ had been requested.
2014-04-03 13:42:10 +02:00
Michael Peter Christen
c64c10ef00 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-03 01:58:06 +02:00
Michael Peter Christen
48fbfa60c1 bugfix to inbound/outbound identification 2014-04-03 01:21:43 +02:00
reger
227c42bc96 eleminate obsolete URIMetaDataRow class
by joining it with/into URIMetaDataNode.
2014-04-03 00:35:15 +02:00
Michael Peter Christen
cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
2014-04-02 23:37:01 +02:00
orbiter
b1ba764d81 fix for first start options and added german translation for popup texts 2014-04-02 17:10:59 +02:00
orbiter
429a874222 - added COLS field in GSA response (non-gsa standard by customer
request)
- updated document link in GSA response writer
2014-04-02 16:05:44 +02:00
Michael Peter Christen
1b9ec9a1c5 - added popover to p2p/stealth mode button to explain the peer mode and
privacy issues.
- added popover to first-time use case to explain that specific servlets
are only visible after customization and/or crawl starts
2014-04-02 13:33:43 +02:00
Michael Peter Christen
62a36fa584 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-02 03:27:08 +02:00
reger
c9f92abddc fix: application link count
(URIMetadataNode)
2014-04-02 03:21:51 +02:00
Michael Peter Christen
a267c46e1a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-02 02:35:58 +02:00
Michael Peter Christen
5b83887da8 npe fix 2014-04-02 02:34:55 +02:00
Michael Peter Christen
63c9fcf3e0 free configuration of postprocessing clickdepth maximum depth and time 2014-04-02 02:34:39 +02:00
Michael Peter Christen
39b641d6cd added tutorial mode - some menu items will only appear if you 'qualify'
for them. Thus, the first-time user will only see four menu items. The
other items will unfold as the user interacts.
2014-04-02 02:33:17 +02:00
sixcooler
f06775850f fix receiving DHT / parse pultipart
+ another close to fix possible resource leak warning
2014-04-02 01:24:15 +02:00
reger
49e76a1c55 make use of detected charset in htmlParser if none is given. 2014-04-01 04:02:34 +02:00
reger
e11504309f adding a hint to javascript browser short cut on Url-Proxy page (AugmentedBrowsing_p.html) 2014-03-30 05:11:42 +02:00
reger
b12200cafe alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules
- use JSoup parser for selective rewrite of html body <a href=  links only,
instead of regex which rewrites also header href/src links
- this improves display of pages which use header <base> tag
- tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer
Disadvantage: scripting links will drop out of proxy

Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet,
leaving the existing code of YaCyProxyServlet untouched available)
2014-03-30 04:04:02 +02:00
reger
2953ebe701 fix: port in local target adress
& button style
2014-03-29 00:34:01 +01:00
Michael Peter Christen
fda591695c fixed visibility of custom icon 2014-03-28 17:25:39 +01:00
Michael Peter Christen
a9b9950d7f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-03-28 14:48:32 +01:00
Michael Peter Christen
b488f33975 added close to fix possible resource leak warning 2014-03-28 14:34:49 +01:00
Michael Peter Christen
56710ecb26 prevent opening of new files as that could be a cause for the latest
too-many-open-files exception. The old file is just truncated if the
table is cleaned.
2014-03-28 14:31:43 +01:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
reger
d7055904a6 fix: proxyservlet path header setting 2014-03-28 02:05:58 +01:00
Michael Peter Christen
e515dd460d added linkscount_i and linksnofollowcount_i to the default solr schema 2014-03-27 23:36:08 +01:00
Michael Peter Christen
1a764135be one more Thread Dump fix for new bootstrap css style 2014-03-27 23:01:28 +01:00
Michael Peter Christen
bb21d825f9 fix for thread dump line spacing 2014-03-27 22:13:37 +01:00
Michael Peter Christen
cbdfef7ce1 changed protocol facet to show also all other counts if one facet is
selected
2014-03-27 13:29:14 +01:00
reger
b9056ef2db remove unused private header entries (HeaderFramework)
X_YACY_ORIGINAL_REQUEST_LINE
X_YACY_KEEP_ALIVE_REQUEST_COUNT
CONNECTION_PROP_REQUESTLINE
2014-03-26 23:28:19 +01:00
sixcooler
6d16fa993d make transparent proxy handle https-connections:
the implemented handle for connect did not work for me - so lets try the
connectHandler
2014-03-26 20:01:15 +01:00
Michael Peter Christen
61ad194065 fix for source and target clickdepth in webgraph index 2014-03-26 16:00:05 +01:00
Marc Nause
809b4e1fd9 Team added support for URLs with unicode characters in host part to
blacklist. Punycode is used to handle unicode characters.
2014-03-25 22:14:54 +01:00
reger
b126b9ba17 add some InputFileStream close at end of reads
to make sure file is released
2014-03-24 02:32:17 +01:00
reger
ca7444dbdf limit filetype nav to known extension also on image/media search
- on text search we limit filetype nav already to known extension, apply filter to image search
2014-03-23 23:10:29 +01:00
reger
651d057e93 surrogate import translate dc:language 3-char codes
OAI records often use 3-char language codes, start converting some 3-char lang's to the internal ISO639-1 2-char code
2014-03-23 00:40:36 +01:00
orbiter
22618e3ba2 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-03-21 20:28:50 +01:00
orbiter
01989f6af9 restrict write buffer size to a limit 2014-03-21 20:28:34 +01:00
Michael Peter Christen
d1091e79f8 - added stealth button to navigation menu
- more fixes to progress bar
2014-03-21 18:01:26 +01:00
reger
c297de5145 remove check for unused virtual path /currentyacypeer/
-  del jqueryheader.template (not used)
2014-03-21 03:02:19 +01:00
orbiter
3c8d6e1eee added adminAccount switch to ConfigAccounts_p servlet to switch on
protection of all pages; some refactoring as well
2014-03-20 22:11:49 +01:00
orbiter
7d24bcb98d added flag to require that all web pages, even such without a "_p"
extension require authorization. (default off)
2014-03-20 19:09:47 +01:00
Michael Peter Christen
7a6658abec removed synchronization in embedded solr connection (that was probably
a mistake?)
2014-03-19 16:21:03 +01:00
Michael Peter Christen
a7d4379ef9 fixed shutdown of solr cores in case that more than one local core is to
be closed (this happens if webgraph is enabled and the index is dumped
using /IndexControlURLs_p.html
2014-03-19 12:23:40 +01:00
Michael Peter Christen
453bfd0f17 removed unused variables and warnings 2014-03-19 09:29:01 +01:00
Michael Peter Christen
05655d98df Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-03-17 11:57:01 +01:00
reger
9f02d2c47b fix: remove link to triplestore in Vocabulary_p (triplestore does not longer exist)
- should be investigated in more detail to look for additional implications
Remove "yacyaction" from proxyservlet as it was only needed for removed interaction routines.
2014-03-16 22:11:19 +01:00
reger
81a846ec33 fix: set YaCy CONNECTION_PROP_HOST Header in ProxyServlet to host incl. port 2014-03-16 20:51:32 +01:00
reger
251be9ecfa remove unused ProxySettings ref. from loader
clean unused whois test code
2014-03-16 05:19:01 +01:00
reger
82dc815af9 cleanup: remove unrelated and unused code 2014-03-16 00:15:12 +01:00
Michael Peter Christen
85a427ec54 support for multiple sitemaps in robots.txt 2014-03-14 13:33:23 +01:00
reger
a373fb717d remove more unused from legacy server.http
- triggerOnlineAction not used
- useTemplateCache not used
2014-03-14 03:12:04 +01:00
reger
749d020aeb remove redundant url string manipulation in HTTPDProxyHandler
(still used by ProxyServlet)
2014-03-14 02:24:12 +01:00
reger
612294cf84 use servletPath in ProxyServlet instead of fixed name
to allow servlet-mapping via web.xml
2014-03-13 02:46:05 +01:00
reger
1d01672bd3 fix DCEntry.getIdentifier
on successful url parameter
2014-03-12 23:35:57 +01:00
Michael Peter Christen
b08375da33 fix for bad/missing values of size_i 2014-03-11 09:51:04 +01:00
reger
6306d28a6a OAI import get multivalued keywords (dc:subject) 2014-03-09 03:15:35 +01:00
reger
0a8c8102de allow YaCy to start w/o ssl if JKS init fails 2014-03-07 20:06:14 +01:00
sixcooler
0b2101c59c Speed up the ProxyHandler:
simplified cache-storing and make it concurrent in order to free the
clientconnection asap
let other prozesses wait on proxy-access like it was bevore
2014-03-07 17:47:09 +01:00
reger
516f8c2489 fix: to allow unix scripts (bin/*.sh) to allways submit http admin apicalls
using auth via config hash (legacy requirement)
2014-03-07 00:16:57 +01:00
Michael Peter Christen
ea3aa30593 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-03-06 03:33:33 +01:00
reger
dd5bf0b71b cleanup old reference to HTTPDemon.setAlternativeResolver
optimize .yacyh check in AbstractRemoteHandler
2014-03-06 03:08:04 +01:00
Michael Peter Christen
51800007c4 - added concurrency to postprocessing of webgraph document
- bundeled separate webgraph postprocesing steps into one
2014-03-06 01:43:48 +01:00
Michael Peter Christen
5f4a6892c1 enhanced RowSet re-sort limit for small sets 2014-03-05 23:28:19 +01:00
reger
351c2be68d fix: make sure adminAccount changes made via ConfigAccounts_p are effective immediately
force to remove current credentials from knownuser cache
2014-03-05 02:59:27 +01:00
reger
5c9dcc269d improve OAI-PMH import identifier recognition
- find best fittng identifier (url) by checking all given dc:identifier in record (many entries proviede several identifiers)
  as identifier is currently a multivalued field use "getParams" in preference of splitting the 1st string by ";" 
- add resolve DOI:... identifier via http://dx.doi.org/
2014-03-04 03:08:37 +01:00
Michael Peter Christen
0e7d249a69 fixed another shutdown problem (only occurs if webgraph core is enabled) 2014-03-04 01:36:38 +01:00
Michael Peter Christen
e485fbd0ce - let crawl loader jobs die after 10 seconds without new jobs
- corrected shutdown order t prevent a deadlock during shutdown
2014-03-04 00:33:13 +01:00
Michael Peter Christen
bcd9dd9e1d enhanced concurrent loading by using a fixed set of concurrent loader
processes in favor of throwaway-processes. The control mechanism does
less often report a 'queue full' message to the busy loop which then
does not perform a long busy waiting; instead all requests are queued
and new loader processes are started if necessary up to a given limit
(as set before)
2014-03-03 22:13:40 +01:00
orbiter
051328271c bugfix-bugfix 2014-03-02 21:13:38 +01:00
orbiter
eedcbcd906 bugfix to proxy handler: recognize the own yacyh-host 2014-03-02 12:10:19 +01:00
orbiter
d68e5ad0c4 NPE fix for Thread name (just commited yesterday, sorry) 2014-03-02 11:20:48 +01:00
reger
6878c90f99 fix: IPv6 INTRANET_PATTERNS for local ip (see http://bugs.yacy.net/view.php?id=378)
requiring following ":" for fc and fd prefix and made pattern match case insesitive
- add some more ipv6 test cases to MultiProtocolURLTest.java
2014-03-02 06:13:21 +01:00
reger
a2e5ea2026 status panel link to set max mem
+url proxy same error text as in transparent
2014-03-01 00:56:45 +01:00
Michael Peter Christen
6ed9c0164e attaching names to all Threads to get a better view in profiling tools
like VisualVM
2014-02-28 15:02:01 +01:00
Michael Peter Christen
fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
2014-02-28 14:01:09 +01:00
reger
ba49ff81ed little more verbose proxy 403 error message 2014-02-28 03:14:07 +01:00
Michael Peter Christen
d325cb8912 fixes and enhancements for postprocessing 2014-02-28 02:51:14 +01:00
Michael Peter Christen
7c1b968378 another fix for the shutdown exceptions 2014-02-28 01:53:32 +01:00
orbiter
133d41386c (again) full redesign of ConcurrentUpdateSolrConnector to remove
out-of-order transactions regarding add and delete operations. Now all
operations (add and delete) are executed concurrently in-order.
2014-02-28 00:19:30 +01:00
Michael Peter Christen
a632b0d2a4 added a forced commit to index deletion to enable synchronized index
updates
2014-02-27 12:50:40 +01:00
Michael Peter Christen
1d069c5861 make sure that postprocessed documents are overwritten 2014-02-27 12:27:15 +01:00
Michael Peter Christen
0d2342575e Merge branch 'master' of ssh://gitorious.org/yacy/rc1 2014-02-27 01:29:52 +01:00
Michael Peter Christen
3cc5c0ffdd a concurrency enhancement which was not used because tests showed worse
indexing speed. I leave the code there since it may be useful in
SolrCloud environments.
2014-02-27 01:27:06 +01:00
Michael Peter Christen
e644981697 added one more postprocessing low memory check 2014-02-27 00:34:13 +01:00
reger
5e645f4449 Merge origin/master 2014-02-27 00:24:30 +01:00
reger
3b89176b9f use config value htroot in Jetty init (was hardcoded)
- move htroot exist check from old httpdfilehandler to startup, remove from filehandler and legacy proxyhandler
- use SwitchboardConstant.htroot where appropriate
2014-02-27 00:23:34 +01:00
Michael Peter Christen
e1bf65c892 added short memory protection during postprocessing 2014-02-26 23:02:56 +01:00
Michael Peter Christen
90b47e83e6 fixed shutdown error when closing solr connectors 2014-02-26 22:47:16 +01:00
Michael Peter Christen
7640834b37 removed double concurrency to put Solr documents into the index. The
writings to the solr index are also buffered in
ConcurrentUpdateSolrConnector
2014-02-26 22:21:00 +01:00
Michael Peter Christen
0f6b72f24b do not use luke requests for remote solr servers if the result is
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
2014-02-26 14:30:48 +01:00
Michael Peter Christen
c57026e242 recover from OOM 2014-02-25 15:23:45 +01:00
Michael Peter Christen
907db8b7a6 fix for bad query shortcut hack 2014-02-25 15:19:04 +01:00
Michael Peter Christen
a2b66fe2eb Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-02-25 14:37:39 +01:00
Michael Peter Christen
9f6be762a6 - better logging for postprocessing
- fixed collection bug in postprocessing
2014-02-25 14:37:30 +01:00
orbiter
da5d4128bf prevent npe 2014-02-25 03:26:20 +01:00
orbiter
a878c7982c prevent npe 2014-02-25 03:19:41 +01:00
orbiter
e4eb87d924 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-02-25 02:16:37 +01:00
orbiter
ced1a96f9c fixed error cache 2014-02-25 02:16:22 +01:00
reger
3ba81bd08a Merge origin/master 2014-02-25 00:24:10 +01:00
reger
4d896383db fix: use timeout = proxy.ClientTimeout in ProxyHandler
(was 10sec fix) see http://bugs.yacy.net/view.php?id=236
2014-02-25 00:23:06 +01:00
orbiter
cfb647db6e - introduced a miss cache in ConcurrentUpdateSolrConnector
- better usage of cache
- bugfix for postprocessing
2014-02-24 23:42:50 +01:00
orbiter
a87d8e4a8e changed caching of ConcurrentUpdateSolrConnector: it caches now also the
url along with the load date. While this takes much more memory, it
eliminates database lookups for getURL() requests, which happen equally
often. This speeds up remote solr configurations.
2014-02-24 22:59:58 +01:00
orbiter
f6e441dd77 refactoring 2014-02-24 21:01:56 +01:00
orbiter
76c53faeb2 removed unused code (HostStat) 2014-02-24 20:51:43 +01:00
orbiter
d3a88eaecb introducing ConcurrentUpdateSolrServer for remote solr servers.
Scaling of write buffers and update queue size is made according to
assigned memory.
2014-02-24 20:26:02 +01:00
reger
809e976578 remove unused java imports form yacy.java 2014-02-24 05:19:40 +01:00
reger
a9b06f8719 add a -config command line parameter e.g. -config "port=9090" "port.ssl=8043"
- useful for remote installation to set any config file property
- multipe parameter can be set at once, on Windows enclose parameter in doublequotes
- special handling   "adminAccount=adminuser:adminpwd"  sets adminusername and md5 encoded admin-pwd

- adjusted windows startbatch to allow command line parameter handling
- remove not needed classpath calculation from startYACY_debug.bat
2014-02-24 05:16:31 +01:00
reger
0923b09216 fix: allow 4 character admin user name
(was min 5 char)
2014-02-24 00:01:11 +01:00
Michael Peter Christen
254a7ac66c fixed cleaning of index 2014-02-22 01:35:01 +01:00
Michael Peter Christen
28a7b42e6b removed warning "sun.misc.BASE64Encoder is internal proprietary API and
may be removed in a future release"
2014-02-22 00:52:49 +01:00
Michael Peter Christen
046f5a03cb one more SolrIndexSearcher bugfix 2014-02-21 23:48:56 +01:00
sixcooler
78c01b3eff fix for 'AlreadyClosedException: this IndexReader is closed' 2014-02-21 17:28:32 +01:00
Michael Peter Christen
1b5e3d523a better control over close-state of remote solr connections 2014-02-20 00:39:19 +01:00
Michael Peter Christen
1a364572a5 fix for
"org.apache.solr.core.SolrCore Too many close [count:-1] on
org.apache.solr.core.SolrCore@51af7c57"
-error
2014-02-20 00:03:35 +01:00
Michael Peter Christen
69391e5d9e changed strategy to test existence of documents in Solr: using the
update time. The reason for that is a better caching for the crawler
double-check, which needs the update time for crawler steering.
2014-02-19 04:03:45 +01:00
Michael Peter Christen
790f103f32 delete fail-docs during postprocessing to prevent that they will appear
again and stay in postprocessing forever.
2014-02-18 01:38:56 +01:00
Michael Peter Christen
ff656ce860 explicit call to optimize to add a expungeDeleted flag 2014-02-12 01:01:23 +01:00
Michael Peter Christen
9eb668e951 enhanced the resource observer
The resource observer is now able to recognize free disk space AND
available space for YaCy. The amount of space which is assigned for YaCy
are defined in new settings in the configuration file.
Furthermore, there is now a cleanup process which deletes files in case
that an autodelete is activated. The autodelete is now BY DEFAULT ON if
the disk space is low, which means that YaCy starts to delete documents
when the disk is full!
2014-02-12 01:00:44 +01:00
Michael Peter Christen
fbee98c06f fixed shortcut self-reference bug 2014-02-11 22:14:46 +01:00
Michael Peter Christen
e7a29a2851 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-02-11 22:03:46 +01:00
Michael Peter Christen
bf97e38b83 removed clearURLIndex, which is a stub remaining from the old metadata
database and not needed any more
2014-02-11 22:01:25 +01:00
orbiter
14764632b5 clear solr caches in case that an exception occurrs. The reason behind
this hack is the occurrence of Exceptions like:
W 2014/02/11 18:51:33 ConcurrentLog GC overhead limit exceeded
java.io.IOException: GC overhead limit exceeded
        at
net.yacy.cora.federate.solr.connector.AbstractSolrConnector.getDocumentById(AbstractSolrConnector.java:334)
        at
net.yacy.cora.federate.solr.connector.MirrorSolrConnector.getDocumentById(MirrorSolrConnector.java:173)
        at
net.yacy.cora.federate.solr.connector.ConcurrentUpdateSolrConnector.getDocumentById(ConcurrentUpdateSolrConnector.java:415)
        at net.yacy.search.index.Fulltext.getMetadata(Fulltext.java:331)
        at net.yacy.search.index.Fulltext.getMetadata(Fulltext.java:317)
        at
net.yacy.search.query.SearchEvent.pullOneRWI(SearchEvent.java:1024)
        at
net.yacy.search.query.SearchEvent.pullOneFilteredFromRWI(SearchEvent.java:1047)
        at
net.yacy.search.query.SearchEvent$3.run(SearchEvent.java:1263)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3077)
        at java.lang.StringCoding.decode(StringCoding.java:196)
        at java.lang.String.<init>(String.java:491)
        at java.lang.String.<init>(String.java:547)
        at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.readField(CompressingStoredFieldsReader.java:187)
        at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:351)
        at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:276)
        at
org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110)
        at
org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
        at
org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:657)
        at
net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.SolrQueryResponse2SolrDocumentList(EmbeddedSolrConnector.java:230)
        at
net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.getDocumentListByParams(EmbeddedSolrConnector.java:320)
        at
net.yacy.cora.federate.solr.connector.AbstractSolrConnector.getDocumentById(AbstractSolrConnector.java:330)
        ... 7 more
        
This problem was analysed with the Eclipse Memory Analyser after a heap
dump, where the following problem was reported as the main Problem
Suspect:

One instance of "org.apache.solr.util.ConcurrentLRUCache" loaded by
"sun.misc.Launcher$AppClassLoader @ 0x42e940a0" occupies 902.898.256
(61,80%) bytes. The memory is accumulated in one instance of
"java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "<system
class loader>".

This memory is part of the result cache of Solr. Flushing this cache
appears the most appropriate solution to that problem.
2014-02-11 20:56:40 +01:00
Michael Peter Christen
bc28247089 Added methods in resource observer to calculate the available and the
occupied disc space. These values are also shown on the status page.
The disc space calculation shall be used for a disk-limitation of the
search index.
2014-02-11 03:20:03 +01:00
Michael Peter Christen
0dda979801 adopted network image drawing to increased number of peers 2014-02-11 00:53:10 +01:00