Commit Graph

28 Commits

Author SHA1 Message Date
Michael Peter Christen
ba6ffddefc refactoring 2014-06-12 05:23:26 +02:00
reger
b12200cafe alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules
- use JSoup parser for selective rewrite of html body <a href=  links only,
instead of regex which rewrites also header href/src links
- this improves display of pages which use header <base> tag
- tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer
Disadvantage: scripting links will drop out of proxy

Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet,
leaving the existing code of YaCyProxyServlet untouched available)
2014-03-30 04:04:02 +02:00
reger
b9056ef2db remove unused private header entries (HeaderFramework)
X_YACY_ORIGINAL_REQUEST_LINE
X_YACY_KEEP_ALIVE_REQUEST_COUNT
CONNECTION_PROP_REQUESTLINE
2014-03-26 23:28:19 +01:00
reger
c297de5145 remove check for unused virtual path /currentyacypeer/
-  del jqueryheader.template (not used)
2014-03-21 03:02:19 +01:00
Michael Peter Christen
453bfd0f17 removed unused variables and warnings 2014-03-19 09:29:01 +01:00
reger
a373fb717d remove more unused from legacy server.http
- triggerOnlineAction not used
- useTemplateCache not used
2014-03-14 03:12:04 +01:00
reger
749d020aeb remove redundant url string manipulation in HTTPDProxyHandler
(still used by ProxyServlet)
2014-03-14 02:24:12 +01:00
Michael Peter Christen
b08375da33 fix for bad/missing values of size_i 2014-03-11 09:51:04 +01:00
reger
dd5bf0b71b cleanup old reference to HTTPDemon.setAlternativeResolver
optimize .yacyh check in AbstractRemoteHandler
2014-03-06 03:08:04 +01:00
reger
3b89176b9f use config value htroot in Jetty init (was hardcoded)
- move htroot exist check from old httpdfilehandler to startup, remove from filehandler and legacy proxyhandler
- use SwitchboardConstant.htroot where appropriate
2014-02-27 00:23:34 +01:00
reger
ad4b213145 remove unused static var from HTTPDProxyHandler 2014-02-02 03:47:12 +01:00
Michael Peter Christen
022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
request into a separate thread and ignores the furthure result of a
request if that does not answer within the requested time-out. This is a
try to solve a problem with the peer-ping, which hangs whenever a peer
appears to be dead or blocked.
2014-01-19 15:21:23 +01:00
reger
4c38bceafc handle http connect for proxy
refactor header cleanup (reuse existing code)
2014-01-04 13:09:34 +01:00
reger
0583f44306 reimplement proxy access log (to Jetty ProxyHandler)
- using existing HTTPDProxyHandler logger
- allow local loopback ip to access proxy
2014-01-02 03:37:33 +01:00
Michael Peter Christen
07cee6b99c removed more unused code 2013-12-23 01:51:48 +01:00
Michael Peter Christen
84167adb49 removed unused anomichttpd code after migration to jetty 2013-12-23 01:23:40 +01:00
reger
0d4efabaa8 fix YaCy version string in proxy headers
(config parameter vString not longer used)
2013-10-13 17:56:53 +02:00
sixcooler
61f627eb85 fix for ssl-connections from proxy-usage staying in close-wait-state
+ some extra 'close' in HttpClient
2013-10-10 20:57:37 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
cc90f82dbb increased default proxy client timeout to one minute 2013-05-06 14:58:18 +02:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
orbiter
1f33c30d7b re-integrating useForHost method (lost sometime?) to get the noProxy
pattern working again. Without using this method all remote urls
including the localhost had been accessed through the configured proxy
2012-12-10 20:44:29 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00