Commit Graph

6930 Commits

Author SHA1 Message Date
Michael Peter Christen
b9d36e45e0 removed the &amp explicit encoding of ampersand character since this is
double-translated within the template replacement process.
2014-01-05 03:40:10 +01:00
reger
e2ccb6ce9d modified DefaultServlet parameter on invoke templates
call response with post=0 (if post empty) simulating previous behavior.

(template servlets typically test for post==null,
found one more Crawler.p.java were empty post caused problem,
= defaults not correctly set)
2014-01-04 20:49:26 +01:00
reger
4c38bceafc handle http connect for proxy
refactor header cleanup (reuse existing code)
2014-01-04 13:09:34 +01:00
reger
cfabe8f67a harmonize access restriction for urlproxy servlet
with proxy handler, what is currently
- use switched on in config
- access from a local IP / hostname

fix shutdown exception for crashprotection handler on interrupted connections.
2014-01-03 12:28:40 +01:00
reger
e6b9643fd6 extended request for local peer check to by hostname resolved ip
the current islocal() check did not detect a domain.com address as request for the local peer.
2014-01-03 01:13:56 +01:00
reger
c797f108a1 add error response on deniedl proxy access
send http 403 response
2014-01-02 09:11:08 +01:00
reger
0583f44306 reimplement proxy access log (to Jetty ProxyHandler)
- using existing HTTPDProxyHandler logger
- allow local loopback ip to access proxy
2014-01-02 03:37:33 +01:00
reger
8cbc1c970a Security Hot-Fix: for transparent proxy. 2014-01-01 20:48:35 +01:00
reger
58ecf5e4dd add to blacklist button in CrawlResults
http://bugs.yacy.net/view.php?id=220
introduced Blacklist.add with sourcefile only parameter
2014-01-01 11:01:22 +01:00
reger
e9081c0f17 moved startup execAPIActions call after Jetty startup
execAPIActions require http to be up. The 10s sleep was sufficient to allow Jetty to start, 
but it's more robust to place the call after http is assigned to switchboard/serverSwitch.
2014-01-01 10:28:49 +01:00
reger
19c1a7a5ca change SolrServlet from Filter to Servlet
(as no multicore required)
this allows to simplify context/servlet initialization in Jetty init.
2014-01-01 10:20:32 +01:00
reger
14c977dd26 fix NPE GSAresponseWriter on query=null
java.lang.NullPointerException
	at net.yacy.cora.federate.solr.responsewriter.GSAResponseWriter.highlight(GSAResponseWriter.java:328)
	at net.yacy.cora.federate.solr.responsewriter.GSAResponseWriter.write(GSAResponseWriter.java:263)
	at net.yacy.http.servlets.SolrServlet.service(SolrServlet.java:235)
2013-12-31 23:01:41 +01:00
orbiter
c3dee2d6bd added security patch 2013-12-31 15:25:44 +01:00
orbiter
dcf46ce8f6 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-12-31 15:20:49 +01:00
orbiter
343d2ef49a new data type for access tracker (unfinished) 2013-12-31 15:20:34 +01:00
reger
dd8ea0cdd6 fix "add to blacklist" button style in IndexControlRWIs_p
- added default filename filter to select field (as only addition to *.black list is permanent)

- modified Blacklist_p header/legend to show all active blacklists 
  (to support understanding that all configured lists are active)
- removed obsolete code in Blacklist_p servlet
2013-12-30 20:03:59 +01:00
reger
abbf487023 fix QueryGoal Image query (missing space)
see query log example .. url_file_ext_s:(jpg OR png OR gif) ORcontent_type:(image/*)) ..
2013-12-29 20:14:10 +01:00
reger
26e9d7e066 fix NPE in IndexControlRWIs_p.html
- metatags my be null
Caused by: java.lang.NullPointerException
	at net.yacy.search.query.QueryParams.getFacets(QueryParams.java:445)
	at net.yacy.search.query.QueryParams.getBasicParams(QueryParams.java:400)
	at net.yacy.search.query.QueryParams.solrTextQuery(QueryParams.java:345)
	at net.yacy.search.query.QueryParams.solrQuery(QueryParams.java:334)
	at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:290)
	at net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:176)
	at IndexControlRWIs_p.genSearchresult(IndexControlRWIs_p.java:641)
	at IndexControlRWIs_p.respond(IndexControlRWIs_p.java:141)
2013-12-29 08:05:37 +01:00
reger
7f9b9315fe Merge origin/master 2013-12-29 02:05:07 +01:00
reger
8eaabb9600 remove dependency from old serverCore.java
- remaining getPortNr not needed 
  (as current release allows only to set plain integer as port,
   see ConfigBasic)
2013-12-29 02:00:44 +01:00
orbiter
2018e55f8b switched back on index deletion (was accidently off because new jetty
framework delivers never null to post arguments .. there may be more of
that kind of problems)
2013-12-29 01:39:30 +01:00
orbiter
3961b643a3 write solr searches to search log 2013-12-29 01:25:44 +01:00
orbiter
15882beb19 fix for strange NPE
java.lang.NullPointerException
        at
net.yacy.search.Switchboard.updateMySeed(Switchboard.java:3667)
        at net.yacy.peers.Network.peerPing(Network.java:195)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at
net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107)
        at
net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165)
2013-12-29 00:40:31 +01:00
orbiter
f3ac923a7e ftp client shall be able to open non-anonymous ftp servers if login
details are given
2013-12-28 22:42:02 +01:00
reger
3d913558ab display configured adminUserName in ConfigAccounts_p
- fix read default username in  in loginservice
2013-12-27 21:04:14 +01:00
reger
fbdd89e198 Merge origin/master 2013-12-27 06:53:14 +01:00
reger
65a2f3d5e7 tweak Jetty credentials to work with YaCy UserDB
- user entry in UserDB with admin right can login to access protected pages
- dto. admin user, choosen username is stored in conf (adminAccountUserName=)
2013-12-27 06:45:22 +01:00
Michael Peter Christen
ffdfe5fb9b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-12-27 03:06:38 +01:00
reger
7d6b34a89f Merge origin/master 2013-12-27 03:04:14 +01:00
reger
45e8750ba5 nasty quick fix for admin login with other username as admin
- userDB is not sync'ed with Jetty credentials as of now only the std. admin account can login

switched initial browser open with ssl active back to std. http port
2013-12-27 02:59:19 +01:00
Michael Peter Christen
ee17bd0b69 added option to attach remote solr servers in read-only mode 2013-12-27 02:55:21 +01:00
Michael Peter Christen
25f9c35033 add patch which shall prevent that naive search mistakes like usage of
regular expressions cause no results. Usage of '*' followed by a dot or
any expression will now cause that this expression is used as a filetype
search.
2013-12-27 00:34:55 +01:00
Michael Peter Christen
667a6adddb - use default files from yacy.init property "defaultFiles" if no
jetty-configuration is given for default files.
- fix a problem with default paths if no path is given (i.e.
http://localhost:8090 instead of http://localhost:8090/). Without this
patch the path was resolved automatically to http://localhost:8090//
2013-12-26 23:59:04 +01:00
Michael Peter Christen
77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives 2013-12-26 23:26:21 +01:00
reger
fca7f1d043 run SSL/HTTPS port (8443) ping test in migration only if SSL/HTTPS is on
- see last commit
2013-12-25 05:33:00 +01:00
reger
71cac1a278 added SSL/HTTPS connector to support SSL/https connection on port 8443
!!! attention !!! to make sure YaCy can start, https will be disabled if port 8443 is used
   - added ping test for above to migration 

- as of now port for https is hardcoded to default 8443
- if not urgend required I'd leave it this way (it's standard) to use different ports for http and https 

- post https port on ConfigBasic.html (if active)
2013-12-25 05:20:13 +01:00
Michael Peter Christen
82c0525e71 wrong logger fix 2013-12-23 10:52:02 +01:00
Michael Peter Christen
e17624b6dd added html retrieval from alternative DATA/HTDOCS path 2013-12-23 02:06:33 +01:00
Michael Peter Christen
07cee6b99c removed more unused code 2013-12-23 01:51:48 +01:00
Michael Peter Christen
20b48f894f refactoring: moving all servlets to the same package (the solr servlet
is currently actually a filter which should be changed somehow)
2013-12-23 01:32:29 +01:00
Michael Peter Christen
84167adb49 removed unused anomichttpd code after migration to jetty 2013-12-23 01:23:40 +01:00
Michael Peter Christen
b461a27abb fixed the SolrServlet 2013-12-20 01:51:51 +01:00
Michael Peter Christen
7603e879dc Merge branch 'master' into HEAD
Conflicts:
	.classpath
	source/net/yacy/cora/federate/solr/SolrServlet.java
2013-12-20 01:19:06 +01:00
Michael Peter Christen
25250405f1 solr servlet preparation for join with jetty branch 2013-12-20 00:45:58 +01:00
Michael Peter Christen
2f16770681 migrated to solr 4.6.0 2013-12-19 21:51:05 +01:00
Michael Peter Christen
57f0f71ac6 added patch to allow binary response writer 2013-12-19 10:13:43 +01:00
orbiter
937273d4e3 added parsing of metadata to surrogate reading:
a dublin core record inside of surrogate input files may now contain
tokens within the namespace 'md' (short for: metadata). The token names
must be valid withing the namespace of the solr field names. All
md-tokens inside of surrogate files then overwrite values within solr
documents before they are written to the solr index. This makes it
possible to assign collection names to each surrogate entry and also
ranking information can be added. Please see the example file.
2013-12-17 14:02:27 +01:00
reger
18497f6475 remove unused init parameter from DefaultServlet
- remove "RelativeResourceBase" parameter
2013-12-15 23:39:19 +01:00
orbiter
4de3fefdb5 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-12-15 19:13:00 +01:00
orbiter
7e346e1d79 using stringbuilder in query construction 2013-12-15 19:12:49 +01:00
reger
c84c313fe1 Merge origin/master into jetty 2013-12-14 20:02:24 +01:00
Michael Peter Christen
2702d9e56b - added a SolrQueryResponse2SolrDocumentList method which is able to
work around the unfolding process in Solr's BinaryResponseWriter.
This was a huge performance bottleneck in the embedded solr connector
and the problem is actually on Solr side, but we have now a workaround.
- This made it possible to abstract a high-performance index access
method which is implemented as method getDocumentListByParams. That
method is also implemented in the SolrServerConnector and provides a
very efficient access to a solr index if the index is embedded.
- a popular use of the document list retrieval is a result count which
can now also make use of the new method, via getDocumentCountByParams.
- enhanced the Error cache which now does not store error documents
within the ram cache if the document is also written to solr. When
documents are retrieved from the cache, they are partly read from the
ram cache and if not existent there, from the Solr index.
2013-12-13 15:56:29 +01:00
Michael Peter Christen
74466d731a use pre-compiled patterns in ymark 2013-12-12 11:50:48 +01:00
Michael Peter Christen
34633044b4 made pattern computation static 2013-12-12 10:55:36 +01:00
Michael Peter Christen
ef7ddbc933 added date parser caches to prevent re-calculation of costly date
parsing
2013-12-12 10:55:12 +01:00
Michael Peter Christen
552ef9f18e fix for bad ErrorCache.exists test (bug from latest commit) 2013-12-12 10:38:32 +01:00
Michael Peter Christen
09412ea3a4 counting search requests in solr interface 2013-12-12 03:37:19 +01:00
Michael Peter Christen
303f5694ba avoid usage of existsByQuery. If a document can be loaded by the ID
before testing other fields from the existsByQuery request, then a
document cache fills and queries after that one can be avoided.
2013-12-12 03:36:30 +01:00
reger
b43bbd3cc4 join DefaultServlet and Jetty8 implementation
- removing Jetty 8 specific dependencies
2013-12-09 23:45:57 +01:00
reger
089c5007ee move conditionalHeader to DefaultServlet
- by removing Jetty specific implementation detail
2013-12-08 00:56:45 +01:00
Michael Peter Christen
79771c60c0 IPv6 fixes 2013-12-06 14:30:08 +01:00
reger
92d9c56f9f Merge origin/master into jetty 2013-12-05 22:53:29 +01:00
Michael Peter Christen
78eac85161 better calibration of caches and queue maximum sizes 2013-12-04 23:15:10 +01:00
Michael Peter Christen
c8af19bd37 removed unnecessary check which causes a NPE when searching with empty
search string
2013-12-04 17:58:36 +01:00
Michael Peter Christen
e3c2f09de9 - reduce computation in case that specific postprocessing fields are not
selected
- de-select citation rank computation
2013-12-04 17:48:12 +01:00
Michael Peter Christen
cfa08024c7 removed optimization bevore postprocessing because that may cause a
time-out which will cause that postprocessing fails.
2013-12-04 16:04:29 +01:00
Michael Peter Christen
6f3a923691 fixed urlmask which was not able to combine several constraints 2013-12-04 13:48:01 +01:00
Michael Peter Christen
9a27bf6e82 removed filter computation in Protocol class for remote searches because
that is already done in the QueryParams class
2013-12-04 13:09:15 +01:00
Michael Peter Christen
f1b5db2c45 - performance graph does not shop peer ping in memory monitor any more
- after a forced GC, the PerformanceMemory view switches to automatic
update by default
2013-12-04 12:59:30 +01:00
Michael Peter Christen
a125904a1c fixed a NPE in surrogat processing 2013-12-04 01:56:38 +01:00
Michael Peter Christen
0db8e34625 enhanced webgraph processing 2013-12-04 01:54:45 +01:00
reger
ac067b5236 clean-up Jetty handler classes 2013-12-01 19:36:24 +01:00
reger
b75e92aac3 add read queryparameter in gsaservlet 2013-11-30 06:29:57 +01:00
reger
1e94719084 fix NPE on mime detection of unknown file extension 2013-11-29 23:23:47 +01:00
reger
effea4bca0 Merge origin/master into jetty
Conflicts:
	source/net/yacy/cora/federate/solr/SolrServlet.java
2013-11-29 22:39:52 +01:00
sixcooler
2c2ebb0d92 tried some hardening in order not letting any Solr-Searchers open 2013-11-29 02:40:12 +01:00
Michael Peter Christen
a16534cb0a tried to fix timeout and connection-lost problems when using an outside
solr.
2013-11-28 01:31:53 +01:00
Michael Peter Christen
c3dcbdc8d5 try to recover from an OOM during citation index reading and fail-over
to second solr core in case of unrecoverable OOM.
2013-11-28 01:10:25 +01:00
Michael Peter Christen
9932c441c8 fixed a problem with Date fields parsing Solr results if a remote Solr
is attached.
2013-11-28 00:54:53 +01:00
sixcooler
94db054aff memory-leak-fix: the DocListSearcher fires an query in its constructor
and it is highly recommend to close every SolrRequest.
Every Request, which is not closed leaves a Searcher with its Chaches an
can not be garbage-collectet.
2013-11-27 19:07:36 +01:00
reger
26bb1e37b7 implement core selection in SolrServlet
- making initcore() obsolete
2013-11-27 02:51:02 +01:00
Michael Peter Christen
ae55d69ef6 include/exclude size NPE fix (recently added) 2013-11-26 11:47:04 +01:00
Michael Peter Christen
2c39b65409 fixes for searches containing stopwords. The fix was done using a
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
2013-11-26 02:24:47 +01:00
Michael Peter Christen
5592ea57f0 hack to remove compiler warnings about deprecated classes. It would be
better to remove the deprecated usage but to do this the Solr core must
adopt the latest apache http core changes as well .. this is not our
fault.
2013-11-25 23:30:35 +01:00
orbiter
037cd0a57c using the BinaryResponseWriter which is supported within the YaCy solr
servlet since YaCy 1.63. This is much more performant for the client
than using the XMLResponseWriter because parsing of XML data is very CPU
intensive. Older YaCy peers are still requested using the
XMLResponseWriter but the majority of YaCy peers already respond with
the binary writer. This makes remote searches much faster and less CPU
intensive.
2013-11-25 21:31:40 +01:00
orbiter
61409788eb less word hash computations (removing some overhead because of MD5
calcs) using the clear word in a normalized form.
2013-11-25 15:20:54 +01:00
reger
f23471c471 add check to prevent index entries containing url_file_ext_s with ";jsession=xyz"
note: check could be implemented in MultiProtocolURL (but at this time didn't oversee possible implication)
2013-11-25 00:14:53 +01:00
reger
5c4a3d1c01 Merge origin/master into jetty 2013-11-24 21:00:39 +01:00
reger
444a9ae674 remove unused options and attributes from DefaultServlet
cleanup obsolete class files
2013-11-24 20:11:39 +01:00
reger
8da75a4b0c fix contentType definition for Solr html responswriter
from xml to html
(hint: value is currently not used, but is in SolrServlet)
2013-11-24 04:31:08 +01:00
Michael Peter Christen
ccf2f4e43b refactoring of seed attributes (introduced more constants) 2013-11-22 14:15:31 +01:00
Michael Peter Christen
1f0bfa8fec added test to Base64Order (runs successfully!) 2013-11-22 10:38:42 +01:00
orbiter
b7f1e5af51 added new servlet which generates the same file as the principal peers
upload to a bootstrap position
 you can call it either with
 http://localhost:8090/yacy/seedlist.html
 or to generate json (or jsonp) with
 http://localhost:8090/yacy/seedlist.json
 http://localhost:8090/yacy/seedlist.json?callback=seedlist
2013-11-19 15:56:10 +01:00
orbiter
3e552550d1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-18 22:48:00 +01:00
orbiter
c2d720cdaf purge a lucene cache - possible memory leak fix 2013-11-18 22:47:35 +01:00
reger
e4f49fb175 for searchresults with empty title use filename as title
- to not store a title in index which isn't extracted from source 
  the title is empty check only added to ResultEntry class
2013-11-18 19:41:31 +01:00
reger
b1dc9a6f52 - disable Jetty servlet defaultUseCache (prevent double caching)
- include short memory status check for class cache in DefaultServlet
- remove obsolete Resource interface for Jetty8YaCyDefaultServlet
2013-11-18 03:15:45 +01:00
reger
f111f30ace Merge origin/master into jetty 2013-11-17 00:18:25 +01:00
reger
94293176a3 use writeOptionHeaders with ServletResponse parameter only 2013-11-17 00:02:08 +01:00
orbiter
ff86cb683f fixed some XSS bugs reported by Marius from http://ctf365.com/ 2013-11-16 20:34:31 +01:00
orbiter
da33ee0d77 extended also timeout fr webgraph postprocessing 2013-11-16 18:30:06 +01:00
orbiter
74f9e40747 extended timeout during postprocessing of 30 minutes. 2013-11-16 18:29:08 +01:00
orbiter
19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
monitor page
2013-11-16 18:23:14 +01:00
Michael Peter Christen
9cf9727685 fix for wrong counter 2013-11-16 11:33:35 +01:00
Michael Peter Christen
fceac8cffd more monitoring for postprocessing 2013-11-16 08:23:42 +01:00
Michael Peter Christen
6842783761 fixed and enhanced postprocessing 2013-11-16 08:23:21 +01:00
Michael Peter Christen
219d5934a4 fixed termination bug in Solr Connector 2013-11-16 08:22:29 +01:00
Michael Peter Christen
bf1bdd52a6 prevent requesting of 0-facets (which actually exist) 2013-11-15 15:41:41 +01:00
Michael Peter Christen
9d5895f643 enhanced and fixed postprocessing 2013-11-15 15:41:12 +01:00
Michael Peter Christen
f86fe90eda enhanced mass storage speed to remote solr servers 2013-11-15 15:40:07 +01:00
Michael Peter Christen
6ed9821209 fixed several problems in solr connectors 2013-11-15 15:39:35 +01:00
Michael Peter Christen
191fd3d7e7 added an optimization option to HandleSet mass data storage structure 2013-11-15 15:38:00 +01:00
Michael Peter Christen
94b565ea0d fixed keepalive min value 2013-11-15 15:37:01 +01:00
reger
b26787dc2d - DefaultServlet: remove static gzip option
YaCy doesn't use pre-gzip'ed static html pages 
- ProxyServlet: remove not neede procedure
- Server init: skip one overlaping servlet context
2013-11-14 01:37:51 +01:00
Michael Peter Christen
24a052ecb9 removed debug code for existsByIds 2013-11-13 13:41:18 +01:00
Michael Peter Christen
087df05e24 added option to Config_Network_p.html to enable remote search while
DHT-Receive is switched off.
2013-11-13 13:38:01 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
c60947360d logger should be static 2013-11-13 06:04:28 +01:00
Michael Peter Christen
69b8d61c47 fix for search requests in GSA interface which contain 'funny'
characters (like ':' etc.)
2013-11-12 15:54:54 +01:00
orbiter
b085cb522b replaced old existsByIds for embedded Solr with obviously much faster
new selection method (including stil existing debug code to test that
this is in fact better)
2013-11-11 11:25:01 +01:00
reger
b29d262e70 implement Jetty8HttpServerImpl.generateSocketAddress
(code 1:1 copied from serverCore)
2013-11-10 18:59:18 +01:00
orbiter
4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-10 18:50:43 +01:00
orbiter
909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
completed
2013-11-10 18:50:34 +01:00
reger
066a1ecf0a add highlight queryparams to solrservlet if missing
- modify query params in Solr parameter map (instead of querystring)
2013-11-10 01:36:57 +01:00
Michael Peter Christen
899e7e92b0 added debug code 2013-11-09 02:37:12 +01:00
reger
4684330505 Merge origin/master into jetty
Conflicts:
	source/net/yacy/cora/federate/solr/responsewriter/HTMLResponseWriter.java
2013-11-07 21:44:14 +01:00
reger
1437c45383 merge rc1/master 2013-11-07 21:30:17 +01:00
Michael Peter Christen
87a956e881 calculating and showing the number of files and the average size of a
file in the HTCACHE in ConfigHTCache_p.html
2013-11-07 12:13:12 +01:00
Michael Peter Christen
acc1f8a749 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-07 12:01:37 +01:00
Michael Peter Christen
81d9e23532 fixed another memory leak in the PDF parser:
the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space
which cannot be cleaned if PDFont.clearResources is called.
The attempt to clean the class cache therefore causes that the class is
loaded and this cache is initialized with some rubbish. I tried to
prevent to instantiate this class by usage of a hacked findLoadedClass
call to the SystemClassLoader (which is protected ...).
Now, without using the PDF parser at all, 8MB of RAM space is not
occupied, however, when the first PDF arrives this space will be taked
and never given back to GC.
WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!
2013-11-07 11:57:01 +01:00
Michael Peter Christen
c152d996e6 reduced footprint of BookmarksDB which can take quite a lot of memory if
the number of bookmarks is high (i.e. > 2000 URLs)
2013-11-07 10:55:02 +01:00
Michael Peter Christen
81bb50118e found and fixed a huge memory leak in solr caching (inside Solr). The
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
2013-11-07 10:01:44 +01:00
reger
7b17cdf6dd add content_type:image/* to image search
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
   /solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type

adresses also possible url without or deviating extension.
2013-11-07 03:11:03 +01:00
reger
082c9a98c1 move writeHeaders from Jetty8 servlet to YaCyDefaultServlet
- after removing Jetty server dependency (of Response using HttpServletResponse only)
2013-11-07 00:32:21 +01:00
sixcooler
987f410011 URL-export:add query and fix for cast-class-exception 2013-11-06 19:22:26 +01:00
Michael Peter Christen
a8253ca49c added missing unicode transformation in href link contents during
parsing
2013-11-06 18:05:02 +01:00
Michael Peter Christen
0cf9e9580b added clickdepth and CR computation debug code to verify that the
process is complete
2013-11-06 15:01:40 +01:00
reger
b85f702f22 add AccessTracker logging to SolrServlet 2013-11-05 22:57:55 +01:00
reger
de1f02420b implement HtmlResponseWriter to solrServlet (and rss / opensearch responswriter) as in yacy select servlet.
- set contenttype of HTLM/GrepHTML-Reponsewriter to "text/html"
- set a contenttype to GSAsearchServlet
2013-11-04 21:11:12 +01:00
Michael Peter Christen
234a974955 load image only if their parser flag is activated 2013-11-04 11:59:28 +01:00
Michael Peter Christen
b2c329929f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-04 10:18:52 +01:00
Michael Peter Christen
60187a4ec2 fix in html parser 2013-11-04 10:16:20 +01:00
Michael Peter Christen
e1c1e57877 less overhead calling exist() with only one hash 2013-11-04 09:37:31 +01:00
reger
3d5d366f1c fix html header in Solr HTMLResponseWriter
- move 1st body content after </head> tag
- add closing <span> tag
2013-11-04 03:12:02 +01:00
reger
bfdb404867 implement a Jetty reconnect to work with Configbasic_p.html port change
- instead of shutting down the server it should be sufficient to manipulate the Jetty http connector
2013-11-03 21:34:21 +01:00
Michael Peter Christen
5a02d650ee avoid cloning 2013-11-03 18:31:50 +01:00
reger
d6760df3e5 fix servlet class exist check to use default path only (in Jetty8YaCyDefaultServlet)
- del redundant doget code in yacydefaultservlet
   - small declaration code opts
- del obsolete libt/proxyservlet.java
2013-11-03 02:26:00 +01:00
reger
b38de92a16 Merge origin/master into jetty 2013-11-02 00:48:42 +01:00
Michael Peter Christen
cc39667399 Speed enhancements and less CPU usage during Solr searches when using
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
2013-11-01 17:24:36 +01:00
Michael Peter Christen
434e13b46d in host browser also show the properties of failed documents including
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
2013-11-01 13:30:53 +01:00
reger
6944225037 - add GSA search /gsa/search servlet for Jetty to Server init
- include SecurityHandler check for /gsa/ /solr/ 
- change one more YaCyDefaultServlet dependency from Jetty to std. javax.Servlet
2013-10-30 23:11:36 +01:00
reger
53cb30a221 reduce logging (by assigning logger to existing logger)
- small additional cleanups
2013-10-30 00:51:04 +01:00
reger
332c6d4fe1 reactivate Domain handler for .yacy / .yacyh handling 2013-10-27 19:15:20 +01:00
reger
b1ce70434e resolve merge conflict
- add missing import statement
2013-10-27 15:24:04 +01:00
reger
7869a4c070 Merge origin/master into jetty
- merge conflict resolve
2013-10-27 15:12:17 +01:00
reger
f017066197 Merge origin/master into jetty 2013-10-27 15:09:24 +01:00
reger
06da6f517c add YaCyProxyServlet to handle /proxy.html?url=proxyurl
- based on Jetty ProxyServlet
- at this time use existing HTTPD ProxyHandler  for url rewrite
- add jetty-client jar (dependency in Jetty ProxyServlet)

reuse ProxyHandler.convertHeaderFromJetty in YaCyDefaultServlet
2013-10-27 05:04:24 +01:00
reger
69599566f9 catch one more malformed url in proxy url rewrite 2013-10-27 04:42:33 +01:00
reger
605530fec5 catch proxy url rewrite exception
malformed url (" http:\/\/" ) may cause error response
 testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test
2013-10-27 04:06:11 +01:00
Michael Peter Christen
9bb7eab389 hacks to prevent storage of data longer than necessary during search and
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
orbiter
3c3cb78555 - removed a lot of garbage and bloated code from GuiHandler.
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
2013-10-24 20:42:34 +02:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
2013-10-24 16:20:20 +02:00
Michael Peter Christen
6aabc4e5c8 reduced logging line memory, 10000 lines had filled up 450MB! grrr.
(thank you, a bomb from the past)
2013-10-24 16:17:53 +02:00
Michael Peter Christen
1a8783147b enhanced computation of number of solr documents. 2013-10-24 15:48:05 +02:00
Michael Peter Christen
4948c39e48 added concurrency for mass crawl check 2013-10-23 11:27:19 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
82621bead0 When doing bootstraping, always accept one seedlist-File without
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
2013-10-22 15:34:51 +02:00
Michael Peter Christen
691d7e70fa added hint to development/commit rss feed 2013-10-21 15:16:29 +02:00
orbiter
20bbde8665 fix for mustmatch regex computation: result had correct semantic, but
may have contained multiple same expressions within the disjunction of
domain-restrictions. This fix removes the redundant restrictions and
makes the regex shorter.
2013-10-18 13:55:37 +02:00
reger
cb2dbcb843 add graceful Jetty shutdown option
- as Jetty stop is not synced, yet
- include jetty jars and servlet-3.0 api jar  in Eclipse .classpath
2013-10-18 00:42:38 +02:00
reger
f46c723398 allow to choose used http server, YaCy-Anomic or Jetty
- defaults to Jetty (in this branch)
- add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking
2013-10-17 03:34:22 +02:00
reger
da4ff5aefa add YaCy HttpCommand "authenticate" check to DefaultServlet 2013-10-17 00:06:17 +02:00
Michael Peter Christen
c833d02cf5 fixed webgraph postprocessing (did nothing and repeated to do this...) 2013-10-16 11:49:04 +02:00
Michael Peter Christen
74d0256e93 enhanced postprocessing: fixed bugs, enable proper postprocessing also
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
2013-10-16 11:27:06 +02:00
reger
1adb4b8741 merge rc1/master 2013-10-16 03:02:21 +02:00
reger
77a73c7475 add YaCy HttpCommand "location" check to DefaultServlet 2013-10-16 01:48:44 +02:00
Michael Peter Christen
7b69c438f7 more methods for the table class 2013-10-15 16:46:59 +02:00
Michael Peter Christen
820b896146 Replaced the inframe loading from yacy.net for donations with the
loading of this iframe from the local host. To make this more flexible,
this iframe is loaded once after startup from yacy.net.
2013-10-15 16:46:06 +02:00
reger
cc223b14a4 remove wrong content mod in SSI parser for virtual path /currentyacypeer/
(is handled on start of request handling)
2013-10-15 03:25:24 +02:00
reger
5606291574 fix last commit (not needed test of GZipInputStream) 2013-10-14 04:29:34 +02:00
reger
f9eed8cb44 add support for gzip encoded multipart forms (needed for transferRWI.html)
- quick and dirty reuse of existing HTTPDemon implementation
2013-10-14 04:18:52 +02:00
reger
cf32a92629 - add size check to multipart form data handling of YaCyDefaultServlet (same as in HTTPDemon.parseMultipart)
- reduce Jetty logging 
- give build.run a bit more memory (set to YaCy.default 600m from 512m)
2013-10-13 20:56:03 +02:00
reger
705f147820 - add localpeername.yacy to list of local address detection for AbstractRemoteHandler
- use proxy via header info as in legacy proxy handler
2013-10-13 18:06:42 +02:00
reger
0d4efabaa8 fix YaCy version string in proxy headers
(config parameter vString not longer used)
2013-10-13 17:56:53 +02:00
reger
2226189743 disable domainhandler due to error
- domainhandler causes closed response output stream in following handlers 
  on addresses resolved to local peer (like in hello protocoll preventing peer to switch to senior peer)
2013-10-13 07:24:33 +02:00
reger
eea504c117 update Info.plist
small DefaultServlet refactoring
2013-10-12 23:01:14 +02:00
reger
a44eede8b8 merge rc1/master 2013-10-11 01:50:25 +02:00
sixcooler
d9a02ed277 NPE fix for my last commit 2013-10-11 00:44:04 +02:00
reger
54a0272338 searchpage javascript (latestinfo) causes reset of search statistic after moving to next page
- disabled call via setTimeout in yacysearch.html
2013-10-10 23:23:58 +02:00
sixcooler
61f627eb85 fix for ssl-connections from proxy-usage staying in close-wait-state
+ some extra 'close' in HttpClient
2013-10-10 20:57:37 +02:00
Michael Peter Christen
d328cc4a83 fix for didyoumean, added also more asian alphabets 2013-10-09 16:17:50 +02:00
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
reger
e74f548551 make legacy http server (serverCore) implement YaCyHttpServer interface 2013-10-09 01:07:22 +02:00
reger
71d2655c02 downgrade to Jetty 8 to assure support of JRE 1.6
- introduce a YaCyHttp interface to modulize/separate http server
- adjust the Jetty version specific implementation part (in package net.yacy.http)
     - putting the version specific code in classes starting with Jetty8xxxx
     - moved existing Jetty9xxx implementation into a test class (to keep the code)
- adjust build to the changed jars
- make use of the introduced YaCyHttpServer interface in related htroot servlets

- adjust other test cases/classes
2013-10-09 00:40:48 +02:00
Michael Peter Christen
1b61bd40ed - Added new solr field url_file_name_tokens_t which stores the file name
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
2013-10-08 23:48:13 +02:00
orbiter
6efa7532d2 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-10-08 19:04:57 +02:00
orbiter
5f5a97bafc added the anchor text within web pages to the searcheable entities of a
web page. This can be of benefit for the ranking if these fields are
used for boosts.
2013-10-08 18:41:07 +02:00
orbiter
705b3338ee list more fields available for search and for ranking boosts 2013-10-08 18:15:35 +02:00
sixcooler
d536092fe4 fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout
for eg. caused by massive requests when crawl from file
2013-10-08 18:02:42 +02:00