Commit Graph

6612 Commits

Author SHA1 Message Date
reger
f23471c471 add check to prevent index entries containing url_file_ext_s with ";jsession=xyz"
note: check could be implemented in MultiProtocolURL (but at this time didn't oversee possible implication)
2013-11-25 00:14:53 +01:00
reger
8da75a4b0c fix contentType definition for Solr html responswriter
from xml to html
(hint: value is currently not used, but is in SolrServlet)
2013-11-24 04:31:08 +01:00
Michael Peter Christen
ccf2f4e43b refactoring of seed attributes (introduced more constants) 2013-11-22 14:15:31 +01:00
Michael Peter Christen
1f0bfa8fec added test to Base64Order (runs successfully!) 2013-11-22 10:38:42 +01:00
orbiter
b7f1e5af51 added new servlet which generates the same file as the principal peers
upload to a bootstrap position
 you can call it either with
 http://localhost:8090/yacy/seedlist.html
 or to generate json (or jsonp) with
 http://localhost:8090/yacy/seedlist.json
 http://localhost:8090/yacy/seedlist.json?callback=seedlist
2013-11-19 15:56:10 +01:00
orbiter
3e552550d1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-18 22:48:00 +01:00
orbiter
c2d720cdaf purge a lucene cache - possible memory leak fix 2013-11-18 22:47:35 +01:00
reger
e4f49fb175 for searchresults with empty title use filename as title
- to not store a title in index which isn't extracted from source 
  the title is empty check only added to ResultEntry class
2013-11-18 19:41:31 +01:00
orbiter
ff86cb683f fixed some XSS bugs reported by Marius from http://ctf365.com/ 2013-11-16 20:34:31 +01:00
orbiter
da33ee0d77 extended also timeout fr webgraph postprocessing 2013-11-16 18:30:06 +01:00
orbiter
74f9e40747 extended timeout during postprocessing of 30 minutes. 2013-11-16 18:29:08 +01:00
orbiter
19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
monitor page
2013-11-16 18:23:14 +01:00
Michael Peter Christen
9cf9727685 fix for wrong counter 2013-11-16 11:33:35 +01:00
Michael Peter Christen
fceac8cffd more monitoring for postprocessing 2013-11-16 08:23:42 +01:00
Michael Peter Christen
6842783761 fixed and enhanced postprocessing 2013-11-16 08:23:21 +01:00
Michael Peter Christen
219d5934a4 fixed termination bug in Solr Connector 2013-11-16 08:22:29 +01:00
Michael Peter Christen
bf1bdd52a6 prevent requesting of 0-facets (which actually exist) 2013-11-15 15:41:41 +01:00
Michael Peter Christen
9d5895f643 enhanced and fixed postprocessing 2013-11-15 15:41:12 +01:00
Michael Peter Christen
f86fe90eda enhanced mass storage speed to remote solr servers 2013-11-15 15:40:07 +01:00
Michael Peter Christen
6ed9821209 fixed several problems in solr connectors 2013-11-15 15:39:35 +01:00
Michael Peter Christen
191fd3d7e7 added an optimization option to HandleSet mass data storage structure 2013-11-15 15:38:00 +01:00
Michael Peter Christen
94b565ea0d fixed keepalive min value 2013-11-15 15:37:01 +01:00
Michael Peter Christen
24a052ecb9 removed debug code for existsByIds 2013-11-13 13:41:18 +01:00
Michael Peter Christen
087df05e24 added option to Config_Network_p.html to enable remote search while
DHT-Receive is switched off.
2013-11-13 13:38:01 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
c60947360d logger should be static 2013-11-13 06:04:28 +01:00
Michael Peter Christen
69b8d61c47 fix for search requests in GSA interface which contain 'funny'
characters (like ':' etc.)
2013-11-12 15:54:54 +01:00
orbiter
b085cb522b replaced old existsByIds for embedded Solr with obviously much faster
new selection method (including stil existing debug code to test that
this is in fact better)
2013-11-11 11:25:01 +01:00
orbiter
4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-10 18:50:43 +01:00
orbiter
909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
completed
2013-11-10 18:50:34 +01:00
Michael Peter Christen
899e7e92b0 added debug code 2013-11-09 02:37:12 +01:00
Michael Peter Christen
87a956e881 calculating and showing the number of files and the average size of a
file in the HTCACHE in ConfigHTCache_p.html
2013-11-07 12:13:12 +01:00
Michael Peter Christen
acc1f8a749 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-07 12:01:37 +01:00
Michael Peter Christen
81d9e23532 fixed another memory leak in the PDF parser:
the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space
which cannot be cleaned if PDFont.clearResources is called.
The attempt to clean the class cache therefore causes that the class is
loaded and this cache is initialized with some rubbish. I tried to
prevent to instantiate this class by usage of a hacked findLoadedClass
call to the SystemClassLoader (which is protected ...).
Now, without using the PDF parser at all, 8MB of RAM space is not
occupied, however, when the first PDF arrives this space will be taked
and never given back to GC.
WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!
2013-11-07 11:57:01 +01:00
Michael Peter Christen
c152d996e6 reduced footprint of BookmarksDB which can take quite a lot of memory if
the number of bookmarks is high (i.e. > 2000 URLs)
2013-11-07 10:55:02 +01:00
Michael Peter Christen
81bb50118e found and fixed a huge memory leak in solr caching (inside Solr). The
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
2013-11-07 10:01:44 +01:00
reger
7b17cdf6dd add content_type:image/* to image search
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
   /solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type

adresses also possible url without or deviating extension.
2013-11-07 03:11:03 +01:00
sixcooler
987f410011 URL-export:add query and fix for cast-class-exception 2013-11-06 19:22:26 +01:00
Michael Peter Christen
a8253ca49c added missing unicode transformation in href link contents during
parsing
2013-11-06 18:05:02 +01:00
Michael Peter Christen
0cf9e9580b added clickdepth and CR computation debug code to verify that the
process is complete
2013-11-06 15:01:40 +01:00
Michael Peter Christen
234a974955 load image only if their parser flag is activated 2013-11-04 11:59:28 +01:00
Michael Peter Christen
b2c329929f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-04 10:18:52 +01:00
Michael Peter Christen
60187a4ec2 fix in html parser 2013-11-04 10:16:20 +01:00
Michael Peter Christen
e1c1e57877 less overhead calling exist() with only one hash 2013-11-04 09:37:31 +01:00
reger
3d5d366f1c fix html header in Solr HTMLResponseWriter
- move 1st body content after </head> tag
- add closing <span> tag
2013-11-04 03:12:02 +01:00
Michael Peter Christen
5a02d650ee avoid cloning 2013-11-03 18:31:50 +01:00
Michael Peter Christen
cc39667399 Speed enhancements and less CPU usage during Solr searches when using
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
2013-11-01 17:24:36 +01:00
Michael Peter Christen
434e13b46d in host browser also show the properties of failed documents including
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
2013-11-01 13:30:53 +01:00
reger
69599566f9 catch one more malformed url in proxy url rewrite 2013-10-27 04:42:33 +01:00
reger
605530fec5 catch proxy url rewrite exception
malformed url (" http:\/\/" ) may cause error response
 testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test
2013-10-27 04:06:11 +01:00