reger
444a9ae674
remove unused options and attributes from DefaultServlet
...
cleanup obsolete class files
2013-11-24 20:11:39 +01:00
reger
8da75a4b0c
fix contentType definition for Solr html responswriter
...
from xml to html
(hint: value is currently not used, but is in SolrServlet)
2013-11-24 04:31:08 +01:00
Michael Peter Christen
ccf2f4e43b
refactoring of seed attributes (introduced more constants)
2013-11-22 14:15:31 +01:00
Michael Peter Christen
1f0bfa8fec
added test to Base64Order (runs successfully!)
2013-11-22 10:38:42 +01:00
orbiter
b7f1e5af51
added new servlet which generates the same file as the principal peers
...
upload to a bootstrap position
you can call it either with
http://localhost:8090/yacy/seedlist.html
or to generate json (or jsonp) with
http://localhost:8090/yacy/seedlist.json
http://localhost:8090/yacy/seedlist.json?callback=seedlist
2013-11-19 15:56:10 +01:00
orbiter
3e552550d1
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-11-18 22:48:00 +01:00
orbiter
c2d720cdaf
purge a lucene cache - possible memory leak fix
2013-11-18 22:47:35 +01:00
reger
e4f49fb175
for searchresults with empty title use filename as title
...
- to not store a title in index which isn't extracted from source
the title is empty check only added to ResultEntry class
2013-11-18 19:41:31 +01:00
reger
b1dc9a6f52
- disable Jetty servlet defaultUseCache (prevent double caching)
...
- include short memory status check for class cache in DefaultServlet
- remove obsolete Resource interface for Jetty8YaCyDefaultServlet
2013-11-18 03:15:45 +01:00
reger
f111f30ace
Merge origin/master into jetty
2013-11-17 00:18:25 +01:00
reger
94293176a3
use writeOptionHeaders with ServletResponse parameter only
2013-11-17 00:02:08 +01:00
orbiter
ff86cb683f
fixed some XSS bugs reported by Marius from http://ctf365.com/
2013-11-16 20:34:31 +01:00
orbiter
da33ee0d77
extended also timeout fr webgraph postprocessing
2013-11-16 18:30:06 +01:00
orbiter
74f9e40747
extended timeout during postprocessing of 30 minutes.
2013-11-16 18:29:08 +01:00
orbiter
19a051bec8
more monitoring for postprocessing and enhanced layout in Crawler
...
monitor page
2013-11-16 18:23:14 +01:00
Michael Peter Christen
9cf9727685
fix for wrong counter
2013-11-16 11:33:35 +01:00
Michael Peter Christen
fceac8cffd
more monitoring for postprocessing
2013-11-16 08:23:42 +01:00
Michael Peter Christen
6842783761
fixed and enhanced postprocessing
2013-11-16 08:23:21 +01:00
Michael Peter Christen
219d5934a4
fixed termination bug in Solr Connector
2013-11-16 08:22:29 +01:00
Michael Peter Christen
bf1bdd52a6
prevent requesting of 0-facets (which actually exist)
2013-11-15 15:41:41 +01:00
Michael Peter Christen
9d5895f643
enhanced and fixed postprocessing
2013-11-15 15:41:12 +01:00
Michael Peter Christen
f86fe90eda
enhanced mass storage speed to remote solr servers
2013-11-15 15:40:07 +01:00
Michael Peter Christen
6ed9821209
fixed several problems in solr connectors
2013-11-15 15:39:35 +01:00
Michael Peter Christen
191fd3d7e7
added an optimization option to HandleSet mass data storage structure
2013-11-15 15:38:00 +01:00
Michael Peter Christen
94b565ea0d
fixed keepalive min value
2013-11-15 15:37:01 +01:00
reger
b26787dc2d
- DefaultServlet: remove static gzip option
...
YaCy doesn't use pre-gzip'ed static html pages
- ProxyServlet: remove not neede procedure
- Server init: skip one overlaping servlet context
2013-11-14 01:37:51 +01:00
Michael Peter Christen
24a052ecb9
removed debug code for existsByIds
2013-11-13 13:41:18 +01:00
Michael Peter Christen
087df05e24
added option to Config_Network_p.html to enable remote search while
...
DHT-Receive is switched off.
2013-11-13 13:38:01 +01:00
Michael Peter Christen
1a4a69c226
set more logger to 'final static'
2013-11-13 06:18:48 +01:00
Michael Peter Christen
c60947360d
logger should be static
2013-11-13 06:04:28 +01:00
Michael Peter Christen
69b8d61c47
fix for search requests in GSA interface which contain 'funny'
...
characters (like ':' etc.)
2013-11-12 15:54:54 +01:00
orbiter
b085cb522b
replaced old existsByIds for embedded Solr with obviously much faster
...
new selection method (including stil existing debug code to test that
this is in fact better)
2013-11-11 11:25:01 +01:00
reger
b29d262e70
implement Jetty8HttpServerImpl.generateSocketAddress
...
(code 1:1 copied from serverCore)
2013-11-10 18:59:18 +01:00
orbiter
4234b0ed6c
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-11-10 18:50:43 +01:00
orbiter
909bbb49d8
added (partly commented) test code for url rewrite methods .. to be
...
completed
2013-11-10 18:50:34 +01:00
reger
066a1ecf0a
add highlight queryparams to solrservlet if missing
...
- modify query params in Solr parameter map (instead of querystring)
2013-11-10 01:36:57 +01:00
Michael Peter Christen
899e7e92b0
added debug code
2013-11-09 02:37:12 +01:00
reger
4684330505
Merge origin/master into jetty
...
Conflicts:
source/net/yacy/cora/federate/solr/responsewriter/HTMLResponseWriter.java
2013-11-07 21:44:14 +01:00
reger
1437c45383
merge rc1/master
2013-11-07 21:30:17 +01:00
Michael Peter Christen
87a956e881
calculating and showing the number of files and the average size of a
...
file in the HTCACHE in ConfigHTCache_p.html
2013-11-07 12:13:12 +01:00
Michael Peter Christen
acc1f8a749
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-11-07 12:01:37 +01:00
Michael Peter Christen
81d9e23532
fixed another memory leak in the PDF parser:
...
the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space
which cannot be cleaned if PDFont.clearResources is called.
The attempt to clean the class cache therefore causes that the class is
loaded and this cache is initialized with some rubbish. I tried to
prevent to instantiate this class by usage of a hacked findLoadedClass
call to the SystemClassLoader (which is protected ...).
Now, without using the PDF parser at all, 8MB of RAM space is not
occupied, however, when the first PDF arrives this space will be taked
and never given back to GC.
WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!
2013-11-07 11:57:01 +01:00
Michael Peter Christen
c152d996e6
reduced footprint of BookmarksDB which can take quite a lot of memory if
...
the number of bookmarks is high (i.e. > 2000 URLs)
2013-11-07 10:55:02 +01:00
Michael Peter Christen
81bb50118e
found and fixed a huge memory leak in solr caching (inside Solr). The
...
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
2013-11-07 10:01:44 +01:00
reger
7b17cdf6dd
add content_type:image/* to image search
...
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
/solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type
adresses also possible url without or deviating extension.
2013-11-07 03:11:03 +01:00
reger
082c9a98c1
move writeHeaders from Jetty8 servlet to YaCyDefaultServlet
...
- after removing Jetty server dependency (of Response using HttpServletResponse only)
2013-11-07 00:32:21 +01:00
sixcooler
987f410011
URL-export:add query and fix for cast-class-exception
2013-11-06 19:22:26 +01:00
Michael Peter Christen
a8253ca49c
added missing unicode transformation in href link contents during
...
parsing
2013-11-06 18:05:02 +01:00
Michael Peter Christen
0cf9e9580b
added clickdepth and CR computation debug code to verify that the
...
process is complete
2013-11-06 15:01:40 +01:00
reger
b85f702f22
add AccessTracker logging to SolrServlet
2013-11-05 22:57:55 +01:00
reger
de1f02420b
implement HtmlResponseWriter to solrServlet (and rss / opensearch responswriter) as in yacy select servlet.
...
- set contenttype of HTLM/GrepHTML-Reponsewriter to "text/html"
- set a contenttype to GSAsearchServlet
2013-11-04 21:11:12 +01:00
Michael Peter Christen
234a974955
load image only if their parser flag is activated
2013-11-04 11:59:28 +01:00
Michael Peter Christen
b2c329929f
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-11-04 10:18:52 +01:00
Michael Peter Christen
60187a4ec2
fix in html parser
2013-11-04 10:16:20 +01:00
Michael Peter Christen
e1c1e57877
less overhead calling exist() with only one hash
2013-11-04 09:37:31 +01:00
reger
3d5d366f1c
fix html header in Solr HTMLResponseWriter
...
- move 1st body content after </head> tag
- add closing <span> tag
2013-11-04 03:12:02 +01:00
reger
bfdb404867
implement a Jetty reconnect to work with Configbasic_p.html port change
...
- instead of shutting down the server it should be sufficient to manipulate the Jetty http connector
2013-11-03 21:34:21 +01:00
Michael Peter Christen
5a02d650ee
avoid cloning
2013-11-03 18:31:50 +01:00
reger
d6760df3e5
fix servlet class exist check to use default path only (in Jetty8YaCyDefaultServlet)
...
- del redundant doget code in yacydefaultservlet
- small declaration code opts
- del obsolete libt/proxyservlet.java
2013-11-03 02:26:00 +01:00
reger
b38de92a16
Merge origin/master into jetty
2013-11-02 00:48:42 +01:00
Michael Peter Christen
cc39667399
Speed enhancements and less CPU usage during Solr searches when using
...
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
2013-11-01 17:24:36 +01:00
Michael Peter Christen
434e13b46d
in host browser also show the properties of failed documents including
...
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
2013-11-01 13:30:53 +01:00
reger
6944225037
- add GSA search /gsa/search servlet for Jetty to Server init
...
- include SecurityHandler check for /gsa/ /solr/
- change one more YaCyDefaultServlet dependency from Jetty to std. javax.Servlet
2013-10-30 23:11:36 +01:00
reger
53cb30a221
reduce logging (by assigning logger to existing logger)
...
- small additional cleanups
2013-10-30 00:51:04 +01:00
reger
332c6d4fe1
reactivate Domain handler for .yacy / .yacyh handling
2013-10-27 19:15:20 +01:00
reger
b1ce70434e
resolve merge conflict
...
- add missing import statement
2013-10-27 15:24:04 +01:00
reger
7869a4c070
Merge origin/master into jetty
...
- merge conflict resolve
2013-10-27 15:12:17 +01:00
reger
f017066197
Merge origin/master into jetty
2013-10-27 15:09:24 +01:00
reger
06da6f517c
add YaCyProxyServlet to handle /proxy.html?url=proxyurl
...
- based on Jetty ProxyServlet
- at this time use existing HTTPD ProxyHandler for url rewrite
- add jetty-client jar (dependency in Jetty ProxyServlet)
reuse ProxyHandler.convertHeaderFromJetty in YaCyDefaultServlet
2013-10-27 05:04:24 +01:00
reger
69599566f9
catch one more malformed url in proxy url rewrite
2013-10-27 04:42:33 +01:00
reger
605530fec5
catch proxy url rewrite exception
...
malformed url (" http:\/\/" ) may cause error response
testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test
2013-10-27 04:06:11 +01:00
Michael Peter Christen
9bb7eab389
hacks to prevent storage of data longer than necessary during search and
...
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
orbiter
3c3cb78555
- removed a lot of garbage and bloated code from GuiHandler.
...
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
2013-10-24 20:42:34 +02:00
Michael Peter Christen
5afa6e3aee
Automatically flush the log cache if a short memory status is reached.
...
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
030d0776ff
Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
...
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
2013-10-24 16:20:20 +02:00
Michael Peter Christen
6aabc4e5c8
reduced logging line memory, 10000 lines had filled up 450MB! grrr.
...
(thank you, a bomb from the past)
2013-10-24 16:17:53 +02:00
Michael Peter Christen
1a8783147b
enhanced computation of number of solr documents.
2013-10-24 15:48:05 +02:00
Michael Peter Christen
4948c39e48
added concurrency for mass crawl check
2013-10-23 11:27:19 +02:00
Michael Peter Christen
1b4fa2947d
- fixed a problem which ocurred when a document was not recognized with
...
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
82621bead0
When doing bootstraping, always accept one seedlist-File without
...
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
2013-10-22 15:34:51 +02:00
Michael Peter Christen
691d7e70fa
added hint to development/commit rss feed
2013-10-21 15:16:29 +02:00
orbiter
20bbde8665
fix for mustmatch regex computation: result had correct semantic, but
...
may have contained multiple same expressions within the disjunction of
domain-restrictions. This fix removes the redundant restrictions and
makes the regex shorter.
2013-10-18 13:55:37 +02:00
reger
cb2dbcb843
add graceful Jetty shutdown option
...
- as Jetty stop is not synced, yet
- include jetty jars and servlet-3.0 api jar in Eclipse .classpath
2013-10-18 00:42:38 +02:00
reger
f46c723398
allow to choose used http server, YaCy-Anomic or Jetty
...
- defaults to Jetty (in this branch)
- add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking
2013-10-17 03:34:22 +02:00
reger
da4ff5aefa
add YaCy HttpCommand "authenticate" check to DefaultServlet
2013-10-17 00:06:17 +02:00
Michael Peter Christen
c833d02cf5
fixed webgraph postprocessing (did nothing and repeated to do this...)
2013-10-16 11:49:04 +02:00
Michael Peter Christen
74d0256e93
enhanced postprocessing: fixed bugs, enable proper postprocessing also
...
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
2013-10-16 11:27:06 +02:00
reger
1adb4b8741
merge rc1/master
2013-10-16 03:02:21 +02:00
reger
77a73c7475
add YaCy HttpCommand "location" check to DefaultServlet
2013-10-16 01:48:44 +02:00
Michael Peter Christen
7b69c438f7
more methods for the table class
2013-10-15 16:46:59 +02:00
Michael Peter Christen
820b896146
Replaced the inframe loading from yacy.net for donations with the
...
loading of this iframe from the local host. To make this more flexible,
this iframe is loaded once after startup from yacy.net.
2013-10-15 16:46:06 +02:00
reger
cc223b14a4
remove wrong content mod in SSI parser for virtual path /currentyacypeer/
...
(is handled on start of request handling)
2013-10-15 03:25:24 +02:00
reger
5606291574
fix last commit (not needed test of GZipInputStream)
2013-10-14 04:29:34 +02:00
reger
f9eed8cb44
add support for gzip encoded multipart forms (needed for transferRWI.html)
...
- quick and dirty reuse of existing HTTPDemon implementation
2013-10-14 04:18:52 +02:00
reger
cf32a92629
- add size check to multipart form data handling of YaCyDefaultServlet (same as in HTTPDemon.parseMultipart)
...
- reduce Jetty logging
- give build.run a bit more memory (set to YaCy.default 600m from 512m)
2013-10-13 20:56:03 +02:00
reger
705f147820
- add localpeername.yacy to list of local address detection for AbstractRemoteHandler
...
- use proxy via header info as in legacy proxy handler
2013-10-13 18:06:42 +02:00
reger
0d4efabaa8
fix YaCy version string in proxy headers
...
(config parameter vString not longer used)
2013-10-13 17:56:53 +02:00
reger
2226189743
disable domainhandler due to error
...
- domainhandler causes closed response output stream in following handlers
on addresses resolved to local peer (like in hello protocoll preventing peer to switch to senior peer)
2013-10-13 07:24:33 +02:00
reger
eea504c117
update Info.plist
...
small DefaultServlet refactoring
2013-10-12 23:01:14 +02:00
reger
a44eede8b8
merge rc1/master
2013-10-11 01:50:25 +02:00
sixcooler
d9a02ed277
NPE fix for my last commit
2013-10-11 00:44:04 +02:00
reger
54a0272338
searchpage javascript (latestinfo) causes reset of search statistic after moving to next page
...
- disabled call via setTimeout in yacysearch.html
2013-10-10 23:23:58 +02:00
sixcooler
61f627eb85
fix for ssl-connections from proxy-usage staying in close-wait-state
...
+ some extra 'close' in HttpClient
2013-10-10 20:57:37 +02:00
Michael Peter Christen
d328cc4a83
fix for didyoumean, added also more asian alphabets
2013-10-09 16:17:50 +02:00
Michael Peter Christen
90c8577840
enhanced ranking; patches to replace old ranking
2013-10-09 15:10:03 +02:00
reger
e74f548551
make legacy http server (serverCore) implement YaCyHttpServer interface
2013-10-09 01:07:22 +02:00
reger
71d2655c02
downgrade to Jetty 8 to assure support of JRE 1.6
...
- introduce a YaCyHttp interface to modulize/separate http server
- adjust the Jetty version specific implementation part (in package net.yacy.http)
- putting the version specific code in classes starting with Jetty8xxxx
- moved existing Jetty9xxx implementation into a test class (to keep the code)
- adjust build to the changed jars
- make use of the introduced YaCyHttpServer interface in related htroot servlets
- adjust other test cases/classes
2013-10-09 00:40:48 +02:00
Michael Peter Christen
1b61bd40ed
- Added new solr field url_file_name_tokens_t which stores the file name
...
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
2013-10-08 23:48:13 +02:00
orbiter
6efa7532d2
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-10-08 19:04:57 +02:00
orbiter
5f5a97bafc
added the anchor text within web pages to the searcheable entities of a
...
web page. This can be of benefit for the ranking if these fields are
used for boosts.
2013-10-08 18:41:07 +02:00
orbiter
705b3338ee
list more fields available for search and for ranking boosts
2013-10-08 18:15:35 +02:00
sixcooler
d536092fe4
fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout
...
for eg. caused by massive requests when crawl from file
2013-10-08 18:02:42 +02:00
Michael Peter Christen
78e7aadb26
removed unused initialization method
2013-10-07 23:51:28 +02:00
Michael Peter Christen
4fbc4740df
removed warnings
2013-10-07 23:41:50 +02:00
Michael Peter Christen
21aa6a0321
migration to Solr 4.5.0
2013-10-07 17:09:40 +02:00
Michael Peter Christen
ef31d0f279
fix for rss reader, see http://bugs.yacy.net/view.php?id=294
2013-10-07 12:59:54 +02:00
Michael Peter Christen
101a6e6e14
Patch the citation index for links with canonical tags.
...
This shall fulfill the following requirement:
If a document A links to B and B contains a 'canonical C', then the
citation rank computation shall consider that A links to C and B does
not link to C.
To do so, we first must collect all canonical links, find all references
to them, get the anchor list of the documents and patch the citation
reference of these links.
2013-10-07 11:15:58 +02:00
reger
daebeb93aa
add call to AccessTracker to jetty security handler
2013-10-04 01:16:17 +02:00
reger
172aefaeeb
adjust YaCySecurityHandler to Jetty 9 conventions
...
- mainly adjust prepareConstraintInfo to use the RoleInfo.setChecked as in Jetty Source distribution
- use constraint check behavior as in ConstraintSecurityHandler
see http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/tree/jetty-security/src/main/java/org/eclipse/jetty/security/ConstraintSecurityHandler.java?id=jetty-9.0.5.v20130813
2013-10-03 19:38:03 +02:00
reger
6f9ed439d3
- expand localHostName check of AbstractRemoteHandler
...
to pevent request is handled as proxy request
- make domain handler not relay on included path in resolved .yacy address
2013-10-01 03:04:32 +02:00
reger
561ea135af
fix : forgot adding security handler
2013-09-30 04:35:17 +02:00
reger
c7c706fd9f
merge with rc1/master
2013-09-30 03:46:39 +02:00
reger
272b196d05
update Jetty server init() to activate yacy-domain and transparent proxy handler
...
- adding domain & proxy handler to a context (as it was in inital design)
(context required for dispatcher)
- make handler context and servlet context parallel available
(to allow use of YaCyDefaultServlet to handle legacyServlets)
- set transparent proxy request handled after dispatch.forward to skip further handling for .yacy domain requests
2013-09-30 03:12:52 +02:00
reger
fd119deb00
fix NPE on modified since check ( Response.requestHeader allowed to be null)
2013-09-30 02:50:53 +02:00
reger
66145a0410
- add welcome file (index.html) support to YaCyDefaultServlet
...
- change SolrServlet default search field (&df) to text_t
2013-09-29 03:34:00 +02:00
Michael Peter Christen
b28d43decc
added two more fields source_cr_host_norm_i,target_cr_host_norm_i in
...
webgraph and an addition to postprocessing to copy all cr ranking
attributes to the link edges associated to the postprocessing documents
2013-09-27 16:57:05 +02:00
Michael Peter Christen
a52f3a597e
fix for canonical-from-http-header feature
2013-09-27 15:09:04 +02:00
Michael Peter Christen
2dd7c5be44
added parsing of http-canonical tags (untested, could not find an
...
example page)
2013-09-27 13:17:50 +02:00
Michael Peter Christen
4476dea5ba
do not fail if a wrong boost key is used; instead, print only a warning
...
See also: http://bugs.yacy.net/view.php?id=293
2013-09-27 12:28:09 +02:00
reger
ab9583d429
add default field (&df) to SolrServlet query if missing
2013-09-26 22:20:35 +02:00
Michael Peter Christen
3bf0104199
fix for crawl domain counter limitation (limit was reached too early)
2013-09-26 13:41:52 +02:00
Michael Peter Christen
82bfd9e00a
- crawl profiles shall be deleted from active and passive stacks if they
...
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
2013-09-26 10:22:31 +02:00
Michael Peter Christen
1b3d26dd23
hack to remove most of the warning: deprecated messages (but not all,
...
one is left)
2013-09-25 21:14:52 +02:00
Michael Peter Christen
a496313248
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-09-25 20:41:02 +02:00
sixcooler
3c48fc65fd
reverted RemoteInstance to deprecated methods of httpClient-4.2
...
this should work with current remote-Solr-Instances
2013-09-25 18:45:16 +02:00
Michael Peter Christen
91a875dff5
self-healing of mistakenly deactivated crawl profiles. This fixes a bug
...
which can happen in rare cases when a crawl start and a cleanup process
happen at the same time.
2013-09-25 18:27:54 +02:00
Michael Peter Christen
095053a9b4
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-09-25 17:32:52 +02:00
sixcooler
0cae420d8e
some dns-timing changes:
...
since httpclient uses the domain-cache it is useful not to clean the
domain cache until crawling is running (domains are filled into this
cache)
On huge crawl-starts (eg. from file) my DNS did not follow the high
rates - so I reduced the rate and give some more time(-out)
2013-09-25 15:01:28 +02:00
sixcooler
15b1bb2513
bump to httpClient-4.3
2013-09-25 14:48:37 +02:00
Michael Peter Christen
4f83d5f18c
added the new field harvestkey_s to the collection index and the
...
webgraph index which is temporary filled with the crawl profile key.
This is used to select a set of documents for post-processing as soon as
a crawl is finished. Now the postprocessing for a specific crawl is
started when that specific crawl is finished and not at the end of all
post-processing steps.
2013-09-25 14:38:24 +02:00
orbiter
14442efa6d
when profiles are cleaned, there shall be first a callback showing which
...
profiles are cleaned. This shall enable a profile-termination-driven
postprocessing. To do this, index writings must carry the profile key
which will be implemented in another (next) step.
2013-09-25 11:04:12 +02:00
orbiter
0013d0d0bb
removed superfluous class
2013-09-24 21:18:37 +02:00
orbiter
f90d5296cb
Added new data structure to be used by the balancer (not used yet).
...
These data structures will enable the balancer to store the crawl queue
into individual queues, one each for a single host.
2013-09-24 21:08:40 +02:00
orbiter
0e8d752462
refactoring
2013-09-24 19:55:59 +02:00
orbiter
8ac2e8c8c9
added location navigator which causes that the image to the map search
...
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
2013-09-24 11:26:51 +02:00
orbiter
d86d2be5c3
automatically removed Places autotagging if no location library is
...
wanted
2013-09-24 11:23:45 +02:00
orbiter
214a087cdf
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-09-23 20:59:03 +02:00
Michael Peter Christen
96ed0c980e
- added hosthash to all documents (also fail documents which is needed
...
there for deletion), this fixes a problem for the deletion of old
documents for new crawl starts
- added clickdepth and citation computation for fail documents
2013-09-23 18:09:42 +02:00
Michael Peter Christen
179ad281f9
close include byte buffer after usage
2013-09-23 12:19:51 +02:00
reger
52dd491c04
fix not necessary use of DigestURL
2013-09-23 03:05:09 +02:00
reger
6b9a624808
remove double declaration of TLD_any_zone_filter
2013-09-23 03:01:08 +02:00
reger
5111841e5b
- reduce Jetty debug logging
...
- fix Context path initialization
2013-09-23 01:30:45 +02:00
reger
bc6ebb3c06
adjust to DigestURI changes from master to DigestURL
2013-09-22 20:57:50 +02:00
reger
561cbc7ee2
use more YaCy HeaderFramework constants (instead of Jetty's)
2013-09-22 04:23:42 +02:00
reger
5c4ba9b5db
merge rc1 master
2013-09-22 02:21:24 +02:00
reger
70c51775ae
Merge remote-tracking branch 'origin/master' into jetty
2013-09-22 02:09:02 +02:00
reger
4b77733e59
implement a YaCyDefaultServlet to handle YaCy-servlets within Jetty server
...
- the implementation is inspired by Jetty's DefaultServlet
- handles static html content and YaCy servlets
- translates between standard servlet request/response and YaCy request/response specification
With the implementation of YaCy-servlets as servlet instead via a jetty handler it's closer to servlet standard and carries less jetty specific dependencies.
2013-09-22 01:57:32 +02:00
orbiter
d2effd21db
fix for npe during location search
2013-09-21 21:03:58 +02:00
orbiter
828603e4f1
fix for 100%CPU problem in error cache cleaning process
2013-09-21 10:20:13 +02:00
orbiter
c64b51134e
hack to add all tokens from the url to text_t. This was working for the
...
RWI index (and still is working) but not for solr-only search indexes.
Maybe we should find a solution using a separate search field instead.
2013-09-21 08:57:43 +02:00
orbiter
6e8377b8ad
do not check all words with synonym library if the library is empty
2013-09-21 08:56:24 +02:00
orbiter
70ba74b23a
disabled ipv4 preference to enable ipv6-only networks like freifunk
2013-09-20 16:52:37 +02:00
orbiter
f3be1930cb
CPU problem when pusing to the error cache; wrong class,
...
ConcurrentHashMap needed for concurrency
2013-09-20 16:51:50 +02:00
Michael Peter Christen
e40671ddb7
better and consistent deletions for error urls
2013-09-17 15:52:57 +02:00
Michael Peter Christen
2602be8d1e
- removed ZURL data structure; removed also the ZURL data file
...
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
31920385f7
set anchor rel attribute of all links to "nofollow" if the html meta
...
contains a robots:nofollow or if the http header contains a
"X-Robots-Tag: nofollow"
2013-09-16 16:14:56 +02:00
reger
9619b8743c
add Solr Servlet
2013-09-16 03:01:18 +02:00
Michael Peter Christen
57e00baf26
fix for parsing of image links inside of anchor links (image-links)
2013-09-15 23:54:46 +02:00
Michael Peter Christen
61c5e40687
- replaced the properties object in AnchorURL with distinct variables
...
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
3ea9bb4427
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-09-15 00:30:41 +02:00
Michael Peter Christen
5e31bad711
- the webgraph shall store all links which appear on a web page and not
...
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
reger
13fc86c960
Merge remote-tracking branch 'origin/master' into jetty
2013-09-14 21:10:24 +02:00
reger
f7f86d8a5d
update to Jetty 9 jars
...
- include javax.servlet 3.0
2013-09-14 20:49:05 +02:00
reger
603368fc3e
remove redundant declaration of USER_AGENT
2013-09-14 18:29:44 +02:00
reger
bd71b14d25
add mandatory p2p parameter to templatePattern
2013-09-12 22:49:09 +02:00
reger
b8da176c5d
adjust setHandled to request of call parameter
2013-09-12 22:04:10 +02:00
reger
127adbf5cf
remove references to 10_http thread (legacy http server)
...
and add needed get/set function to jetty http server wrapper
2013-09-12 22:02:11 +02:00
Michael Peter Christen
1a8c64117f
decreased the responseHeaderDB database which is now flushed more
...
frequently. This will preserve more documents in the cache in case of a
crash.
2013-09-11 13:03:58 +02:00
reger
36b7159282
- remove double initialization of jetty
...
- refactor some var assignments
2013-09-11 02:24:47 +02:00
reger
63ed04260a
Merge remote-tracking branch 'origin/master' into jetty
2013-09-10 20:42:38 +02:00
Michael Peter Christen
35ab2cef7b
added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
...
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
2013-09-10 10:31:57 +02:00
reger
2ee68f76f6
added read parameter from multi-part form fields (to nasty quick-fix)
2013-09-10 01:42:08 +02:00
Michael Peter Christen
9cc8468b30
added tools to visualize image generation (i.e. during testing)
2013-09-09 12:58:26 +02:00
reger
105cf8f593
changes to adjust jetty to recent code changes
2013-09-09 02:37:29 +02:00
reger
aafef72a8a
merged current rc1/master into jetty branch to allow further development with latest version
...
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI - needs further work
- TODO: YaCy servlet return values/parameters are not handled
2013-09-09 02:36:06 +02:00
Michael Peter Christen
dbef8ccfcb
forced deletion of ZURL entries for a specific host for each host that
...
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171
refactoring (im preparation for new removeHost method)
2013-09-05 09:59:41 +02:00
Michael Peter Christen
7a5574cd51
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-09-04 23:12:04 +02:00
Michael Peter Christen
85456f46b2
added two new fields, exact_signature_copycount_i and
...
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00
orbiter
26366596d9
fix for a problem which ocurres when a site is crawled where the start
...
url is redirected.
2013-09-04 16:00:47 +02:00
Michael Peter Christen
a2511b5600
turned images_alt_txt back to images_alt_sxt because it is not necessary
...
to index the alt text. Indexed image Text is in images_text_t
2013-09-04 10:47:18 +02:00
Michael Peter Christen
85b1922244
activated image type navigation for image search
2013-09-03 13:34:01 +02:00
Michael Peter Christen
9e12fdff23
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-09-03 12:22:57 +02:00
Michael Peter Christen
ab1201fdfd
fixed wrong facet count
2013-09-03 12:22:29 +02:00
Michael Peter Christen
049c3b3f2e
added an option to exclude image search results from text search. This
...
is on by default.
2013-09-03 11:14:23 +02:00
Michael Peter Christen
69f85265e1
added an option to put image links to the crawl queue and handle these
...
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
2013-09-03 11:13:45 +02:00
Michael Peter Christen
e8e558a9b7
fix for content domain classification in URIMetadataNode
2013-09-03 10:49:09 +02:00
Michael Peter Christen
a8c5bfcf58
avoid to create unnecessary objects
2013-09-03 09:48:05 +02:00
Michael Peter Christen
5a0de1b77d
moving image description text to image text field
2013-09-03 09:47:27 +02:00
Michael Peter Christen
dc179bd61f
fix for catchall query goal for image search
2013-09-03 07:55:21 +02:00
reger
392174de8c
remove all_words, all_strings lists from QueryGoal
...
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
2013-09-02 23:09:43 +02:00
Michael Peter Christen
169ef8963d
one more fix for image search
2013-09-02 20:02:26 +02:00
Michael Peter Christen
cb85b22725
redesign of the image search process (with much better results,
...
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
reger
29967102a2
optimized QueryGoal (reducing mem and computation by removing all_hashes)
...
- all_hashes used for text highlighting and word distance computation which can be done with include_hashes only
2013-09-02 04:19:53 +02:00
orbiter
f106345eef
link strings should not be tokenized
2013-09-01 14:35:36 +02:00
orbiter
deadeb406e
image alt tag strings should be tokenized
2013-09-01 13:48:10 +02:00
reger
d0e78082d1
return field names in index instead of in schema for SolrServerConnector.getFields
2013-08-31 06:25:12 +02:00
Michael Peter Christen
1a3e42eca4
index migration to lucene 4.4
2013-08-26 12:49:39 +02:00
Michael Peter Christen
a88a62f7aa
added a feature to set a collection for a crawl result based on a
...
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
3c5abedabf
NPE during shutdown fix
2013-08-24 23:36:50 +02:00
Michael Peter Christen
e4cbe9232d
fixed a crawler bug where a double-occurring url was not re-crawled
...
because the double-check error was written to the error-db and never
deleted. No the error-db is cleared on every start and these
double-messages are not written to the error-db any more.
2013-08-22 15:56:09 +02:00
Michael Peter Christen
765943a4b7
Redesign of crawler identification and robots steering. A non-p2p user
...
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
0f3d8890db
removed an assert which causes a shortcut call circuit
2013-08-22 10:12:25 +02:00
Michael Peter Christen
6d5fefe060
added missing files :(
2013-08-20 16:31:34 +02:00
Michael Peter Christen
554c0351dd
fix for http://bugs.yacy.net/view.php?id=286
2013-08-20 16:10:26 +02:00
Michael Peter Christen
47b1c81d08
- refactoring
...
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
1c62fa7698
fix for bad snippets in gsa api
2013-08-18 10:37:25 +02:00
Michael Peter Christen
697613170d
less logging for postprocessing (this was a debugging logging with high
...
CPU load)
2013-08-17 09:25:32 +02:00
reger
b4016ff324
- remove possible double initialization of rdfa parser
...
- use ordered list to use preferred parser for mime/extension first (relates to html, rdfa, argument parser)
- harmonize xhtml extension config for the 3 html base parsers
2013-08-14 21:12:10 +02:00
reger
f0575bd44b
FieldReIndex: omit active vocabulary fields from reindex detection
2013-08-14 00:00:30 +02:00
reger
a5019bc470
make Vocabulary Navigator tags a hard result entry filter
...
by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query)
TODO: as vocabularies are only locally valid, auto-switch to Searchdom.LOCAL could be considered.
2013-08-13 03:07:25 +02:00
reger
a67a4b7d86
improve tld: query modifier filter pattern (to prevent tld:net accepting www.abcinet.org)
2013-08-12 21:20:23 +02:00
reger
02fe8b43ba
Field Re-Indexing: display list of fields in reindex queue
...
change servlet to display statistic on 1st click (instead after refresh)
2013-08-11 04:51:29 +02:00
sixcooler
7f501b7c38
clear some caches before reporting low Memory
...
do not break lines in Network-table-rows
2013-08-08 14:38:26 +02:00
reger
b355dd52c6
Index Administration - Field Re-Indexing: exclude internal Solr _version_ field from obsolete field check
2013-08-08 00:55:21 +02:00
sixcooler
8a96140f92
fix / workaround for
...
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4750
+ Seed.hash should be final
2013-08-01 16:40:58 +02:00
Michael Peter Christen
2857499467
fix to collection schema; bug appeared for _txt fields with empty String
...
as content
2013-07-31 13:32:05 +02:00
Michael Peter Christen
dbfa865700
added a stub of a class for crawler redesign
2013-07-31 13:16:32 +02:00
Michael Peter Christen
76afcccaaf
fix for default boolean post values: the default value MUST NOT be TRUE,
...
because it's normal that a boolean value is missing in the post argument
if a checkbox is not selected.
Added also some style enhancements to IndexFederated, removed the Solr
attachment manual and replaced it with a link to the wiki which explains
this in more detail.
2013-07-31 10:49:26 +02:00
orbiter
252c525709
fixed feed api servlet and and enhanced RSSReader class
2013-07-31 06:18:30 +02:00
orbiter
d38c3c14d8
fix for CGI test
2013-07-31 05:43:58 +02:00
Michael Peter Christen
31902f54df
fix for NPE which happens within solr code at MultiMapSolrParams.java,
...
line 52 in case that the array arr.length == 0
2013-07-30 14:32:59 +02:00
Michael Peter Christen
f13df9dbb6
migration to solr 4.4.0
2013-07-30 14:01:16 +02:00
Michael Peter Christen
58fe986cca
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20
replaced the single-text description solr field with a multi-value
...
description_txt text field
2013-07-30 12:48:57 +02:00
sixcooler
7d53ac86a3
fix for Blacklist (-Administration)
2013-07-29 19:09:28 +02:00
reger
f2d99053ed
Field Re-Indexing: prevent endless error loop in ReindexSolrBusyThread on Solr exception (by skipping query causing the exception)
...
(occured during testing while working on q=store:[* TO *])
2013-07-29 01:32:02 +02:00
reger
92d3f71b16
htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used),
...
note: stream.close is done by caller (Textparser.parseSource)
- removed unnecessary reset in AugmentParser
- added stream.mark in tdfatripleimpl. to make stream.reset work here
2013-07-28 03:41:09 +02:00
orbiter
87cfeaa4f3
fix for npe
2013-07-27 15:20:09 +02:00
orbiter
268a36aaff
emergency fix for crawler: this will otherwise cause loss of complete
...
crawl queue if latency of remote system is too low
2013-07-27 11:59:07 +02:00
orbiter
d05e0c5368
wait a bit longer before doing the first peer ping
2013-07-27 11:00:35 +02:00
orbiter
b8f57f7703
don't be noisy when doing background tasks that may be allowed to fail
2013-07-27 10:51:58 +02:00
Roland Haeder
0343f0668c
Fix for NPE:
...
E 2013/07/26 20:29:29 BUSYTHREAD Runtime Error in
serverInstantThread.job, thread
'net.yacy.search.Switchboard.cleanupJob': null; target exception: null
java.lang.NullPointerException
at
net.yacy.search.schema.CollectionConfiguration.convergenceStep(CollectionConfiguration.java:1116)
at
net.yacy.search.schema.CollectionConfiguration.postprocessing(CollectionConfiguration.java:897)
at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2296)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107)
at
net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165)
Conflicts:
source/net/yacy/search/schema/CollectionConfiguration.java
2013-07-27 10:19:46 +02:00
Roland Haeder
b58ca8622d
Some cleanups:
...
- added SKINS_PATH_DEFAULT as same as LISTS_PATH_DEFAULT was added
- Added 'final' keyword to a string
2013-07-27 10:13:57 +02:00
Roland Haeder
7263bb82fb
Fix for NPE on shutdown:
...
java.lang.NullPointerException
at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732)
at net.yacy.search.Switchboard.access00(Switchboard.java:207)
at net.yacy.search.Switchboard.run(Switchboard.java:3049)
2013-07-27 09:55:43 +02:00
Roland Haeder
13433d41a1
Log this exception better
...
Conflicts:
source/net/yacy/kelondro/blob/Tables.java
2013-07-27 09:54:51 +02:00
orbiter
080d80c9de
do not write an empty failreason in case that there is no fail. Because
...
of the lazy instantiation rule this value was not actually written, but
if lazy instantiation is switched on, then this causes that all crawl
starts delete all crawl-start-hosts completely because this looks for
filled error reasons.
2013-07-26 17:53:28 +02:00
Michael Peter Christen
4c242f9af9
always use a default value for boolean options to have transparency for
...
the outcome if the attribute is missing in servlets
2013-07-25 12:17:29 +02:00
Michael Peter Christen
61e015268b
fix in forced deletion: forced commit needed
2013-07-25 09:53:19 +02:00
Michael Peter Christen
83e2921b39
new test case for http://bugs.yacy.net/view.php?id=141
2013-07-25 09:31:48 +02:00
Michael Peter Christen
304aacb2cc
fix for http://bugs.yacy.net/view.php?id=267
2013-07-25 09:26:24 +02:00
Michael Peter Christen
c3b2301b2f
fix for http://bugs.yacy.net/view.php?id=268
2013-07-25 09:21:37 +02:00
reger
aa1a1f1d2c
- small adjustment to make sure genericParser is tried last
...
-- for some documents genericParser grabs document instead of specific available parser due to unordered pick of 1st to try parser
(like .ps .rdf files and other)
- remove redundant file extension registration
2013-07-23 20:24:13 +02:00
orbiter
3e901dcb06
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-07-23 19:33:07 +02:00
orbiter
f50b596e0b
do not run dht ditribution if system load is over 2.5
2013-07-23 19:32:32 +02:00
orbiter
056b42f5aa
- added information about segment count to status_p.xml
...
- also moved this information from the old index structure, which is
still in use for the RWI/DHT index to that front-end
2013-07-23 18:03:33 +02:00
orbiter
6fb2811e68
fixes for problems with remote solr and non-activated webgraph index
2013-07-23 16:46:44 +02:00
sixcooler
af740f3058
changed optimization to a segment-size of index-size/5.000.000
...
+ one if not idle
+ one (and force) if postprocessing
2013-07-23 14:21:12 +02:00
Michael Peter Christen
336f86394c
replaced StringBuffer with StringBuilder
2013-07-23 12:21:27 +02:00
Michael Peter Christen
aeac2fb763
replaced more containsKey() -> get() usages by a simple get(), followed
...
by a test for NULL. This should increase the application speed and
reduces the lookup time for the affected methods by 50%
2013-07-23 12:16:51 +02:00
orbiter
5364c4dcc9
delayed first peer-ping to send the first ping out after the http got
...
up; if the ping comes before the http is up, it cannot be recognized as
senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266
2013-07-22 18:21:37 +02:00
orbiter
e24016e30a
added the property federated.service.solr.indexing.timeout to yacy.init
...
to provide a configurable time-out for solr; see also:
http://bugs.yacy.net/view.php?id=254
2013-07-22 17:45:12 +02:00
orbiter
c124037f19
removed forced non-soft commits to prevent index fragmentation
2013-07-22 17:28:20 +02:00
Michael Peter Christen
31483c47e1
fixed problem with remote luke requests
2013-07-22 15:55:20 +02:00
Michael Peter Christen
c15aa758dc
removed failreason_t removal patch because that causes too much
...
confusion using an external solr. to clean up the index after a schema
change, use the index cleaner function from the online servlet
2013-07-22 14:17:38 +02:00
reger
2b7a38640a
extend content type detection on file extension for .tif .tiff .htm
2013-07-21 22:57:21 +02:00
Michael Peter Christen
ac1aad5064
added a getSegmentCount method and use it to disable optimize if wanted
...
current segment count is below optimization level
2013-07-18 14:31:42 +02:00
Michael Peter Christen
36035e0a0a
- used reger's LukeRequest to generalize the index info in
...
SolrServerConnector
- used the LukeRequest in SolrServerConnector to replace the index size
method by a getNumDocs request to a LukeRequest result
2013-07-18 13:26:07 +02:00
Michael Peter Christen
39fceb5ccf
fix for NPE & bug #264
2013-07-18 12:37:32 +02:00
Michael Peter Christen
735a66eff3
enhancements to crawler
2013-07-18 12:29:04 +02:00
Roland Haeder
be0ff6018f
Removed trailing spaces + some more final
2013-07-17 18:44:24 +02:00
Roland Haeder
aaedc0405d
Fixes and avoid of catching bad exceptions (some):
...
- Rewrote usage of HashMap/Map to concurrent versions (to avoid a
CME=ConcurrentModificationException)
- Rewrote ConnectionInfo (as an example) to use a synchronized iterator
instead of synchronizing an
already synced HashSet (see Collections call)
- This avoids catching CMEs again
- Commented out noisy ConcurrentLog.logException() call
Conflicts:
source/net/yacy/repository/LoaderDispatcher.java
2013-07-17 18:37:34 +02:00
Roland Haeder
841a28ae76
Added 'final' for all exception blocks as this helps the Java compiler
...
to optimize memory usage
Conflicts:
source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Felix Ableitner
03044589dd
Fixed (?i) appearing in entries, fixed multiple equal lines in file.
2013-07-17 16:42:10 +02:00
Michael Peter Christen
89c0aa0e74
added collection_sxt to error documents
2013-07-17 15:20:56 +02:00
Michael Peter Christen
0df5195cb0
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-07-17 12:42:06 +02:00
Michael Peter Christen
1fd006cc56
fixes using the embedded connector
2013-07-17 12:41:54 +02:00
orbiter
d0dc86cf3d
logging of deadlocks (if any) during cleanup process
2013-07-17 12:38:58 +02:00
Michael Peter Christen
c6a6f159e8
fix for crawl stack domain counter
2013-07-16 18:18:55 +02:00
Michael Peter Christen
93d1bac140
do a more frequent optimization, reduces IO after optimization
2013-07-16 17:16:48 +02:00
orbiter
b71d13a014
added load and deadlock detector in Memory util
2013-07-16 10:49:20 +02:00
orbiter
290e24564b
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
2013-07-14 17:41:32 +02:00
orbiter
5533fc8e01
fix for bug 260
2013-07-14 17:40:28 +02:00
Michael Peter Christen
b79471ee67
grr
2013-07-14 10:15:47 +02:00
Michael Peter Christen
a79f288ac1
automatically running optimize on solr if user/search is idle for some
...
time
2013-07-14 10:02:08 +02:00
orbiter
a9c8046c87
do a light optimization at the end of a crawl postprocessing
2013-07-13 19:09:46 +02:00
orbiter
a548354c71
replaced type of solr schema object sku of text_en_splitting_tight by
...
string
2013-07-13 18:54:09 +02:00
orbiter
2f1ec8d4a2
npe fix
2013-07-13 11:10:05 +02:00
Michael Peter Christen
bcc623a843
refactoring of load_delay: this is a matter of client identification
2013-07-12 16:24:56 +02:00
orbiter
0d0b3a30f5
activate api actions after postprocessing of crawls
2013-07-12 16:05:48 +02:00
orbiter
3978c5ca5d
fix for http://bugs.yacy.net/view.php?id=255
2013-07-12 14:38:30 +02:00
orbiter
2be456e7fb
added a postprocessing field into api/status_p.xml to show if the
...
postprocessing task is running at that time (status: busy) or not
(status:idle)
2013-07-12 14:29:22 +02:00
orbiter
dac88561ae
minimum access time has a tight connection to ClientIdentification,
...
therefore it is defined there.
2013-07-11 17:04:24 +02:00
Michael Peter Christen
9a29ab469e
another patch to prevent CLOSE_WAIT status on solr connections
2013-07-11 12:53:39 +02:00
Michael Peter Christen
5091d627bc
fixed parsing of peer flags
2013-07-11 12:53:16 +02:00
Michael Peter Christen
87e9052081
added Connection:close to all http requests in our http client to
...
prevent CLOSE_WAIT states (as seen in lsof)
2013-07-11 11:54:11 +02:00
Michael Peter Christen
5c6946dd5f
replaced usage of log4j by ConcurrentLog where possible
2013-07-09 14:42:39 +02:00
Michael Peter Christen
5878c1d599
- refactoring of log to ConcurrentLog:
...
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
orbiter
f4f6551c66
better handling of time-out at solrj in case that a commit is done in a
...
fail-over case during add
2013-07-09 11:01:37 +02:00
Michael Peter Christen
07261fe274
Merge remote-tracking branch 'nutomics/blacklist_structure'
2013-07-08 23:32:15 +02:00