Commit Graph

230 Commits

Author SHA1 Message Date
orbiter
e24016e30a added the property federated.service.solr.indexing.timeout to yacy.init
to provide a configurable time-out for solr; see also:
http://bugs.yacy.net/view.php?id=254
2013-07-22 17:45:12 +02:00
orbiter
c124037f19 removed forced non-soft commits to prevent index fragmentation 2013-07-22 17:28:20 +02:00
Michael Peter Christen
c15aa758dc removed failreason_t removal patch because that causes too much
confusion using an external solr. to clean up the index after a schema
change, use the index cleaner function from the online servlet
2013-07-22 14:17:38 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
89c0aa0e74 added collection_sxt to error documents 2013-07-17 15:20:56 +02:00
orbiter
d0dc86cf3d logging of deadlocks (if any) during cleanup process 2013-07-17 12:38:58 +02:00
Michael Peter Christen
c6a6f159e8 fix for crawl stack domain counter 2013-07-16 18:18:55 +02:00
Michael Peter Christen
93d1bac140 do a more frequent optimization, reduces IO after optimization 2013-07-16 17:16:48 +02:00
Michael Peter Christen
b79471ee67 grr 2013-07-14 10:15:47 +02:00
Michael Peter Christen
a79f288ac1 automatically running optimize on solr if user/search is idle for some
time
2013-07-14 10:02:08 +02:00
orbiter
a9c8046c87 do a light optimization at the end of a crawl postprocessing 2013-07-13 19:09:46 +02:00
Michael Peter Christen
bcc623a843 refactoring of load_delay: this is a matter of client identification 2013-07-12 16:24:56 +02:00
orbiter
0d0b3a30f5 activate api actions after postprocessing of crawls 2013-07-12 16:05:48 +02:00
orbiter
2be456e7fb added a postprocessing field into api/status_p.xml to show if the
postprocessing task is running at that time (status: busy) or not
(status:idle)
2013-07-12 14:29:22 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
default.
2013-07-03 14:50:06 +02:00
Michael Peter Christen
f1c5338210 prepartion for greedy crawl profiles and refactoring 2013-07-01 13:10:09 +02:00
Michael Peter Christen
f9d859f5dc now writing image alt texts and (camelcase-)parsed urls into a text
search field for a better image retrieval
2013-06-18 16:51:56 +02:00
Michael Peter Christen
bdf306e0a7 increased time-out for loading of seed-lists 2013-06-13 22:32:06 +02:00
Michael Peter Christen
1762911f57 added synchronizations and timeouts in solr api; missing
synchronizations in index modification methods causes deadlocks inside
solr.
2013-06-12 02:13:18 +02:00
Michael Peter Christen
6115bef335 added a 'greedy learning' mechanismn which will cause that a 'fresh'
yacy will load linked web pages from search results until the total
number of web pages reaches 15000. This shall give fresh peers a 'boost'
to get faster a personalized search index.
2013-06-11 14:42:30 +02:00
Michael Peter Christen
8e965ffd16 fix for host compare in case that the host is null. This happens when
doing a search in the intranet for file resources (they don't have a
host).
2013-06-10 16:23:58 +02:00
reger
7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247
- append language setting specific stopword list

- remove unused OVERHANG stack type
2013-06-06 22:07:54 +02:00
Michael Peter Christen
5f92c68f1f removed block rank ranking and all YBR files in /ranking 2013-05-30 13:01:22 +02:00
Michael Peter Christen
0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of
soft commits, reduced caching size of search events, ensured that solr
results are processed before connection is closed to keep that stuff not
too long in RAM
2013-05-29 18:27:27 +02:00
Michael Peter Christen
5344a1c5f7 getting the trash out 2013-05-29 16:09:05 +02:00
Michael Peter Christen
8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
reduced time-out of robots.txt load limit
2013-05-20 22:05:28 +02:00
Michael Peter Christen
8dbc80da70 redesign of index.exist-test: this shall now not be done using a single
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
2013-05-17 13:59:37 +02:00
Michael Peter Christen
44e363f37f refactoring of WorkflowProcessor, added process counter, update of
process counter if an blocking thread dies. Added also a new column in
PerformanceConcurrency_p servlet to show the actual number of concurrent
processes.
2013-05-13 13:28:07 +02:00
orbiter
cf36c1614f prevent that concurrent deletion process causes wrong double-check in
crawl start
2013-05-12 21:37:45 +02:00
Michael Peter Christen
b9b446bca6 - added ssl configuration sign (a lock) to network statistic/table
- fixed a bug in bitfield
2013-05-10 17:32:21 +02:00
reger
4fc6837690 - fix monitor url of crawl job in PerformanceQueues_p.html
- reduce logging of every index add  (switch embeddedsolr.add from info to debug)
2013-05-10 04:38:13 +02:00
orbiter
a1c989002b fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652
generate dht data even if dht receive and dht transmission is switched
off
2013-05-08 16:48:45 +02:00
Michael Peter Christen
e26bdd4a52 fixes to deletion methods (removed unnecessary concurrency and added
removal of crawl queue entries)
2013-05-08 13:26:25 +02:00
Michael Peter Christen
f7f3e28c5e prevent that the size of the index is computed too many times.
Because the index size is now provided by solr, and the only way to do
that is a match for [* TO *], a size computation is quite complex and
time-consuming. Therefore this patch prevents that the method is called
at all and if necessary puts a DOS-preventing barrier in front of it.
2013-05-08 11:50:46 +02:00
Michael Peter Christen
cca19d94d4 re-declared some fields to be of type string rather than text which
makes them more efficient and less large
2013-05-06 16:45:54 +02:00
reger
46fa800bc7 added httpstatus_i to automatically switched on fields (used in all search queries) 2013-04-27 03:11:44 +02:00
Michael Peter Christen
3a0fcfbeda Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-26 10:50:08 +02:00
Michael Peter Christen
25499eead5 - added a new field for the regular expression in crawl start
- added the field in crawl profile
- adopted logging end error management
- adopted duplicate document detection
- added a new rule to the indexing process to reject non-matching
content
- full redesign of the expert crawl start servlet
The new filter field can now be seen in /CrawlStartExpert_p.html at
Section "Document Filter", subsection item "Filter on Content of
Document"
2013-04-26 10:49:55 +02:00
orbiter
e1bfe9d07a - reduction of the concurrently running processes to make YaCy more
adjusted to smaller and 1-core devices.
- the workflow processor now starts no process at all. these are started
as soon as parser/condenser/indexing queues are filled.
- better abstraction
2013-04-25 11:33:17 +02:00
Michael Peter Christen
c091000165 added collection attribute also to the rss feed reader 2013-04-24 01:14:35 +02:00
Michael Peter Christen
97775fbebc fixed ranking for add-function queries: this did not work. The option
was removed. All function queries are now boosts (multiplies the score
according to a function). This is also the recommended way to boost
rankings based on functions as explained in
http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/
2013-04-16 14:45:14 +02:00
Michael Peter Christen
a20941c067 resume paused crawls on startup; user expects that restarts 'heal'
everything
2013-04-11 15:07:08 +02:00
orbiter
e4d26d1cb4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-03-17 10:52:42 +01:00
orbiter
940c6849ee enhanced did-you-mean (a bit): can now remember previously searched
words (plus small enhancements)
2013-03-17 10:52:31 +01:00
reger
d57b221921 add: reset Solr schema filed selection to default button in IndexSchema_p 2013-03-17 03:46:29 +01:00
Michael Peter Christen
b8ed66a55d added all clickdepth computations for source and target paths in
webstructure core
2013-03-14 17:54:33 +01:00
Michael Peter Christen
6300730d7f refactoring of clickdepth computation as preparation for clickdepth
computation of webgraph links
2013-03-14 12:13:02 +01:00
orbiter
47114910d5 fix for possible memory leaks 2013-03-13 17:55:37 +01:00
Michael Peter Christen
addba047e2 changes in ranking computation
- an existing ranking servlet for solr was extended. It is now possible
to set boost values for fields, boost functions and boost queries.
- The ranking can have different instances, but currently only the first
one is used
- added an abstraction layer for fields which can be used for search and
those fields can be edited in the solr ranking configruation
- the ranking value from solr within the field score is used to combine
remote search requests, which all are created using the same locally
defined boost values
- reduced the number of fields which are used for search (makes it
faster)
- replaced some text fields by string fields (makes indexing faster)
- removed classes which had no use
- made a large number of experiments for a better ranking and created a
temporary setting which prefers hits inside titles
- adjusted also the RWI-based ranking computation to 'prefer title'
- made special cases like for portal search where no post-processing and
post-ranking is wanted: this keeps the original ranking order as done by
Solr
- fixed many bugs with old settings for ranking
2013-03-13 14:47:00 +01:00