Commit Graph

6509 Commits

Author SHA1 Message Date
Michael Peter Christen
a1644ca0fd new workflow processor in Segment to enqueue indexing documents to solr 2013-05-30 12:34:53 +02:00
Michael Peter Christen
a8dc4346e8 default configuration of MMapDirectoryFactory for solr, increased lock
timeout, less documents from remote searches (too many results had
easily blocked a peer)
2013-05-30 12:31:28 +02:00
Michael Peter Christen
0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of
soft commits, reduced caching size of search events, ensured that solr
results are processed before connection is closed to keep that stuff not
too long in RAM
2013-05-29 18:27:27 +02:00
Michael Peter Christen
5344a1c5f7 getting the trash out 2013-05-29 16:09:05 +02:00
Michael Peter Christen
709e9b8ce7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-29 13:49:42 +02:00
Michael Peter Christen
1eb9626cca less logging 2013-05-29 13:30:32 +02:00
Michael Peter Christen
281959a2d7 added option to re-boot the embedded solr during run-time. Added also
API recording for this method so it can be repeated automatically. The
index dump generation is now also available for API recording. Added
some synchronization in backend which was necessary for this.
2013-05-29 13:09:34 +02:00
orbiter
da621e827e prevent NPE in case RWI is disabled 2013-05-28 16:26:38 +02:00
Michael Peter Christen
c2bcfd8afb Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-28 11:39:10 +02:00
Michael Peter Christen
67757b425a use a retry handler with retryCount=0 because we usually expect requests
to fail if we access non-permanently available resources (peers, web
pages) and want to fail fast without repeating the same request which is
doomed to fail. The previous appearance of http client connection had a
1-2-4-8-second timeout scheme, which caused that connection attempts
lasted for 16 seconds.
2013-05-28 11:38:45 +02:00
Michael Peter Christen
c2b1075dcf activating pollImmediately in case that DHT receive is off. This will
cause a much faster search result when running in public robinson mode.
2013-05-28 10:36:49 +02:00
orbiter
888a985dc6 set a higher limit for table copy usage 2013-05-27 15:23:12 +02:00
Michael Peter Christen
2b563debbf javadoc of new multiple-exist test 2013-05-27 13:45:09 +02:00
Marc Nause
8fb1b1e290 *) simplified banner creation code 2013-05-25 12:56:43 +02:00
Michael Peter Christen
8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
reduced time-out of robots.txt load limit
2013-05-20 22:05:28 +02:00
reger
97ab5b90e8 - odt & ooxml (office document) parser correction to add content to fulltext index
- adjust Junit yacyVersionTest & ParserTest 
- update yacyVersion.combined2prettyVersion to the default 4-digit minor ver.
2013-05-20 01:50:09 +02:00
Michael Peter Christen
b68fbe7d21 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/migration.java
2013-05-17 14:13:07 +02:00
Michael Peter Christen
06d3063dc9 - no downcase when using collection modifier
- removed warnings
2013-05-17 14:11:10 +02:00
Michael Peter Christen
8dbc80da70 redesign of index.exist-test: this shall now not be done using a single
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
2013-05-17 13:59:37 +02:00
reger
7f63d3747d more generic field selection for reindex option of documents with disabled fields
using Luke request to compare config with actual fields in index
2013-05-15 23:16:32 +02:00
Michael Peter Christen
c91c67c3cd reject bad solr requests 2013-05-15 22:42:05 +02:00
Michael Peter Christen
44e363f37f refactoring of WorkflowProcessor, added process counter, update of
process counter if an blocking thread dies. Added also a new column in
PerformanceConcurrency_p servlet to show the actual number of concurrent
processes.
2013-05-13 13:28:07 +02:00
Michael Peter Christen
4058369288 fixed query expressions for collection selection (added quotes) 2013-05-13 13:27:01 +02:00
Michael Peter Christen
f2e36fbd06 enhanced deletion process for very large number of documents 2013-05-13 13:26:24 +02:00
reger
79401cb938 added reindex option for documents with disabled or obsolete fields to Solr Schema Editor page (IndexSchema_p.html)
this allows to remove obsolete fields from the index (according to current schema config)
by selecting all documents containig disabled fields.
2013-05-13 04:06:57 +02:00
orbiter
cf36c1614f prevent that concurrent deletion process causes wrong double-check in
crawl start
2013-05-12 21:37:45 +02:00
orbiter
aeff31cd44 fix for workflow processor (cause: latest redesign for less threads) 2013-05-12 21:36:20 +02:00
Michael Peter Christen
77faeada4d small memory leak patch 2013-05-11 11:19:06 +02:00
Michael Peter Christen
b24d1d18e4 removed synchronization and concurrency in Fulltext class, concurrent
deletions are now handled in ConcurrentUpdateSolrConnector
2013-05-11 10:53:12 +02:00
Michael Peter Christen
b9b446bca6 - added ssl configuration sign (a lock) to network statistic/table
- fixed a bug in bitfield
2013-05-10 17:32:21 +02:00
Michael Peter Christen
e6c8b545c2 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-10 12:16:55 +02:00
orbiter
a83c2fe833 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-10 12:02:40 +02:00
orbiter
4baa0d4a97 Added a default keystore for ssl encryption of the YaCy web interface.
This will enable https-access to YaCy, but this feature is disabled by
default using the new server.https=false attribute. This has two
purposes:
- make it easier for everyone to use https (just set server.https=true)
- provide the basis for secure yacy-to-yacy communication in the future
2013-05-10 12:02:31 +02:00
Michael Peter Christen
aaddb4809c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-05-10 04:57:15 +02:00
Michael Peter Christen
038f956821 fix for sitemap detection: the sitemap url was not visible if it
appeared after the declaration of robots allow/deny for the crawler
because the sitemap parser terminated after the allow/deny rules had
been found. Now the parser reads the robots.txt until the end to
discover also sitemap rules at the end of the file.
2013-05-10 04:56:58 +02:00
reger
4fc6837690 - fix monitor url of crawl job in PerformanceQueues_p.html
- reduce logging of every index add  (switch embeddedsolr.add from info to debug)
2013-05-10 04:38:13 +02:00
Michael Peter Christen
442ed50be0 removed some unnecessary synchronizations 2013-05-09 03:06:48 +02:00
Michael Peter Christen
ad050ec88d - upgraded httpclient, httpcore and httpmime
- removed httpclient 3.1 which has been used by solrj < 4.x.x and is now
not used any more
- fixed some parts in YaCy which used methods from httpclient 3.1
2013-05-09 00:22:45 +02:00
orbiter
a1c989002b fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652
generate dht data even if dht receive and dht transmission is switched
off
2013-05-08 16:48:45 +02:00
Michael Peter Christen
e26bdd4a52 fixes to deletion methods (removed unnecessary concurrency and added
removal of crawl queue entries)
2013-05-08 13:26:25 +02:00
Michael Peter Christen
f2c9b0b5f2 better robustness of Concurrent Solr Connector against update/deletion
thread failure
2013-05-08 12:41:24 +02:00
Michael Peter Christen
f7f3e28c5e prevent that the size of the index is computed too many times.
Because the index size is now provided by solr, and the only way to do
that is a match for [* TO *], a size computation is quite complex and
time-consuming. Therefore this patch prevents that the method is called
at all and if necessary puts a DOS-preventing barrier in front of it.
2013-05-08 11:50:46 +02:00
Michael Peter Christen
cca19d94d4 re-declared some fields to be of type string rather than text which
makes them more efficient and less large
2013-05-06 16:45:54 +02:00
Michael Peter Christen
cc90f82dbb increased default proxy client timeout to one minute 2013-05-06 14:58:18 +02:00
Michael Peter Christen
ed1d5bace6 draw the names of other peers which receive/send dht into the network
graphic
2013-05-06 14:27:39 +02:00
Michael Peter Christen
b528448332 enlarge network graph circle according to image height and reduce the
image height in the Network servlet. Overall, the image is now larger
but takes less space on the web page.
2013-05-05 23:39:46 +02:00
reger
24d2b4baee remove pre 1.0 migration statement which possibly overwrites user navigator setting 2013-05-05 05:00:42 +02:00
Michael Peter Christen
3841854c97 abstraction of catchall term 2013-05-04 00:14:22 +02:00
Michael Peter Christen
ea85674be2 added the date to error documents 2013-05-04 00:14:00 +02:00
Michael Peter Christen
6fafed2180 fix for solr cache when a delete buffer is filled and a document, which
is the delete queue, is replaced with a new one.
2013-05-03 02:03:30 +02:00
Michael Peter Christen
20b767f35e preventing score computation in solr where applicable 2013-05-03 02:02:35 +02:00
orbiter
7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
- check geolocation coordinates and accept only those, which are
well-formed
- the solr push process does not stop crawling any more if after 20
requests to Solr Solr does not accept the record. Instead, a severe log
entry asks the user to create a bug request
2013-05-03 00:24:39 +02:00
Michael Peter Christen
ee217dbdee remove sort order in all cases where not needed 2013-04-30 11:44:56 +02:00
Michael Peter Christen
70e981b333 prevent that long-running deletion tasks block a hard commit. 2013-04-30 11:09:21 +02:00
Michael Peter Christen
bb4bf3d8fd infinity timeout bug protection patch 2013-04-30 11:06:48 +02:00
Michael Peter Christen
1b102d98d8 - added index deletion to index administration submenu
- added index deletion processes to the process scheduler/recorder
2013-04-30 02:11:28 +02:00
Michael Peter Christen
d1be4127e7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-29 19:31:40 +02:00
Michael Peter Christen
1aac722cc6 added another solr connector, the ConcurrentUpdateSolrConnector which
does not block when long-running updates to solr are made. This is
realized using blocking queues which process all long-running tasks in
the background. Also some bugfixes to existing connectors.
2013-04-29 19:30:04 +02:00
Michael Peter Christen
0af7803367 added more features to ScoreMap (pretty toString) 2013-04-29 19:28:17 +02:00
Michael Peter Christen
f36a7da5f6 - re-introduced existById in solr connector.
- intruduced raw-queries for the re-introduced byId-Queries (they are
hopefully faster than full edismax queries)
- removed the cached solr connector (testing this) to rely only on the
solr built-in search caches. That should save some RAM (also). We will
see if this is usable.
2013-04-28 21:20:14 +02:00
reger
46fa800bc7 added httpstatus_i to automatically switched on fields (used in all search queries) 2013-04-27 03:11:44 +02:00
Michael Peter Christen
3502b4c697 refactoring (renaming) of yacy-solr api 2013-04-27 01:32:18 +02:00
Michael Peter Christen
3a0fcfbeda Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-26 10:50:08 +02:00
Michael Peter Christen
25499eead5 - added a new field for the regular expression in crawl start
- added the field in crawl profile
- adopted logging end error management
- adopted duplicate document detection
- added a new rule to the indexing process to reject non-matching
content
- full redesign of the expert crawl start servlet
The new filter field can now be seen in /CrawlStartExpert_p.html at
Section "Document Filter", subsection item "Filter on Content of
Document"
2013-04-26 10:49:55 +02:00
orbiter
e1bfe9d07a - reduction of the concurrently running processes to make YaCy more
adjusted to smaller and 1-core devices.
- the workflow processor now starts no process at all. these are started
as soon as parser/condenser/indexing queues are filled.
- better abstraction
2013-04-25 11:33:17 +02:00
Michael Peter Christen
c091000165 added collection attribute also to the rss feed reader 2013-04-24 01:14:35 +02:00
orbiter
f7571386a3 added a 'collection' property attribute in yacysearch.html which can be
used to select between different collections as defined during a crawl
start with the 'collection' attribute. This actually implements the
ability to prepare search tenants which restrict their search results to
a specific collection. The main use for this is to provide tenants to
the yaml4 interface (at this time).
2013-04-23 20:42:54 +02:00
orbiter
3e79bd4b1f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-23 12:15:46 +02:00
Michael Peter Christen
d937c55204 extended limitation of dom export size from 100000 to 100000000 2013-04-22 22:33:13 +02:00
Michael Peter Christen
fc2095ac67 some extensions to raster plotter to transform a RGB picture to an
indexed color scheme. This is needed for gif animations
2013-04-22 14:33:04 +02:00
Michael Peter Christen
c1a2175fbc added transparency to gif image animation and the integration to the
YaCy httpd for on-the-fly generated gifs (including animated gifs)
2013-04-21 12:29:05 +02:00
orbiter
5d442dad82 avoid NPE in regex checker 2013-04-20 10:53:49 +02:00
Michael Peter Christen
50421171c3 added new schema fields:
hreflang_url_sxt and hreflang_cc_sxt
for
http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077

navigation_url_sxt and navigation_type_sxt
for
http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html

publisher_url_s
for http://support.google.com/plus/answer/1713826?hl=de

all fields are disabled by default and not written to the index.
2013-04-18 17:21:17 +02:00
Michael Peter Christen
566d6c980c checking of document signature for a double-document check now refers
only to documents within the same domain
2013-04-17 16:15:27 +02:00
Michael Peter Christen
1d30082446 added hindi translation configuration 2013-04-17 12:57:27 +02:00
Michael Peter Christen
d05dc07cff setting of new default values for ranking 2013-04-16 15:02:00 +02:00
Michael Peter Christen
97775fbebc fixed ranking for add-function queries: this did not work. The option
was removed. All function queries are now boosts (multiplies the score
according to a function). This is also the recommended way to boost
rankings based on functions as explained in
http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/
2013-04-16 14:45:14 +02:00
Michael Peter Christen
ac5fa9fe48 fix for result counter logging 2013-04-16 13:32:13 +02:00
Michael Peter Christen
7ab5093321 added new solr title_exact_signature_l and
description_exact_signature_l to be able to identify unique title and
unique description fields.
2013-04-16 01:35:15 +02:00
Michael Peter Christen
f24ac518e6 redesign of exists()-query (can now be called with query) and the
CachedSolrConnector which based its cache on the key value. This will be
used to correct the title_unique_b and description_unique_b field.
2013-04-15 14:08:30 +02:00
Michael Peter Christen
27d6222880 added new field host_extent_i which, after a crawl and postprocessing,
holds the number of documents for the host where the document is hosted.
This is necessary for ranking and the norming of references per local
host in the ranking computation.
2013-04-14 20:52:40 +02:00
reger
518b20147c skip postprocessing during document.store if no citation index connected (prevent null pointer exception) 2013-04-14 02:01:27 +02:00
Marc Nause
ac478384d3 *) did some long overdue refactoring 2013-04-13 23:04:44 +02:00
Michael Peter Christen
ada3f27de7 added three new field for a better ranking: references_internal_i,
references_external_i and references_exthosts_i. These can be used to
count and evaluate the number of external links to every web page. An
experimental ranking function can be i.e.:
div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))
2013-04-12 16:17:14 +02:00
Michael Peter Christen
082e3274d6 - setting the same default ranking in the solr interface as for YaCy
search interfaces if no other ranking attributes are given
- using the YaCy ranking in the GSA interface only if there was not
given a GSA-style sort attribute
- to avoid confusion about correct ranking attributes, only the default
'0'-ranking profile is used and not scenario-adopted (site, date)
because that should be configurable in the web interface before it is
used actually for ranking.
2013-04-12 10:48:41 +02:00
Michael Peter Christen
a20941c067 resume paused crawls on startup; user expects that restarts 'heal'
everything
2013-04-11 15:07:08 +02:00
Michael Peter Christen
edc0b33f6d - showing references count and clickdepth in host browser
- fixed generation and presentation of both values
2013-04-11 14:46:13 +02:00
reger
566a3b0294 fix: Index Administration > Reverse Word Index (IndexControlRWIs_p) corrected use of word search to word-hash search
- removed duplicate QueryParams.hashes2Handles , redundant  with .hashes2Set
2013-04-08 21:25:21 +02:00
Michael Peter Christen
cf0acd2cb4 upgrade to solr 4.2.1 2013-04-06 16:11:24 +02:00
reger
e89491271f - fix opensearch discover err msg - webgraph not enabled - if no opensearchdescription link found in index
- remove search2.net from sample config (is down)
2013-04-04 00:40:59 +02:00
reger
6a9d0b60a3 make sure configured port is reported on recreated mySeed.txt 2013-04-01 03:51:57 +02:00
Michael Peter Christen
870aedf3c6 fixes for better search interface integration in yaml templates 2013-03-20 16:19:49 +01:00
Michael Peter Christen
5512be6673 fix in GSA result writer which evaluates result context fields as
String. After the migration to Solr 4.1.0 'some' of these fields
suddenly are stored as String[]; this patch compensates this confusion.
2013-03-19 10:33:35 +01:00
Michael Peter Christen
342ba1049b - callback fix
- memory allocation problem in RowCollection: if memory is too low, do
not to try to increase by 1 because this leads to very long execution
time and at the end to the same OOM as if we allocate the memory at the
moment we need it even if the resource observer states that this memory
is not there. To compensate this, the increase size is reduced.
2013-03-19 10:32:01 +01:00
orbiter
65d73e5652 renamed callback function to 'callback' because that is a standard for
jsonp which is also used in backbone.js/jquery
2013-03-19 00:59:47 +01:00
orbiter
17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
and html documents
2013-03-17 22:13:56 +01:00
orbiter
e4d26d1cb4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-03-17 10:52:42 +01:00
orbiter
940c6849ee enhanced did-you-mean (a bit): can now remember previously searched
words (plus small enhancements)
2013-03-17 10:52:31 +01:00
reger
d57b221921 add: reset Solr schema filed selection to default button in IndexSchema_p 2013-03-17 03:46:29 +01:00
Michael Peter Christen
9406a2e438 fixed NPE during index abstract computation 2013-03-15 10:04:27 +01:00
Michael Peter Christen
16e9d4d1dd added a restart hint 2013-03-15 10:00:06 +01:00
Michael Peter Christen
b3a54d5b1c fix for wrong class name in log 2013-03-15 09:35:57 +01:00
Michael Peter Christen
2d36a7eaf5 - do not create a new query for all remote peers
- no document search this time
- adjusted banner and network to not show 'WORDS' but DHT Chunks. This
is to avoid confusion for robinson peers which do not create Word
Entries
2013-03-15 00:14:28 +01:00
Michael Peter Christen
4af0839be2 use appropriate ranking for each search situation:
- when using the /date modifier, a date ranking profile is used
- when using a site: modifier, a ranking profile supporting longer urls
is used
2013-03-14 21:13:12 +01:00
Michael Peter Christen
b8ed66a55d added all clickdepth computations for source and target paths in
webstructure core
2013-03-14 17:54:33 +01:00
Michael Peter Christen
6300730d7f refactoring of clickdepth computation as preparation for clickdepth
computation of webgraph links
2013-03-14 12:13:02 +01:00
Michael Peter Christen
2080fc7406 removed unused tag fields 2013-03-14 10:35:21 +01:00
reger
230a12bfe2 adjust Opensearch discover function to new webgraph Solr schema 2013-03-14 03:10:54 +01:00
orbiter
6b13dd0d3d added clickdepth field writing for webgraph core (unfinished) 2013-03-14 01:35:38 +01:00
orbiter
47114910d5 fix for possible memory leaks 2013-03-13 17:55:37 +01:00
Michael Peter Christen
addba047e2 changes in ranking computation
- an existing ranking servlet for solr was extended. It is now possible
to set boost values for fields, boost functions and boost queries.
- The ranking can have different instances, but currently only the first
one is used
- added an abstraction layer for fields which can be used for search and
those fields can be edited in the solr ranking configruation
- the ranking value from solr within the field score is used to combine
remote search requests, which all are created using the same locally
defined boost values
- reduced the number of fields which are used for search (makes it
faster)
- replaced some text fields by string fields (makes indexing faster)
- removed classes which had no use
- made a large number of experiments for a better ranking and created a
temporary setting which prefers hits inside titles
- adjusted also the RWI-based ranking computation to 'prefer title'
- made special cases like for portal search where no post-processing and
post-ranking is wanted: this keeps the original ranking order as done by
Solr
- fixed many bugs with old settings for ranking
2013-03-13 14:47:00 +01:00
reger
38f46eb33d set RootNodeFlag only if EmbeddedSolr is connected (as RootNodes may receive direct Solr queries) 2013-03-12 03:13:14 +01:00
reger
2962f2b9e9 Merge branch 'master' of git://gitorious.org/yacy/rc1.git 2013-03-12 02:51:17 +01:00
orbiter
ab74d559fb Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-03-11 18:23:43 +01:00
Michael Peter Christen
4490133909 removed target_tag_s (superfluous) 2013-03-11 10:46:29 +01:00
orbiter
cd197bb555 fix for NPE if surrogates do not exist 2013-03-10 19:46:06 +01:00
reger
6ae30f9d0f replace the terminateOldSessions - return immediate time from fixed 3 sec to requested minage parameter 2013-03-10 05:22:18 +01:00
Michael Peter Christen
252bb51f98 fix for wrong mime type in noload crawler 2013-03-07 15:31:00 +01:00
Michael Peter Christen
25300913fa fixes to search debugging after testing with the different search
debugging options
2013-03-05 21:28:22 +01:00
Michael Peter Christen
81380ae5c8 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-03-05 12:24:10 +01:00
Michael Peter Christen
c2fde018b5 concurrent snippet fetching from solr results which do not have snippets 2013-03-05 12:24:01 +01:00
orbiter
b1140e3d82 added debug switches for detailed search testing 2013-03-05 12:19:32 +01:00
orbiter
cdbfddf091 added filter queries for better image, audio and video results 2013-03-04 21:18:54 +01:00
Michael Peter Christen
587ef83eab added missing cleanup statements for short memory cases during search 2013-03-04 13:01:24 +01:00
orbiter
2562f052b9 do not put the fulltext field text_t into the search cache because it is
not used there and uses a lot of memory
2013-03-04 12:01:10 +01:00
Michael Peter Christen
2b6c79d347 in method exists() also use the new caching-stacks for
documents/metadata
2013-03-04 01:13:17 +01:00
Michael Peter Christen
ae734b3f8d enhanced the search result processing
- no waiting time at the end
- switched on 'classic' snippet production and verification (again)
2013-03-04 00:17:29 +01:00
Michael Peter Christen
0d7b4bc891 better protection against OOM during search flush and fixed missing
result push
2013-03-03 23:45:47 +01:00
Michael Peter Christen
221ed7d764 - enhanced concurrency during search without IO blocking
- introduced a second queue to flush remote search results (now: old
metadata structure from DHT peers)
- fixed result counters
2013-03-03 22:38:50 +01:00
Michael Peter Christen
3b1d9dc884 made index storage from DHT search result concurrently. This prevents
blocking by high CPU usage during search. Also: removed query from Solr
for DHT search results; results are taken from the pending queue.
2013-03-02 10:25:52 +01:00
orbiter
f13c0b2abd fix for search 2013-03-01 19:18:16 +01:00
orbiter
0f7ea7ad9f - enhanced solr.add procedure for mass adds
- removed unused solr access classes
- made snippet generation for documents aus YaCy RWI/DHT concurrent (as
it was before the search process removation)
- reduced the number of remote results in settings file because the
processing of such mass documents add is too CPU-intensive (in Solr)
2013-03-01 15:27:17 +01:00
Michael Peter Christen
f327ffedb4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-02-28 15:55:13 +01:00
orbiter
9c09fd7d0b better/less requests to local solr; the request is made in chunks which
are exactly at only that size which is needed to present the current
search result page. This will also cause that next solr request are made
automatically during switching to next pages.
2013-02-28 14:04:08 +01:00
Michael Peter Christen
840fa22135 disabled clickdepth computation during craling since that is repeated
during clean-up phase.
2013-02-28 02:25:39 +01:00
orbiter
d74472f562 corrected result counter 2013-02-27 22:40:23 +01:00
orbiter
2555542f7a removed the dns prefetch because that was not soo useful 2013-02-27 20:58:34 +01:00
Michael Peter Christen
d957739441 removed size request 2013-02-26 17:53:44 +01:00
Michael Peter Christen
c95a84103a complete redesign of search process:
- removed 'worker' processes
- no internal time-out behaviour: methods either are successful or
return null
- waiting is only done on top-level
- removed snippet-production; this is replaced by solr snippets
- removed statistics based on solr size queries (they had been VERY
long); the statistics (like suggestions or tag cloud) are now again
based on the old but very fast RWI index. In portal or intranet mode the
RWI index is usually switched off; if you like to have statistics again
then you must switch on the rwis again in this mode.
- fixed many bugs regarding correct page counter
2013-02-26 17:16:31 +01:00
Michael Peter Christen
35fa718b77 testing to use solr for portalsearch caused some bugfixing but no full
success: try to comment out the solr search request in
yacy-portalsearch.js
2013-02-25 14:31:50 +01:00
Michael Peter Christen
008288719c fix for schema export to consider also automatically generated
coordinate fields
2013-02-25 01:13:03 +01:00
Michael Peter Christen
089dee1770 - generalized SchemaConfiguration into super-class Configuration and
adopted other classes which used the configuration-only access for that
class
- removed many warnings
- adjusted logging
2013-02-25 00:09:41 +01:00
Michael Peter Christen
c16de49f64 fix for webgraph delete query 2013-02-24 18:17:58 +01:00
Michael Peter Christen
56d5946a59 - added flags in IndexFederated_p.html to switch on or off the webgraph
index (new solr core webgraph) .. this is now off by default
- completely redesigned this servlet
- added description how to attach a remote solr
- adjusted naming of servlet and menues
- moved 'lazy initialization' attribut from IndexSchema to
IndexFederated (this is a general option) back again.
2013-02-24 18:09:34 +01:00
Michael Peter Christen
14cceb6b17 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	htroot/IndexFederated_p.html
	source/net/yacy/cora/federate/solr/YaCySchema.java
	source/net/yacy/peers/Protocol.java
	source/net/yacy/search/Switchboard.java
	source/net/yacy/search/index/Segment.java

also moved portalsearch-dev to yacy-portalsearch to be able to fix
problems with new attachment to solr of the search widget
2013-02-23 08:48:33 +01:00
Michael Peter Christen
58e1e6fa2b fixes to schema 2013-02-23 08:14:10 +01:00
reger
f291d60c5f on remote Solr search take only locally enabled schema fields from remote solrdocument for the inputdocument added to local index 2013-02-22 22:17:45 +01:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
structure, but is not filled yet. To have the opportunity of a second
core, multi-core functionality had to be implemented to the
deep-embedded solr:
- migrated the solr_40 directory content to a subdirectory
'collection1'; the previously used default core is now called
collection1
- added solr_40/webgraph subdirectory as second core
- added a servlet configuration for the second core 'webgraph' in
/IndexSchema_p.html
- added instance handling as addition to solr connections: all solr
connectors are now instances of an solr 'instance' object; this required
a complete re-design of the solr embedding
- migrated also caching and sharding ontop of new instance handling
- migrated the search apis to handle now the access to a specific core,
the default core named 'collection1'
- migrated the remote solr search interface to access shards of cores;
for the yacy remote search the default core is now called 'solr'; using
the peer address as solr address
- migrated the solr backup and restore process: old backups cannot be
used after this migration!
- redesign of solr instance handling in all methods which access the
instances: they cannot hold copies of these instances any more; the must
retrieve the actuall connection object every time they want to write to
it (this solves also some bugs when switching the index/network)
- added another schema 'solr.webgraph.schema', the old solr.keys.list is
replaced by solr.collection.schema
2013-02-21 13:23:55 +01:00
Michael Peter Christen
33bc255e85 prevent that crawl starts with very large url lists cause a time-out in
the user front-end
2013-02-15 01:58:28 +01:00
Michael Peter Christen
b6de1f42dc Full redesign of solr connection architecture. This was done to support
multiple solr cores instead of just one. Therefore it is now necessary
to distuingish between solr server connections (called an 'Instance')
and a connection to a single solr core. One Instance may now have
multiple connector classes assigned to it, each connecting to a single
core.
To support multiple cores it is also necessary to distinguish between
the connection configuration and the configuration of the index schema.
We will have multiple schema configurations in the future, each for
every solr core. This caused that the IndexFederated servlet had to be
split into two parts, the new Servlet for the Schema editor is now in
the IndexSchema Servlet.
2013-02-15 01:38:10 +01:00
Michael Peter Christen
4111606654 removed the commitWithin attribute because that is not the way how the
index is updated the right way for us. May also be be superfluous with
the solr 4.0 softcommit.
2013-02-13 02:29:47 +01:00
Michael Peter Christen
c20fa3640d fix to unbalanced tag and license for null objects 2013-02-13 01:23:05 +01:00
Michael Peter Christen
3a6097966d added jsonp option to yjson result writer 2013-02-13 01:11:57 +01:00
Michael Peter Christen
de58043205 Added image license generation for solr image search results when
results are generated within yjson result writer. This makes it possible
to view images in yacyinteractive from solr.
2013-02-13 00:33:53 +01:00
Michael Peter Christen
d3508fa8ff fixed json search, quotes, auto-facets, urls etc. for
yacyinteractive.html
2013-02-13 00:01:38 +01:00
Michael Peter Christen
1db23e9eac Moved methods from SolrServerConnector to AbstractSolrConnector with the
result that most of these methods become superfluous in other classes.
This is a generalization step towards multi-indexes in Solr.
2013-02-12 22:03:10 +01:00
Michael Peter Christen
16d90859b7 reverted put-semantics back to as-usual in serverObjects and introduced
an add-method to put in several objects for the same key
2013-02-12 11:52:33 +01:00
Michael Peter Christen
0d888ff69e Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-02-12 03:42:58 +01:00
Michael Peter Christen
c34af7fe94 extended JSON Response Writer and Opensearch Response Writer for the
Solr search interface in such way that it is possible to use this
interface for the yacyinteractive search. This search interface is now
much faster using the Solr search directly. For the Solr interface it
was necessary to create a translation from the YaCy search modifiers to
the Solr facet selection. This was added in such a way that it becomes
generic for the normal YaCy search and as a on-top evaluation for Solr
queries.
2013-02-12 03:42:46 +01:00
reger
c37d718f16 make sure yacy.running is deleted if not running (catch exception)
- to prevent following log if YaCy was previously not properly shutdown 

E ... STARTUP WARNING: the file C:\src\git\yacy-rc1\DATA\yacy.running exists, this usually means that a YaCy instance is still running
E ... STARTUP FATAL ERROR: java.util.concurrent.TimeoutException
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
	at net.yacy.cora.protocol.TimeoutRequest.call(TimeoutRequest.java:91)
	at net.yacy.cora.protocol.TimeoutRequest.ping(TimeoutRequest.java:112)
	at net.yacy.yacy.startup(yacy.java:200)
	at net.yacy.yacy.main(yacy.java:638)
Caused by: java.util.concurrent.TimeoutException

- adjust Netbeans path (to solr4.1.jars)
2013-02-11 22:53:19 +01:00
Michael Peter Christen
762b687e47 extended the serverObjects to be able to hold multipel values for a
single key. This is done using the solr class MultiMapSolrParams. That
class is needed in the OpensearchResultWriter to get multiple facet
requests.
2013-02-11 22:12:15 +01:00
Michael Peter Christen
d70d99fab5 added more metadata fields and facets to OpensearchResponseWriter.
This should make it possible to replace the original and enriched yacy
opensearch result with a solr output in opensearch format.
2013-02-11 22:10:14 +01:00
Michael Peter Christen
6a4878940b fix in html parser and bookmark generation 2013-02-11 13:28:08 +01:00
Michael Peter Christen
dee8b24d3c better error handling for bookmarks 2013-02-09 06:55:57 +01:00
Michael Peter Christen
e1da39245a when searching the network, do not search on robinson peers with the old
DHT search interface. Now use the solr interface.
2013-02-08 18:30:08 +01:00
Michael Peter Christen
6f6ddaf7e7 A robinson peer does not need to write RWI data if such peers are only
searched using the solr interface. Searching public rpbinsons will be
done with solr only in the future.
2013-02-08 17:58:54 +01:00
Michael Peter Christen
ab4f74c82c fix for xml blacklist import 2013-02-08 15:12:10 +01:00
Michael Peter Christen
7806680ab8 fixed a problem with re-feeding of already indexed documents whith
coordinates attached.
2013-02-08 12:45:54 +01:00
Michael Peter Christen
cb38e860cf After the observation that Windows user simply forget that they started
YaCy; YaCy is still running and the user additionally expect that
another doubleclick on the YaCy icon simply opens the search windows
(again) I decided to add a function that complies to the expectation to
the user: simply open the browser pop-up page again if the user starts
YaCy while YaCy is still running.
2013-02-07 23:39:00 +01:00
Marc Nause
27894d2c1a Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2013-02-05 21:09:41 +01:00
Marc Nause
75f9568472 *) only install files from the RELEASE directory
*) minor changes
2013-02-05 21:02:32 +01:00
Michael Peter Christen
eb80405a16 added a disable function in RemoteCrawl_p servlet which prevents setting
of remote crawl if peer is not a senior or principal peer
2013-02-05 12:47:20 +01:00
Michael Peter Christen
19c46e4acf catch more exceptions 2013-02-04 21:24:39 +01:00
Michael Peter Christen
7de502f43d Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-02-04 20:02:35 +01:00
Marc Nause
3bc5ee6e3d *) added protection against CSRF in update download page
(http://localhost:8090/ConfigUpdate_p.html?releaseinstall=../../test.txt&deleteRelease=Delete+Release
does not work anymore)
2013-02-04 19:57:28 +01:00
Michael Peter Christen
4f270d89e2 another NPE 2013-02-04 18:04:52 +01:00
Michael Peter Christen
921091c3a6 use thread-safe http connection manager for authenticated remote solr
connections
2013-02-04 17:48:04 +01:00
Michael Peter Christen
e8f7b85b98 fixes to internal RWI usage if RWI is switched off (NPE etc) 2013-02-04 17:11:02 +01:00
Michael Peter Christen
3834829b37 bugfixes and more logging for solr connector 2013-02-04 16:42:10 +01:00
Michael Peter Christen
80fe3d7860 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/cora/federate/solr/connector/EmbeddedSolrConnector.java
2013-02-04 10:57:54 +01:00
Michael Peter Christen
4323621a76 update to Solr 4.1.0 2013-02-04 10:55:49 +01:00
reger
160ce568b3 move testing SolrServlet.main to test, making include of jetty*.jar in distribution and classpath obsolete
- move jetty*.jar to test library 
- move SolrServlet.main as is to test, add also a junit test simulating main 
  - add build.xml cleanup for EmbeddedSolrConnectorTest created test/DATA
- adjust some test compile errors
2013-02-03 22:32:38 +01:00
orbiter
07a20e8253 removed unused import 2013-02-02 10:52:39 +01:00
Michael Peter Christen
d1cb4cbc84 enhanced network scanner, is faster and more flexible now
- start more processes
- remove superfluous host name resolution
- better/more flexible subnet ip range calculation
- prefer ipv4 makes better usable ip pre-settings in servlet
- extended servlet by new subnet /20 - option
- redesign of scanner start process in servlet (generalization)
2013-02-02 09:51:43 +01:00
Michael Peter Christen
592adf7ccb fix for domain navigation 2013-02-02 07:21:18 +01:00
Michael Peter Christen
4ca1b76627 less search overhead when first result set is smaller than requested 2013-02-02 07:20:56 +01:00
Michael Peter Christen
f748b0aa7c NPE fix 2013-02-02 07:20:02 +01:00
Michael Peter Christen
7dfcc92b71 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-01-31 13:15:42 +01:00
Michael Peter Christen
0b6566a389 optimizations when starting large crawl requests with many start urls in
one request:
- allow larger match-fields in html interface
- delete all host hashes at once from zurl
- when deleting by host, do not count size of deleted entries since that
was the reason it took so long
2013-01-31 13:15:28 +01:00
orbiter
a2160054d7 ability to create vocabularies also without any objectspace: this
iterates over all urls in the index do create terms
2013-01-30 19:33:48 +01:00
orbiter
ecc10a752c fixes to index enumeration for vocabulary production 2013-01-29 18:14:14 +01:00
sixcooler
3a13906121 clear some more caches if running out of memory 2013-01-25 04:24:36 +01:00
Michael Peter Christen
8651ec35fe turned author_s into the multi-valued field author_sxt 2013-01-24 18:24:31 +01:00
Michael Peter Christen
4589afe056 fix NPE when solr does not deliver snippets 2013-01-24 14:12:31 +01:00
Michael Peter Christen
0fe7b6fd3b migrated the index export methods from the old metadata to solr. Now
exports are done using solr queries. removed superfluous methods and
servlets.
2013-01-24 12:39:19 +01:00
Michael Peter Christen
1768c82010 removed field selection because that created documents with that field
only which was not useful when re-writing the same document
2013-01-24 03:26:38 +01:00
Michael Peter Christen
31e854bef6 Merge remote-tracking branch 'copro/master' 2013-01-23 14:41:17 +01:00
Michael Peter Christen
4735bd47f4 - changed solr commit call and added an optimize option. Since Solr
4.0.0 there is a new softcommit feature which implements a
near-real-time (NRT) search option. The softcommit does not do IO and
does not cause performance issues.
YaCy has now an extension in its solr connectors to use the softcommit
feature. The softcommit call now replaces all places where a hard commit
was used. Furthermore the commit strategy in when doing a search from
the web interface was changed (it's done every time before a search is
done).

The softcommit feature was implemented because it was needed for the
following changes (customer demands), which is also included in this
git commit:

- added a feature to identify all documents which have unique titles
and/or unique descriptions. These unique flags are disabled by default.
- added also a feature to set a flag when the url from a canonical tag
is equal to the document url. This is also disabled by default.

To support the new softcommit strategy, the commitWithinMs option was
set to -1 do disable automatic commit based on document insert times. If
documents are inserted permanently then also a commit would happen
permanently whenever the commitWithinMs time is reached. This would
conflict with the regular autocommit of 10 minutes and the new
softcommit strategy.
2013-01-23 14:40:58 +01:00
Copro
3ea8380959 Adding Vimeo tag to wiki commands to embedd Video video with id 2013-01-23 04:00:15 +01:00
Copro
ee9d7fd93d Added feature to embedd Youtube videos to wiki commands for usage in
Wiki, Blog or other servlets
2013-01-23 02:43:58 +01:00
Michael Peter Christen
9ccdd21d76 Merge remote-tracking branch 'aleksejs/fixtrans'
Conflicts:
	locales/ru.lng
	
Tried to merge this but I had to made this 'blind'.
Sorry if I deleted something that was right.
2013-01-22 11:54:38 +01:00
Michael Peter Christen
db024a4e19 added new solr fields (unused yet; implementation will follow) 2013-01-21 18:02:29 +01:00
Michael Peter Christen
f5fd2aea18 removed archaic migration code 2013-01-21 17:59:42 +01:00
Michael Peter Christen
60f2a69331 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-01-17 21:53:19 +01:00
Michael Peter Christen
cba038f97b one more NPE fix 2013-01-17 21:52:56 +01:00
sixcooler
f3e705c4fe bump to httpclient / httpcore 4.2.3 (bugfix-release) 2013-01-17 20:10:49 +01:00
Michael Peter Christen
af465cdca5 fix for wrong robots.txt loading for https protocol
see also: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4579
2013-01-16 17:38:06 +01:00
Michael Peter Christen
c3d50d91f8 relaxing site operator for www prefix:
- when using a site operator search for a domain where the domain has a
www prefix, also the domain without the www is enclosed
- when using a site operator search for a domain where the domain has no
www prefix, also the domain with the www in enclosed
- in the host navigator, all domains with and without a www prefix are
accumulated. That means that the host navigator does never show a host
with a www prefix.
This should prevent usage mistakes of the site operator.
2013-01-16 14:54:35 +01:00
Michael Peter Christen
db49e91724 fixed a NPE which may appear for freeworld peers without any rwi index
data. This the NPE looked like:
Caused by: java.lang.NullPointerException
	at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:279)
	at
net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:155)
	at search.respond(search.java:314)
	... 12 more
2013-01-16 11:07:20 +01:00
Michael Peter Christen
4faa07c214 added a timeout for topic computation (solr is here much slower than the
old metadata-db)
2013-01-15 16:20:43 +01:00
Michael Peter Christen
d2d5be032d added a 'inlink' search option according to the suggestion in the YaCy
forum at 
http://forum.yacy-websuche.de/viewtopic.php?f=18&t=4572#p27410

The feature was not called 'haslink' but called 'inlink' to have a
analogous naming like 'inurl'. This causes now that you can search for
words in links of the document, like:
* inlink:yacy
searches all documents which link to pages which have an 'yacy' in the
url.
2013-01-14 12:50:21 +01:00
reger
3897bb4409 added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index)
- migrates all entries in old urldb

Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), 
- added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format)
- removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0  (leaving it to the compiler/optimizer to choose number format)

current log excerpt for NumberFormatException:
W 2013/01/14 00:10:07 StackTrace For input string: "-"
java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
...
Caused by: java.lang.NumberFormatException: For input string: "-"
	at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
	at java.lang.Double.parseDouble(Unknown Source)
	at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
	at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
	at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
	at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
	at transferURL.respond(transferURL.java:152)
2013-01-14 03:06:24 +01:00
reger
3b6e08b49f prevent checking of urldb if empty
- disconnect urlIndexFile if empty
- add missing lock class in submenuSearchConfiguration
2013-01-12 15:20:23 +01:00
reger
f143804382 fix configuration for search page navigators
- added additional config page (ConfigSearchPage_p) for easy setup of search page layout (to not overload ConfigPortal page)
   - currently redundant setting with part of ConfigPortal page
- added missing config for filetype and protocol navigator
- adjusted init of SearchEvent to check navigation config setting
- renamed RankigProcess.getTopicNavigator to getTopics (to distiguish between added SearchEvent.getTopicNavigator)
2013-01-05 19:00:54 +01:00
Michael Peter Christen
becd52a984 added also a re-calculation of reference counts during the
post-processing of clickcount calculations. This is a really nice thing
to have because the reference count affects ranking.
2013-01-05 00:58:27 +01:00
Michael Peter Christen
38d3feae65 added separate delete commands for the local+remote solr index, the old
metadata and old rwi and for the citation index. The important
advancement is the separation of the citation index deletion because
that index is responsible for the linkdepth calculation. Now a search
index can be deleted without the citation index and that should cause
that less clickdepths must be post-processed.
2013-01-04 16:39:34 +01:00
Michael Peter Christen
6f0baaa309 added the clickdepth post-processing: some links may have 'shortcuts' to
already calculated click depths. There are then calculated if the crawl
buffer is empty and therefore no new 'shortcuts' can be discovered.
The status of the clickdepth stack (to-be-processed) can be seen using a
solr search command like this:
http://localhost:8090/solr/select?q=process_sxt:[*%20TO%20*]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt
2013-01-04 16:37:39 +01:00
Michael Peter Christen
0f5b6f38c1 enhanced root-url detection 2013-01-03 19:21:21 +01:00
Michael Peter Christen
5c0c56cfe1 Preparations to produce a click depth attribute in the search index.
This attribute can be used for ranking and for other purpose (demand by
customer)
The click depth is computed in two steps:
- during indexing the current fill-state of the reverse link index is
used to backtrack the current page to the root page. The length of that
backtrack is the clickdepth. But this does not discover the shortest
click depth. To get this, a second process to check again is needed
- added a process tag that can be used to do operations on the existing
index after a crawl; i.e. calculation the shortest clickpath. Added a
field to control this operation but not a method to operate on this.
- added a visualization of the clickpath length in the host browser
2013-01-02 20:55:43 +01:00
Michael Peter Christen
6861af87e2 removed warnings 2013-01-02 19:05:48 +01:00
Michael Peter Christen
295884fd54 - Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8'
- fixed conflict in	htroot/yacysearch.java
- removed nedres check because that causes that the remote server is not
called at all in most cases (local index has already results but we want
more)
- fixed a regex bug (a '=' too much)
2013-01-02 15:08:07 +01:00
reger
276e63401e small sanitary fixes
- exclude unix shell scripts in NSIS windows install archive
- replace link to env/grafics/yacy.gif to yacy.png (build.nsi)
- remove unused code lines (Blacklist_p, Response, WordReferenceVars)
- type & xhtml (RankingSolr_p.html)
2013-01-02 01:59:47 +01:00
reger
f301336adf fix: no results with configuration citation reference index switched off
- urlcitationindex != null check added to ResultEntry.referencesCount
- plus other places where conflicting procedure was used (and urlcitationindex not already checked != null)
2012-12-30 02:13:48 +01:00
orbiter
fe50702eb0 added a filterscannerfail attribute to QueryParams which causes that a
check to the network scanner fail/success status can be used/suppressed
for search results. This is a feature that comes with the port scanner.
2012-12-29 17:47:34 +01:00
reger
168b1d130d Adding heuristic to get search results from configured systems which support opensearch specification
- any system supporting opensearch specification can be configured
- search query is only forwarded to remote system if not enough results available on local peer
- discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config
     - sample config file with some general search engines with opensearch support
2012-12-29 08:24:48 +01:00
Michael Peter Christen
eb90d38cd7 added missing extension 'mkv' for navigation 2012-12-27 13:56:13 +01:00
Michael Peter Christen
95712fdc8b update to pdf parser 2012-12-27 04:16:31 +01:00
Michael Peter Christen
4a9182ae16 use the search configuration to default the cacheStrategy to the value
as given in the search configuration
2012-12-27 03:19:21 +01:00
Michael Peter Christen
98819ec3d9 use solr boost configuration to select search fields. At this time it is
possible to enter a negative boost value to switch that value off. This
might be different in the future with a better input interface.
2012-12-27 03:17:45 +01:00
Michael Peter Christen
e1f89efd0d - made image search in interactive search using the ViewImage servlet -
that enables viewing of images for intranet SMB servers.
- added a filter search for protocol, tld and ext again; otherwise p2p
search produces a lot of rubbish
2012-12-26 21:25:27 +01:00
Michael Peter Christen
8f3bd0c387 fix for smb crawl situation (lost too many urls) 2012-12-26 19:15:11 +01:00
reger
d456f69381 SeedUpload url : check to reject localhost url included in saveSeedList (same check as in / copied from Seed.isProper() ), to prevent identity change on next startup (due to rejected seeduploadurl). 2012-12-24 23:29:02 +01:00
reger
4987caf1c9 - apply fix for localhost handling (from yacy2solr) also to metadata2solr 2012-12-23 01:30:52 +01:00
reger
0148f1bb8c fix: exception if default work files don't exist 2012-12-22 23:03:39 +01:00
Michael Peter Christen
9e4033f229 fix for event starter: delete start time when event is removed 2012-12-22 21:16:22 +01:00
Michael Peter Christen
99271ffd13 copy work tables from defaults/data/work if exist there and not in
DATA/WORK
This can be used to create start-up behavior work scripts in the
api.bheap table
2012-12-22 20:54:05 +01:00
Michael Peter Christen
24c9bb35f7 extended the Scheduler: introduced scheduled events
- an event type (once, regular) can be selected
- for this event type, a fixed time can be selected. This may be either
directly after startup or at one of the full hours at a day (==25
options)
The main point about this feature is the opportunity to start an action
directly after startup. That makes it possible to create YaCy
distributions which, after started at the first time, start to index
parts of the intranet/internet by itself.
2012-12-22 16:27:14 +01:00
Michael Peter Christen
433143ba40 removed protocol, tld, ext from the urlmask and created specific
navigation field for these
2012-12-19 12:45:40 +01:00
Michael Peter Christen
84f82541e8 search process enhancements 2012-12-19 10:41:22 +01:00
Michael Peter Christen
02020b590b - removed all extension types from extension navigation which are not
proper/known
- automatically show the protocol navigation if there is more than http
and https
- automatically show the extension navigation if there is some media
content
2012-12-19 02:38:05 +01:00
Michael Peter Christen
01200f06cc using the author field as solr-native facet. this makes it necessary to
introduce a copy-field for the author field to be copied to a string
field. This field is then used to generate facets. Without this field,
the facet would consist only of the words of the author names, not of
the full author string.
2012-12-19 01:56:33 +01:00
Michael Peter Christen
2a4c064c89 using the publisher information for the author field if no author is
given. This applies to cases where only the copyright field in the html
header is filled but not the author field
2012-12-19 01:54:35 +01:00
Michael Peter Christen
bab573361f - using a filter query for facet restriction
- calculating the whole search result in at most two sub-queries from
solr
2012-12-19 01:00:57 +01:00
Michael Peter Christen
eac9650b31 added another solr field clickdepth_i which reflects the number of
clicks which are necessary to get from the portal of a host to a
specific document. At this time, only the start document is flagged with
clickdepth '0', all other with '-1'. To get the actual clickdepth, a
process must use crawled information to collect the actual number of
clicks. This will be added in another/next step.
2012-12-18 17:20:42 +01:00
Michael Peter Christen
1052263af3 - added a new solr field references_i which stores the number of
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
2012-12-18 14:42:35 +01:00
Michael Peter Christen
7c3de8b4cd - fix for localhost detection
- added IPv6 patterns for localhost detection
2012-12-18 12:52:20 +01:00
Michael Peter Christen
34f8786508 removed dependency of vocabulary navigation from Jena and it's
triplestore; the vocabulary search is now done using generic solr fields
which are created on-the-fly during runtime.
2012-12-18 02:29:03 +01:00
reger
ad71747525 fix: set defaul language to "en" 2012-12-16 20:53:45 +01:00
Michael Peter Christen
9319b90d8a - fixes for host navigation
- fixes for filetype navigation
- removed unused code
2012-12-15 09:14:49 +01:00