Commit Graph

5255 Commits

Author SHA1 Message Date
luccioman
4dd9c0d5d9 Merge from main repository 2015-09-08 08:54:48 +02:00
Michael Peter Christen
0a37d8af89 in case that a site crawl is started for urls with file:// path, the
host filter does not work because there is no host given in such urls.
In that case, patch the filter to be a sub-path filter.
2015-09-05 14:07:23 +02:00
luccioman
9df249296a Return to mai repository version 2015-09-04 13:52:03 +02:00
luccioman
c1d937a90c Merge branch 'master' of ssh://git@github.com/yacy/yacy_search_server 2015-09-04 09:57:49 +02:00
reger
7c1da173e0 fix missing license in image search
see http://mantis.tokeek.de/view.php?id=522
2015-09-03 23:36:57 +02:00
luccioman
918ef72bbe Corrected br markup 2015-09-03 08:59:17 +02:00
luccioman
f88bb2277e Corrected bookmark link title 2015-09-03 08:58:14 +02:00
luccioman
802ea66d19 Merge branch 'master' of ssh://git@github.com/yacy/yacy_search_server 2015-09-03 08:04:38 +02:00
reger
5297e80cda fix missing onclick in ConfigPortal
to enable checkbox
2015-09-03 00:59:14 +02:00
luccioman
70e483ecc6 Merge branch 'master' of ssh://git@github.com/yacy/yacy_search_server 2015-09-01 08:57:32 +02:00
sixcooler
87e4abe393 fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has
moved and was not cleared anymore. This results in an huge fieldcache.
(http://lucene.apache.org/#highlights-of-the-lucene-release-include
https://issues.apache.org/jira/browse/LUCENE-5666)
Here I try to use DovValues where it is possible.
For this I used the Api-Scheme as new basis für the Solr-Schema.
This needs at least a complete optimization of the Solr-Index to get a
smaller FieldCache.
Everything that is indexed with these setting will not use the
Fieldcache at all.
2015-08-31 20:24:41 +02:00
luccioman
67799ce867 Updated translation of index.html, yacysearch.html and
simpleheader.template, corrected some special characters not written as
HTML entities.
2015-08-26 14:40:39 +02:00
Michael Peter Christen
df3314ac1a added a new facet type based on a probabilistic classifier using
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.

New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".

To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.

Then, each new document is classified to match within one of the given
categories for each context.
2015-08-10 14:27:44 +02:00
Michael Peter Christen
dbbad23e12 removed warnings 2015-08-03 05:37:34 +02:00
reger
9e4043731d add missing ; in base.css 2015-08-02 21:36:44 +02:00
Michael Peter Christen
de8cfbe1d7 added export option to export the fulltext of the search index text only 2015-07-30 03:21:40 +02:00
Michael Peter Christen
785781253e added jsonp to suggest servlet 2015-07-16 23:42:41 +02:00
reger
821262a179 add CommonPattern for multiple spaces
to eliminate empty split words on following spaces
2015-07-04 22:49:01 +02:00
Michael Peter Christen
f901e7d3cf fix for non-authorized view of IndexBrowser: show only the number of
non-failure documents
2015-06-30 11:12:36 +02:00
Michael Peter Christen
3c4c69adea fix for
- bad regex computation for crawl start from file (limitation on domain
did not work)
- servlet error when starting crawl from a large list of urls
2015-06-29 02:02:01 +02:00
Michael Peter Christen
1fec7fb3c1 suppress access to solr when doing search suggestions in case that the
index has more than two million documents. This protects the index from
beeing flooded with search requests that cannot be resolved before the
real search query has to be computet.
2015-06-24 13:02:12 +02:00
Michael Peter Christen
886fca2260 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-06-24 01:59:46 +02:00
Michael Peter Christen
694b22f165 migration to Solr 5.2: huge benefits - this is a lot faster!
This is a very complex migration: many classes had been renamed or
removed, dependencies changed and the solr index type is now aligned to
be a solr cloud repository.
Together with the Solr 5.2 library update, one other dependent library
had been updated as well: httpclient 4.4->4.4.1

Older indexes are migrated from 4_10 to 5_2. However, the new index
structure is more efficient and we recommend to re-index everything.
Please use the index export before you do the update to a large
surrogate xml file. After the update, start with an empty index and then
initialize this with your dump.
2015-06-24 01:55:51 +02:00
Michael Peter Christen
6c2e6f1f37 remove redundant code 2015-06-23 23:41:43 +02:00
Michael Peter Christen
9c12555be5 added link to Snapshots in search results if the snapshot exists and
option is set in ConfigSearchPage_p
(this is a stub: we also need a visualization of pdf files!)
2015-06-07 20:37:37 +02:00
reger
72f6a0b0b2 enhance recrawl job
- allow to modify the query to select documents to  process (after job has started)
- allow to include failed urls (httpstatus <> 200)
2015-06-06 18:45:39 +02:00
Michael Peter Christen
e0a23c56c7 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-06-05 08:32:55 +02:00
Michael Peter Christen
fb9e1dd3f5 servlet for latest commit 2015-06-05 07:22:35 +02:00
reger
7478338a40 remove augmented parsing activation from frontend
experimental implementation not used and based on error prone experimental rdfaparser
2015-06-05 00:51:00 +02:00
reger
11aa2edfe1 remove RDFa parser activation from frontend
reason: experimental implementatin of RDFa parser not executed (limited to special urls) but may cause error on normal html parsing due to a inputstream.reset
2015-06-05 00:15:16 +02:00
Michael Peter Christen
ff11ac89f7 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-06-04 23:04:04 +02:00
Michael Peter Christen
5e2d23b7a0 removed the new index export method from the IndexControlURLs_p.html
servlet and moved it to a new /IndexExport_p.html servlet. This servlet
is now more prominent linked in the main menu under Production -> Index
Export/Import
2015-06-04 23:03:46 +02:00
reger
49b79987c9 remove obsolete searchfl work table
was used to register urls with not complete words in snippet but is never accessed
2015-06-04 22:44:01 +02:00
Michael Peter Christen
b43811d38c added surrogate import process for exported solr dumps.
Just throw your solr dump file into DATA/SURROGATES/in/ and it will be
imported!
2015-05-30 13:19:59 +02:00
Michael Peter Christen
eec78e1b0c added intensity option to graphics 2015-05-30 06:31:08 +02:00
Michael Peter Christen
c7576d6028 added a full solr export to the IndexControlURLs_p.html servlet. The
export function is also now the default export option. The export file
format for a full solr export is very similar to a solr search result
xml, only the <lst name="responseHeader"> tag is missing.

The exported xml has a special line termination feature: all documents
will be exported into a single line without any CR in between. That
means that every document is completely inside a single line. While this
is not readable at all for humans, it is very useful for linux line
processing scripts, like grep. Using grep it will be easy to select
single documents which match for a given pattern.

Such dumps shall be importable with the DATA/SURROGATE/in import
function, but that import is not yet adopted to the new file format.
2015-05-29 15:05:52 +02:00
Michael Peter Christen
47682bf467 fix for unresolved pattern 2015-05-28 17:43:52 +02:00
Michael Peter Christen
197f7449e5 All entities of crawl profiles are now editable in the crawl profile
editor.
2015-05-28 16:07:40 +02:00
reger
1d8e1e4bac - Image search expand box, adjust javascript hs padtominsize parameter, to make sure expand box doesn't shrink on small images
- asure ImageResult.imagetext has value for the link text (use filename if no alt text given)
2015-05-27 02:31:13 +02:00
reger
000dde9511 Eleminate duplication of values for search ResultEntry
by instatiation from URIMetadataNode, by eleminating differentiation of ResultEntry/URIMetadataNode.
- moved remaining ResultEntry functionallity to URIMetadataNode
   - for 1:1 functionallity added a function makeResultEntry() 
- removed ResultEntry 
- refactored related code

Main difference is after makeResultEntry the text_t content is removed and alternative title/url strings for display are calculated.


Main difference left is, that
2015-05-26 04:15:00 +02:00
reger
3d53da8236 refactor ResultEntry to be based on MetadataNode/SolrDocument
to share/reuse common access routines
2015-05-25 21:28:48 +02:00
reger
17e820cfd7 use doctype() in ViewFile to choose display routines
in preference of getfileExtension()
2015-05-25 00:08:38 +02:00
reger
aa83931765 Convert content charset for display via CacheResource_p
Cached resource charset encoding might not fit to internal handling (using utf-8),
convert resource to utf-8
see http://mantis.tokeek.de/view.php?id=576
2015-05-23 20:31:37 +02:00
reger
3e742d1e34 Init remote crawler on demand
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
Michael Peter Christen
dbf9e3503d Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-05-22 11:39:00 +02:00
Michael Peter Christen
8b1a30be50 removed a -UNRESOLVED_PATTERN- 2015-05-22 11:22:36 +02:00
Michael Peter Christen
9938c81378 fix for division by zero 2015-05-22 11:15:53 +02:00
reger
ace71a8877 Initial (experimental) implementation of index update/re-crawl job
added to IndexReIndexMonitor_p.html
Selects existing documents from index and feeds it to the crawler.
currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY]
Documents are  added in small chunks (200) to the crawler, only if no other crawl is running.
2015-05-16 01:23:08 +02:00
Michael Peter Christen
f810915717 added crawl start from a clone with very, very large url: they are now
encoded as post submit form inside a javascript creation function.
2015-05-11 16:30:41 +02:00
reger
609c52e987 refactor getBookmark
to consistenly check existance by != null (w/o throwing exception on not found)
2015-05-11 00:37:04 +02:00