Commit Graph

5279 Commits

Author SHA1 Message Date
Michael Peter Christen
e0a23c56c7 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-06-05 08:32:55 +02:00
Michael Peter Christen
fb9e1dd3f5 servlet for latest commit 2015-06-05 07:22:35 +02:00
reger
7478338a40 remove augmented parsing activation from frontend
experimental implementation not used and based on error prone experimental rdfaparser
2015-06-05 00:51:00 +02:00
reger
11aa2edfe1 remove RDFa parser activation from frontend
reason: experimental implementatin of RDFa parser not executed (limited to special urls) but may cause error on normal html parsing due to a inputstream.reset
2015-06-05 00:15:16 +02:00
Michael Peter Christen
ff11ac89f7 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-06-04 23:04:04 +02:00
Michael Peter Christen
5e2d23b7a0 removed the new index export method from the IndexControlURLs_p.html
servlet and moved it to a new /IndexExport_p.html servlet. This servlet
is now more prominent linked in the main menu under Production -> Index
Export/Import
2015-06-04 23:03:46 +02:00
reger
49b79987c9 remove obsolete searchfl work table
was used to register urls with not complete words in snippet but is never accessed
2015-06-04 22:44:01 +02:00
Michael Peter Christen
b43811d38c added surrogate import process for exported solr dumps.
Just throw your solr dump file into DATA/SURROGATES/in/ and it will be
imported!
2015-05-30 13:19:59 +02:00
Michael Peter Christen
eec78e1b0c added intensity option to graphics 2015-05-30 06:31:08 +02:00
Michael Peter Christen
c7576d6028 added a full solr export to the IndexControlURLs_p.html servlet. The
export function is also now the default export option. The export file
format for a full solr export is very similar to a solr search result
xml, only the <lst name="responseHeader"> tag is missing.

The exported xml has a special line termination feature: all documents
will be exported into a single line without any CR in between. That
means that every document is completely inside a single line. While this
is not readable at all for humans, it is very useful for linux line
processing scripts, like grep. Using grep it will be easy to select
single documents which match for a given pattern.

Such dumps shall be importable with the DATA/SURROGATE/in import
function, but that import is not yet adopted to the new file format.
2015-05-29 15:05:52 +02:00
Michael Peter Christen
47682bf467 fix for unresolved pattern 2015-05-28 17:43:52 +02:00
Michael Peter Christen
197f7449e5 All entities of crawl profiles are now editable in the crawl profile
editor.
2015-05-28 16:07:40 +02:00
reger
1d8e1e4bac - Image search expand box, adjust javascript hs padtominsize parameter, to make sure expand box doesn't shrink on small images
- asure ImageResult.imagetext has value for the link text (use filename if no alt text given)
2015-05-27 02:31:13 +02:00
reger
000dde9511 Eleminate duplication of values for search ResultEntry
by instatiation from URIMetadataNode, by eleminating differentiation of ResultEntry/URIMetadataNode.
- moved remaining ResultEntry functionallity to URIMetadataNode
   - for 1:1 functionallity added a function makeResultEntry() 
- removed ResultEntry 
- refactored related code

Main difference is after makeResultEntry the text_t content is removed and alternative title/url strings for display are calculated.


Main difference left is, that
2015-05-26 04:15:00 +02:00
reger
3d53da8236 refactor ResultEntry to be based on MetadataNode/SolrDocument
to share/reuse common access routines
2015-05-25 21:28:48 +02:00
reger
17e820cfd7 use doctype() in ViewFile to choose display routines
in preference of getfileExtension()
2015-05-25 00:08:38 +02:00
reger
aa83931765 Convert content charset for display via CacheResource_p
Cached resource charset encoding might not fit to internal handling (using utf-8),
convert resource to utf-8
see http://mantis.tokeek.de/view.php?id=576
2015-05-23 20:31:37 +02:00
reger
3e742d1e34 Init remote crawler on demand
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
Michael Peter Christen
dbf9e3503d Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-05-22 11:39:00 +02:00
Michael Peter Christen
8b1a30be50 removed a -UNRESOLVED_PATTERN- 2015-05-22 11:22:36 +02:00
Michael Peter Christen
9938c81378 fix for division by zero 2015-05-22 11:15:53 +02:00
reger
ace71a8877 Initial (experimental) implementation of index update/re-crawl job
added to IndexReIndexMonitor_p.html
Selects existing documents from index and feeds it to the crawler.
currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY]
Documents are  added in small chunks (200) to the crawler, only if no other crawl is running.
2015-05-16 01:23:08 +02:00
Michael Peter Christen
f810915717 added crawl start from a clone with very, very large url: they are now
encoded as post submit form inside a javascript creation function.
2015-05-11 16:30:41 +02:00
reger
609c52e987 refactor getBookmark
to consistenly check existance by != null (w/o throwing exception on not found)
2015-05-11 00:37:04 +02:00
reger
5f4d35437e add bookmark.query to edit form 2015-05-10 15:30:21 +02:00
reger
89124335c4 update bookmark autosearch description
- add german translation
2015-05-10 02:29:08 +02:00
Michael Peter Christen
213401a446 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-05-08 13:48:29 +02:00
Michael Peter Christen
97930a6aad added must-not-match filter to snapshot generation.
also: fixed some bugs
2015-05-08 13:46:27 +02:00
reger
b47267b79c precaution against NPE on createorgetBookmark on search result 2015-05-07 03:25:19 +02:00
Michael Peter Christen
75879e051b Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-05-03 03:03:45 +02:00
reger
8a5b8f8789 on bookmaring of search result, remember orig. query in separate bookmark property
(instead of using the description field)
- adjust display and autosearch
- don't overwrite existing bookmark but combine info
2015-05-03 02:31:50 +02:00
reger
cf1fc7f700 harmonize filesearch input box layout 2015-05-01 19:24:14 +02:00
Michael Peter Christen
e334a06370 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-04-29 10:51:09 +02:00
reger
579303a04e add additional links to crawl queue pages 2015-04-25 02:45:05 +02:00
Michael Peter Christen
99718dc09a don't record dump generation calls since that
- is not a change of the index
- happens very often within self-backup strategies from the outside
(i.e. cronjobs)
2015-04-23 18:17:28 +02:00
Michael Peter Christen
5b59477415 update to bootstrap.css 3.3.4 2015-04-23 06:36:57 +02:00
Michael Peter Christen
0d365e67a5 Merge pull request #2 from Scarfmonster/master
English Synonyms and small fixes
2015-04-20 10:24:34 +02:00
Eugene Kuligin
8ae3229306 add vertical margin to the search cloud block 2015-04-20 05:24:50 +03:00
Eugene Kuligin
f9408dfa48 fix RSS icon displaying 2015-04-20 05:19:37 +03:00
reger
f7b0148f6a fix NPE in Vocabulary_p servlet
called w/o parameter
2015-04-20 00:01:14 +02:00
Ryszard Goń
ca1a70aec8 fix for Accept '?' URLs column in Crawl Profile List 2015-04-19 15:55:49 +02:00
Ryszard Goń
b0cd0212fd SynonymLibrary status check fix for multiple files 2015-04-17 17:38:48 +02:00
Ryszard Goń
f3f1b2e899 added English synonyms 2015-04-17 17:32:54 +02:00
reger
296e97c78e put https port in peers dna
as we flag if a peer is accesible via https, we need to know the port if we want to use is (e.g. for interYaCy communication)
start to provide / tansport the port by recording it in peers dna.
- add https link on the Network.html lock symbol
2015-04-16 02:36:12 +02:00
Michael Peter Christen
088853c1e8 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-04-15 13:17:37 +02:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
reger
f6a55f9279 incoming connection count/text fix
improvement on http://mantis.tokeek.de/view.php?id=570
2015-04-15 02:16:53 +02:00
Michael Peter Christen
3e338e0987 Merge pull request #1 from Scarfmonster/master
Search navigation fix
2015-04-14 10:33:43 +02:00
reger
702c30e619 add info text icon next to Augmented Browsing check-box
with hint to config page
2015-04-14 03:19:27 +02:00
reger
4c907bec89 show "Augmented Browsing" link in search result only if urlproxy allowed and option switched on in layout
(AugmentedBrowsing_p.html, ConfigSearchPage_p.html)
as user only gets a error page if the option is not enabled
2015-04-14 02:07:02 +02:00
Ryszard Goń
6d78a6d06e Search navigation fix 2015-04-13 23:32:06 +02:00
Michael Peter Christen
a08a3c5f29 reverted json syntax for facet results to version from january 2015-04-13 16:18:15 +02:00
Michael Peter Christen
d8cc773d05 fix for not valid json in case that topics are switched off 2015-04-11 12:20:29 +02:00
Michael Peter Christen
1df6492019 enhanced suggestions 2015-04-10 15:59:18 +02:00
Michael Peter Christen
c7fdde3bd1 replaced "fork me" banner with github banner 2015-04-10 15:10:18 +02:00
Michael Peter Christen
876cdb083f Merge branch 'master' of github.com:yacy/yacy_search_server 2015-04-09 14:31:16 +02:00
Michael Peter Christen
2e88028c1a when selecting collections in navigation, do show the un-selected
collections in search result. When selecting one of them in another
search, switch off the previously selected collection. This actually
turns the collection navigation modifier into a radio-button like
behaviour
2015-04-07 13:13:58 +02:00
reger
2f592a8063 add SynonymLibrary status to DictionaryLoader_p servlet
http://mantis.tokeek.de/view.php?id=564
2015-04-04 00:24:16 +02:00
reger
c59ebde083 show location nav as selectable nav in search page layout
- switch automatically on upon load of geodata provider
- but allow switch on also without geodata file (and display the location nav if search result has lat/lon location)
2015-04-02 02:10:00 +02:00
Michael Peter Christen
5bc1e5cfbf use a cursor hand on facet headline to show that this is clickable 2015-04-01 18:37:45 +02:00
Michael Peter Christen
40389987ec Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-04-01 18:18:05 +02:00
Michael Peter Christen
f9ba50379d added an expansion option to search facets on result page:
- if less or equal of 8 facet options are present, they are shown by
default
- if more facet options are present, they are hidden
To view or hide all facets, just click on the facet header bar
2015-04-01 18:17:52 +02:00
reger
b1ec0644e5 fix NPE in location search on missing/empty PubDate in underlaying rss data 2015-03-31 02:20:13 +02:00
reger
2f84b04fa9 add err msg on failure during Load_rss 2015-03-29 05:48:54 +02:00
reger
96292cf3eb shorten exception loggin on not available connection in Load_RSS_p servlet 2015-03-28 21:12:00 +01:00
reger
66d0b5046a fix NPE on viewfile of url not in index 2015-03-26 00:21:31 +01:00
Michael Peter Christen
5789c96292 fix: banner did not show link and qph for portal mode 2015-03-25 13:21:36 +01:00
Michael Peter Christen
9bf0d7ecb9 added a new collection type 'dht' to all documents from the peer-to-peer
interface to distinguish rich and poor document data.
This also reverts some changes from commit
796770e070 because the firstSeen database
is the wrong method to distinguish these types of data
2015-03-24 12:32:39 +01:00
reger
7fcf0d0b71 fix missing display of CrawlerMonitor -> robots.txt Monitor
revert delete of file api/table_p.html see 3ffe19b85c
(still used in this menu)
2015-03-24 00:13:05 +01:00
Marc Nause
efadb710a4 Updated Git links from Gitorious to Github. 2015-03-23 11:12:39 +01:00
reger
65f8371163 fix link to DeReWo project page 2015-03-11 21:28:57 +01:00
reger
a5d19e2982 update configheuristics_p.html text
to state current opensearch heuristic function
2015-03-09 00:09:36 +01:00
reger
74ed399180 remove unused statement 2015-03-05 02:09:27 +01:00
Michael Peter Christen
fd4e2c809a Show dates in the content of a document in the search result:
- if an eventDate is given in the search result, replace the document
date with the event date and prefix it with the string "on ".
- the document date is omitted if a date from the cent is shown

Added also the date as fields in the json and rss result sets.
2015-03-02 18:00:20 +01:00
Michael Peter Christen
710a0efa1b generalized time period computations 2015-03-02 12:55:31 +01:00
Michael Peter Christen
dcfc384eee bugfix for fixed host/port 2015-03-02 04:43:42 +01:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
ba276d3e64 add description_txt to default query fields,
Dublin Core Metadata field extracted by most parsers.
2015-02-22 05:42:04 +01:00
reger
ad1596f9ac upd lucene api doc link 2015-02-16 01:20:12 +01:00
reger
1196ff01c8 revert: formatting fix eats also up highlighting
need other solution for snippets with unwanted html code
2015-02-14 02:43:05 +01:00
reger
61f42a7928 fix formatting issue in search result display
if description contains html code
noticed e.g. for id=NmNdJ9uApLaQ  http://hswong3i.net/blog/hswong3i/virtualmin-drupal-7-x-ubuntu-12-04-howto
2015-02-13 00:20:33 +01:00
Michael Peter Christen
6578ff3ddb enhanced suggest function 2015-02-09 18:45:07 +01:00
reger
ab98f69592 fix: searchoption hint for heuristic 2015-02-08 00:15:30 +01:00
Michael Peter Christen
974d58b01f IPv6 Fix for push interface 2015-02-04 15:03:34 +01:00
Michael Peter Christen
fe50e5aef6 fix for failed selection of terms in faceted search with vocabularies 2015-02-04 11:55:27 +01:00
Michael Peter Christen
1309619a71 remove remote indexing option in crawl start if not in p2p mode 2015-02-04 11:37:07 +01:00
Michael Peter Christen
6324db1213 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-02-04 11:27:31 +01:00
reger
5cb05c3013 adjust table column width to not line wrap crawler traffic line 2015-02-04 03:51:34 +01:00
Michael Peter Christen
606d00c8f2 cloning a crawl now accepts the class name of vocabulary scapers 2015-02-04 01:50:35 +01:00
reger
11b21308c0 fix: malformed filename in image search
fix for http://mantis.tokeek.de/view.php?id=533
2015-02-01 05:35:09 +01:00
reger
9e1ec5fec4 refactor: just some more useages of constant for term ":[* TO *]" 2015-02-01 04:26:33 +01:00
Michael Peter Christen
b5ac29c9a5 added a html field scraper which reads text from html entities of a
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.

To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
2015-01-30 13:20:56 +01:00
Michael Peter Christen
68c605d637 replace with CommonPattern.SPACE for split 2015-01-29 02:28:03 +01:00
Michael Peter Christen
1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits 2015-01-29 02:19:41 +01:00
Michael Peter Christen
a8a2b7a803 persistency for vocabulary facet switch 2015-01-29 02:16:42 +01:00
Michael Peter Christen
efbc9a3561 introducting a new getConfig method which parses comma-separated llists
from setting fields; refactoring for all places where such lists are
parsed
2015-01-29 01:53:36 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
Michael Peter Christen
5a060c9f26 refactoring of reindexSolr (just replaced constant string) 2015-01-29 00:33:07 +01:00
Michael Peter Christen
3d717b749a fix for urlmaskfilter 2015-01-28 13:40:41 +01:00
Michael Peter Christen
2636582435 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-28 10:32:17 +01:00