Commit Graph

3330 Commits

Author SHA1 Message Date
Michael Peter Christen
c99a665593 adding a 3-pixel font generator made some time ago.. 2015-05-30 06:01:52 +02:00
Michael Peter Christen
c7576d6028 added a full solr export to the IndexControlURLs_p.html servlet. The
export function is also now the default export option. The export file
format for a full solr export is very similar to a solr search result
xml, only the <lst name="responseHeader"> tag is missing.

The exported xml has a special line termination feature: all documents
will be exported into a single line without any CR in between. That
means that every document is completely inside a single line. While this
is not readable at all for humans, it is very useful for linux line
processing scripts, like grep. Using grep it will be easy to select
single documents which match for a given pattern.

Such dumps shall be importable with the DATA/SURROGATE/in import
function, but that import is not yet adopted to the new file format.
2015-05-29 15:05:52 +02:00
Michael Peter Christen
197f7449e5 All entities of crawl profiles are now editable in the crawl profile
editor.
2015-05-28 16:07:40 +02:00
reger
1d8e1e4bac - Image search expand box, adjust javascript hs padtominsize parameter, to make sure expand box doesn't shrink on small images
- asure ImageResult.imagetext has value for the link text (use filename if no alt text given)
2015-05-27 02:31:13 +02:00
reger
8b35656007 remove hard throw exception in makeResultEntry
remove not used "share." peername.yacy url rewrite
2015-05-26 23:57:06 +02:00
reger
af57fbefad use available mime (instead null) on imageresult from metadatanode 2015-05-26 23:54:04 +02:00
reger
dd7782bac0 revert deletion of BinSearch
(accident)
2015-05-26 04:26:26 +02:00
reger
000dde9511 Eleminate duplication of values for search ResultEntry
by instatiation from URIMetadataNode, by eleminating differentiation of ResultEntry/URIMetadataNode.
- moved remaining ResultEntry functionallity to URIMetadataNode
   - for 1:1 functionallity added a function makeResultEntry() 
- removed ResultEntry 
- refactored related code

Main difference is after makeResultEntry the text_t content is removed and alternative title/url strings for display are calculated.


Main difference left is, that
2015-05-26 04:15:00 +02:00
reger
29c4aa3991 fix compiler notification of missing serialID
from last commit
2015-05-25 21:51:32 +02:00
reger
3d53da8236 refactor ResultEntry to be based on MetadataNode/SolrDocument
to share/reuse common access routines
2015-05-25 21:28:48 +02:00
reger
d882991bc5 Implement sharing of ioDispatcher for term & citation index
as proposed in ioDispatcher description
2015-05-25 19:46:26 +02:00
reger
370ba9da71 On imageSearch prefere mime to sort out none-image documents
Generalize the hack to prevent urls with just a img extension beeing returned

improving http://mantis.tokeek.de/view.php?id=528
2015-05-24 21:48:58 +02:00
reger
cd31633369 improve MultiprotocolURL.getFileExtension()
prevent string OOB while querypart contains a dot (return just "")
see log snippet in http://mantis.tokeek.de/view.php?id=533
2015-05-24 19:38:04 +02:00
reger
c60ccdfbcf Increase IODspatcher dumpQueue size to 2 to reduce risk of concurrent emergency dump,
skip concurrent emergency merge
dealing with/see  http://mantis.tokeek.de/view.php?id=566
2015-05-24 18:03:27 +02:00
reger
8a9622c31c fix string OoB on getImagelinks with long alttext
in description calculation
2015-05-24 01:59:40 +02:00
reger
3e742d1e34 Init remote crawler on demand
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
reger
13f013f64a Limit extra sleep of BusyThread on LowMemCycle 2015-05-17 06:21:12 +02:00
reger
cd7c0e0aae detail optimization of RecrawlThread 2015-05-17 00:13:00 +02:00
reger
ace71a8877 Initial (experimental) implementation of index update/re-crawl job
added to IndexReIndexMonitor_p.html
Selects existing documents from index and feeds it to the crawler.
currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY]
Documents are  added in small chunks (200) to the crawler, only if no other crawl is running.
2015-05-16 01:23:08 +02:00
reger
141cd80456 correct log msg text 2015-05-16 00:01:54 +02:00
reger
f3ce99bfb8 fix extract of inboundlinks_protocol_sxt
url counter maybe > 999
2015-05-14 00:03:09 +02:00
reger
2bc9cb5828 fix early return in addToCrawler
check / handle all supplied urls after error url
2015-05-13 21:58:43 +02:00
Michael Peter Christen
f5f88272e4 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-05-12 12:06:42 +02:00
Michael Peter Christen
5c67c4d460 fix for latest commit, see
f810915717 (commitcomment-11145880)
2015-05-12 12:06:21 +02:00
reger
c37dda8849 fix NPE on MultiProtocolURL on url with parameter value and '='
in getAttribute
- added test case for it
2015-05-12 01:09:10 +02:00
Michael Peter Christen
f810915717 added crawl start from a clone with very, very large url: they are now
encoded as post submit form inside a javascript creation function.
2015-05-11 16:30:41 +02:00
Michael Peter Christen
51de86c992 disabled debug thread dumps 2015-05-11 14:46:09 +02:00
Michael Peter Christen
d524a9d77c Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-05-11 14:42:40 +02:00
Michael Peter Christen
0710648c31 enable api calls with very long urls 2015-05-11 14:42:21 +02:00
reger
31346e873b upd library reference of missing jsch-0.1.21 in seeduploadscp.xml
upd to jsch-0.1.52.jar
2015-05-11 01:35:12 +02:00
reger
609c52e987 refactor getBookmark
to consistenly check existance by != null (w/o throwing exception on not found)
2015-05-11 00:37:04 +02:00
reger
1481a8ab56 add opensearch rss results to dht collection (due to text = snippet)
which is used to differentiate meta from full data
- make sure check for dht is not dependant on number of collection entries
2015-05-10 18:52:33 +02:00
reger
f134aa7f7f persist bookmark timestamp
on setTimeStamp()
2015-05-10 15:29:23 +02:00
reger
752eec6697 fix NPE in addToIndex when used outside searchEvent 2015-05-10 05:18:23 +02:00
Michael Peter Christen
fbf85a1561 added temporary debug output in http client 2015-05-08 15:31:01 +02:00
Michael Peter Christen
ff29b0e503 added option to re-index exported xml snapshot dumps to
HTCACHE/snapshots by just placing them in the SURROGATES/in path
2015-05-08 15:30:26 +02:00
Michael Peter Christen
6f4fe4b175 revert of 8a7c68e4c7
keeping surrogates after processing is essential for some users. If the
space they are taking is too high, please set up an automatic deletion
process (like a cronjob).
2015-05-08 14:01:30 +02:00
Michael Peter Christen
97930a6aad added must-not-match filter to snapshot generation.
also: fixed some bugs
2015-05-08 13:46:27 +02:00
Michael Peter Christen
9d8f426890 adding a try-catch to link graph processing to prevent that a single
malformed url interrupts the storage process
2015-05-08 10:38:33 +02:00
reger
8a5b8f8789 on bookmaring of search result, remember orig. query in separate bookmark property
(instead of using the description field)
- adjust display and autosearch
- don't overwrite existing bookmark but combine info
2015-05-03 02:31:50 +02:00
reger
7224209486 break out of NormalizeDistributor loop on timeout 2015-05-02 02:36:18 +02:00
reger
47e61f8325 fix typo in image filter query
(extra bracket)
2015-04-28 03:12:14 +02:00
reger
4b4ab6799f fix String out of range in Collection Nav
see http://mantis.tokeek.de/view.php?id=573
2015-04-27 22:38:40 +02:00
reger
572cfe8fd4 improve character encoding for urlproxy servlet
for none utf-8 pages
2015-04-26 17:42:39 +02:00
reger
6bc8a9b11e make Quality of Service Servlet available to prioritize requests from local host
This assigns priorities to incoming requests. Higher priority numbers are served before lower.
(disabled by default in defaults/web.xml, 
uncomment or copy entry to DATA/Settings/web.xml)
2015-04-26 04:29:32 +02:00
Ryszard Goń
ca1a70aec8 fix for Accept '?' URLs column in Crawl Profile List 2015-04-19 15:55:49 +02:00
reger
5408448a56 skip redundant add. of keywords to text
search uses keywords as default search field
2015-04-17 02:14:13 +02:00
reger
296e97c78e put https port in peers dna
as we flag if a peer is accesible via https, we need to know the port if we want to use is (e.g. for interYaCy communication)
start to provide / tansport the port by recording it in peers dna.
- add https link on the Network.html lock symbol
2015-04-16 02:36:12 +02:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
Michael Peter Christen
b060ba900d added parsing of contentprop attribute in html tags for
content='startDate' and content='endDate'. The value of these field is
now written to new solr fields startDates_dts and endDates_dts.
2015-04-13 16:20:00 +02:00
Michael Peter Christen
4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
written to special solr fields which are deactivated by default.
2015-04-12 22:02:45 +02:00
reger
1395f10e95 fix typecast for css links 2015-04-12 01:11:47 +02:00
Michael Peter Christen
3288489fd2 more logging during start-up 2015-04-11 13:00:32 +02:00
Michael Peter Christen
abaaaef5f1 fix for filter queries 2015-04-11 12:30:29 +02:00
Michael Peter Christen
4d00175157 <experimental> added parsing of <article> html element.
Whenever such an element occurs, the complete content of all article
elements replaces the parsed <content> part of documents.
2015-04-10 16:16:20 +02:00
Michael Peter Christen
1df6492019 enhanced suggestions 2015-04-10 15:59:18 +02:00
Michael Peter Christen
ae02c92fd0 logging fix 2015-04-09 14:21:23 +02:00
Michael Peter Christen
5651713134 better debugging of fq 2015-04-07 17:02:02 +02:00
Michael Peter Christen
f5a032f293 split query into filter query and text query to get better ranking
results and faster results
2015-04-07 16:10:13 +02:00
Michael Peter Christen
2e88028c1a when selecting collections in navigation, do show the un-selected
collections in search result. When selecting one of them in another
search, switch off the previously selected collection. This actually
turns the collection navigation modifier into a radio-button like
behaviour
2015-04-07 13:13:58 +02:00
Michael Peter Christen
1de9b21c65 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-04-07 12:40:43 +02:00
reger
5f4cd8d6f5 replace deprecated getIP with getIPs in AbstractRemoteHandler 2015-04-07 00:10:42 +02:00
Michael Peter Christen
fa7edc9f7a refactoring of filter queries (several queries instead only one) 2015-04-02 13:27:47 +02:00
Michael Peter Christen
40389987ec Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-04-01 18:18:05 +02:00
Michael Peter Christen
f9ba50379d added an expansion option to search facets on result page:
- if less or equal of 8 facet options are present, they are shown by
default
- if more facet options are present, they are hidden
To view or hide all facets, just click on the facet header bar
2015-04-01 18:17:52 +02:00
reger
1f0f77bb77 make location facet return results
for location nav facet of field coordinate_p does not return results, now using coordinate_p_0_coordinate as alternative to get facet counts. As the actual facet value is not used this should not harm any analysis (even if facet is a incomplete location).
If facet value is used in future likely *_geohash field could be introduced (for facet and other ... as transport value)
2015-04-01 01:57:56 +02:00
reger
b1ec0644e5 fix NPE in location search on missing/empty PubDate in underlaying rss data 2015-03-31 02:20:13 +02:00
reger
c1dcc8c456 fix display and limit of max server connections after startup
(on restart value returned to default=50)
This has no effect on Jetty but the limit is still respected.
2015-03-29 07:12:23 +02:00
reger
839b962c20 correct percent encoding for '%' char 2015-03-28 03:05:21 +01:00
Michael Peter Christen
9bf0d7ecb9 added a new collection type 'dht' to all documents from the peer-to-peer
interface to distinguish rich and poor document data.
This also reverts some changes from commit
796770e070 because the firstSeen database
is the wrong method to distinguish these types of data
2015-03-24 12:32:39 +01:00
reger
796770e070 prevent overwrite of crawled or received full documents by (newer) metadata
To protect rich index data (full resource) from overwriting by metadata gathered during remote search,
the newly introduced "firstSeen" index is used to differentiate between full-resource-doc and metadata,
as a "firstSeen" entry is only added on store's of full-resource-docs (during crawl or remote search).
2015-03-23 03:57:47 +01:00
Michael Peter Christen
ee2490ab98 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-19 10:42:57 +01:00
reger
431311df42 fix get fresh_date_dt to allow returned value to be date in future 2015-03-18 22:04:03 +01:00
otter
74c7e8b686 Fixes hanging FlushThread (see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5447)
by replacing put() method by the more robust add() to
add a merge job to the queue.
2015-03-18 21:57:41 +01:00
reger
f63fff9008 fix snippet containig number with comma as desmo point http://mantis.tokeek.de/view.php?id=344
to keep it as one word (by altering the split regex)
- added sniipet test case with number
- regex for word split to match multiple splitcars
2015-03-16 02:03:40 +01:00
reger
b241264632 fix error on *abc query input
http://mantis.tokeek.de/view.php?id=486
2015-03-15 22:31:47 +01:00
reger
2ef8ffdb60 apply UTF-8 encoding
copied from escape()
2015-03-15 06:02:45 +01:00
reger
7120ea42f1 fix for path with char code > 255
(causing index out of bound exception)
+ test cas for it
2015-03-15 03:37:32 +01:00
reger
1d81bd0687 fix url encoding for path see http://mantis.tokeek.de/view.php?id=559
So far we used same escape procedure for all parts of the url (which includes x-www-form-urlencoded for all url components)
Added capability to use different encoding rules for the different url components (through specific bitset for each component).
(this is inspired by org.apache.http.client and java.net.uri implementation).
- Added test case for  http://mantis.tokeek.de/view.php?id=559
2015-03-15 00:46:07 +01:00
reger
62087fb8b2 fix MultiProtocolURL mailto protocol detection 2015-03-13 02:02:53 +01:00
reger
2e8c24e02a fix link to DeReWo download file 2015-03-11 20:02:23 +01:00
reger
706f75ddc2 try to fix hang on index blob merge on shutdown
http://mantis.tokeek.de/view.php?id=505
It happens but not able to reproduce. This change makes sure terminate signal is catched at end of currently running merge jobs
2015-03-11 19:36:23 +01:00
reger
f94e34058c fix url (path) %-decoding http://mantis.tokeek.de/view.php?id=519
- add test case for this
2015-03-11 01:05:14 +01:00
reger
7e09bff4a1 exclude default search fields from text copy to text_t
for metadata index documents (reduce text redundance)
2015-03-08 21:49:23 +01:00
reger
86073a5ba3 For remote crawlReceipt add document abstract/description
enhance the returned metadata returned to the originator by description_txt to improve fulltext search result hits.
2015-03-08 02:34:48 +01:00
reger
8af70950d9 harmonize snippet computation
to considere description_txt always (solr hl & internal).
For now just added desc to text list for computation, could be further equalized with hl computation.
2015-03-05 02:22:05 +01:00
Michael Peter Christen
fd4e2c809a Show dates in the content of a document in the search result:
- if an eventDate is given in the search result, replace the document
date with the event date and prefix it with the string "on ".
- the document date is omitted if a date from the cent is shown

Added also the date as fields in the json and rss result sets.
2015-03-02 18:00:20 +01:00
Michael Peter Christen
893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
search for: "Berlin on:tomorrow" to find events happening tomorrow in
Berlin
2015-03-02 13:10:05 +01:00
Michael Peter Christen
710a0efa1b generalized time period computations 2015-03-02 12:55:31 +01:00
Michael Peter Christen
d9d3111d10 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-02 04:31:05 +01:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
d7259419f3 postpone raw snippet html encoding upon use
instead of during init of snippet 
adressing http://mantis.tokeek.de/view.php?id=551
2015-02-28 19:02:18 +01:00
reger
de56d934b2 apply query parameter getQueryFields() to GSA servlet 2015-02-27 00:53:20 +01:00
reger
2d2299f484 fix mimetype of rss items in rss parser
- remove self reference as anchor for items
2015-02-25 01:58:42 +01:00
Michael Peter Christen
b432049d59 enhanced date parsing time 2015-02-25 01:05:46 +01:00
reger
9b0de2de64 introduce getQueryFields to return default query fields (queryparamter QF)
calculated from boostfields config, making sure title, description, keywords and content is always searched.
- apply change to solrServlet makes sure every remote query uses at least all locally defined boost fields for search
- apply to local solr search
- simplify select query by using QF defaults
2015-02-23 23:12:07 +01:00
reger
a0f04db9ea add extracted description/subject to pptParser 2015-02-22 05:31:56 +01:00
reger
8ec1db76ee url unescape add check for inconsistent utf8 multibyte parsing
If the url contains special chars (like umlaute äöü) it's interpreted as multybyte char and actually not converted at all (removed).
Added a check if the multibyte convesion is not complete, just add the char as is.

This fixes http://mantis.tokeek.de/view.php?id=200
2015-02-20 02:21:04 +01:00
reger
4b97ddb9ec stop sending crawl receipts if receiver got offline 2015-02-17 03:16:10 +01:00
reger
7e35518787 add extracted description/subject to docParser 2015-02-16 00:50:16 +01:00