Commit Graph

844 Commits

Author SHA1 Message Date
reger
49eae79c01 fix Tables.hasIndex check for tablename = key
apply same functionality to hasHeap (to not create new table on call hasHeap)
2016-11-09 02:33:42 +01:00
reger
669f60223e upd Column.toString to output encoder "{bytes}"
used for String and binary Column types
2016-11-06 21:02:58 +01:00
reger
c9e81d2fa0 fix Column parsing from celldefinition string, without cellwidth def.
(outofbound exception)
2016-11-06 03:34:24 +01:00
reger
20a1b29ed3 add simple test case for ReferenceContainer helpful for debugging
calculated ranking parameter
2016-10-26 01:38:40 +02:00
reger
3c7220bc7b Refacture rwi reference word position and word distance calculation
used for rwi ranking.
Main changes:  
- introduce a  posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
2016-10-23 19:40:02 +02:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
reger
4c67ed3f8d catch rwi ranking div by zero exception
during rwi search result processing worddistance calculation is effected 
by concurrent update (normalization) of min/max ranking parameter for
wordpositions. On update of min/max the exception is raised in distance calc
and now catched. 
This concurrent update and change of ranking results is needed for speed
but should be further checked for optimization
2016-10-22 00:53:47 +02:00
luccioman
ee92082a3b Updated javadocs : warning about closing stream responsibility. 2016-10-21 12:48:36 +02:00
reger
68217465fe div by null in word distance calculation
(again, description in http://mantis.tokeek.de/view.php?id=698)
as root cause was not seen, added just workaround reducing in favour over a 
try catch (for easier followup).
2016-10-19 22:55:36 +02:00
reger
8b74a6bf57 fix min/max calculation of WordReferenceVars.distance()
Issue was the calculation in AbstractReference with positions.clear() call,
this made distance result always 0 (distance needs min 2 positions) and created concurrency issues.
+ unit test of changes
2016-10-17 23:58:28 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
reger
685d8e86bf Avoid frequent data type casting (float/long) for rwi score
refactor to using long in URIMetadataNode too (and related call parameters)
As remote rwi score's are not used (since v1.83) skip reading float-score ,
but keep in toString() for communication with older versions.
2016-10-14 01:17:34 +02:00
reger
681a61dafb adjust rwi index result word position handling used for rwi ranking
- correct WordReferenceVars.toRowEntry posintext parameter
to set expected min posintext (the difference is on multi-word queries,
while positions are ordered by search word order).
- modified posofphrase/posinphrase join operation
 - to set min posofphrase
 - and keep posinphrase if not same posofphrase (was set to 0, no differentiation during ranking)
+ fix compiler msg (missing type declaration)
2016-10-04 01:42:18 +02:00
reger
ff6589fc0f test case: simulating multi word query for local rwi index
Purpose of the test case is to be able to (controlled) analyse the rwi ranking for
multi word searches (with focus on posintext and word-distance ranking)
2016-09-18 00:59:27 +02:00
reger
3b694b3935 add some javadoc to rwi wordreference distance, position
to remember facts for http://mantis.tokeek.de/view.php?id=683
Init missing word position to 0 like in other non text body words
2016-09-14 00:36:19 +02:00
reger
96467c5467 remove not needed counter in Tokeninzer (completing last changes)
including a small change, word posintext counting. 
We remember/store 1st posintext. Previously following words got a handle (posintext)
excluding found. Now it just counts and assigns true posintext as handle (posintext)
2016-09-10 18:23:09 +02:00
reger
7efb66ee10 adjust the WordReference.join wordsintext calc to take the max (instead of sum)
The reference is for the same url (add same for title and phrases).
+ del redundant join() procedure
2016-09-08 02:29:48 +02:00
reger
120bf7e6e2 implemented RWI WordReference to return the word position value (was always left empty)
This is needed and enables existing word position ranking for RWI.
The upcoming concurrency issue in word position min/max calculation were eliminated
by iterator.hasHext check before next() access.
2016-09-06 03:18:02 +02:00
Michael Peter Christen
103a8348b3 fix for NPE and small performance enhancement 2016-08-10 06:48:08 +02:00
luccioman
6e96c7341a Merge remote-tracking branch 'origin/master'
Conflicts:
	htroot/Load_MediawikiWiki.java
	htroot/Load_PHPBB3.java
	htroot/ViewImage.java
2016-07-03 18:59:00 +02:00
reger
5aaa057c65 ignore empty input lines in FileUtils.getListArray() to poka joke blacklist read.
equalizes behavior with getListString()
improves: case were blacklist file contained a undesired empty line, not 
fixed by blacklist-cleaner.
2016-06-28 23:44:28 +02:00
reger
4cc38e979d add InputStream close after reading input file (Vocabulary_p servlet) 2016-05-24 00:26:28 +02:00
Burkhard
9a18e2297b Merge pull request #51 from JeremyRand/multiple-boost-query
Fix multiple boost queries
2016-05-22 22:24:04 +02:00
reger
f0d7b93372 make use and activate autodetect charset in Vocabulary input from file
+ revert mistake of empty cn.lng
2016-05-22 05:38:26 +02:00
JeremyRand
58824dfa6c Refactor escaping in config file read/write code. Now it uses Apache Commons StringUtils instead of RegEx. 2016-05-20 20:17:51 -05:00
luc
26f1ead57c Created ViewFavicon class specialized in favicon viewing.
Main image processing is now in ImageViewer, used by both ViewImage and
ViewFavicon.

Fixed URIMetadataNode.getFavicon to use non-standard icons with no size
ass fallback.
2016-02-09 20:46:44 +01:00
luc
07222b3e1a Added favicon url transmission in RWI chunks. 2016-02-05 17:05:36 +01:00
luc
3cc5619d93 Improved HTML icons indexing and rendering in search results.
See http://mantis.tokeek.de/view.php?id=629
2016-02-02 09:57:54 +01:00
luc
571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
charset names.
2016-01-05 23:37:05 +01:00
sixcooler
dce1cb65c4 Merge remote-tracking branch 'choose_remote_name/master' 2015-12-28 23:20:42 +01:00
reger
b4b6910d60 fix (todo): correct doc.id of remote search result if no match with newly
calculated doc hash if different.
Testing showed that in some cases delivered url doesn't match the local
calculated hash. In this case replace doc.id (and host_id_s) with calculation
from url.
2015-12-20 02:10:49 +01:00
reger
cb83e65f89 drop returning document language "en" if unknown (fix todo)
which also harmonizes handling of query.modifier for rwi and solr results
(to result must match a given language filter)
2015-12-19 01:42:35 +01:00
luc
70595d05d0 Modified MemoryControl.main() test to properly end for better results
displaying.
2015-12-14 23:49:28 +01:00
reger
cdb8f3b10d make current ranking score value avail. to search interface / api
Update the result score result field with the result queue ranking value to reflect
the actual calculated/used score,
for rwi & solr stack results.
(calc. etc. is unchanged, it's just that result entry carries the latest val
as api retrieves the number from it)
2015-12-08 03:17:32 +01:00
Michael Peter Christen
d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	.classpath
2015-11-30 13:34:10 +01:00
reger
1160b13172 remove unused md5 from ViewFile servlet params 2015-11-28 23:09:15 +01:00
reger
b2c8bc0ae6 remove md5_s from default index fields
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
2015-11-27 02:41:02 +01:00
luc
5bbb2e1730 Ensure resource is closed when reading a full file InputStream 2015-11-18 10:08:06 +01:00
reger
7d0d19cb8e avoid File.deleteOnExit() on temp files
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir 
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
reger
02e4489a23 set tmpfile.deleteOnExit by default,
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
ca3d26a401 harmonize wordsintitle & CollectionSchema.title_words_val calculation,
remove obsolete partial init of wordreference from urimetadata
2015-11-15 06:06:37 +01:00
sixcooler
d3b9349b6f simplification / speedup of GenerationMemoryStrategy 2015-11-10 20:39:46 +01:00
luc
c38d6c1f37 Correction for mantis 535: inurl: parameter doesn't work on URLs with
upper-case letters
2015-09-23 21:01:51 +02:00
reger
3f2b8ab5e5 optionally include mime in p2p url exchange string
if doctype decodes to ambiguous mime and default conversion is not equal to original
2015-09-22 00:12:31 +02:00
reger
e37a4f0b3d prevent metadata records in index w/o valid url
by throwing MalformedURL exception on URIMetadataNode creation
2015-09-06 22:19:05 +02:00
Michael Peter Christen
c40c302748 when many crawl queues are generated, this NPE can occur; probably
caused as concurrency issue:
W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException
java.lang.NullPointerException
	at java.util.TreeMap.rotateRight(TreeMap.java:2239)
	at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271)
	at java.util.TreeMap.put(TreeMap.java:582)
	at net.yacy.kelondro.table.Table.<init>(Table.java:235)
	at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229)
	at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204)
	at net.yacy.crawler.HostQueue.push(HostQueue.java:397)
	at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237)
	at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184)
	at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355)
	at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134)
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at
net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101)
	at
net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82)
	at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2015-09-05 14:12:17 +02:00
luccioman
2f0f0180e2 Added a function to list files recursively. 2015-09-04 13:42:57 +02:00
reger
0e4ba0360b fix NPE on .yacyh result url of disconnected peer
(cleanup yacyshare remaining)
2015-08-25 23:26:17 +02:00
Michael Peter Christen
dbbad23e12 removed warnings 2015-08-03 05:37:34 +02:00
Michael Peter Christen
b94bd7f20a a collection of search query enhancements:
- fixed superfluous space in query field list
- fixed filter query logic
- removed look-ahead query which caused that each new search page
submitted two solr queries
- fixed random solr result orders in case that the solr score was equal:
this was then re-ordered by YaCy using the document hash which came from
the solr object and that appeared to be random. Now the hash of the url
is used and the score is additionally modified by the url length to
prevent that this particular case appears at all.
2015-08-02 14:52:41 +02:00