Commit Graph

882 Commits

Author SHA1 Message Date
sgaebel
df9ea0a42a removes some warnings: unused imports, params 2020-07-27 22:20:49 +02:00
luccioman
08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
Necessary to prevent blocking the indexing workflow when some
wkhtmltopdf renderings fail without terminating
2018-12-11 22:31:31 +01:00
luccioman
2bdd71de60 Added server side columns sorting on the Process Scheduler table
For easier usage of large tables in the Table_API_p.html page.
2018-07-04 10:28:32 +02:00
luccioman
bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-07-02 10:00:40 +02:00
luccioman
e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-28 14:59:23 +02:00
luccioman
fa4399d5d2 Small perf improvement : initialize threads names early when possible
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.

Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
2018-05-23 14:45:35 +02:00
luccioman
addd18c993 Removed some remaining uses of deprecated Seed.getIP() 2018-04-26 09:39:30 +02:00
luccioman
bcbd0ae1a4 Enabled partial parsing of audio resources. 2018-03-01 20:50:44 +01:00
luccioman
46c9da6428 Allow creation of vocabularies from remote CSV file URLs. 2018-02-21 08:41:13 +01:00
luccioman
6cd3847d0a Fixed NullPointerException case on Table init with relative file path.
Can occur for example when running dbtest with relative test table file
name (wihout explicit parent folder).
2018-01-29 14:00:43 +01:00
luccioman
9ddf92d143 Removed unncessary reflection usage for workflow tasks.
This improves code readability and maintainability (calls hierarchy are
easier to read) and eventually performance.
2018-01-15 10:05:49 +01:00
luccioman
6425963cee Fixed internal tables exact value match iterator 2018-01-10 18:38:42 +01:00
luccioman
36e9b1c5b3 Fixed SegmentTest test case time dependant occasional failures
As highlighted by latest automated Travis builds.
2018-01-02 10:21:07 +01:00
luccioman
938d8a9731 Added some JavaDoc 2017-11-14 09:24:13 +01:00
luccioman
6e497241f7 Properly close resources (even on error) on OS and ThreadDump classes.
Also updated some JavaDoc and main() function usage message on the same
ones.
2017-10-16 17:04:22 +02:00
luccioman
dd9cb06d25 Fixed RWI distance calculation on multi words search queries.
Distance was lost when storing/retrieving references to intermediate
result container.

Now all JUnit tests are again successfully passing!
2017-10-04 08:41:43 +02:00
luccioman
5d3ceb31b7 Improved search navigators counters accuracy and consistency.
- added some missing increments from RWI results
- decrement relevant navigator counts when solr or RWI results are
evicted because duplicates detection or constraints checked belatedly
- do not compute facets when unnecessary to avoid unwanted CPU load
- do not increment from facets when already done
- do not rely on facets on remote solr peers requests, as most of the
time only a limited part of their total results if fetched (thus also
preventing unnecessary load on remote peers)
- use a concurrency friendly score map for the dates navigators to
prevent unwanted ConcurrentModificationExceptions

This improves the situation for the most obvious inconsistencies in
search navigators counts, but more has to be done for a true accuracy
(notably when query modifiers constraints are applied belatedly - after
the solr or RWI retrieval request - such as the content domain
constraint)
2017-09-06 16:58:40 +02:00
luccioman
8e4f31bdc7 Updated internal ISO 639-1 language codes with latest standards.
Includes 54 language code additions, some name modifications, and
marking a few deprecated.
2017-09-02 09:53:38 +02:00
luccioman
4eba88f2ff Removed some unnecessary uses of java.lang.reflect api.
This improves code browsing and readability, making search by references
or call hierarchy IDE features more accurate.
2017-08-24 18:47:18 +02:00
luccioman
b23a563065 Prevent search result failure on incomplete images information.
Complements the recent modification related to images in commit 7f395ef.

Unfortunately many documents metadata fetched from the freeworld p2p
network have only partial information about embedded images. Without
proper error handling, this made many searches in p2p mode to fail
completely.
2017-08-15 10:11:05 +02:00
Michael Peter Christen
7f395ef937 added image link in search results
This should be a help to make a preview of search results.
The image is computed from the list of embedded images, it is
always the first image in that list.
In rss-type results the image is presented like
<media:content medium="image" url="https://abc.xyz/logo.png"/>
as defined in
http://www.rssboard.org/media-rss#media-content
2017-08-14 20:12:09 +02:00
luccioman
f369679d1c Fixed read/copy on input streams reading sometimes less than expected. 2017-07-11 09:00:27 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
319231a458 Added a generic XML parser, able to parse elements text and URLs.
This parser adds support for any XML based format other than already
supported XML vocabularies such XHTML, RSS/Atom feeds... It will
eventually be used as a fallback if one of these specific parsers fail,
before falling back to the existing genericParser which extracts not
that much useful information except URL tokens.
2017-06-26 16:30:21 +02:00
luccioman
0487336ec3 Prevent integer overflow in table statistics and use strong typing 2017-06-19 17:02:11 +02:00
luccioman
5fdd5d16b1 Use volatile to ensure concurrent threads use up to date property value 2017-06-15 09:48:22 +02:00
luccioman
28b451a0b3 Made Cache compression level and lock timeout user configurable 2017-06-14 19:02:08 +02:00
luccioman
a7394b479b Limit the synchronization blocking time on some Cache operations.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.

Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.

Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
2017-06-14 09:13:50 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
luccioman
c53c58fa85 Unsure closing ChunkIterator stream in every possible use case.
Also trace in logs the eventual close failures instead of failing
silently.
This should help prevent holding too many unreleased system file
handlers, as in the case reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02
)
2017-06-02 09:47:45 +02:00
reger
f9180fabc4 assure that RWI Index.Segment IODispatcher is not blocking on shudown
waiting on a semaphore permit.
see desc. http://mantis.tokeek.de/view.php?id=723
2017-01-24 01:51:28 +01:00
luccioman
0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.

Also added a unit test for this function.
2017-01-16 10:18:42 +01:00
reger
eedee6eabb fix exception on URIMetadataNote instantiation with corrected id hash on
host_id_s. Use Solr setField instead of addField to prevent
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
	at net.yacy.kelondro.data.meta.URIMetadataNode.hosthash(URIMetadataNode.java:247)
	at net.yacy.search.query.SearchEvent.addNodes(SearchEvent.java:966)
	at net.yacy.peers.Protocol.solrQuery(Protocol.java:1242)
	at net.yacy.peers.RemoteSearch$2.run(RemoteSearch.java:349)
2017-01-05 00:24:37 +01:00
reger
083df255e4 fix html tag attribute parsing containing attribute w/o value
e.g. itemscope or autofocus (in such case the next key was not properly
recognized).
2016-12-24 06:57:11 +01:00
luccioman
eec5779889 Added a name prefix to pooled threads for easier monitoring.
Using JVM monitoring tools, it is then easier to identify tasks running
inside thread pool with a custom prefix rather than the generic one :
"pool-".
2016-11-23 11:21:14 +01:00
luccioman
a0dfbaca6a FileUtils : added some JavaDocs and unit test cases 2016-11-16 15:12:21 +01:00
reger
49eae79c01 fix Tables.hasIndex check for tablename = key
apply same functionality to hasHeap (to not create new table on call hasHeap)
2016-11-09 02:33:42 +01:00
reger
669f60223e upd Column.toString to output encoder "{bytes}"
used for String and binary Column types
2016-11-06 21:02:58 +01:00
reger
c9e81d2fa0 fix Column parsing from celldefinition string, without cellwidth def.
(outofbound exception)
2016-11-06 03:34:24 +01:00
reger
20a1b29ed3 add simple test case for ReferenceContainer helpful for debugging
calculated ranking parameter
2016-10-26 01:38:40 +02:00
reger
3c7220bc7b Refacture rwi reference word position and word distance calculation
used for rwi ranking.
Main changes:  
- introduce a  posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
2016-10-23 19:40:02 +02:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
reger
4c67ed3f8d catch rwi ranking div by zero exception
during rwi search result processing worddistance calculation is effected 
by concurrent update (normalization) of min/max ranking parameter for
wordpositions. On update of min/max the exception is raised in distance calc
and now catched. 
This concurrent update and change of ranking results is needed for speed
but should be further checked for optimization
2016-10-22 00:53:47 +02:00
luccioman
ee92082a3b Updated javadocs : warning about closing stream responsibility. 2016-10-21 12:48:36 +02:00
reger
68217465fe div by null in word distance calculation
(again, description in http://mantis.tokeek.de/view.php?id=698)
as root cause was not seen, added just workaround reducing in favour over a 
try catch (for easier followup).
2016-10-19 22:55:36 +02:00
reger
8b74a6bf57 fix min/max calculation of WordReferenceVars.distance()
Issue was the calculation in AbstractReference with positions.clear() call,
this made distance result always 0 (distance needs min 2 positions) and created concurrency issues.
+ unit test of changes
2016-10-17 23:58:28 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
reger
685d8e86bf Avoid frequent data type casting (float/long) for rwi score
refactor to using long in URIMetadataNode too (and related call parameters)
As remote rwi score's are not used (since v1.83) skip reading float-score ,
but keep in toString() for communication with older versions.
2016-10-14 01:17:34 +02:00