Commit Graph

1752 Commits

Author SHA1 Message Date
Michael Peter Christen
8aa08261a7 update to Solr Boost handling 2012-12-05 12:26:42 +01:00
Michael Peter Christen
908ad2f174 Added a new servlet to configure the solr ranking using field boosts 2012-12-03 17:01:19 +01:00
Michael Peter Christen
a01e47b992 enhanced exists()-method for solr; should reduce a lot of IO during DHT
target selection
2012-12-02 17:29:37 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
b5ee88c6af added more logging to get info which url causes performance problems 2012-12-02 16:52:12 +01:00
reger
1faa045dc1 fix: prevent regex pattern compile error for blacklist import for path '*' (extend it to '.*') 2012-12-01 22:41:21 +01:00
reger
6cf33f899c prevent Solr "version conflict" on update by set Solr "_version_" field to 0 (=no version check) 2012-11-28 00:09:53 +01:00
Michael Peter Christen
acd98bebb7 improvements in GSA result writer 2012-11-26 15:18:51 +01:00
Michael Peter Christen
3de784c8dd replaced more split and replaceAll missing pattern pre-compilation with
pre-compiled pattern
2012-11-26 13:40:53 +01:00
Michael Peter Christen
8fc3679c66 using more pre-compile pattern for split methods 2012-11-26 13:11:55 +01:00
Michael Peter Christen
d48e9788d2 enhanced search result processing behavior
- query less at one time; query more often
- in between the small queries, evaluate results
- remove fields from search results which are not needed
2012-11-26 12:24:35 +01:00
Michael Peter Christen
bf512e6350 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-11-26 00:14:57 +01:00
reger
469efcdb9d fix: display and calculate authors and namespace search navigator if configured (otherwise skip overhead)
(leave hosts, topics and  not in ConfigPortal included filetype,  protocoll navigator untouched)
2012-11-25 22:49:26 +01:00
Michael Peter Christen
eca68fa197 added debug code to crawler monitor 2012-11-25 15:43:42 +01:00
Michael Peter Christen
205f8b222b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-11-25 14:41:49 +01:00
orbiter
ee612e8b93 start the local search only if this peer is doing a remote search or
when it is doing a local search and the peer is old
2012-11-25 11:58:57 +01:00
Michael Peter Christen
d465773a37 - removed multi-add of documents (no used)
- inserted specialized code for size request
2012-11-25 01:34:39 +01:00
Michael Peter Christen
a1a4d9aa94 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/cora/federate/solr/connector/MirrorSolrConnector.java
2012-11-24 22:31:46 +01:00
Michael Peter Christen
b7004043ea - added a field cache for solr queries which call only for a single
value
- fixed a version conflict exception within a solr add request
2012-11-24 22:30:05 +01:00
orbiter
5aa5202adf fixes for filesystem indexing 2012-11-24 10:27:29 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
5e182a566f - added another enumeration method in kelondro data structure to get a
more random access to data for the balancer
- added random access inside the balancer
2012-11-23 13:58:39 +01:00
Michael Peter Christen
4eab3aae60 removed overhead by preventing generation of full search results when
only the url is requested
2012-11-23 01:35:28 +01:00
Michael Peter Christen
a114bb23bb - using edismax in gsa interface
- generating less field data for gsa search results
- using a boost query in gsa interface to move double content to the end
of the result list
2012-11-22 13:03:33 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
Michael Peter Christen
46be4af5b9 Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890' 2012-11-18 22:11:04 +01:00
Michael Peter Christen
832eead998 Merge remote-tracking branch 'regerdev/master' 2012-11-18 22:04:11 +01:00
Michael Peter Christen
952e143580 FINALLY YaCy can now search for full strings using double- or
singlequoted strings in the search query line!!!
2012-11-18 16:03:34 +01:00
orbiter
5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
2012-11-18 01:22:41 +01:00
cominch
2bb8f045cc content control: use up-to-date definitions 2012-11-13 17:32:19 +01:00
Michael Peter Christen
5fd3b93661 added deletion of hosts during crawl start if deleteold option was given 2012-11-13 16:54:28 +01:00
Michael Peter Christen
d64445c3cb because we have the inurl:<term> - searchmodifier, we don't actually
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
2012-11-13 11:45:56 +01:00
cominch
a67ff1c8ac SMW Import: replaced JSON import routines with stable ones 2012-11-12 11:17:50 +01:00
cominch
d2a94cc55e refactor package 2012-11-09 16:22:24 +01:00
cominch
05742b4562 remove old SMW importer which was part of the ymarks package 2012-11-09 15:44:59 +01:00
cominch
21df1ad9e0 update and generalization of the SMW import and content control routines 2012-11-09 13:48:40 +01:00
Michael Peter Christen
842faf96a2 fixed media search 2012-11-07 17:27:13 +01:00
Michael Peter Christen
93001586a0 removed warnings, removed too-fast pausing of crawls 2012-11-07 15:37:14 +01:00
Michael Peter Christen
8041742e48 added matching of path to query pattern 2012-11-07 15:06:13 +01:00
Michael Peter Christen
8b1c9cba3d fixed a problem with non-terminating crawls 2012-11-07 15:05:44 +01:00
Michael Peter Christen
61a1d32356 fix to ftp client 2012-11-07 14:58:28 +01:00
Michael Peter Christen
5105256927 update to search result logging (this was a remaining issue from the
solr 4.0.0 migration)
2012-11-07 14:15:27 +01:00
Michael Peter Christen
570e42c4e3 fix for filetype naviagtor 2012-11-07 13:53:29 +01:00
Michael Peter Christen
71ed8e5e07 bugfixes for crawler 2012-11-07 12:52:19 +01:00
Michael Peter Christen
12c0db20e5 fixed npe for surrogate import 2012-11-07 02:46:51 +01:00
Michael Peter Christen
52df6ee369 more logging 2012-11-07 02:04:08 +01:00
Michael Peter Christen
158732af37 automatically delete entries from the crawl profile list if crawl is
terminated.
2012-11-07 02:03:44 +01:00
Michael Peter Christen
15d1460b40 added information about the reason of pausing of crawls 2012-11-06 15:21:56 +01:00
Michael Peter Christen
2371ef031c added solr faceted search support to YaCy search results
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
2012-11-06 14:32:08 +01:00
Michael Peter Christen
b30a7162fa added more thread-renaiming for search processes 2012-11-06 12:31:23 +01:00
Michael Peter Christen
900445d8e9 set the thread name during solr queries to the solr query to get better
debugging options
2012-11-06 11:48:04 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
a15819fbec fix for some interface problems 2012-11-05 22:14:52 +01:00
Michael Peter Christen
791e1dcfdf when a new crawl is started, delete all entries about error-urls for
crawl-start domains
2012-11-05 22:14:27 +01:00
Michael Peter Christen
619bf7e875 fixed filetype modified for media types in text search 2012-11-05 18:08:00 +01:00
Michael Peter Christen
97f82994a6 automatically pause the crawler if there is a problem with solr 2012-11-05 16:34:42 +01:00
Michael Peter Christen
8fb370d9f8 renovated the way how search results are count. should be correct now... 2012-11-05 03:19:28 +01:00
Michael Peter Christen
7bec253bb0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-11-04 09:21:58 +01:00
Michael Peter Christen
d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-11-04 09:21:21 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
reger
633fbe9188 Fix Metadata handling
- language default on missing lang property to "uk" (fix set to nothing)
-  language set to TLD (added call to existing language calculation from TLD)
-  coordinate number exception on possible lat/lon content of "NaN,NaN"

adjust Netbeans IDE classpath (for Solr/Lucene 4.0.0 jars)
2012-11-04 02:07:59 +01:00
Michael Peter Christen
75dd706e1b update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
e2c4c3c7d3 migration to solr 4.0.0 2012-11-02 12:29:48 +01:00
Michael Peter Christen
b764de424a code cleanup 2012-11-02 10:28:32 +01:00
Michael Peter Christen
9330ad4838 - fixed the delete option in host browser
- added a delete method which can be used to delete a full subpath in
solr.
2012-11-02 01:22:31 +01:00
Michael Peter Christen
a63179f3f9 added the MIME attribute for the R tag in GSA search result writer 2012-11-02 00:14:29 +01:00
Michael Peter Christen
1168d09de8 more refactoring - integrated the code of SnippetProcess into
SearchEvent
2012-11-01 17:40:06 +01:00
Michael Peter Christen
6629e37685 tried to clean up the search process mess 2012-11-01 17:16:43 +01:00
Michael Peter Christen
c5f67a5d6d fixed a problem with local search from solr results: now all results
from solr are shown (again)
2012-11-01 10:22:22 +01:00
Michael Peter Christen
f8f05ecba7 - added a delete button in host browser to delete a complete subpath
- removed storage of default collection name - default is now "user"
- made stacking of crawl start points concurrently
2012-10-31 17:44:45 +01:00
Michael Peter Christen
0716a24737 added more / all new crawl profile fields into crawl profile editor 2012-10-31 15:13:05 +01:00
Michael Peter Christen
4a14122ba7 in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
2012-10-31 14:08:33 +01:00
Michael Peter Christen
0fe8be7981 enhaced data structures for balancer and latency computation which
should produce a bit better prognosis about forced waiting times.
2012-10-30 17:30:24 +01:00
Michael Peter Christen
ac9540dfb6 removed options for stopwords which are not used 2012-10-30 12:36:36 +01:00
Michael Peter Christen
ce3fed8882 added the Google Search Appliance (GSA) api interface to the main menu.
See:
https://developers.google.com/search-appliance/documentation/68/xml_reference#request_overview
2012-10-30 12:27:22 +01:00
Michael Peter Christen
b2ffd49817 less latency 2012-10-30 12:26:32 +01:00
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain
with many documents block refreshing of the crawl queue
2012-10-29 22:26:52 +01:00
Michael Peter Christen
6905182d41 - fix for number of words log message
- adding meta:refresh also to crawler stack
2012-10-29 21:42:31 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
a94c537afc fixed getSize() which can use the cache size while the crawl is running 2012-10-29 11:56:07 +01:00
Michael Peter Christen
96912c9471 enhancement to solr caching: consider that during a get() the document
is not in solr but the cache points out that a commit is needed to get
the document.
2012-10-29 11:35:24 +01:00
Michael Peter Christen
a87811bc38 more auto-commit calls when a search interface is opened, but not when a
search is done there to prevent blocking during search-time.
2012-10-29 11:27:13 +01:00
Michael Peter Christen
3d3d654e88 if a network configuration is choosed which does not allow DHT and no
P2P communication is in robinson mode) then some menu entries are
disabled which have no use in this mode.
2012-10-29 01:51:19 +01:00
Michael Peter Christen
2d9e577ad0 replaced the custom robots.txt loader by the standard http loader 2012-10-28 22:48:11 +01:00
Michael Peter Christen
799d71bc67 enhanced solr caching:
- increased cache size which is needed for longer solr commit time
- speed hacks on cache write code
2012-10-28 20:31:29 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
orbiter
8952153ecf update to Balancer algorithm:
- create a load list from the current list of known hosts
- do not create this list for each Balancer.pop access
- create the list from those hosts which have a zero-waiting time
- select 1/3 from that list which have the most urls waiting
- get hosts from the wainting list in random order
- fixes for some delta-time computations
- always load all urls from hosts which have never been loaded before
2012-10-28 13:24:49 +01:00
orbiter
354f0d9acd moved static method from ClusteredScoreMap to MapDataMining because it
was not used in the ClusteredScoreMap class but only in MapDataMining
2012-10-28 11:29:53 +01:00
reger
722a447b0d - optimize code of augmented parsing to enhence document tags
- commented out augmentedparser.analyse (not function implemented yet)
- adjust init of document title list to always use same list type
2012-10-26 18:50:45 +02:00
Michael Peter Christen
8e1248ffe3 force a commit in advance of a search for the administrator to get most
recent results even if commit time is high and an indexing is ongoing.
2012-10-26 15:35:42 +02:00
Michael Peter Christen
3b48c78190 added an option to force a commit to solr.
may be used by a search front-end in case that the commitWithinMs time
is too short to get recently indexed documents.
2012-10-26 07:39:07 +02:00
sixcooler
2d972f289a rise commitWithinMs to default-value from SwitchBoard
(result in lower hd-io)

no dots in memory-graph (there are to much of them)
2012-10-26 02:12:45 +02:00
orbiter
8fde1dd3b6 another performance and memory hack to graphics: this makes it possible
to produce a 100-Megapixel png network graphic image on my 6 year old
laptop in standard configuration in 10 seconds.
2012-10-25 21:40:27 +02:00
Michael Peter Christen
1baf498d59 - show more lines in online log
- reverse order is default now
2012-10-25 18:38:39 +02:00
Michael Peter Christen
55bdafbaf1 more image processing hacks 2012-10-25 18:20:05 +02:00
Michael Peter Christen
f2d0418218 because the new PngEncoder had a problem with the PixelGrabber which is
caused by a JRE bug, the PixelGrabber had to be circumvented using an
own frame buffer which can be read without a PixelGrabber. This resulted
in ultra-fast and much less memory-consuming transformation. YaCy images
are now generated really fast!
2012-10-25 17:59:20 +02:00
Michael Peter Christen
d5d64019e5 - added a method for the RasterPlotter to draw arrow endings to lines
- replaced the dot in the NetworkGraph with arrows
- enhanced the image drawing speed using pre-computed color values
- added more attention for OOM cases during very large image painting
2012-10-25 16:05:04 +02:00
Michael Peter Christen
85ca07b90e when a new crawl is started, an equal crawl, if still running, is
terminated and the corresponding crawl profile is deleted (this also
clears the crawl queue entries for that crawl profile)
2012-10-25 10:20:55 +02:00
Michael Peter Christen
906e51214a the web structure image shows the pivot dot in a different color 2012-10-25 10:18:28 +02:00
Michael Peter Christen
b3ffcde0c7 - prepared PngEncoder for concurrency: PixelGrabber.grabPixels is the
main time-consuming process. This shall be done in concurrency.
- added concurrent processes to call the PixelGrabber and framework to
do that (queues)

It is now possible to create 4k-Images (3840x2160) i.e. with the Network
Graphics servlet
2012-10-24 02:08:51 +02:00
Michael Peter Christen
e9c6f4ce2e - new order of data computation: first compute the size of
compressed deflater output, then assign an exact-sized byte[] which
makes resizing afterwards superfluous
- after all enhancements all class objects were removed; result is just
one short static method
- made objects final where possible
2012-10-24 00:41:09 +02:00
orbiter
c6a1b21399 added a 9-year old png encoder from David Eisenberg which I rewrote
quite a bit to remove all code that handles transparency. With this
highly specialized png writer it is possible to write png images much
faster that with the JRE built-in png writer.
In a second step it can be possible to add concurrency to increase
computation speed further.
2012-10-23 23:27:41 +02:00
orbiter
276dd6452b removed warnings 2012-10-23 19:08:44 +02:00
Michael Peter Christen
b991685782 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-10-23 18:14:58 +02:00
Michael Peter Christen
ea11a1efea fix for highlighting in gsa search 2012-10-23 18:11:49 +02:00
Michael Peter Christen
9eaede50e7 enhanced web structure images 2012-10-23 18:11:19 +02:00
Michael Peter Christen
b7ac1da6a3 gsa results shall have only one title in metadata and that should be the
visible title in the <title>-tag
2012-10-23 18:03:12 +02:00
Michael Peter Christen
ae6feb5610 showing the web structure graph as animation in the crawl monitor 2012-10-23 02:50:26 +02:00
reger
87aab9aa7c - fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url
- fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode
   see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/
2012-10-22 22:48:35 +02:00
Michael Peter Christen
39317a6c66 enhanced webstructure image: introduced
- multiple hosts can be listed (comma-separated) as host argument
- new 'bf'-attribut (branch factor): the maximum number of edges per
node
- the bf-value is computed automatically
- ordering of nodes when the graphic is drawed: mostly the drawing ends
with an limitation eg. number of nodes. When this happens, it should be
ensured that more 'interesting' nodes are painted in advance. This is
now done by sorting all nodes by the number of links they have in de
distant sub-graph.
2012-10-22 16:23:39 +02:00
sixcooler
47ae7e322e smaller dhtDispatcher.cloudSize
@Orbiter: we talked about this times ago - please revert if I'm wrong
2012-10-21 20:05:28 +02:00
sixcooler
57ddd63888 not hold a expensive cache of references for DHT-out,but but load them
on demand
see: http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4530
2012-10-21 20:00:36 +02:00
Michael Peter Christen
ea27d2e5f6 fixed more getSolrFieldName usages 2012-10-18 15:21:05 +02:00
Michael Peter Christen
ce0e5b1e17 - more refactoring / private methods
- fix for usage of custom solr field names
2012-10-18 15:09:04 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
b400fc7b4d fix for file parser problem 2012-10-17 18:06:44 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
6017691522 added an exception catch 2012-10-17 13:56:11 +02:00
Michael Peter Christen
5d16c23a1f specified more URIMetadata as URIMetadataNode 2012-10-16 18:26:21 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
cc98496ff3 enhanced the HostBrowser:
- showing also outbound links to other domains if there are any
- the outbound links browser shows also the link structure image
- showing even inbound links if the web structure graph has information
about that
- removed the left menu and made the HostBrowser a part of the top menu
for search
- moved the file search also to the top menu
- added hover information in the HostBrowser to explain what the click
means
- because the HostBrowser also links to the Metadata viewer ViewFile,
there should be a button to switch back to the HostBrowser: added that
also.
2012-10-16 17:13:18 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
4023d88b0b added date info in parser errors 2012-10-15 10:57:36 +02:00
Michael Peter Christen
1b02408936 use less cache 2012-10-11 14:32:37 +02:00
Michael Peter Christen
e45a3235e0 default cache size was much too high; decreased solr cache size 2012-10-11 12:03:48 +02:00
Michael Peter Christen
613cf7da7f enhancement to post argument parsing - possible fix to zero-filled
parameter values
2012-10-11 10:46:06 +02:00
Michael Peter Christen
36c13ed15b less solr prefetch 2012-10-11 10:17:05 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
53789555b9 fix for crawl start filter 2012-10-10 10:40:32 +02:00
orbiter
68d0f8de03 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-10-09 20:36:32 +02:00
reger
bfb0d4c69b - add language detection from <html lang="xx"> tag
- add jaudiotagger jar to Netbeans-IDE project classpath
2012-10-09 20:02:58 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
a06930662c replaced some more .getBytes() with UTF8/ASCII.getBytes() 2012-10-09 12:14:28 +02:00
Michael Peter Christen
bd769de604 since the solr index is now used for all pages that are indexed locally,
there is no need for the RWI index if the index is not transfered to
another peer. Therefore the creation of RWI index data is now suppressed
if DHT is disabled. This applies for all intranet and portal mode
configurations, but not for public robinson modes. A robinson may switch
back to public mode and then transmit its data. That means if someone
wants to switch never to DHT mode, it would be more appropriate to
choose the portal mode.
2012-10-09 11:48:55 +02:00
Michael Peter Christen
4b5e0c1500 added an url rewriter which can be used to remove session ids from urls 2012-10-09 11:24:48 +02:00
Michael Peter Christen
877042a6b5 fix for portal mode 2012-10-08 14:54:06 +02:00
Michael Peter Christen
76d218fbef fixes to crawl profiles 2012-10-08 10:50:40 +02:00
Michael Peter Christen
2f536cb54d code cleanup: removed unised methods and made more methods and objects
private
2012-10-08 10:50:24 +02:00
Michael Peter Christen
584663ae8c - redesign of solr query construction
- fix for solr boosts and location search
- fix for number of search results in local search
2012-10-07 07:46:55 +02:00
Michael Peter Christen
6ab64746d7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-06 03:35:32 +02:00
Michael Peter Christen
a8167e6e5b clean-up: removed unused methods in kelondro 2012-10-06 03:34:52 +02:00
sof
5cb244b79b Merge remote branch 'origin/master' 2012-10-05 18:54:39 +02:00
apfelmaennchen
88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
on the jaudiotagger library. The parser is disabled by default as it
needs to store temporary files for non file:// protocols, which might be
disliked. For your local MP3-collection it loads nicely Artist,
Title, Album etc. from the audio files meta data.
2012-10-05 18:54:26 +02:00
Michael Peter Christen
28bd3e62b1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-05 00:04:09 +02:00
orbiter
4fed4a86d8 another fix to location search 2012-10-04 22:44:44 +02:00
orbiter
0f7a54452d fix for location search query encoding 2012-10-04 14:46:40 +02:00
Michael Peter Christen
31485a963d refactoring 2012-10-02 21:57:50 +02:00
Michael Peter Christen
f8a3ab2d82 added the usage of synonyms to the GSA search interface 2012-10-02 14:29:45 +02:00
Michael Peter Christen
3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
synonyms_sxt
2012-10-02 11:13:06 +02:00
Michael Peter Christen
41ab2a2279 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:24:03 +02:00
orbiter
c8b1a693dc ups, added missing class for last commit 2012-10-02 10:23:10 +02:00
Michael Peter Christen
3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:14:09 +02:00
orbiter
3190347814 added a synonyms_t field to solr and a process to read synonym files.
This can be used to add another stemming to solr using stemming files
that are expressed as synonyms for grammatical alternatives. The
synonym/stemming files must have the following form:
- each line is a comma-separated list of synonyms
- the list of synonyms may be enclosed with {} (like the GSA synonyms
file)
- the file may contain comments which are lines starting with a '#'
The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and
are activated by default whenever a synonym file is in place.
Then, for each word that is found in a document all synonyms are added
to a long text field which is stored into synonyms_t. Processes using
the synonyms must query with that field as optional matcher.
2012-10-02 00:02:50 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
c4a3d8870f fixed computation of links in host browser which are not indexed but
knwon by the crawler. Such links are now displayed in grey color.
2012-09-29 02:13:11 +02:00
Michael Peter Christen
f45f7fc12e added new Host Browser to main menu:
this new search interface is something completely new for search, but
completely common on desktops: browser a web space like one would browse
a file system in a file browser. The file listing is created using the
search index and a faceted restriction to specific domains.
2012-09-28 22:45:16 +02:00
Michael Peter Christen
8556a3d521 extended solr connector with a method to retrieve a single facet. 2012-09-28 13:50:13 +02:00
Michael Peter Christen
816cb6ce93 another fix for the debian installer: the installer fails because some
classes had unresolved dependencies. This fix removes the dependencies.
2012-09-28 09:00:40 +02:00
Michael Peter Christen
280e36c90b allow Cross-Origin Resource Sharing for all stream servlets, that is the
solr and the gsa search interface. That means that all JavaScript in
browsers now can Cross-Origin access all YaCy search interfaces, which
opens the option of 'YaCy Client in Browser' and 'End-Point Fail-over'
concepts.
2012-09-27 12:02:24 +02:00
Michael Peter Christen
016ffa7434 increased strength of crawling waves in network image 2012-09-26 23:32:13 +02:00
Michael Peter Christen
23f68f2a69 force usage of default faceting mechanisms for search 2012-09-26 18:48:59 +02:00
Michael Peter Christen
24d2ee3c52 - better date ranking
- more protection against NPE and time travel effects
2012-09-26 18:36:32 +02:00
Michael Peter Christen
ca313e404f - if a "/date" modifier is used, the solr remote query applies an
ordering by date (ascending)
- added also some 'anti-timetravel' protection (check if date is in the
future within any metadata date field)
2012-09-26 16:56:33 +02:00
Michael Peter Christen
a4214694df We assert that no other metadata storage than solr is used now.
Therefore a property like solrConnected() must be true all the time.
Removal of this method causes removal of all write operations to the old
metadata index.
2012-09-26 16:05:11 +02:00
Michael Peter Christen
0cec7e761a enhanced snippet extractor to find snippets also inside of tokens of an
url
2012-09-26 15:33:37 +02:00
sixcooler
6c50d016ed pdf- and zipParser should not use forced Memory-Limits 2012-09-26 14:03:51 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
24f4ca4d85 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-26 12:01:34 +02:00
apfelmaennchen
116f429e35 fix for java.lang.RuntimeException: TableColumnIndex not available... 2012-09-26 09:56:16 +02:00
Michael Peter Christen
5ac61591f3 better abstraction for solr query params 2012-09-25 23:59:30 +02:00
Michael Peter Christen
c913b2ba77 - fix for NPEs during remote solr configuration
- fixed remote solr setting switch
- added more logging
2012-09-25 23:59:09 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
e49359cc95 removed tenant query attribute since it is not used any more and is
replaced by the site-operator in the GSA interface. This operator can
also be simulated in the Solr interface using the collections_sxt field.
2012-09-25 21:09:06 +02:00
Michael Peter Christen
872f83ebe0 refactoring 2012-09-25 21:04:58 +02:00
Michael Peter Christen
fb9460f0a8 using the search filter to drill down search to file types.
A search like "mp3 filetype:mp3" will now maybe surprise you.
2012-09-25 17:52:33 +02:00
Michael Peter Christen
15ea053c3a - added xml output in IndexControlURLs to get the storage page of index
dump commands
- adjusted the apicall.sh script to get the downloaded text as output to
stdout which is necessary to parse the content out of it
- added indexdump.sh script which creates a solr dump and prints out the
storage path for the index dump
- added synchronization to the Fulltext class to prevent that data is
stored to a non-existing solr index while this index is disabled during
the storage of the dump
2012-09-25 00:19:52 +02:00
Michael Peter Christen
1b474139dd used the new zip writer/reader to add a solr dump process: the whole
solr index can be written to a zip dump and also restored during runtime
2012-09-24 17:05:28 +02:00
Michael Peter Christen
4a3e684f8c added a directory-to-zip writer and zip-to-directory reader 2012-09-24 17:04:37 +02:00
Michael Peter Christen
d9ebf4a40f a bit more logging 2012-09-24 15:01:44 +02:00
Michael Peter Christen
5683162bd3 simplifications in DHT Distribution class and more documentation 2012-09-24 12:01:09 +02:00
Michael Peter Christen
e57bf2ca39 simplified DHT classes 2012-09-24 01:04:39 +02:00
orbiter
a053b356ee added new classes to renovate the YaCy protocol based on simple data
structures in cora:
- added the Peer object, which is a fresh version of Seed
- added the Peers object, which is a fresh version of Network
- added the Network api access class to retrieve a list of peers based
on the Network.xml servlet in all YaCy peers.
2012-09-22 11:10:11 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
f879a344e7 fix for no depth limit default value 2012-09-21 16:05:17 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00
orbiter
563d584420 removed more dependencies in cora from kelondro 2012-09-21 11:02:36 +02:00
orbiter
aa65282259 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-21 10:27:30 +02:00
orbiter
63762d8f89 removed kelondro dependencies from cora 2012-09-20 19:38:22 +02:00
orbiter
6e0f4557f8 added ftp to getName 2012-09-20 18:29:04 +02:00
cominch
23204d2245 change parameter to support the smw extension for list import 2012-09-20 15:02:57 +02:00
Michael Peter Christen
c235d5c0f1 fixed size parsing in RSS message parser (for YaCy size parameter) 2012-09-19 06:36:07 +02:00
Michael Peter Christen
5bc8f34150 fix for success query counter 2012-09-18 11:06:36 +02:00
orbiter
60b1e23f05 added new crawl options:
- indexUrlMustMatch and indexUrlMustNotMatch which can be used to select
loaded pages for indexing. Default patterns are in such a way that all
loaded pages are also indexed (as before) but when doing an expert crawl
start, then the user may select only specific urls to be indexed.
- crawlerNoDepthLimitMatch is a new pattern that can be used to remove
the crawl depth limitation. This filter a never-match by default (which
causes that the depth is used) but the user can select paths which will
be loaded completely even if a crawl depth is reached.
2012-09-16 21:27:55 +02:00
orbiter
4987921d3d fixed the size() method which counted also failed pages (which are also
inside the solr index)
2012-09-16 21:22:56 +02:00
Michael Peter Christen
6ec02deec6 added new crawl attributes in crawl profile (not active yet) 2012-09-14 16:49:29 +02:00
Michael Peter Christen
975bc95ddf added default facet fields for json response format (stub) 2012-09-14 12:09:20 +02:00
Michael Peter Christen
0504b01bdc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-14 00:48:17 +02:00