Commit Graph

4263 Commits

Author SHA1 Message Date
Michael Peter Christen
205f8b222b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-11-25 14:41:49 +01:00
orbiter
c54cb85422 added link to
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
to the /RegexTest.html servlet
2012-11-25 12:20:41 +01:00
Michael Peter Christen
b7004043ea - added a field cache for solr queries which call only for a single
value
- fixed a version conflict exception within a solr add request
2012-11-24 22:30:05 +01:00
Michael Peter Christen
bf42179982 introduced more structure in HostBrowser, table view, better counting,
distinguishing of error cases (fail/excluded)
2012-11-23 14:09:48 +01:00
Michael Peter Christen
4eab3aae60 removed overhead by preventing generation of full search results when
only the url is requested
2012-11-23 01:35:28 +01:00
Michael Peter Christen
a114bb23bb - using edismax in gsa interface
- generating less field data for gsa search results
- using a boost query in gsa interface to move double content to the end
of the result list
2012-11-22 13:03:33 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
Michael Peter Christen
46be4af5b9 Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890' 2012-11-18 22:11:04 +01:00
Michael Peter Christen
952e143580 FINALLY YaCy can now search for full strings using double- or
singlequoted strings in the search query line!!!
2012-11-18 16:03:34 +01:00
orbiter
5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
2012-11-18 01:22:41 +01:00
Michael Peter Christen
5fd3b93661 added deletion of hosts during crawl start if deleteold option was given 2012-11-13 16:54:28 +01:00
Michael Peter Christen
d64445c3cb because we have the inurl:<term> - searchmodifier, we don't actually
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
2012-11-13 11:45:56 +01:00
orbiter
b55ea2197f - redesign of crawl start servlet
- for domain-limited crawls, the domain is deleted now by default before
the crawl is started
2012-11-13 10:54:21 +01:00
orbiter
1c66de4bd4 - removed scheduled crawling options in crawl start because it is
superfluous there; it can be changed in the scheduler servlet. It's also
confusing in the presence of the delete-option, which will be
implemented next.
- removed unused crawl start servlet
- some refactoring to make the time parser reusable
2012-11-12 11:19:39 +01:00
Michael Peter Christen
2e7219f9fd removed hightlighting of search results within collections in GSA
interface
2012-11-09 16:25:24 +01:00
Michael Peter Christen
074dfd297b added icons and a selection for hosts with urls pending for crawler or
with errors
2012-11-09 16:24:56 +01:00
cominch
21df1ad9e0 update and generalization of the SMW import and content control routines 2012-11-09 13:48:40 +01:00
Michael Peter Christen
4c4e0eece2 added new submenu 'Target Analysis' with three servlets which are useful
to analyse the target servers: robots.txt table, mass target analysis
and a regex tester
2012-11-07 21:26:01 +01:00
Michael Peter Christen
61995d508e do the commit anyway before calling a search interface 2012-11-07 17:27:50 +01:00
Michael Peter Christen
86ec199126 using a better file name 2012-11-07 16:39:49 +01:00
Michael Peter Christen
5105256927 update to search result logging (this was a remaining issue from the
solr 4.0.0 migration)
2012-11-07 14:15:27 +01:00
Michael Peter Christen
570e42c4e3 fix for filetype naviagtor 2012-11-07 13:53:29 +01:00
Michael Peter Christen
71ed8e5e07 bugfixes for crawler 2012-11-07 12:52:19 +01:00
Michael Peter Christen
29fbbb49dc better colors for host browser and corrected document count 2012-11-07 12:23:21 +01:00
Michael Peter Christen
6244b084cd fixed wrong order of result count values 2012-11-07 02:29:33 +01:00
Michael Peter Christen
631b08e7e2 update to HostBrowser 2012-11-07 02:17:24 +01:00
Michael Peter Christen
51f420e4f5 removed location search because it is only working in special cases 2012-11-07 02:04:41 +01:00
Michael Peter Christen
15d1460b40 added information about the reason of pausing of crawls 2012-11-06 15:21:56 +01:00
Michael Peter Christen
2371ef031c added solr faceted search support to YaCy search results
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
2012-11-06 14:32:08 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
a15819fbec fix for some interface problems 2012-11-05 22:14:52 +01:00
Michael Peter Christen
791e1dcfdf when a new crawl is started, delete all entries about error-urls for
crawl-start domains
2012-11-05 22:14:27 +01:00
Michael Peter Christen
c6a6f4c4e6 added a hack which makes the HostBrowser more performant when the given
host has a lot of urls. If the number of urls is > 1000, then the list
of documents is restricted to such which have no subpath, if the root
path is selected. However, this can cause a problem if no documents on
the root path exist but only on paths below that root path.
2012-11-05 18:57:21 +01:00
Michael Peter Christen
64ac2b7b7d new submenu template 2012-11-05 15:36:42 +01:00
Michael Peter Christen
5e77801aac update to web interface structure 2012-11-05 15:23:03 +01:00
Michael Peter Christen
8fb370d9f8 renovated the way how search results are count. should be correct now... 2012-11-05 03:19:28 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
Michael Peter Christen
19d1f474ce host browser now shows also number of pending files per subdirectory +
bugfixes
2012-11-02 14:40:02 +01:00
Michael Peter Christen
75dd706e1b update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
e2c4c3c7d3 migration to solr 4.0.0 2012-11-02 12:29:48 +01:00
Michael Peter Christen
9330ad4838 - fixed the delete option in host browser
- added a delete method which can be used to delete a full subpath in
solr.
2012-11-02 01:22:31 +01:00
Michael Peter Christen
40df2fd193 added the host browser as link to search results. that means you can
select a browsing position after a search is done on the search results.
2012-11-01 21:38:05 +01:00
Michael Peter Christen
1168d09de8 more refactoring - integrated the code of SnippetProcess into
SearchEvent
2012-11-01 17:40:06 +01:00
Michael Peter Christen
6629e37685 tried to clean up the search process mess 2012-11-01 17:16:43 +01:00
Michael Peter Christen
c5f67a5d6d fixed a problem with local search from solr results: now all results
from solr are shown (again)
2012-11-01 10:22:22 +01:00
Michael Peter Christen
f8f05ecba7 - added a delete button in host browser to delete a complete subpath
- removed storage of default collection name - default is now "user"
- made stacking of crawl start points concurrently
2012-10-31 17:44:45 +01:00
Michael Peter Christen
0716a24737 added more / all new crawl profile fields into crawl profile editor 2012-10-31 15:13:05 +01:00
Michael Peter Christen
4a14122ba7 in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
2012-10-31 14:08:33 +01:00
Michael Peter Christen
0fe8be7981 enhaced data structures for balancer and latency computation which
should produce a bit better prognosis about forced waiting times.
2012-10-30 17:30:24 +01:00
Michael Peter Christen
ac9540dfb6 removed options for stopwords which are not used 2012-10-30 12:36:36 +01:00
Michael Peter Christen
ce3fed8882 added the Google Search Appliance (GSA) api interface to the main menu.
See:
https://developers.google.com/search-appliance/documentation/68/xml_reference#request_overview
2012-10-30 12:27:22 +01:00
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
a87811bc38 more auto-commit calls when a search interface is opened, but not when a
search is done there to prevent blocking during search-time.
2012-10-29 11:27:13 +01:00
Michael Peter Christen
3d3d654e88 if a network configuration is choosed which does not allow DHT and no
P2P communication is in robinson mode) then some menu entries are
disabled which have no use in this mode.
2012-10-29 01:51:19 +01:00
Michael Peter Christen
2d9e577ad0 replaced the custom robots.txt loader by the standard http loader 2012-10-28 22:48:11 +01:00
Michael Peter Christen
799d71bc67 enhanced solr caching:
- increased cache size which is needed for longer solr commit time
- speed hacks on cache write code
2012-10-28 20:31:29 +01:00
orbiter
8952153ecf update to Balancer algorithm:
- create a load list from the current list of known hosts
- do not create this list for each Balancer.pop access
- create the list from those hosts which have a zero-waiting time
- select 1/3 from that list which have the most urls waiting
- get hosts from the wainting list in random order
- fixes for some delta-time computations
- always load all urls from hosts which have never been loaded before
2012-10-28 13:24:49 +01:00
Michael Peter Christen
8e1248ffe3 force a commit in advance of a search for the administrator to get most
recent results even if commit time is high and an indexing is ongoing.
2012-10-26 15:35:42 +02:00
Michael Peter Christen
1baf498d59 - show more lines in online log
- reverse order is default now
2012-10-25 18:38:39 +02:00
Michael Peter Christen
f2d0418218 because the new PngEncoder had a problem with the PixelGrabber which is
caused by a JRE bug, the PixelGrabber had to be circumvented using an
own frame buffer which can be read without a PixelGrabber. This resulted
in ultra-fast and much less memory-consuming transformation. YaCy images
are now generated really fast!
2012-10-25 17:59:20 +02:00
Michael Peter Christen
d5d64019e5 - added a method for the RasterPlotter to draw arrow endings to lines
- replaced the dot in the NetworkGraph with arrows
- enhanced the image drawing speed using pre-computed color values
- added more attention for OOM cases during very large image painting
2012-10-25 16:05:04 +02:00
Michael Peter Christen
342543a6c4 fix for host browser 2012-10-25 10:23:43 +02:00
Michael Peter Christen
85ca07b90e when a new crawl is started, an equal crawl, if still running, is
terminated and the corresponding crawl profile is deleted (this also
clears the crawl queue entries for that crawl profile)
2012-10-25 10:20:55 +02:00
Michael Peter Christen
906e51214a the web structure image shows the pivot dot in a different color 2012-10-25 10:18:28 +02:00
orbiter
276dd6452b removed warnings 2012-10-23 19:08:44 +02:00
orbiter
59bf4677b6 added option to view the complete directory structure in host browser 2012-10-23 19:02:55 +02:00
Michael Peter Christen
b991685782 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-10-23 18:14:58 +02:00
Michael Peter Christen
9eaede50e7 enhanced web structure images 2012-10-23 18:11:19 +02:00
Michael Peter Christen
ae6feb5610 showing the web structure graph as animation in the crawl monitor 2012-10-23 02:50:26 +02:00
Michael Peter Christen
39317a6c66 enhanced webstructure image: introduced
- multiple hosts can be listed (comma-separated) as host argument
- new 'bf'-attribut (branch factor): the maximum number of edges per
node
- the bf-value is computed automatically
- ordering of nodes when the graphic is drawed: mostly the drawing ends
with an limitation eg. number of nodes. When this happens, it should be
ensured that more 'interesting' nodes are painted in advance. This is
now done by sorting all nodes by the number of links they have in de
distant sub-graph.
2012-10-22 16:23:39 +02:00
sixcooler
57ddd63888 not hold a expensive cache of references for DHT-out,but but load them
on demand
see: http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4530
2012-10-21 20:00:36 +02:00
reger
1dc6482feb format crawler timeout output string in seconds (was days) 2012-10-21 03:00:05 +02:00
Michael Peter Christen
ef937af35d more custom field usage in gsa search result 2012-10-18 15:26:55 +02:00
Michael Peter Christen
ce0e5b1e17 - more refactoring / private methods
- fix for usage of custom solr field names
2012-10-18 15:09:04 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
7f71dfab03 added a HostBrowser.xml api file and changed a bit of attribute naming 2012-10-18 11:42:13 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
5d16c23a1f specified more URIMetadata as URIMetadataNode 2012-10-16 18:26:21 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
cc98496ff3 enhanced the HostBrowser:
- showing also outbound links to other domains if there are any
- the outbound links browser shows also the link structure image
- showing even inbound links if the web structure graph has information
about that
- removed the left menu and made the HostBrowser a part of the top menu
for search
- moved the file search also to the top menu
- added hover information in the HostBrowser to explain what the click
means
- because the HostBrowser also links to the Metadata viewer ViewFile,
there should be a button to switch back to the HostBrowser: added that
also.
2012-10-16 17:13:18 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
4023d88b0b added date info in parser errors 2012-10-15 10:57:36 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
53789555b9 fix for crawl start filter 2012-10-10 10:40:32 +02:00
Michael Peter Christen
abebb3b124 added a crawl start checker which makes a simple analysis on the list of
all given urls: shows if the url can be loaded and if there is a robots
and/or a sitemap.
2012-10-10 02:02:17 +02:00
Michael Peter Christen
941873fba4 moved the index deletion functions from IndexControlRWIs to
IndexControlURLs where it appears more naturally. Because the RWI
administration is less important in the presence of Solr, the
IndexControlURL is now the default servlet when the Index Administration
button on the main menu is selected.
2012-10-10 00:09:27 +02:00
orbiter
ae246c30c3 fixed interpretation of directDocByURL attribute during crawl start 2012-10-09 23:11:31 +02:00
Michael Peter Christen
a06930662c replaced some more .getBytes() with UTF8/ASCII.getBytes() 2012-10-09 12:14:28 +02:00
Michael Peter Christen
bd769de604 since the solr index is now used for all pages that are indexed locally,
there is no need for the RWI index if the index is not transfered to
another peer. Therefore the creation of RWI index data is now suppressed
if DHT is disabled. This applies for all intranet and portal mode
configurations, but not for public robinson modes. A robinson may switch
back to public mode and then transmit its data. That means if someone
wants to switch never to DHT mode, it would be more appropriate to
choose the portal mode.
2012-10-09 11:48:55 +02:00
Michael Peter Christen
554db5608b fix for ViewFile 2012-10-09 11:25:05 +02:00
orbiter
9190599d21 use links in AccessTracker 2012-10-08 19:47:14 +02:00
Michael Peter Christen
42e525ca9a enhanced the host browser 2012-10-08 14:00:14 +02:00
Michael Peter Christen
76d218fbef fixes to crawl profiles 2012-10-08 10:50:40 +02:00
Michael Peter Christen
2f536cb54d code cleanup: removed unised methods and made more methods and objects
private
2012-10-08 10:50:24 +02:00
Michael Peter Christen
406e1f3e7e added an option to start indexing right from the host browser 2012-10-02 21:18:27 +02:00
Michael Peter Christen
f8a3ab2d82 added the usage of synonyms to the GSA search interface 2012-10-02 14:29:45 +02:00
orbiter
be4c96f3b1 The HostBrowser now offers to index files that are discovered because
they are linked in the web interface.
2012-09-30 13:23:06 +02:00
Michael Peter Christen
c4a3d8870f fixed computation of links in host browser which are not indexed but
knwon by the crawler. Such links are now displayed in grey color.
2012-09-29 02:13:11 +02:00
Michael Peter Christen
97a47319c8 added nice links to the host browser:
- click on the file icon to get the metadata of the file
- click on the link icon behind the link to open the original file in
the browser
2012-09-28 23:09:21 +02:00
Michael Peter Christen
f45f7fc12e added new Host Browser to main menu:
this new search interface is something completely new for search, but
completely common on desktops: browser a web space like one would browse
a file system in a file browser. The file listing is created using the
search index and a faceted restriction to specific domains.
2012-09-28 22:45:16 +02:00
Michael Peter Christen
280e36c90b allow Cross-Origin Resource Sharing for all stream servlets, that is the
solr and the gsa search interface. That means that all JavaScript in
browsers now can Cross-Origin access all YaCy search interfaces, which
opens the option of 'YaCy Client in Browser' and 'End-Point Fail-over'
concepts.
2012-09-27 12:02:24 +02:00
Michael Peter Christen
ccd65ecf8d fixed url search in IndexControlURLs_p.html / using now the solr
interface
2012-09-27 00:31:59 +02:00
Michael Peter Christen
24d2ee3c52 - better date ranking
- more protection against NPE and time travel effects
2012-09-26 18:36:32 +02:00
Michael Peter Christen
a4214694df We assert that no other metadata storage than solr is used now.
Therefore a property like solrConnected() must be true all the time.
Removal of this method causes removal of all write operations to the old
metadata index.
2012-09-26 16:05:11 +02:00
Michael Peter Christen
abab291162 made the index schema retrieval public and allow cross-domain retrieval 2012-09-26 15:44:50 +02:00
sixcooler
c65b576a6f added filename for missing crawlname when crawling from file 2012-09-26 14:05:33 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
24f4ca4d85 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-26 12:01:34 +02:00
apfelmaennchen
7efe9eb37b adding CORS access header for Network.xml to overcome cross domain
restriction (e.g. necessary to build a JavaScript YaCy
client).
2012-09-26 10:36:09 +02:00
Michael Peter Christen
c913b2ba77 - fix for NPEs during remote solr configuration
- fixed remote solr setting switch
- added more logging
2012-09-25 23:59:09 +02:00
Michael Peter Christen
882d54067a added dummy update servlet 2012-09-25 23:09:32 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
e49359cc95 removed tenant query attribute since it is not used any more and is
replaced by the site-operator in the GSA interface. This operator can
also be simulated in the Solr interface using the collections_sxt field.
2012-09-25 21:09:06 +02:00
Michael Peter Christen
872f83ebe0 refactoring 2012-09-25 21:04:58 +02:00
Michael Peter Christen
15ea053c3a - added xml output in IndexControlURLs to get the storage page of index
dump commands
- adjusted the apicall.sh script to get the downloaded text as output to
stdout which is necessary to parse the content out of it
- added indexdump.sh script which creates a solr dump and prints out the
storage path for the index dump
- added synchronization to the Fulltext class to prevent that data is
stored to a non-existing solr index while this index is disabled during
the storage of the dump
2012-09-25 00:19:52 +02:00
Michael Peter Christen
1b474139dd used the new zip writer/reader to add a solr dump process: the whole
solr index can be written to a zip dump and also restored during runtime
2012-09-24 17:05:28 +02:00
Michael Peter Christen
e57bf2ca39 simplified DHT classes 2012-09-24 01:04:39 +02:00
orbiter
14897d4bfc fixed mistake in wt-option which caused that the yacy json format
overlapped the solr built-in json format
2012-09-21 21:38:50 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
fa7f6f0be8 added HostBrowser servlet (stub) 2012-09-21 15:48:40 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00
orbiter
563d584420 removed more dependencies in cora from kelondro 2012-09-21 11:02:36 +02:00
orbiter
63762d8f89 removed kelondro dependencies from cora 2012-09-20 19:38:22 +02:00
orbiter
089a03114e full memory usage for debian and when changing the size: debian seems to
dislike the big difference between xmx and xms (I have crashes here
which stop if both values are same)
2012-09-18 22:31:01 +02:00
orbiter
60b1e23f05 added new crawl options:
- indexUrlMustMatch and indexUrlMustNotMatch which can be used to select
loaded pages for indexing. Default patterns are in such a way that all
loaded pages are also indexed (as before) but when doing an expert crawl
start, then the user may select only specific urls to be indexed.
- crawlerNoDepthLimitMatch is a new pattern that can be used to remove
the crawl depth limitation. This filter a never-match by default (which
causes that the depth is used) but the user can select paths which will
be loaded completely even if a crawl depth is reached.
2012-09-16 21:27:55 +02:00
Michael Peter Christen
6ec02deec6 added new crawl attributes in crawl profile (not active yet) 2012-09-14 16:49:29 +02:00
Michael Peter Christen
a13e5153ac - added the possibility to have not one but a list of crawl start urls
- the list of urls is entered in the expert crawl start in a textfield;
the one-line input field was replaced with a text box
- start urls can also be given in one single line where the urls are
separated by a '|'-character
- as an effect, the crawl profile cannot carry a single start url for
identificaton because it is possible to have more. Therefore the url was
removed from the crawl profile
- this affect all servlets which display a crawl profile: removed the
url field from all there servlets
- to work consistently with several start urls and the other crawl
starts which computed crawl start url lists from sitelists or sitemaps,
the crawl start servlet was restructured completely
- new rules for must-match patterns were created to make it possible
that site crawl starts also work with several crawl starts at once
2012-09-14 12:25:46 +02:00
Michael Peter Christen
975bc95ddf added default facet fields for json response format (stub) 2012-09-14 12:09:20 +02:00
Michael Peter Christen
2f218df55d added missing license headers 2012-09-14 12:06:06 +02:00
Michael Peter Christen
a30653a864 added a regular expression test servlet which is linked within the
parser/crawler error page whenever a problem with regular expression
occurs.
This makes it easy to correct and enhance the must-match and
must-not-match patterns just by trying out which pattern could be
correct.
2012-09-14 12:04:54 +02:00
Michael Peter Christen
0504b01bdc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-14 00:48:17 +02:00
orbiter
a55e77a115 added twitter search heuristic 2012-09-13 23:53:53 +02:00
Michael Peter Christen
e54ac38095 - some corrections in usage of getFile() and getFileName()
- added more attributes in json response writer according to yacy
servlet
2012-09-11 23:28:21 +02:00
Michael Peter Christen
9644c186a4 added search functionality to ViewFile.html servlet 2012-09-11 02:03:14 +02:00
Michael Peter Christen
b69ed96f0b - added collections to yacydoc
- changed yacydoc.htm to yacydoc.json
- added query logging in solr and gsa search result
2012-09-10 15:20:55 +02:00
Michael Peter Christen
5df553c152 - added a json writer for solr (yes there was one using xslt but this
one writes the same way as yacysearch.json)
- using the new json solr result to change the ajax search in
IndexControlURLs to the new solr search
2012-09-10 14:30:44 +02:00
Michael Peter Christen
4d29f59a27 removed warnings 2012-09-10 07:15:52 +02:00
Michael Peter Christen
8c099d2106 Merge remote-tracking branch 'origin/master'
Conflicts:
	htroot/api/ymarks/import_ymark.java
	source/de/anomic/data/ymark/YMarkEntry.java
	source/de/anomic/data/ymark/YMarkTables.java
2012-09-10 07:05:20 +02:00
apfelmaennchen
59bd478ed1 Added more sophisticated RDF output for YMarks, including the folder
structure (b:Topic) and support for multiple tags (dc:subject) and
folders (b:hasTopic) via rdf:Bag container.
2012-09-09 22:56:24 +02:00
apfelmaennchen
d31a632951 - added dmoz RDF dump importer
- added indexing to Tables columns to support larger bookmark
collections
- added RDF output (HTTP) for public bookmarks at /YMarks.rdf
- YMarkRDF also provides a Jena RDF Model as "internal" API
- various other changes/fixes for YMarks (mainly backend)
2012-09-09 09:53:58 +02:00
orbiter
66ac4076c2 added disjunction '|' option to site parameter in GSA API 2012-09-06 22:35:55 +02:00
sixcooler
9ee2e09983 statistics for solr-cache 2012-09-06 22:02:29 +02:00
Michael Peter Christen
d8425e6809 added collections to crawl monitor 2012-09-04 14:47:53 +02:00
Michael Peter Christen
4b36a2c3b4 small style changes 2012-09-04 11:23:41 +02:00
Michael Peter Christen
8ca842b137 added new button design to more buttons 2012-09-03 16:04:57 +02:00
Michael Peter Christen
b2b516cc3e added a collection attribute to crawls and searches:
- a solr field collection_sxt can be used to store a set of crawl tags
- when this field is activated, a crawl tag can be assigned when crawls
are started
- the content of the collection field can be comma-separated, all of
them are assigned to the documents when they are indexed as result of
such a crawl start
- a search result can be drilled down to a specific collection; this is
currently only available in the solr interface and also in the gsa
interface using the 'site' option
- this adds a mandatory field for gsa queries (the google api demands
that field all the time)
2012-09-03 15:26:08 +02:00
Michael Peter Christen
174530a9e0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-03 00:46:17 +02:00
apfelmaennchen
43f3a932fd removed jquery.slider as it is already included as part of jquery-ui
package
2012-09-01 14:17:20 +02:00
apfelmaennchen
a01eb1b7fe removed unused jquery plugin slider as it is part of jquery-ui package 2012-09-01 10:25:22 +02:00
Michael Peter Christen
f75b3f8a47 added more patches to work without RWI data structure 2012-08-31 14:35:56 +02:00
Michael Peter Christen
a427a68bac removed many warnings 2012-08-31 14:07:33 +02:00
Michael Peter Christen
c72c435517 - moved the gsa search interface from /gsa/searchresult? to /gsa/search?
- fixed the NB field data
2012-08-31 14:00:53 +02:00
Michael Peter Christen
31d4d38804 - extended the solr interface by a references-by-word-count method
- reduced danger that a non-existing RWI database causes NPEs
- added Solr queries to did-you-mean: this makes it possible that our
did-you-mean algorithm works together with only Solr and without RWIs
2012-08-31 13:03:00 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00
Michael Peter Christen
3142e675e8 fixed problems with GSA api:
- better FS attribute
- highlightning of searched words in title
2012-08-29 16:48:53 +02:00
Michael Peter Christen
3b19fe7b52 - fixed num parameter in GSA api
- changed FS attribute in GSA api
2012-08-29 16:28:32 +02:00
Michael Peter Christen
75d5e3475d Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-29 10:13:51 +02:00
cominch
dc468dad01 add content control features for custom filter lists 2012-08-29 09:04:28 +02:00
Michael Peter Christen
316b5fe116 - added a solr type definition verifier
- fixed type definition found by the verifier
- added multivalue-string fields for solr with extension 'sxt'
- added multivalue-integer fields for solr with extension 'val'
- renamed some solr attributes from txt to sxt
- changed solr query line to an explicit AND/OR structure
- added a country code second level domain list to Domains class; with
parser
- added a host string parser to get domain class name, country-code
second-level domain and subdomain out of it
- removed old coordinate attributes
2012-08-28 16:58:06 +02:00
reger
2d2be546fe fix path to env/grafics to display api icon on meta data page 2012-08-26 04:36:52 +02:00
orbiter
7ac259477f added a direct access to solr search api to enhance the visibility if
the embedded solr
2012-08-24 23:04:19 +02:00
orbiter
67f2866cd0 small fixes 2012-08-24 21:44:22 +02:00
orbiter
479bfca571 refctoring 2012-08-23 09:30:11 +02:00
Michael Peter Christen
48a82bc705 log queries anonymous from gsa+solr requests 2012-08-22 23:50:40 +02:00
Michael Peter Christen
ab6ec4ec52 added snippet computation to solr/rss and gsa result writer 2012-08-22 17:37:34 +02:00
Michael Peter Christen
4716546ef5 - reduced memory usage in index transmission using a transformation of
Node to Row objects
- removed peerDeparture in solr remote search in case that peer does not
answer (this may be normal because it is allowed to switch this off)
2012-08-22 16:30:33 +02:00
Michael Peter Christen
0ad52ac4c3 gsa bugfix for date parser 2012-08-21 02:39:28 +02:00
Michael Peter Christen
3ce4c2f937 fixes for gsa result format 2012-08-21 01:57:46 +02:00
Michael Peter Christen
2d5fdfeb65 added authorization-based maximum results limitation to solr and gsa
search
2012-08-20 17:10:48 +02:00
Michael Peter Christen
6fc5400f91 added a tooltip for search navigation to mention that search pages can
be navigated using the TAB key
2012-08-20 13:02:29 +02:00
Michael Peter Christen
a06123aec6 more abstraction and less parameter overhead for remote search 2012-08-20 01:29:15 +02:00
Michael Peter Christen
f00733186b code simplifications 2012-08-19 13:17:03 +02:00
orbiter
780f8974e7 added ramaining iteration methods for solr in fulltext class 2012-08-18 15:39:14 +02:00
orbiter
6f01542aaa explicit double-check in transferURL 2012-08-18 13:18:51 +02:00
Michael Peter Christen
d54b80327a refactoring 2012-08-17 17:28:27 +02:00
Michael Peter Christen
0cab06c47c refactoring 2012-08-17 15:52:33 +02:00
Michael Peter Christen
40c0856489 refactoring 2012-08-17 15:33:02 +02:00
Michael Peter Christen
e651d3e320 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-17 14:45:18 +02:00
Michael Peter Christen
06a78eecb7 code simplification 2012-08-17 14:43:32 +02:00
cominch
8a91f4fa42 local robots.txt: disallow external crawlers to follow the URL proxy 2012-08-17 11:47:39 +02:00
Michael Peter Christen
18f989dfb1 - refactoring (load -> getMetadata)
- added getDocument to retrieve Solr documents which shall replace
getMetadata
2012-08-17 01:34:38 +02:00
Michael Peter Christen
6197caf698 added clear-text search words in query params 2012-08-16 23:05:37 +02:00
Michael Peter Christen
23226676c6 FOR THE BRAVE.. this is a forced migration to solr which is now ready
for production as a replacement of the metadata-db.
This intermediate release 1.041 will switch on the previously optional
solr index and the old metadata-db will still work as it did before.
Solr+metadata are accessed in mixed mode, no migration is done yet.
If this causes not a catastrophe until the end of the weekend, we will
do a YaCy 1.1 main release containing this as default.
2012-08-16 18:17:47 +02:00
Michael Peter Christen
7c31be1c80 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-16 17:45:26 +02:00
cominch
6456a1656a changed local robots.txt to prevent external crawlers to submit random
search queries
2012-08-16 17:38:10 +02:00
Michael Peter Christen
703f427303 fixed some peer-ping connection details
- larger time-out
- removed too old seedlist
- fixed a bug in connection test
2012-08-16 17:11:54 +02:00
Michael Peter Christen
597bb76e4f get the peer location more quickly 2012-08-16 16:28:57 +02:00
orbiter
156d457aec fix for Index out of bounds exception in Network servlet 2012-08-16 07:47:52 +02:00
Lotus
ae9cd7a118 fix xss bug #204 2012-08-15 14:23:21 +02:00
Michael Peter Christen
d988ba50cf added a very rudimentary, incomplete, non-verified GSA response writer
for solr. Try this:
http://localhost:8090/gsa/searchresult?q=pdf&site=col1&num=10
2012-08-14 12:40:26 +02:00
Michael Peter Christen
aab0b680c3 - added xslt support for solr result formats.
try i.e.
http://localhost:8090/solr/select?q=*:*&start=0&rows=10&wt=xslt&tr=json.xsl
- added servlet-side mime-type configuration for streamed servlets. this
is used for the result formatters in solr result formats
2012-08-14 11:12:50 +02:00
cominch
ad62609ec7 added a possibility to define a custom network definition URL for remote
management
2012-08-13 16:57:53 +02:00
cominch
fb0f430685 Merge remote-tracking branch 'original yacy/master' 2012-08-13 16:48:14 +02:00
Michael Peter Christen
b51df6c7e8 - added coordinate storage in solr schema
- fixed shutdown process
- fixed some solr-to-metadata reading
- added a large number of metadata attributes in ViewFile.html
2012-08-13 10:40:04 +02:00
orbiter
9b88433f45 patch from hint in
http://forum.yacy-websuche.de/viewtopic.php?p=26858#p26858
from gaston
2012-08-10 15:44:37 +02:00
orbiter
e816b88b55 changed behaviour of metadata storage: in case that any solr is
attached, the metadata is not written to the metadata-db, even if it is
enabled but instead to solr. This prevents that metadata is written in
two store systems at the same time. It is also the next step to migrate
the current metadata-db to solr.
2012-08-10 15:39:10 +02:00
Michael Peter Christen
f9c0e6e950 - Implemented and integrated the URIMetadataNode object which is a
metadata representation from the solr index. This shall replace metadata
from the built-in database in the future.
- added the Solr-driven metadata into the search index of YaCy which
makes it now possible to run YaCy without the old metadata index. This
is a major stept forward to a full migration to Solr.
2012-08-10 13:26:51 +02:00
Michael Peter Christen
b2b480fff2 more abstraction of the YaCySchema -> Opensearch matching process 2012-08-10 09:48:15 +02:00