yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	eca68fa197	added debug code to crawler monitor	2012-11-25 15:43:42 +01:00
Michael Peter Christen	205f8b222b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-11-25 14:41:49 +01:00
orbiter	c54cb85422	added link to http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html to the /RegexTest.html servlet	2012-11-25 12:20:41 +01:00
Michael Peter Christen	b7004043ea	- added a field cache for solr queries which call only for a single value - fixed a version conflict exception within a solr add request	2012-11-24 22:30:05 +01:00
Michael Peter Christen	bf42179982	introduced more structure in HostBrowser, table view, better counting, distinguishing of error cases (fail/excluded)	2012-11-23 14:09:48 +01:00
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	2012-11-23 01:35:28 +01:00
Michael Peter Christen	a114bb23bb	- using edismax in gsa interface - generating less field data for gsa search results - using a boost query in gsa interface to move double content to the end of the result list	2012-11-22 13:03:33 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	2012-11-19 17:24:34 +01:00
Michael Peter Christen	46be4af5b9	Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890'	2012-11-18 22:11:04 +01:00
Michael Peter Christen	952e143580	FINALLY YaCy can now search for full strings using double- or singlequoted strings in the search query line!!!	2012-11-18 16:03:34 +01:00
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	2012-11-18 01:22:41 +01:00
Michael Peter Christen	5fd3b93661	added deletion of hosts during crawl start if deleteold option was given	2012-11-13 16:54:28 +01:00
Michael Peter Christen	d64445c3cb	because we have the inurl:<term> - searchmodifier, we don't actually need regular expressions as search attributes. They had now been removed from the advanced search page while they are still created internally. The filter is then expressed against solr as regular expression filter query. If the expression points out a selection of an specific protocol, host or filetype this is then translated into a facetted query.	2012-11-13 11:45:56 +01:00
orbiter	b55ea2197f	- redesign of crawl start servlet - for domain-limited crawls, the domain is deleted now by default before the crawl is started	2012-11-13 10:54:21 +01:00
orbiter	1c66de4bd4	- removed scheduled crawling options in crawl start because it is superfluous there; it can be changed in the scheduler servlet. It's also confusing in the presence of the delete-option, which will be implemented next. - removed unused crawl start servlet - some refactoring to make the time parser reusable	2012-11-12 11:19:39 +01:00
Michael Peter Christen	2e7219f9fd	removed hightlighting of search results within collections in GSA interface	2012-11-09 16:25:24 +01:00
Michael Peter Christen	074dfd297b	added icons and a selection for hosts with urls pending for crawler or with errors	2012-11-09 16:24:56 +01:00
cominch	21df1ad9e0	update and generalization of the SMW import and content control routines	2012-11-09 13:48:40 +01:00
Michael Peter Christen	4c4e0eece2	added new submenu 'Target Analysis' with three servlets which are useful to analyse the target servers: robots.txt table, mass target analysis and a regex tester	2012-11-07 21:26:01 +01:00
Michael Peter Christen	61995d508e	do the commit anyway before calling a search interface	2012-11-07 17:27:50 +01:00
Michael Peter Christen	86ec199126	using a better file name	2012-11-07 16:39:49 +01:00
Michael Peter Christen	5105256927	update to search result logging (this was a remaining issue from the solr 4.0.0 migration)	2012-11-07 14:15:27 +01:00
Michael Peter Christen	570e42c4e3	fix for filetype naviagtor	2012-11-07 13:53:29 +01:00
Michael Peter Christen	71ed8e5e07	bugfixes for crawler	2012-11-07 12:52:19 +01:00
Michael Peter Christen	29fbbb49dc	better colors for host browser and corrected document count	2012-11-07 12:23:21 +01:00
Michael Peter Christen	6244b084cd	fixed wrong order of result count values	2012-11-07 02:29:33 +01:00
Michael Peter Christen	631b08e7e2	update to HostBrowser	2012-11-07 02:17:24 +01:00
Michael Peter Christen	51f420e4f5	removed location search because it is only working in special cases	2012-11-07 02:04:41 +01:00
Michael Peter Christen	15d1460b40	added information about the reason of pausing of crawls	2012-11-06 15:21:56 +01:00
Michael Peter Christen	2371ef031c	added solr faceted search support to YaCy search results added solr highlighting / YaCy snippets to YaCy search results - facets are now much more complete - facets are computed and searched much faster - snippet computation is done by solr if solr knows the snippet	2012-11-06 14:32:08 +01:00
Michael Peter Christen	d481abd087	added the visualization of error-urls to host browser - only visible for admins - a faceted search generates a huge list for all hosts in the host list - the faceted search algorithms had to be modified for that - within the browsing of the directory path, the error cause is written to the url which is presented as error-url - the errors are also accumulated for directory sums	2012-11-06 00:29:37 +01:00
Michael Peter Christen	a15819fbec	fix for some interface problems	2012-11-05 22:14:52 +01:00
Michael Peter Christen	791e1dcfdf	when a new crawl is started, delete all entries about error-urls for crawl-start domains	2012-11-05 22:14:27 +01:00
Michael Peter Christen	c6a6f4c4e6	added a hack which makes the HostBrowser more performant when the given host has a lot of urls. If the number of urls is > 1000, then the list of documents is restricted to such which have no subpath, if the root path is selected. However, this can cause a problem if no documents on the root path exist but only on paths below that root path.	2012-11-05 18:57:21 +01:00
Michael Peter Christen	64ac2b7b7d	new submenu template	2012-11-05 15:36:42 +01:00
Michael Peter Christen	5e77801aac	update to web interface structure	2012-11-05 15:23:03 +01:00
Michael Peter Christen	8fb370d9f8	renovated the way how search results are count. should be correct now...	2012-11-05 03:19:28 +01:00
orbiter	354ef8000d	- added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency	2012-11-04 02:58:26 +01:00
Michael Peter Christen	19d1f474ce	host browser now shows also number of pending files per subdirectory + bugfixes	2012-11-02 14:40:02 +01:00
Michael Peter Christen	75dd706e1b	update to HostBrowser: - time-out after 3 seconds to speed up display (may be incomplete) - showing also all links from the balancer queue in the host list (after the '/') and in the result browser view with tag 'loading'	2012-11-02 13:57:43 +01:00
Michael Peter Christen	e2c4c3c7d3	migration to solr 4.0.0	2012-11-02 12:29:48 +01:00
Michael Peter Christen	9330ad4838	- fixed the delete option in host browser - added a delete method which can be used to delete a full subpath in solr.	2012-11-02 01:22:31 +01:00
Michael Peter Christen	40df2fd193	added the host browser as link to search results. that means you can select a browsing position after a search is done on the search results.	2012-11-01 21:38:05 +01:00
Michael Peter Christen	1168d09de8	more refactoring - integrated the code of SnippetProcess into SearchEvent	2012-11-01 17:40:06 +01:00
Michael Peter Christen	6629e37685	tried to clean up the search process mess	2012-11-01 17:16:43 +01:00
Michael Peter Christen	c5f67a5d6d	fixed a problem with local search from solr results: now all results from solr are shown (again)	2012-11-01 10:22:22 +01:00
Michael Peter Christen	f8f05ecba7	- added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently	2012-10-31 17:44:45 +01:00
Michael Peter Christen	0716a24737	added more / all new crawl profile fields into crawl profile editor	2012-10-31 15:13:05 +01:00
Michael Peter Christen	4a14122ba7	in case that a crawl profile has a collection assigned, use the collection to show a name in the web interface. This should prevent that much too long names make the interface unusable.	2012-10-31 14:08:33 +01:00
Michael Peter Christen	0fe8be7981	enhaced data structures for balancer and latency computation which should produce a bit better prognosis about forced waiting times.	2012-10-30 17:30:24 +01:00
Michael Peter Christen	ac9540dfb6	removed options for stopwords which are not used	2012-10-30 12:36:36 +01:00
Michael Peter Christen	ce3fed8882	added the Google Search Appliance (GSA) api interface to the main menu. See: https://developers.google.com/search-appliance/documentation/68/xml_reference#request_overview	2012-10-30 12:27:22 +01:00
Michael Peter Christen	0833937c1c	better balancing and duetime-cumputation also for no-delay intranet hosts	2012-10-30 11:28:49 +01:00
Michael Peter Christen	c25d7bcb80	- added concurrency for robots.txt loading - changed data model for domain counter	2012-10-29 21:08:45 +01:00
Michael Peter Christen	a87811bc38	more auto-commit calls when a search interface is opened, but not when a search is done there to prevent blocking during search-time.	2012-10-29 11:27:13 +01:00
Michael Peter Christen	3d3d654e88	if a network configuration is choosed which does not allow DHT and no P2P communication is in robinson mode) then some menu entries are disabled which have no use in this mode.	2012-10-29 01:51:19 +01:00
Michael Peter Christen	2d9e577ad0	replaced the custom robots.txt loader by the standard http loader	2012-10-28 22:48:11 +01:00
Michael Peter Christen	799d71bc67	enhanced solr caching: - increased cache size which is needed for longer solr commit time - speed hacks on cache write code	2012-10-28 20:31:29 +01:00
orbiter	8952153ecf	update to Balancer algorithm: - create a load list from the current list of known hosts - do not create this list for each Balancer.pop access - create the list from those hosts which have a zero-waiting time - select 1/3 from that list which have the most urls waiting - get hosts from the wainting list in random order - fixes for some delta-time computations - always load all urls from hosts which have never been loaded before	2012-10-28 13:24:49 +01:00
Michael Peter Christen	8e1248ffe3	force a commit in advance of a search for the administrator to get most recent results even if commit time is high and an indexing is ongoing.	2012-10-26 15:35:42 +02:00
Michael Peter Christen	1baf498d59	- show more lines in online log - reverse order is default now	2012-10-25 18:38:39 +02:00
Michael Peter Christen	f2d0418218	because the new PngEncoder had a problem with the PixelGrabber which is caused by a JRE bug, the PixelGrabber had to be circumvented using an own frame buffer which can be read without a PixelGrabber. This resulted in ultra-fast and much less memory-consuming transformation. YaCy images are now generated really fast!	2012-10-25 17:59:20 +02:00
Michael Peter Christen	d5d64019e5	- added a method for the RasterPlotter to draw arrow endings to lines - replaced the dot in the NetworkGraph with arrows - enhanced the image drawing speed using pre-computed color values - added more attention for OOM cases during very large image painting	2012-10-25 16:05:04 +02:00
Michael Peter Christen	342543a6c4	fix for host browser	2012-10-25 10:23:43 +02:00
Michael Peter Christen	85ca07b90e	when a new crawl is started, an equal crawl, if still running, is terminated and the corresponding crawl profile is deleted (this also clears the crawl queue entries for that crawl profile)	2012-10-25 10:20:55 +02:00
Michael Peter Christen	906e51214a	the web structure image shows the pivot dot in a different color	2012-10-25 10:18:28 +02:00
orbiter	276dd6452b	removed warnings	2012-10-23 19:08:44 +02:00
orbiter	59bf4677b6	added option to view the complete directory structure in host browser	2012-10-23 19:02:55 +02:00
Michael Peter Christen	b991685782	Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1	2012-10-23 18:14:58 +02:00
Michael Peter Christen	9eaede50e7	enhanced web structure images	2012-10-23 18:11:19 +02:00
Michael Peter Christen	ae6feb5610	showing the web structure graph as animation in the crawl monitor	2012-10-23 02:50:26 +02:00
Michael Peter Christen	39317a6c66	enhanced webstructure image: introduced - multiple hosts can be listed (comma-separated) as host argument - new 'bf'-attribut (branch factor): the maximum number of edges per node - the bf-value is computed automatically - ordering of nodes when the graphic is drawed: mostly the drawing ends with an limitation eg. number of nodes. When this happens, it should be ensured that more 'interesting' nodes are painted in advance. This is now done by sorting all nodes by the number of links they have in de distant sub-graph.	2012-10-22 16:23:39 +02:00
sixcooler	57ddd63888	not hold a expensive cache of references for DHT-out,but but load them on demand see: http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4530	2012-10-21 20:00:36 +02:00
reger	1dc6482feb	format crawler timeout output string in seconds (was days)	2012-10-21 03:00:05 +02:00
Michael Peter Christen	ef937af35d	more custom field usage in gsa search result	2012-10-18 15:26:55 +02:00
Michael Peter Christen	ce0e5b1e17	- more refactoring / private methods - fix for usage of custom solr field names	2012-10-18 15:09:04 +02:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	7f71dfab03	added a HostBrowser.xml api file and changed a bit of attribute naming	2012-10-18 11:42:13 +02:00
Michael Peter Christen	e5b3c172ff	removed hack which translated Solr documents to virtual RWI entries which had been then mixed with remote RWIs. Now these Solr documents are feeded into the result set as they appear during local and remote search. That makes the search much faster.	2012-10-17 17:45:41 +02:00
Michael Peter Christen	5d16c23a1f	specified more URIMetadata as URIMetadataNode	2012-10-16 18:26:21 +02:00
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	2012-10-16 18:11:57 +02:00
Michael Peter Christen	cc98496ff3	enhanced the HostBrowser: - showing also outbound links to other domains if there are any - the outbound links browser shows also the link structure image - showing even inbound links if the web structure graph has information about that - removed the left menu and made the HostBrowser a part of the top menu for search - moved the file search also to the top menu - added hover information in the HostBrowser to explain what the click means - because the HostBrowser also links to the Metadata viewer ViewFile, there should be a button to switch back to the HostBrowser: added that also.	2012-10-16 17:13:18 +02:00
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	2012-10-15 13:17:13 +02:00
Michael Peter Christen	4023d88b0b	added date info in parser errors	2012-10-15 10:57:36 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	53789555b9	fix for crawl start filter	2012-10-10 10:40:32 +02:00
Michael Peter Christen	abebb3b124	added a crawl start checker which makes a simple analysis on the list of all given urls: shows if the url can be loaded and if there is a robots and/or a sitemap.	2012-10-10 02:02:17 +02:00
Michael Peter Christen	941873fba4	moved the index deletion functions from IndexControlRWIs to IndexControlURLs where it appears more naturally. Because the RWI administration is less important in the presence of Solr, the IndexControlURL is now the default servlet when the Index Administration button on the main menu is selected.	2012-10-10 00:09:27 +02:00
orbiter	ae246c30c3	fixed interpretation of directDocByURL attribute during crawl start	2012-10-09 23:11:31 +02:00
Michael Peter Christen	a06930662c	replaced some more .getBytes() with UTF8/ASCII.getBytes()	2012-10-09 12:14:28 +02:00
Michael Peter Christen	bd769de604	since the solr index is now used for all pages that are indexed locally, there is no need for the RWI index if the index is not transfered to another peer. Therefore the creation of RWI index data is now suppressed if DHT is disabled. This applies for all intranet and portal mode configurations, but not for public robinson modes. A robinson may switch back to public mode and then transmit its data. That means if someone wants to switch never to DHT mode, it would be more appropriate to choose the portal mode.	2012-10-09 11:48:55 +02:00
Michael Peter Christen	554db5608b	fix for ViewFile	2012-10-09 11:25:05 +02:00
orbiter	9190599d21	use links in AccessTracker	2012-10-08 19:47:14 +02:00
Michael Peter Christen	42e525ca9a	enhanced the host browser	2012-10-08 14:00:14 +02:00
Michael Peter Christen	76d218fbef	fixes to crawl profiles	2012-10-08 10:50:40 +02:00
Michael Peter Christen	2f536cb54d	code cleanup: removed unised methods and made more methods and objects private	2012-10-08 10:50:24 +02:00
Michael Peter Christen	406e1f3e7e	added an option to start indexing right from the host browser	2012-10-02 21:18:27 +02:00
Michael Peter Christen	f8a3ab2d82	added the usage of synonyms to the GSA search interface	2012-10-02 14:29:45 +02:00
orbiter	be4c96f3b1	The HostBrowser now offers to index files that are discovered because they are linked in the web interface.	2012-09-30 13:23:06 +02:00
Michael Peter Christen	c4a3d8870f	fixed computation of links in host browser which are not indexed but knwon by the crawler. Such links are now displayed in grey color.	2012-09-29 02:13:11 +02:00
Michael Peter Christen	97a47319c8	added nice links to the host browser: - click on the file icon to get the metadata of the file - click on the link icon behind the link to open the original file in the browser	2012-09-28 23:09:21 +02:00
Michael Peter Christen	f45f7fc12e	added new Host Browser to main menu: this new search interface is something completely new for search, but completely common on desktops: browser a web space like one would browse a file system in a file browser. The file listing is created using the search index and a faceted restriction to specific domains.	2012-09-28 22:45:16 +02:00
Michael Peter Christen	280e36c90b	allow Cross-Origin Resource Sharing for all stream servlets, that is the solr and the gsa search interface. That means that all JavaScript in browsers now can Cross-Origin access all YaCy search interfaces, which opens the option of 'YaCy Client in Browser' and 'End-Point Fail-over' concepts.	2012-09-27 12:02:24 +02:00
Michael Peter Christen	ccd65ecf8d	fixed url search in IndexControlURLs_p.html / using now the solr interface	2012-09-27 00:31:59 +02:00
Michael Peter Christen	24d2ee3c52	- better date ranking - more protection against NPE and time travel effects	2012-09-26 18:36:32 +02:00
Michael Peter Christen	a4214694df	We assert that no other metadata storage than solr is used now. Therefore a property like solrConnected() must be true all the time. Removal of this method causes removal of all write operations to the old metadata index.	2012-09-26 16:05:11 +02:00
Michael Peter Christen	abab291162	made the index schema retrieval public and allow cross-domain retrieval	2012-09-26 15:44:50 +02:00
sixcooler	c65b576a6f	added filename for missing crawlname when crawling from file	2012-09-26 14:05:33 +02:00
Michael Peter Christen	562183932b	- removed ip_s from default profile since that needs a DNS lookup to create an document entry. This makes remote search much slower. - removed synchronization of add method if ip_s is activated to prevent that a user configuration causes bad behavior. The disadvantage of that is, that a index dump can cause data loss if an indexing is running during index dump - catched more exceptions and more NPE - better abstraction in MirrorSolrConnector - slight performance enhancement when only the index count is requested (rows=0 is sufficient to get a total count)	2012-09-26 13:38:04 +02:00
Michael Peter Christen	24f4ca4d85	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-26 12:01:34 +02:00
apfelmaennchen	7efe9eb37b	adding CORS access header for Network.xml to overcome cross domain restriction (e.g. necessary to build a JavaScript YaCy client).	2012-09-26 10:36:09 +02:00
Michael Peter Christen	c913b2ba77	- fix for NPEs during remote solr configuration - fixed remote solr setting switch - added more logging	2012-09-25 23:59:09 +02:00
Michael Peter Christen	882d54067a	added dummy update servlet	2012-09-25 23:09:32 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	e49359cc95	removed tenant query attribute since it is not used any more and is replaced by the site-operator in the GSA interface. This operator can also be simulated in the Solr interface using the collections_sxt field.	2012-09-25 21:09:06 +02:00
Michael Peter Christen	872f83ebe0	refactoring	2012-09-25 21:04:58 +02:00
Michael Peter Christen	15ea053c3a	- added xml output in IndexControlURLs to get the storage page of index dump commands - adjusted the apicall.sh script to get the downloaded text as output to stdout which is necessary to parse the content out of it - added indexdump.sh script which creates a solr dump and prints out the storage path for the index dump - added synchronization to the Fulltext class to prevent that data is stored to a non-existing solr index while this index is disabled during the storage of the dump	2012-09-25 00:19:52 +02:00
Michael Peter Christen	1b474139dd	used the new zip writer/reader to add a solr dump process: the whole solr index can be written to a zip dump and also restored during runtime	2012-09-24 17:05:28 +02:00
Michael Peter Christen	e57bf2ca39	simplified DHT classes	2012-09-24 01:04:39 +02:00
orbiter	14897d4bfc	fixed mistake in wt-option which caused that the yacy json format overlapped the solr built-in json format	2012-09-21 21:38:50 +02:00
Michael Peter Christen	8219a445f3	refactoring	2012-09-21 16:46:57 +02:00
Michael Peter Christen	fa7f6f0be8	added HostBrowser servlet (stub)	2012-09-21 15:48:40 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
orbiter	563d584420	removed more dependencies in cora from kelondro	2012-09-21 11:02:36 +02:00
orbiter	63762d8f89	removed kelondro dependencies from cora	2012-09-20 19:38:22 +02:00
orbiter	089a03114e	full memory usage for debian and when changing the size: debian seems to dislike the big difference between xmx and xms (I have crashes here which stop if both values are same)	2012-09-18 22:31:01 +02:00
orbiter	60b1e23f05	added new crawl options: - indexUrlMustMatch and indexUrlMustNotMatch which can be used to select loaded pages for indexing. Default patterns are in such a way that all loaded pages are also indexed (as before) but when doing an expert crawl start, then the user may select only specific urls to be indexed. - crawlerNoDepthLimitMatch is a new pattern that can be used to remove the crawl depth limitation. This filter a never-match by default (which causes that the depth is used) but the user can select paths which will be loaded completely even if a crawl depth is reached.	2012-09-16 21:27:55 +02:00
Michael Peter Christen	6ec02deec6	added new crawl attributes in crawl profile (not active yet)	2012-09-14 16:49:29 +02:00
Michael Peter Christen	a13e5153ac	- added the possibility to have not one but a list of crawl start urls - the list of urls is entered in the expert crawl start in a textfield; the one-line input field was replaced with a text box - start urls can also be given in one single line where the urls are separated by a '\|'-character - as an effect, the crawl profile cannot carry a single start url for identificaton because it is possible to have more. Therefore the url was removed from the crawl profile - this affect all servlets which display a crawl profile: removed the url field from all there servlets - to work consistently with several start urls and the other crawl starts which computed crawl start url lists from sitelists or sitemaps, the crawl start servlet was restructured completely - new rules for must-match patterns were created to make it possible that site crawl starts also work with several crawl starts at once	2012-09-14 12:25:46 +02:00
Michael Peter Christen	975bc95ddf	added default facet fields for json response format (stub)	2012-09-14 12:09:20 +02:00
Michael Peter Christen	2f218df55d	added missing license headers	2012-09-14 12:06:06 +02:00
Michael Peter Christen	a30653a864	added a regular expression test servlet which is linked within the parser/crawler error page whenever a problem with regular expression occurs. This makes it easy to correct and enhance the must-match and must-not-match patterns just by trying out which pattern could be correct.	2012-09-14 12:04:54 +02:00
Michael Peter Christen	0504b01bdc	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-14 00:48:17 +02:00
orbiter	a55e77a115	added twitter search heuristic	2012-09-13 23:53:53 +02:00
Michael Peter Christen	e54ac38095	- some corrections in usage of getFile() and getFileName() - added more attributes in json response writer according to yacy servlet	2012-09-11 23:28:21 +02:00
Michael Peter Christen	9644c186a4	added search functionality to ViewFile.html servlet	2012-09-11 02:03:14 +02:00
Michael Peter Christen	b69ed96f0b	- added collections to yacydoc - changed yacydoc.htm to yacydoc.json - added query logging in solr and gsa search result	2012-09-10 15:20:55 +02:00
Michael Peter Christen	5df553c152	- added a json writer for solr (yes there was one using xslt but this one writes the same way as yacysearch.json) - using the new json solr result to change the ajax search in IndexControlURLs to the new solr search	2012-09-10 14:30:44 +02:00
Michael Peter Christen	4d29f59a27	removed warnings	2012-09-10 07:15:52 +02:00
Michael Peter Christen	8c099d2106	Merge remote-tracking branch 'origin/master' Conflicts: htroot/api/ymarks/import_ymark.java source/de/anomic/data/ymark/YMarkEntry.java source/de/anomic/data/ymark/YMarkTables.java	2012-09-10 07:05:20 +02:00
apfelmaennchen	59bd478ed1	Added more sophisticated RDF output for YMarks, including the folder structure (b:Topic) and support for multiple tags (dc:subject) and folders (b:hasTopic) via rdf:Bag container.	2012-09-09 22:56:24 +02:00
apfelmaennchen	d31a632951	- added dmoz RDF dump importer - added indexing to Tables columns to support larger bookmark collections - added RDF output (HTTP) for public bookmarks at /YMarks.rdf - YMarkRDF also provides a Jena RDF Model as "internal" API - various other changes/fixes for YMarks (mainly backend)	2012-09-09 09:53:58 +02:00
orbiter	66ac4076c2	added disjunction '\|' option to site parameter in GSA API	2012-09-06 22:35:55 +02:00
sixcooler	9ee2e09983	statistics for solr-cache	2012-09-06 22:02:29 +02:00
Michael Peter Christen	d8425e6809	added collections to crawl monitor	2012-09-04 14:47:53 +02:00
Michael Peter Christen	4b36a2c3b4	small style changes	2012-09-04 11:23:41 +02:00
Michael Peter Christen	8ca842b137	added new button design to more buttons	2012-09-03 16:04:57 +02:00
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	2012-09-03 15:26:08 +02:00
Michael Peter Christen	174530a9e0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-03 00:46:17 +02:00

1 2 3 4 5 ...

4214 Commits