Commit Graph

207 Commits

Author SHA1 Message Date
Michael Peter Christen
a88a62f7aa added a feature to set a collection for a crawl result based on a
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
e4cbe9232d fixed a crawler bug where a double-occurring url was not re-crawled
because the double-check error was written to the error-db and never
deleted. No the error-db is cleared on every start and these
double-messages are not written to the error-db any more.
2013-08-22 15:56:09 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
dbfa865700 added a stub of a class for crawler redesign 2013-07-31 13:16:32 +02:00
Michael Peter Christen
58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
orbiter
268a36aaff emergency fix for crawler: this will otherwise cause loss of complete
crawl queue if latency of remote system is too low
2013-07-27 11:59:07 +02:00
reger
2b7a38640a extend content type detection on file extension for .tif .tiff .htm 2013-07-21 22:57:21 +02:00
Michael Peter Christen
735a66eff3 enhancements to crawler 2013-07-18 12:29:04 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
89c0aa0e74 added collection_sxt to error documents 2013-07-17 15:20:56 +02:00
Michael Peter Christen
c6a6f159e8 fix for crawl stack domain counter 2013-07-16 18:18:55 +02:00
Michael Peter Christen
bcc623a843 refactoring of load_delay: this is a matter of client identification 2013-07-12 16:24:56 +02:00
orbiter
3978c5ca5d fix for http://bugs.yacy.net/view.php?id=255 2013-07-12 14:38:30 +02:00
orbiter
dac88561ae minimum access time has a tight connection to ClientIdentification,
therefore it is defined there.
2013-07-11 17:04:24 +02:00
Michael Peter Christen
5c6946dd5f replaced usage of log4j by ConcurrentLog where possible 2013-07-09 14:42:39 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
a34e137e27 fix for citation index generation in case that entry.referrerhash() is
null. This is especially the case if ftp sites are crawled
2013-07-08 16:26:11 +02:00
sixcooler
9551720d5c re-enable saved setting for proxy-crawl-profile 2013-07-04 19:10:57 +02:00
Michael Peter Christen
57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
default.
2013-07-03 14:50:06 +02:00
Michael Peter Christen
fa08bd9d5a hack to prevent long waiting times in crawler 2013-07-01 13:24:52 +02:00
Michael Peter Christen
f1c5338210 prepartion for greedy crawl profiles and refactoring 2013-07-01 13:10:09 +02:00
Michael Peter Christen
203921006a redesign of citation index storage 2013-06-30 02:11:46 +02:00
Michael Peter Christen
16d1d744fa added url_file_name_s in default collection schema for the file name
without the file extension. This part of the file path is removed from
the multi-field url_paths_sxt, which has now not the file name as last
part of the path list.

The same applies to the new fields source_file_name_s and
target_file_name_s in the webgraph schema.
2013-06-25 16:27:20 +02:00
Michael Peter Christen
374d2e2a52 removed warning message during crawling 2013-06-13 13:03:56 +02:00
Michael Peter Christen
2fd7bbb450 reduced load on solr; no seed update in Status and no exists-check in
HTTPLoader in case of redirects, that can be done using the htcache.
2013-06-12 00:14:55 +02:00
Michael Peter Christen
2648b42b27 added fixed clear method as public method 2013-06-11 16:22:43 +02:00
Michael Peter Christen
e20450e798 patch in HTCache and CitationIndex loading in case that a file is
broken: do not crash; instead ignore the file and delete it.
2013-06-07 12:52:03 +02:00
reger
7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247
- append language setting specific stopword list

- remove unused OVERHANG stack type
2013-06-06 22:07:54 +02:00
Michael Peter Christen
8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
reduced time-out of robots.txt load limit
2013-05-20 22:05:28 +02:00
Michael Peter Christen
06d3063dc9 - no downcase when using collection modifier
- removed warnings
2013-05-17 14:11:10 +02:00
Michael Peter Christen
8dbc80da70 redesign of index.exist-test: this shall now not be done using a single
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
2013-05-17 13:59:37 +02:00
Michael Peter Christen
44e363f37f refactoring of WorkflowProcessor, added process counter, update of
process counter if an blocking thread dies. Added also a new column in
PerformanceConcurrency_p servlet to show the actual number of concurrent
processes.
2013-05-13 13:28:07 +02:00
Michael Peter Christen
77faeada4d small memory leak patch 2013-05-11 11:19:06 +02:00
Michael Peter Christen
038f956821 fix for sitemap detection: the sitemap url was not visible if it
appeared after the declaration of robots allow/deny for the crawler
because the sitemap parser terminated after the allow/deny rules had
been found. Now the parser reads the robots.txt until the end to
discover also sitemap rules at the end of the file.
2013-05-10 04:56:58 +02:00
Michael Peter Christen
bb4bf3d8fd infinity timeout bug protection patch 2013-04-30 11:06:48 +02:00
Michael Peter Christen
3a0fcfbeda Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-26 10:50:08 +02:00
Michael Peter Christen
25499eead5 - added a new field for the regular expression in crawl start
- added the field in crawl profile
- adopted logging end error management
- adopted duplicate document detection
- added a new rule to the indexing process to reject non-matching
content
- full redesign of the expert crawl start servlet
The new filter field can now be seen in /CrawlStartExpert_p.html at
Section "Document Filter", subsection item "Filter on Content of
Document"
2013-04-26 10:49:55 +02:00
orbiter
e1bfe9d07a - reduction of the concurrently running processes to make YaCy more
adjusted to smaller and 1-core devices.
- the workflow processor now starts no process at all. these are started
as soon as parser/condenser/indexing queues are filled.
- better abstraction
2013-04-25 11:33:17 +02:00
Michael Peter Christen
c091000165 added collection attribute also to the rss feed reader 2013-04-24 01:14:35 +02:00
Michael Peter Christen
252bb51f98 fix for wrong mime type in noload crawler 2013-03-07 15:31:00 +01:00
orbiter
2555542f7a removed the dns prefetch because that was not soo useful 2013-02-27 20:58:34 +01:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
structure, but is not filled yet. To have the opportunity of a second
core, multi-core functionality had to be implemented to the
deep-embedded solr:
- migrated the solr_40 directory content to a subdirectory
'collection1'; the previously used default core is now called
collection1
- added solr_40/webgraph subdirectory as second core
- added a servlet configuration for the second core 'webgraph' in
/IndexSchema_p.html
- added instance handling as addition to solr connections: all solr
connectors are now instances of an solr 'instance' object; this required
a complete re-design of the solr embedding
- migrated also caching and sharding ontop of new instance handling
- migrated the search apis to handle now the access to a specific core,
the default core named 'collection1'
- migrated the remote solr search interface to access shards of cores;
for the yacy remote search the default core is now called 'solr'; using
the peer address as solr address
- migrated the solr backup and restore process: old backups cannot be
used after this migration!
- redesign of solr instance handling in all methods which access the
instances: they cannot hold copies of these instances any more; the must
retrieve the actuall connection object every time they want to write to
it (this solves also some bugs when switching the index/network)
- added another schema 'solr.webgraph.schema', the old solr.keys.list is
replaced by solr.collection.schema
2013-02-21 13:23:55 +01:00
Michael Peter Christen
b6de1f42dc Full redesign of solr connection architecture. This was done to support
multiple solr cores instead of just one. Therefore it is now necessary
to distuingish between solr server connections (called an 'Instance')
and a connection to a single solr core. One Instance may now have
multiple connector classes assigned to it, each connecting to a single
core.
To support multiple cores it is also necessary to distinguish between
the connection configuration and the configuration of the index schema.
We will have multiple schema configurations in the future, each for
every solr core. This caused that the IndexFederated servlet had to be
split into two parts, the new Servlet for the Schema editor is now in
the IndexSchema Servlet.
2013-02-15 01:38:10 +01:00
Michael Peter Christen
0b6566a389 optimizations when starting large crawl requests with many start urls in
one request:
- allow larger match-fields in html interface
- delete all host hashes at once from zurl
- when deleting by host, do not count size of deleted entries since that
was the reason it took so long
2013-01-31 13:15:28 +01:00
Michael Peter Christen
0fe7b6fd3b migrated the index export methods from the old metadata to solr. Now
exports are done using solr queries. removed superfluous methods and
servlets.
2013-01-24 12:39:19 +01:00
Michael Peter Christen
9ccdd21d76 Merge remote-tracking branch 'aleksejs/fixtrans'
Conflicts:
	locales/ru.lng
	
Tried to merge this but I had to made this 'blind'.
Sorry if I deleted something that was right.
2013-01-22 11:54:38 +01:00
Michael Peter Christen
af465cdca5 fix for wrong robots.txt loading for https protocol
see also: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4579
2013-01-16 17:38:06 +01:00
Michael Peter Christen
38d3feae65 added separate delete commands for the local+remote solr index, the old
metadata and old rwi and for the citation index. The important
advancement is the separation of the citation index deletion because
that index is responsible for the linkdepth calculation. Now a search
index can be deleted without the citation index and that should cause
that less clickdepths must be post-processed.
2013-01-04 16:39:34 +01:00
reger
276e63401e small sanitary fixes
- exclude unix shell scripts in NSIS windows install archive
- replace link to env/grafics/yacy.gif to yacy.png (build.nsi)
- remove unused code lines (Blacklist_p, Response, WordReferenceVars)
- type & xhtml (RankingSolr_p.html)
2013-01-02 01:59:47 +01:00
Michael Peter Christen
8f3bd0c387 fix for smb crawl situation (lost too many urls) 2012-12-26 19:15:11 +01:00
Michael Peter Christen
7c3de8b4cd - fix for localhost detection
- added IPv6 patterns for localhost detection
2012-12-18 12:52:20 +01:00
orbiter
712cc37c40 if maxFileSize < 0 then the file size limit is without limit. 2012-12-10 21:17:45 +01:00
Michael Peter Christen
a3cd3852ab introduced a better place to update the lastacc time value in latency 2012-12-07 15:49:23 +01:00
Michael Peter Christen
864abcd33d removed Latency update after URL selection because that causes
a completely wrong behaviour when cache fresh cases appear. Makes
re-crawling MUCH faster!
2012-12-07 15:35:44 +01:00
Michael Peter Christen
dd241d03bb latency fix: only set last-visit time if access was actually by the
robot
2012-12-07 02:00:12 +01:00
Michael Peter Christen
1e002ab18e added another blacklist-cleaner into balancer 2012-12-07 01:27:24 +01:00
Michael Peter Christen
10527e28ae fix for wrong display of error urls in HostBrowser 2012-12-07 00:31:10 +01:00
Michael Peter Christen
756772fbd3 fix for waitingtime computation for intranet configuration 2012-12-06 17:40:52 +01:00
Michael Peter Christen
fa27e5820f - check blacklist (again) when taking urls from the crawl stack because
the blacklist may get extended during crawling
- removed debug output
2012-12-06 00:12:16 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
3de784c8dd replaced more split and replaceAll missing pattern pre-compilation with
pre-compiled pattern
2012-11-26 13:40:53 +01:00
Michael Peter Christen
d48e9788d2 enhanced search result processing behavior
- query less at one time; query more often
- in between the small queries, evaluate results
- remove fields from search results which are not needed
2012-11-26 12:24:35 +01:00
Michael Peter Christen
eca68fa197 added debug code to crawler monitor 2012-11-25 15:43:42 +01:00
orbiter
5aa5202adf fixes for filesystem indexing 2012-11-24 10:27:29 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
5e182a566f - added another enumeration method in kelondro data structure to get a
more random access to data for the balancer
- added random access inside the balancer
2012-11-23 13:58:39 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
cominch
d2a94cc55e refactor package 2012-11-09 16:22:24 +01:00
Michael Peter Christen
8b1c9cba3d fixed a problem with non-terminating crawls 2012-11-07 15:05:44 +01:00
Michael Peter Christen
71ed8e5e07 bugfixes for crawler 2012-11-07 12:52:19 +01:00
Michael Peter Christen
158732af37 automatically delete entries from the crawl profile list if crawl is
terminated.
2012-11-07 02:03:44 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
791e1dcfdf when a new crawl is started, delete all entries about error-urls for
crawl-start domains
2012-11-05 22:14:27 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
Michael Peter Christen
75dd706e1b update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
0716a24737 added more / all new crawl profile fields into crawl profile editor 2012-10-31 15:13:05 +01:00
Michael Peter Christen
4a14122ba7 in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
2012-10-31 14:08:33 +01:00
Michael Peter Christen
0fe8be7981 enhaced data structures for balancer and latency computation which
should produce a bit better prognosis about forced waiting times.
2012-10-30 17:30:24 +01:00
Michael Peter Christen
ac9540dfb6 removed options for stopwords which are not used 2012-10-30 12:36:36 +01:00
Michael Peter Christen
b2ffd49817 less latency 2012-10-30 12:26:32 +01:00
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain
with many documents block refreshing of the crawl queue
2012-10-29 22:26:52 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
a87811bc38 more auto-commit calls when a search interface is opened, but not when a
search is done there to prevent blocking during search-time.
2012-10-29 11:27:13 +01:00
Michael Peter Christen
2d9e577ad0 replaced the custom robots.txt loader by the standard http loader 2012-10-28 22:48:11 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
orbiter
8952153ecf update to Balancer algorithm:
- create a load list from the current list of known hosts
- do not create this list for each Balancer.pop access
- create the list from those hosts which have a zero-waiting time
- select 1/3 from that list which have the most urls waiting
- get hosts from the wainting list in random order
- fixes for some delta-time computations
- always load all urls from hosts which have never been loaded before
2012-10-28 13:24:49 +01:00
Michael Peter Christen
85ca07b90e when a new crawl is started, an equal crawl, if still running, is
terminated and the corresponding crawl profile is deleted (this also
clears the crawl queue entries for that crawl profile)
2012-10-25 10:20:55 +02:00
Michael Peter Christen
ae6feb5610 showing the web structure graph as animation in the crawl monitor 2012-10-23 02:50:26 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
53789555b9 fix for crawl start filter 2012-10-10 10:40:32 +02:00
Michael Peter Christen
a06930662c replaced some more .getBytes() with UTF8/ASCII.getBytes() 2012-10-09 12:14:28 +02:00
Michael Peter Christen
4b5e0c1500 added an url rewriter which can be used to remove session ids from urls 2012-10-09 11:24:48 +02:00
Michael Peter Christen
76d218fbef fixes to crawl profiles 2012-10-08 10:50:40 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
872f83ebe0 refactoring 2012-09-25 21:04:58 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
f879a344e7 fix for no depth limit default value 2012-09-21 16:05:17 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00