Commit Graph

36 Commits

Author SHA1 Message Date
orbiter
b55ea2197f - redesign of crawl start servlet
- for domain-limited crawls, the domain is deleted now by default before
the crawl is started
2012-11-13 10:54:21 +01:00
orbiter
1c66de4bd4 - removed scheduled crawling options in crawl start because it is
superfluous there; it can be changed in the scheduler servlet. It's also
confusing in the presence of the delete-option, which will be
implemented next.
- removed unused crawl start servlet
- some refactoring to make the time parser reusable
2012-11-12 11:19:39 +01:00
Michael Peter Christen
5e77801aac update to web interface structure 2012-11-05 15:23:03 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
Michael Peter Christen
ac9540dfb6 removed options for stopwords which are not used 2012-10-30 12:36:36 +01:00
orbiter
60b1e23f05 added new crawl options:
- indexUrlMustMatch and indexUrlMustNotMatch which can be used to select
loaded pages for indexing. Default patterns are in such a way that all
loaded pages are also indexed (as before) but when doing an expert crawl
start, then the user may select only specific urls to be indexed.
- crawlerNoDepthLimitMatch is a new pattern that can be used to remove
the crawl depth limitation. This filter a never-match by default (which
causes that the depth is used) but the user can select paths which will
be loaded completely even if a crawl depth is reached.
2012-09-16 21:27:55 +02:00
Michael Peter Christen
a13e5153ac - added the possibility to have not one but a list of crawl start urls
- the list of urls is entered in the expert crawl start in a textfield;
the one-line input field was replaced with a text box
- start urls can also be given in one single line where the urls are
separated by a '|'-character
- as an effect, the crawl profile cannot carry a single start url for
identificaton because it is possible to have more. Therefore the url was
removed from the crawl profile
- this affect all servlets which display a crawl profile: removed the
url field from all there servlets
- to work consistently with several start urls and the other crawl
starts which computed crawl start url lists from sitelists or sitemaps,
the crawl start servlet was restructured completely
- new rules for must-match patterns were created to make it possible
that site crawl starts also work with several crawl starts at once
2012-09-14 12:25:46 +02:00
Michael Peter Christen
b2b516cc3e added a collection attribute to crawls and searches:
- a solr field collection_sxt can be used to store a set of crawl tags
- when this field is activated, a crawl tag can be assigned when crawls
are started
- the content of the collection field can be comma-separated, all of
them are assigned to the documents when they are indexed as result of
such a crawl start
- a search result can be drilled down to a specific collection; this is
currently only available in the solr interface and also in the gsa
interface using the 'site' option
- this adds a mandatory field for gsa queries (the google api demands
that field all the time)
2012-09-03 15:26:08 +02:00
Michael Peter Christen
d7eb18cdf2 accept also file names beginning with "file://" for crawl start from
file.
2012-06-06 14:27:18 +02:00
Michael Peter Christen
8bfc987374 enhanced hint how to enter file:// urls 2012-02-24 02:14:54 +01:00
orbiter
ebd840ebf6 - enhanced description on search front page
- fixed language and heuristic modifier
- added hint to crawl start that we can do also ftp and smb crawls
- added a protocol extension to remote crawls to transport all search modifiers to remote peers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8108 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-26 13:40:33 +00:00
orbiter
e4a82ddd8b produce a bookmark entry from every crawl start. these bookmarks are always private.
these bookmarks will be used to get a source reference for the search in case of intranet or portal searches.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-21 23:10:29 +00:00
orbiter
ff32469272 added a link to /api/util/getpageinfo_p.xml as API to crawl start info and to ViewFile.html
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8035 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-14 20:19:41 +00:00
low012
1b8b989744 *) set maxlength of input field for country code filter to value > default text length (old value caused warning in Opera)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8002 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-10-22 16:37:56 +00:00
orbiter
cf4fd525ee added directDocByURL attribute in crawl profile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 12:38:28 +00:00
orbiter
b250e6466d implemented crawl restrictions for IP pattern and country lists
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7980 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-29 15:17:39 +00:00
orbiter
5ad7f9612b added crawl settings for three new filters for each crawl:
must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue)
must-not-match for IPs
must-match against a list of country codes (allows only loading from hosts that are hostet in given countries)

note: the settings and input environment is there with that commit, but the values are not yet evaluated

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-27 21:58:18 +00:00
orbiter
af63aa1d0e added fresh links to java regular expression api-doc
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7763 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-31 13:33:04 +00:00
orbiter
7962d35425 - removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons:
1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file
2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible

- fixed deletion of crawl start point urls in crawl stack and balancer double-check
- fixed a problem with steering self-call (no resolving of localhost)
- added more logging for the crawler to supervise why crawl urls are not taken by the loader
- added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url
- fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-09 12:50:39 +00:00
orbiter
11bebe356b fixed crawl start: with SVN 7225 the name of the crawl start url was not given in input field and therefore all crawl starts had contained the empty string as crawl start url
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7229 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-08 22:02:24 +00:00
mikeworks
70576e88d2 de.lng: Added some more untranslated strings I found and uncommented old ones that were removed
terminal_p.html: Put back the old ID which was really easy to find
IndexCreate.js: Because XHTML 1.0 Strict does not allow name tags for some elements rewrote most element access functions to use getElementById
Table_API_p.html and all other html pages: Some XHTMl 1.0 Strict fixes, changed checkAll javascript, marked the first row with checkboxes as unsortable where applicable
Table_API_p.java and all other java pages: URLencoded lines with possible ampersands & -> & for validation XHTML 1.0 Strict sourcecode
--> All Index Create pages should validate now. Hope I did not break anything else (too much :-)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7225 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-06 00:00:23 +00:00
orbiter
f6eebb6f99 replaced auto-dom filter with easy-to-understand Site Link-List crawler option
- nobody understand the auto-dom filter without a lenghtly introduction about the function of a crawler
- nobody ever used the auto-dom filter other than with a crawl depth of 1
- the auto-dom filter was buggy since the filter did not survive a restart and then a search index contained waste
- the function of the auto-dom filter was in fact to just load a link list from the given start url and then start separate crawls for all these urls restricted by their domain
- the new Site Link-List option shows the target urls in real-time during input of the start url (like the robots check) and gives a transparent feed-back what it does before it can be used
- the new option also fits into the easy site-crawl start menu

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7213 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-30 12:50:34 +00:00
mikeworks
b019426811 de.lng: Added German translations for new Index Creation pages RSS Feeds and adapted text in Tables_p.html and CrawlStartExpert_p.html to match some typos, also changed one name tag to id to conform with XHTML 1.0 Strict
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7191 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-26 01:39:51 +00:00
orbiter
58b7417a59 - added a new 'easy' crawl start menu which can be used for the special case of loading a complete domain
- the previous crawl start servet was renamed to CrawlStartExpert_p
- easy crawl start is now default

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7160 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-16 12:02:43 +00:00
orbiter
2f381b8d7a - fixed at least two causes for a NPE after a use case switch.
A large refactoring was neccessary
- added another crawl start option: automatic restriction to sub-path
- removed crawlStartSimple and renamed crawl start expert
   to crawl start (without expert)
- some changes to texts in crawl start
- added some more deletions when an web index is deleted:
   delete also queues and robots cache


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4881 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-04 21:34:57 +00:00
lulabad
fc54d4519e some more XHTML strict errors
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4471 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-02-10 09:06:17 +00:00
daburna
3636526bd6 replaced re-crawl/min-age as suggested here: http://forum.yacy-websuche.de/viewtopic.php?f=9&t=198
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4466 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-02-09 15:15:58 +00:00
daburna
a047e7f830 replaced irritating "re-crawl"
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4463 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-02-08 22:47:18 +00:00
orbiter
b183bf6f42 - fixed opensearch bugs
- added 'full domain' button to expert crawl start
- removed not-workin 'only one domain' button, the regex allowed crawling of other domains

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4125 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-02 21:43:05 +00:00
low012
51800539b2 *) changed regex that is created for crawling filter (see http://forum.yacy-websuche.de/viewtopic.php?t=83)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3945 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-01 17:09:57 +00:00
orbiter
5009695537 fix for double-entries of crawl tasks.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3920 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-25 14:16:32 +00:00
orbiter
c7a614830a several bugfixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3899 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-15 17:45:49 +00:00
allo
b2a9080a14 fix for when the user hits cancel
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3820 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-07 19:56:59 +00:00
allo
b68fb8a0ba one \ more
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-07 19:48:15 +00:00
allo
e24b54301e RegEx, not Blacklist-style RegEx ;/
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3818 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-07 19:46:24 +00:00
orbiter
3f49cd516b splittet the index create page into two pages:
- one with less option but with information about other remote crawls
- one with complete information but without any other information
on both pages the steering options had beed removed. They are now at the monitoring page.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3813 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-06 22:27:03 +00:00