Commit Graph

4816 Commits

Author SHA1 Message Date
orbiter
aa322bc6d0 fix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8050 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 15:36:30 +00:00
orbiter
97d1347adb added also a default accept field to robots.txt downloads
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8049 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 15:33:55 +00:00
orbiter
f183d3822c added a default accept header in http requests since some http fraud detection functions check that this header field exist
see also: http://bad-behavior.ioerror.us/ in source file browser.inc.php

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8048 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 15:27:43 +00:00
orbiter
06352b8d6b more logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8047 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 14:09:50 +00:00
orbiter
a99934226e more logging for debugging of robots.txt
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8046 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 13:56:31 +00:00
orbiter
7a5841e061 fix for robot parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8045 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 13:12:46 +00:00
orbiter
458c20ff72 fix for robot parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8044 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 13:06:46 +00:00
orbiter
017a01714d - enhanced logging in robots.txt parser for remote debugging
- robots.txt is now more robust against database operations

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8043 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 01:03:49 +00:00
apfelmaennchen
a8dfe787ed - updated to jquery flexigrid 1.1
- YMarks.html automatically  recognizes if a bookmark is a crawl start


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8040 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-15 21:45:17 +00:00
orbiter
eb1c7c041d write info about robots.txt evaluation into getpageinfo_p.xml
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8038 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-15 00:33:54 +00:00
apfelmaennchen
abba31f02e - bugfix for correctly sorting ymarks
- some tuning for the autotagger (still not perfect)
- /api/ymarks/get_metadata.xml now provides info for crawlstarts
- removed unused code

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8036 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-14 22:00:44 +00:00
orbiter
775b44017e refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8033 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-14 15:11:57 +00:00
apfelmaennchen
5f7dbe1c42 - some refactoring (ymarks)
- improvement for autotagger (is now able to create/detect  multi word tags e.g. 'open source')



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8031 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-13 23:19:47 +00:00
orbiter
78ce3b13be typo
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8027 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-10 11:57:26 +00:00
orbiter
85d6bf4ac4 fixed urls to media content during indexing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8021 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-09 15:40:14 +00:00
orbiter
0d858d48ec replaced String with StringBuilder in suggestion process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8020 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-09 14:42:55 +00:00
orbiter
3a807e10cf - added a cache for active crawl profiles to the crawl switchboard
- moved the domain cache for domain counter from the crawl switchboard to the crawl profiles. the crawl domain counter is now therefore relative for each crawl start, not for the whole crawler.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8018 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-08 15:38:08 +00:00
orbiter
37e35f2741 normalization of url using urlencoding/decoding
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8017 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-08 12:02:22 +00:00
orbiter
1b86d06d1e fix for http://bugs.yacy.net/view.php?id=62
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8004 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-10-26 10:07:16 +00:00
orbiter
9e4875230f performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8001 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-10-20 23:06:49 +00:00
orbiter
a9838f8b99 fix for http://bugs.yacy.net/view.php?id=59
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7997 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-10-12 22:26:48 +00:00
orbiter
a7df70221e refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7987 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-10-04 09:06:24 +00:00
orbiter
cf4fd525ee added directDocByURL attribute in crawl profile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 12:38:28 +00:00
orbiter
c61e4cfd78 - fix for incomplete clear() in balancer
- renamed Parser Errors to Rejected URLs

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7984 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 10:27:14 +00:00
orbiter
813f297a95 another performance hack: re-use of known host addresses for isLocal property; avoids look-up in local hash
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7983 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 08:26:31 +00:00
orbiter
035ebfbf3b - performance hacks (should affect the crawl balancer and reduce CPU load during crawl stack re-fill)
- this may have also (good) performance side effects on other parts of YaCy


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7982 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 07:57:50 +00:00
orbiter
b250e6466d implemented crawl restrictions for IP pattern and country lists
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7980 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-29 15:17:39 +00:00
f1ori
e207c41c8e * fix urlproxy for urls containing dolar signs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7979 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-29 12:53:55 +00:00
orbiter
5ad7f9612b added crawl settings for three new filters for each crawl:
must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue)
must-not-match for IPs
must-match against a list of country codes (allows only loading from hosts that are hostet in given countries)

note: the settings and input environment is there with that commit, but the values are not yet evaluated

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-27 21:58:18 +00:00
orbiter
d2ea250d99 refactoring:
- moved many classes from de.anomic to net.yacy
- made more sub-packages for search classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-25 16:59:06 +00:00
low012
42b5f09f68 *) this should fix a bug in snippet creation (also cleaned up a little bit)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7972 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-25 16:07:22 +00:00
orbiter
6b22865dbc - removed some warinings
- removed a dead update location

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7970 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-24 01:58:54 +00:00
orbiter
0c6d95e57b - more tolerance against failure of table opening
- more connections for solrj

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7968 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-21 15:08:05 +00:00
orbiter
4f31869c5a enhanced search result timing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7966 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-21 10:43:08 +00:00
orbiter
6b02b696b0 - add number of search results to end of rss and json output to reflect latest status of retrieval
- distinguish search access with different verify state in access of search cache

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7965 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-20 19:41:44 +00:00
f1ori
87e6abd168 * fix urls containing a port number in urlproxy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7964 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-20 15:02:15 +00:00
f1ori
97045022fa * pass cookies to Server Side Includes
* User.html a bit more usable


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7963 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-20 14:54:14 +00:00
orbiter
ce2a76d603 performance hack for search process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7961 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-16 10:00:51 +00:00
orbiter
2c4a672fe2 bugfixes and performance hacks for tabe index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7957 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-15 11:17:02 +00:00
orbiter
dad5b586a4 added a concurrent warmin-up of Table data structures. that should speed-up the start-up process but may also cause stronger CPU load at that time.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7956 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-15 10:01:21 +00:00
orbiter
734059d33e performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7955 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-14 23:34:05 +00:00
orbiter
23e81b28b2 synchronization enhancements
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7954 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-14 21:19:02 +00:00
orbiter
dd4635e323 patches
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7953 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-14 20:11:27 +00:00
orbiter
bb0c045036 fix for problem with relocation of network
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7944 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-13 18:46:11 +00:00
orbiter
85a5487d6d YaCy can now use the solr index to compute text snippets. This makes search result preparation MUCH faster because no document fetching and parsing is necessary any more.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7943 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-13 14:39:41 +00:00
orbiter
52a2b3f110 try to fix bug http://bugs.yacy.net/view.php?id=26
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7941 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-08 19:13:19 +00:00
orbiter
2cba860693 - fix for wrong entries in NOLOAD indexing queue (that caused that urls had been only indexed based on their url and not loaded)
- patch for better urls to solr admin interface

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7938 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-08 12:23:55 +00:00
orbiter
cec3836e73 added reference limitation to IndexControlRWIs_p.html servlet
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7936 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-07 21:47:54 +00:00
orbiter
49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-07 10:08:57 +00:00
f1ori
41e146116a fixes size of document in case the server doesn't give the size in the header
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7929 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-05 12:21:25 +00:00
orbiter
e1a3d609aa moved merger object from Segment to IndexCell to enable a correct shutdown sequence. This solves a bug where yacy cannot be shut down during an index merge that appears during the shutdown phase.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7924 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-04 23:27:12 +00:00
sixcooler
2cf61a40ce fixed a bug from 7856, where Snippet returned an error by mistake when Metadata was found
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7921 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-02 16:50:05 +00:00
orbiter
610b01e1c3 - added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index.
- some refactoring for mime type discovery

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7919 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-01 16:05:00 +00:00
orbiter
3da21c4266 protection against starting of a (second) yacy peer while another one is already running on the same port
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7917 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-01 13:13:21 +00:00
orbiter
3e6767d66c limitation of reference evaluation (protection against crawler pits)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7902 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-25 21:12:31 +00:00
orbiter
2c595a6a47 added new methods to count the number of objects in RWIs. lots of refactoring was necessary to introduce new Rating class and to unify naming of methods
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7896 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-25 10:35:25 +00:00
orbiter
9f9f634de2 fix in search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7894 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-24 12:12:48 +00:00
sixcooler
5f8a5ca32d - not doing merge-jobs while short on Memory
- using configuration-values of crawling-max-filesize also for snippetfetching and loading files into Index

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7893 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-24 12:07:53 +00:00
orbiter
22d69a6368 refactoring in cora: added sorting package
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7890 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-23 20:18:30 +00:00
orbiter
51cf697acd refactoring: moved all score-related classes to new ranking package
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7889 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-22 22:37:53 +00:00
sixcooler
169236c6d9 almost revert changes in this class of 7880 and 7882
since MemoryControl does handle negative value requests

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7887 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-22 17:58:23 +00:00
sixcooler
4fec99115b Implementation of strategies for controlling memory resources.
You can toggle between previous (standard) and new (generation) strategy at PerformanceMemory_p.html.
The generation memory strategy is implemented with the objective of running more robust
but with the cost of early stopping some tasks (eg. dht) while running low on memory.
This new strategy does respect the generational way a heap is organized on most used jvms.
These changes run fine on my 3 peers for weeks now, but as I'm human, I may fail.
Please be carefull using generation memory strategy and report errors by naming
OS, jvm and java_args.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7886 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-22 17:50:03 +00:00
orbiter
c64faf41e2 addon to svn 7880
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7882 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-15 11:07:03 +00:00
sixcooler
06408a9428 since many POST-requests come as gzip they report a contentlength of -1
request memory of -1 * 3 look useless to me
so I added some megs to it - even correct report of contentlength should not be harmed by this

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7880 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-13 01:04:37 +00:00
orbiter
594d8f546a #cccamp11 maintenance fix: anons may find up to 1000 items in interactive search (was: 100)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7866 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-11 21:37:35 +00:00
sixcooler
eb14111200 encapsulate potential expensive objects in TextSnippet to allow GC them asap
this reduces chance of OOMs at massive search & snippet-fetching

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7865 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-11 21:07:52 +00:00
orbiter
e3fc1efbef performance hack and ensuring termination in serverAccessTracker. cause:
"Session_:53600#0_POST /yacy/hello.html HTTP/1.1" prio=10 tid=0x2322b000 nid=0x3ba7 runnable [0x03d3e000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Long.valueOf(Long.java:557)
        at de.anomic.server.serverAccessTracker.clearTooOldAccess(serverAccessTracker.java:113)
        at de.anomic.server.serverAccessTracker.cleanupAccessTracker(serverAccessTracker.java:75)
        - locked <0x3bda2ae8> (a de.anomic.server.serverAccessTracker)
        at de.anomic.server.serverAccessTracker.track(serverAccessTracker.java:125)
        at de.anomic.server.serverSwitch.track(serverSwitch.java:542)
        at de.anomic.http.server.HTTPDemon.parseRequestLine(HTTPDemon.java:641)
        at de.anomic.http.server.HTTPDemon.POST(HTTPDemon.java:491)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at de.anomic.server.serverCore$Session.listen(serverCore.java:757)
        at de.anomic.server.serverCore$Session.run(serverCore.java:651)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7862 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-03 18:47:43 +00:00
orbiter
44d74f8f89 performance hacks for seed generation (because thread dumps showed multiple occurrences at these code points)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7861 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-03 18:32:11 +00:00
sixcooler
5cd07d7f84 early freeing resources on deleting index reference if search-verification fails (aka Switchboard.cleanupJob)
doing same thingy on other methods of touched files as well

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7860 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-02 15:52:33 +00:00
sixcooler
a311596881 finishing up my commits (7855-7858) which could be helpful for
not declaring inside loops (helps GC of some VMs)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7859 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-01 23:35:24 +00:00
sixcooler
c0caca57e3 stoping thread for fetching searchresults if running short on memory
- in most cases at least one thread stays alive for getting the results
- fewer threads should do the work with less resouces, but much slower then

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7857 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-01 23:32:29 +00:00
sixcooler
ce248cc8dd less byte-arrays of response-content, less byte-array <-> stream conversation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7856 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-01 23:31:08 +00:00
sixcooler
59b767eebd stop loading via http at defined maximum of bytes - even size is unknown before loading
using max-file-size of type int for parsing documents
(since content is used as byte-arrays, 'integer' should be maximum)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7855 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-01 23:28:23 +00:00
f1ori
3a5fa73008 * revert parts of previous commit, because it breaks the trickle-feature
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7851 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-19 12:04:40 +00:00
f1ori
6e79675ff3 * use gzip-encoding in more cases
* send Expire-Header for static content
* should improve webserver-performance for slow connections
* fixes #37

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7850 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-19 11:47:53 +00:00
sixcooler
aff875baef smaler ping-entry @ ProfilingGraph
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7841 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-15 09:14:21 +00:00
orbiter
1912d0cccc changed handling of RowSet element retrieval: until today all elements had been copied from the underlying byte[] arrays into a new Entry object that again had a copy of a portion of that byte[] in its own bye[]. There was an option to just refer to the underlying byte[] with a pointer but that was almost never used. This commit now changes an interface to the Row class where it is now necessary to tell if a copy is always required. Fortunately the copy is only needed in very rare cases. That means that this change should cause much less memory allocation; it is expected that this happens especially during search situations.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7840 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-15 08:38:10 +00:00
orbiter
be15874be1 added request line in http which can support better debugging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7838 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-14 11:00:38 +00:00
orbiter
11dc653de3 added a visualization of peer pings to the performance graphic
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7837 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-14 07:07:06 +00:00
orbiter
3a191cdf14 because newbies are scared about the memory consumption in the performance graph and arguments about high memory consumption according to bad knowledge about java garbage collection techniques, the memory display had been removed from the performance graph shown on the Status.html page. The memory graph can still be seen on the Performance page where the memory graph is just like it was.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7836 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-14 03:25:57 +00:00
cominch
09bb7a390c do not replace malformed or invalid URLs in urlproxy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7835 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-12 07:44:23 +00:00
orbiter
768c59740c - replaced solrj 3.1 with solrj 3.3
- updated also slf4j
- added authentication for solrj


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7829 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-04 16:35:30 +00:00
low012
c7b95e8c81 *) Invalid crawl profiles (containing invalid mustmatch/mustnotmatch filters) will be moved from active crawls to invalid crawls (new file: DATA/INDEX/freeworld/QUEUES/crawlProfilesInvalid.heap). This file can not be edited yet, but it shoudl be easy to extend the CrawlProfileEditor accordingly.
*) Corrupt crawlProfilesPassive.heap would cause crawlProfilesActive.heap to be deleted. Don't know if this ever happend, but will not happen anymore.
*) Cleaned up a little bit.
*) Added some comments.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7827 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 23:55:55 +00:00
orbiter
719777b2a7 replaced method to call getUsableSpace using reflection with direct call since we now use java 1.6
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7821 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 18:13:37 +00:00
orbiter
2d4bb139d3 - added counting of links with noindex tag for solr index
- bugfixes for solr index

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7820 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 06:40:05 +00:00
orbiter
892caccdca added default configuration in ConfigurationSet in case of new values
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7814 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-02 00:09:49 +00:00
orbiter
bda3eec0ff added parsing of canonical link element to html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-01 16:38:01 +00:00
orbiter
b6f09a475d - added an index profile editor in the /indexFederated_p.html servlet for solr indexes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-30 15:49:21 +00:00
f1ori
a17351dcfe * navigation bar for filetype constraints
javascript interpreted backslashes from urlmask as escaping and didn't forward them to yacy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7806 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-29 15:30:24 +00:00
f1ori
96957375cc * fix url proxy for relative links and chromium
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7805 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-29 09:32:02 +00:00
orbiter
9ebc75db4b fix for channel authorization
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7803 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-26 23:14:02 +00:00
orbiter
6d9e5865ee faster appearance of search result page (but complete search time is the same)
this was inspired by http://bugs.yacy.net/view.php?id=37

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7801 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-26 21:17:02 +00:00
orbiter
f7ca84cfc0 enhanced template engine
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7800 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-26 21:15:13 +00:00
orbiter
84c9658644 added a file type navigator
added a protocol navigator

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7795 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-23 15:39:52 +00:00
orbiter
31283ecd07 - added a search option to filter only specific network protocols. i.e. get only results from ftp servers. Just add '/ftp' to your search.
for example search for "passwd /ftp". This can also be done with /http /https and /smb
- fixed some search throttling processes that should protect your peer against search DoS or strong search load

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7794 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-23 11:57:17 +00:00
orbiter
4b425ffdd2 fix for http://bugs.yacy.net/view.php?id=41
added another RSS channel "PROXY". the rss feed for peer news filters this channel if there is not an authorized access on that channel


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7792 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-22 10:19:32 +00:00
orbiter
7db208c992 performance hacks: more pre-allocated StringBuilder
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7790 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-21 23:10:50 +00:00
orbiter
87bd559c42 fixed warning
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7789 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-20 22:53:43 +00:00
orbiter
f30d36b101 enhanced template engine
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7783 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-19 13:02:06 +00:00
orbiter
115abc8917 - more attributes for search progress bar
- moved cache strategy to cora package

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7778 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-13 21:44:03 +00:00
sixcooler
7bfa6bb4b6 prevent getting a yacySeed from zero-length-hash-string by chance
(for eg.: proxy-crawls got displayed as initiated by some other peer)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7776 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-05 22:58:17 +00:00
orbiter
bce280a308 update on options for interface graphics
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7775 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-05 22:48:21 +00:00
orbiter
2683162ec5 - added more options to access grid picture, web structure picture and network graphics
- remove test class


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7770 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-02 23:27:26 +00:00
orbiter
0c1b29f3c9 - applied many small performance hacks
- added a memory limitation in the zip parser and the pdf parser
- added a search throttling: if there are too many search queries are still to be computed, then new requests are not accepted for some time. if after a one second still no space is there to perform another search, the search terminates with no results. this case should only happen in case of DoS-like situations and in case of strong load on a peer like if it is integrated in metager.
- added a search cache deletion process that removes search requests in case that throttling happens

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7766 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-01 19:31:56 +00:00
f1ori
900dacbf97 * improve link rewriting in proxy-url
* only rewrites links, which are in current search domain

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7765 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-01 13:27:04 +00:00
f1ori
dc855d881b * further improve proxyurl
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7762 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-30 21:25:20 +00:00
orbiter
a7a6b392f5 code cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7760 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-30 10:16:43 +00:00
orbiter
fe0c08455b more concurrency (enhancement) hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7759 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-30 08:53:58 +00:00
orbiter
0e9a99cb05 another resource hack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7758 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-30 07:51:18 +00:00
orbiter
535b6b953c more hacks to omit superfluous string object allocation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7757 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-30 07:31:17 +00:00
orbiter
87082f407e less String object creation during search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7756 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-30 04:19:20 +00:00
orbiter
ab5a16b957 lesse memory occupation during ranking and faster host navigator
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7755 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-29 20:33:12 +00:00
orbiter
1489ebeedf one more hack to free ram for search events
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7753 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-27 14:26:37 +00:00
f1ori
ddcc333acc * fix negative result counts
results sorted out by add to RankingProcess were counted in
sortedout-counter, but were not added to remote_indexCount nor
local_indexCount

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7749 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-27 11:21:00 +00:00
orbiter
fa734bdf9f better memory protection in search logger
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7748 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-27 11:18:22 +00:00
orbiter
dbea40d536 - changed snippet fetch strategy logic: do not check if entry is in cache. This should reduce IO load on the HTCACHE which is a showstopper during large number of search requests
- forced a possible short memory status when a search is started to flush caches that may cause search-heaps with resource contention effects

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7747 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-27 09:32:03 +00:00
orbiter
4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes).
The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-27 08:24:54 +00:00
orbiter
746e3c3b06 Replaced a widely-used Property Object in the httpd with HashMap<String, Object> which is not synchronized like Properties
A synchronization is not needed here and applies an overhead to the httpd process which is now removed.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7745 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 16:34:35 +00:00
f1ori
14e1666b21 * fix replacing regexes in url proxy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7742 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 16:09:29 +00:00
orbiter
e28bd0d038 fix for some possible causes of memory leaks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7741 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 14:35:32 +00:00
orbiter
09ba6814c0 - non-blocking word hash computation with dynamic digest object generation (this was important!)
- (very) small performance enhancement in did-you-mean


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7740 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 12:58:11 +00:00
orbiter
10e2f588f8 - enhanced ybr ranking computation
- many speed/performance hacks
- added solr charding and new charding web interface
- added option to switch off the yacy index when using solr
- added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr
- refactoring/renaming of some method names to distinguish host/url hashes better
- a large number of bug/npe fixes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 10:57:02 +00:00
orbiter
bd55dcee50 - commented out experimental distributed ranking loading
- less threads for blocking threads
- disable all threads for DHT transmission for networks with zero peers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7737 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-24 21:08:01 +00:00
orbiter
d1dbbd956a always use a template method cache even if the template cache flag is set to false. This flag is only used to make dynamic updates to the template files, to not dynamic updates to the rewrite methods (which is not possible without recompiling). low memory usage is guaranteed by the usage of soft references which are dropped before an OOM is thrown
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7735 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-24 09:31:07 +00:00
orbiter
0d040ff6bb fix for bug 0000036: no crawling of https pages
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7734 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-24 09:14:32 +00:00
orbiter
3ed4a09368 small features, some bug fixes and performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-23 21:08:04 +00:00
orbiter
e55c254f7b enhanced logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7732 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-22 20:12:13 +00:00
orbiter
b45701d20f this is a re-implementation of the YaCy Block Rank feature
This time it works like this:
- each peer provides its ranking information using the yacy/idx.json servlet
- peers with more than 1 GB ram will load this information from all other peers, combine that into one ranking table and store it locally. This happens during the start-up of the peer concurrently. The new generated file with the ranking information is at DATA/INDEX/<network>/QUEUES/hostIndex.blob
- this index is then computed to generate a new fresh ranking table. Peers which can calculate their own ranking table will do that every start-up to get latest feature updates until the feature is stable
- I computed new ranking tables as part of the distribition and commit it here also
- the YBR feature must be enabled manually by setting the YBR value in the ranking servlet to level 15. A default configuration for that is also in the commit but it does not affect your current installation only fresh peers
- a recursive block rank refinement is implemented but disabled at this point. it needs more testing

Please play around with the ranking settings and see if this helped to make search results better.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7729 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-18 14:26:28 +00:00
orbiter
021840e5ba removed (almost) deadlocks and unnecessary CPU load
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7726 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-17 00:00:01 +00:00
orbiter
123375bfba added a new yacy protocol servlet 'idx'. This returns an index to one of the data entities that is stored in YaCy.
This servlet currently only serves for indexes to the web structure hosts. It can be tested by calling
http://localhost:8090/yacy/idx.json?object=host
This yacy protocol servlet is the first one that returns JSON code and that also shows index entries in a readable format. This will make the development of API applications much easier. This is also an example implementation for possible json versions of the other existing YaCy protocol interfaces.

The main purpose of this new feature is to provide a distributed block rank collection feature. Creating a block rank is very difficult if the forward-link data is first collected and then one peer must create a backward-link index. This interface provides already a partial backward index and therefore a collection of all these indexes needs only to be joined which is very easy. The result should be the computation of new block rank tables that all peers can perform.

To reduce load from peers this servlet buffers all data and refreshes it only once in 12 hours. This very slow update cycle is needed because the interface will be called round-robin from all peers once after start-up.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7724 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-15 22:57:31 +00:00
orbiter
1d8b0f74f4 one more fix for SVN 7713
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7716 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 15:31:24 +00:00
orbiter
0960261769 fix for svn 7713
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7715 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 15:20:57 +00:00
orbiter
5b579e21a3 code cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7713 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 06:21:40 +00:00
orbiter
039126cfaf better handling of on/off switched solr indexing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7709 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-08 22:47:20 +00:00
orbiter
9248a4eef4 reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder'
see http://bugs.yacy.net/view.php?id=9

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7705 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-07 07:37:46 +00:00
orbiter
0621a15f89 fix for wrong search result counter: added a counter for all filtered out entities
see also http://bugs.yacy.net/view.php?id=5

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7704 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-06 23:04:27 +00:00
orbiter
9c33b2fb58 fix for String Matcher in case that no snippet is returned (NPE)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7702 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-05 23:11:03 +00:00
orbiter
76f2817e00 a fix for the snippet computation and hopefully better snippets
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7701 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-05 23:05:38 +00:00
orbiter
deda54d684 - relaxed matching of string-search (this is now case-insensitive)
- added transport of string-search pattern to remote search protocol
- fixed a problem parsing snippets with a '-' inside

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7700 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-05 22:37:06 +00:00
orbiter
6e42d4de88 - added full-String search function: find things that match exactly what is quoted in the query
- re-structuring authentification methods to fix a problem with API steering

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7697 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-05 00:25:14 +00:00
apfelmaennchen
8b8db2aaba YMarks: some small changes/fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7695 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-03 21:21:06 +00:00
apfelmaennchen
441035f1f4 YMarks: some improvements to flexigrid quick search on YMarks.html
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7694 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-02 20:11:58 +00:00
orbiter
6fa439c82b - refactoring of robots
- added option to crawler to send error-URLs to solr
- changed solr scheme slightly (no multi-value fields where no multi values are)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7693 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-02 14:05:51 +00:00
apfelmaennchen
e7c2ea193b YMark:
- general improvements on importers, especially on auto tagging
- added get_tags (needed for tag clouds etc.)
- improved flexigrid support
- added YMarks.html (not fully working) that will eventually replace Bookmarks.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7691 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-01 21:42:48 +00:00
orbiter
3b578a28ef some patches to prevent that empty or bad IP information is broadcasted
- on client-side: fix bad IP reports from remote Peers by replacing their reported IP with their server IP if the reported IP is bad, broken or disallowed
- on server-side: the same during a peer ping (here the ping'ed server acts also as client during the back-ping) and also when receiving a message or a search where the client sends also its seed. Here the IP is replaced by the client IP if the reported IP is broken or bad

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7687 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 10:58:12 +00:00
orbiter
361841df16 another patch according to http://bugs.yacy.net/view.php?id=26#c36
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7686 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 02:26:50 +00:00
orbiter
37fede9d30 better logic for proper seed ip recognition and better error messages
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7685 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 02:19:13 +00:00
orbiter
8b95a26866 better magic
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7684 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 02:00:37 +00:00
orbiter
2700a58e5a added a magic to the peer ping that will be used in case that the contacting peer requests that it's reported IP shall be used for a back-ping. The back-ping now also returns the same magic which will make it possible that the requested peer can verify that the back-pinged peer is actually the same peer.
This is also a protection against the foced-fake of a external IP: if such an IP was faked, then the next ping from the affected peer to another peer looks like a staticIP report. Such a bad staticIP-by-faked-response can now be discovered and fixed by the peer that gets the second ping after the first ping contained a faked response.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7683 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 01:52:20 +00:00
orbiter
8879cc1db2 removed System.out.println
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7682 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 14:08:02 +00:00
orbiter
f6077b3cc0 added more attributes for html parser and enhanced data structures
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7679 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 13:09:01 +00:00
f1ori
0b02083e97 * function for simple crawl of one url
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7678 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 13:04:33 +00:00
f1ori
d671de8c17 add ranking weight to json-search-results
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7677 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 11:18:14 +00:00
orbiter
d8e934c085 better abstraction of http client identification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7675 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 13:35:29 +00:00
sixcooler
a3e707283d not using HTTPConnector anymore
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7674 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 11:46:31 +00:00
orbiter
b77b8cac0c - enhanced html parser: recognized much more details in the content
- added more properties to solr index
- refactoring
- more constants in switchboard
- fix for some NPEs
- recognition of more images
- removed synchronization in HandleMap (obviously not necessary?)
- added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-21 13:58:49 +00:00
low012
bc84d2bc9d *) fixed typo in stop script
*) added <u> </u> tags for underlined text in Wiki Code
*) minor code changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-20 22:54:29 +00:00
apfelmaennchen
b2281f0b7d YMark: intermediate work towards flexigrid support
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7670 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-20 22:33:01 +00:00
low012
06d50fd801 *) fixed stupid bug (introduced in r7663 by myself) which caused wrong parsing of Wiki pages
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7669 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-20 17:27:59 +00:00
apfelmaennchen
60412d2bb3 YMark:
- more refactoring >> YMarkEntry
- integration of SurrogateReader as bookmark importer
- various small bug fixes e.g. get_xbel.xml

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7668 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-18 21:42:14 +00:00
orbiter
3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
- added deletion of solr index in IndexControlRWIs
- added asynchronous adding of large url lists (happens when crawls are startet with file)
- fixed npe in Image display
- replaced language warning with fine logging
- added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups)
- added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list
- added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm)
- fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-18 16:11:16 +00:00
orbiter
fd3baa9025 fix for http://bugs.yacy.net/view.php?id=24
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7664 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-17 22:37:04 +00:00
low012
2e9694c9e9 *) removed recursion which hopefully prevents exception
*) fixed bug in creation of table of content which caused double entries if a page was previewed more than once

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7663 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-17 21:02:18 +00:00
apfelmaennchen
a2e86daae9 YMark: more bug fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7662 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-16 22:09:50 +00:00
apfelmaennchen
62855f9567 YMark: code clean up and some small fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7661 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-16 21:19:42 +00:00
apfelmaennchen
667e912b19 YMark:
- some improvements to firefox json bookmark importer
- test import with: /api/ymarks/test_import.html
- view ymarks with: /api/ymarks/test_treeview.html


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7660 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-16 09:09:33 +00:00
apfelmaennchen
a0e4960a4d YMark:
- first attempt for a firefox json bookmark importer
- added JSON library json-simple-1.1.jar

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7658 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-15 20:58:58 +00:00
orbiter
958ff4778e enhanced location search:
search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found.
This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed.

Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search.
Fixed many smaller bugs in connection with location search.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-15 15:54:19 +00:00
orbiter
19fd13d3bc Added federated index storage to solr.
YaCy supports now the storage to remote solr indexes.
More federated storage (and search) methods may follow.

The remote index scheme is the same as produced by the SolrCell; see
http://wiki.apache.org/solr/ExtractingRequestHandler
Because this default scheme is used, the default example scheme can be used as solr configuration
This is also the same scheme that solr uses if documents are imported with apache tika.

federated solr storage is switched off by default.

To use this, do the following:
- set federated.service.solr.indexing.enabled = true
- download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/
- extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar'
- start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes.
- to check whats in solr after indexing, open http://localhost:8983/solr/admin/

Until now it is not possible to use the solr index to search with YaCy in that solr index.
This functionality is now available for two reasons:
1) to compare the functionality of Solr and YaCy and to compare the search speed
2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods
   that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still
   want to use solr instead of YaCy.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-14 20:05:04 +00:00
orbiter
c17d102bd8 enhanced speed for OrderedScoreMap inc method and size comparisment in concurrent environments
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7653 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-13 22:04:23 +00:00
orbiter
01690eab86 fix for mediawiki importer and wikicode parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7651 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-13 13:22:27 +00:00
orbiter
c5352e6872 added new SearchResult class (to be used later)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7650 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-13 06:16:31 +00:00
orbiter
4c013d9088 more UTF8 getBytes() performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7649 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-12 05:02:36 +00:00
apfelmaennchen
78d6d6ca06 refactoring for ymarks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7648 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-08 21:15:10 +00:00
orbiter
a47bdc405b better logging for robinson selection according to peer tag
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7645 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-05 08:04:25 +00:00
orbiter
cafcb1f9ed removed the DNS resolving for web structure computation from the indexing queue and placed it in a concurrent computation queue that does not block the crawler. Makes crawling faster and less DNS-speed-dependent
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7644 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-04 22:01:07 +00:00
orbiter
17530ca7b5 fix for bug http://bugs.yacy.net/view.php?id=10
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7642 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-04 12:20:20 +00:00
orbiter
96c32e87b0 fixes to crawler and new user-agent crawl-delay handling
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7640 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-04 09:47:18 +00:00
orbiter
b2fe4b7b1a added a handling of appearances of yacy bot entries in robots.txt if this entry addresses the yacy peer
(directly or indirectly) and it grants a crawl-delay of 0. Then all forced pause mechanisms in YaCy are switched off and the domain is crawled at full speed.
crawl delay values can be assigned to either
- all yacy peers using the user-agent yacybot
- a specific peer with peer name <peer-name>.yacy or
- a specific peer with peer hash <peer-hash>.yacyh


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7639 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-03 23:39:45 +00:00
orbiter
cb6f709a16 - enhancements in surrogate reading
- better display of map in location search

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7636 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-02 00:11:37 +00:00
low012
1ff9947f91 *) added new user right: extended search right (allows to define users who can query more results than anonymous users)
*) cleaned up code a little bit

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7635 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-01 23:32:40 +00:00
orbiter
156cf02703 - added an index constraint 'has location' to the condenser
- added evaluation of the 'has location' constraint to search using the /location operator


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7633 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-31 09:41:30 +00:00
orbiter
0430a94eaa the location search shows now not re-evaluated locations but only such locations that are attached as metadata to web pages
- added parser for in-text appearing geo-locations
- added geo-locations to rss search result
- added evaluation of metadata-attached geo-locations in yacysearch_location to show search results within a map


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7631 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-30 23:26:36 +00:00
orbiter
9b25d07295 - added geo information parsing to html parser
- extended metadata information in index with geolocalisation
- added display of location in yacydoc and ViewFile

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7629 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-30 00:49:47 +00:00
f1ori
efcf37a953 * show info in log, if robots.txt is rejected due to wrong mime-type
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7628 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-28 19:55:15 +00:00
low012
16cd919795 *) fixed Exceptions which caused 500 error when entering invalid URL mask or invalid prefer mask, invalid masks are ignored, error message is displayed on yacysearch.html (what about yacysearch.rss and yacysearch.json?)
*) fixed "more options" link on yacysearch.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7623 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-23 00:48:19 +00:00
low012
1a24917cea *) fixed NPE which occured when empty String was entered as search word
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7622 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-23 00:44:38 +00:00
orbiter
b1a8d0c020 enhancements to web cache and less strict caching rules
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7620 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-22 10:35:26 +00:00
orbiter
f3baaca920 - enhancements to DNS IP caching and crawler speed
- bugfixes (NPEs)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7619 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-22 09:34:10 +00:00
low012
e7860b1239 *) <mode="Homer">D'oh!</Homer>
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7618 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 22:23:20 +00:00
low012
82f1580a60 *) trying to fix ConcurrentModificationException
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7617 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 22:20:19 +00:00
low012
9f0286b380 *) fixed potential "java.lang.IllegalArgumentException: Illegal group reference" which occured if special characters which are also used as metacharacters in regular expression were used inside of <pre>...</pre> (see: http://veerasundar.com/blog/2010/01/java-lang-illegalargumentexception-illegal-group-reference-in-string-replaceall/)
The class still contains a potential ConcurrentModificationException which occurs when the List which contains the elements of the table of content is moified during a recursion of tagReplace(). Will try to fix this later today.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7615 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 18:02:09 +00:00
orbiter
78d4c45d09 enhancement during search process: fast fail of search in case that all index feeder have terminated.
This change should affect filtering and navigators and should cause that search navigation gets faster

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7614 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 13:05:51 +00:00
orbiter
ba03ca8620 added more configuration options for search:
- removed configuration button for 'search only for admin' from index.html and added this to ConfigPortal
- added configuration of link verification options (iffresh, cacheonly, nocache, ifexist) to ConfigPortal
- added configuration of navigation options to ConfigPortal
- added an option to switch off automatic index cleaning in case that a link verification method fails


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7613 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 07:50:34 +00:00
f1ori
e0c7d490f9 * fix bug #6
* exclude signature files from auto-deletion of unknown files in DATA/RELEASE


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7612 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-20 17:59:58 +00:00
orbiter
a50f28e6e7 - fixed missing save operation for peer name change
- fixed import of mediawiki dump files
- added script to add mediawiki dump files

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7609 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-19 23:52:09 +00:00
orbiter
2b5f8585bf performance hack for Balancer and ip address parsing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7608 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-17 21:09:18 +00:00
low012
2861d0888a *) simplified code\n*) fixed potential NumberFormatExceptions
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7600 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-15 01:03:35 +00:00
orbiter
1989ebc24b removed more warnings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7598 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-14 22:52:30 +00:00
orbiter
b62b79675b removed type cast warnings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7594 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-14 21:08:18 +00:00