Commit Graph

9001 Commits

Author SHA1 Message Date
Michael Peter Christen
342543a6c4 fix for host browser 2012-10-25 10:23:43 +02:00
Michael Peter Christen
85ca07b90e when a new crawl is started, an equal crawl, if still running, is
terminated and the corresponding crawl profile is deleted (this also
clears the crawl queue entries for that crawl profile)
2012-10-25 10:20:55 +02:00
Michael Peter Christen
906e51214a the web structure image shows the pivot dot in a different color 2012-10-25 10:18:28 +02:00
Michael Peter Christen
b3ffcde0c7 - prepared PngEncoder for concurrency: PixelGrabber.grabPixels is the
main time-consuming process. This shall be done in concurrency.
- added concurrent processes to call the PixelGrabber and framework to
do that (queues)

It is now possible to create 4k-Images (3840x2160) i.e. with the Network
Graphics servlet
2012-10-24 02:08:51 +02:00
Michael Peter Christen
e9c6f4ce2e - new order of data computation: first compute the size of
compressed deflater output, then assign an exact-sized byte[] which
makes resizing afterwards superfluous
- after all enhancements all class objects were removed; result is just
one short static method
- made objects final where possible
2012-10-24 00:41:09 +02:00
orbiter
c6a1b21399 added a 9-year old png encoder from David Eisenberg which I rewrote
quite a bit to remove all code that handles transparency. With this
highly specialized png writer it is possible to write png images much
faster that with the JRE built-in png writer.
In a second step it can be possible to add concurrency to increase
computation speed further.
2012-10-23 23:27:41 +02:00
orbiter
276dd6452b removed warnings 2012-10-23 19:08:44 +02:00
orbiter
59bf4677b6 added option to view the complete directory structure in host browser 2012-10-23 19:02:55 +02:00
Michael Peter Christen
b991685782 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-10-23 18:14:58 +02:00
Michael Peter Christen
7602fce0b9 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-23 18:12:48 +02:00
Michael Peter Christen
ea11a1efea fix for highlighting in gsa search 2012-10-23 18:11:49 +02:00
Michael Peter Christen
9eaede50e7 enhanced web structure images 2012-10-23 18:11:19 +02:00
Michael Peter Christen
b7ac1da6a3 gsa results shall have only one title in metadata and that should be the
visible title in the <title>-tag
2012-10-23 18:03:12 +02:00
sixcooler
206e7bcf94 whitelist yacyportalsearch aka search.yacy.net 2012-10-23 03:49:27 +02:00
Michael Peter Christen
ae6feb5610 showing the web structure graph as animation in the crawl monitor 2012-10-23 02:50:26 +02:00
reger
87aab9aa7c - fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url
- fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode
   see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/
2012-10-22 22:48:35 +02:00
Michael Peter Christen
39317a6c66 enhanced webstructure image: introduced
- multiple hosts can be listed (comma-separated) as host argument
- new 'bf'-attribut (branch factor): the maximum number of edges per
node
- the bf-value is computed automatically
- ordering of nodes when the graphic is drawed: mostly the drawing ends
with an limitation eg. number of nodes. When this happens, it should be
ensured that more 'interesting' nodes are painted in advance. This is
now done by sorting all nodes by the number of links they have in de
distant sub-graph.
2012-10-22 16:23:39 +02:00
sixcooler
47ae7e322e smaller dhtDispatcher.cloudSize
@Orbiter: we talked about this times ago - please revert if I'm wrong
2012-10-21 20:05:28 +02:00
sixcooler
57ddd63888 not hold a expensive cache of references for DHT-out,but but load them
on demand
see: http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4530
2012-10-21 20:00:36 +02:00
reger
1dc6482feb format crawler timeout output string in seconds (was days) 2012-10-21 03:00:05 +02:00
Michael Peter Christen
ef937af35d more custom field usage in gsa search result 2012-10-18 15:26:55 +02:00
Michael Peter Christen
ea27d2e5f6 fixed more getSolrFieldName usages 2012-10-18 15:21:05 +02:00
Michael Peter Christen
ce0e5b1e17 - more refactoring / private methods
- fix for usage of custom solr field names
2012-10-18 15:09:04 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
7f71dfab03 added a HostBrowser.xml api file and changed a bit of attribute naming 2012-10-18 11:42:13 +02:00
Michael Peter Christen
b400fc7b4d fix for file parser problem 2012-10-17 18:06:44 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
6017691522 added an exception catch 2012-10-17 13:56:11 +02:00
Michael Peter Christen
68c7ed5ce9 added a shell script which can be used to delete the api action steering
table. This may be necessary if the api is called by remote command and
the recordings are not used. Then they can be deleted frequently by
calling this clear command using a cron job
2012-10-17 00:44:16 +02:00
Michael Peter Christen
ed803708ab added a shell script which can be used to add a rss feed to the index.
All pages linked in the rss feed are added. The process is not repeated
automatically. If you want to repeat this, add the command to a cron
job.
2012-10-17 00:31:59 +02:00
Michael Peter Christen
5d16c23a1f specified more URIMetadata as URIMetadataNode 2012-10-16 18:26:21 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
cc98496ff3 enhanced the HostBrowser:
- showing also outbound links to other domains if there are any
- the outbound links browser shows also the link structure image
- showing even inbound links if the web structure graph has information
about that
- removed the left menu and made the HostBrowser a part of the top menu
for search
- moved the file search also to the top menu
- added hover information in the HostBrowser to explain what the click
means
- because the HostBrowser also links to the Metadata viewer ViewFile,
there should be a button to switch back to the HostBrowser: added that
also.
2012-10-16 17:13:18 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
4023d88b0b added date info in parser errors 2012-10-15 10:57:36 +02:00
Michael Peter Christen
1b02408936 use less cache 2012-10-11 14:32:37 +02:00
Michael Peter Christen
e45a3235e0 default cache size was much too high; decreased solr cache size 2012-10-11 12:03:48 +02:00
Michael Peter Christen
613cf7da7f enhancement to post argument parsing - possible fix to zero-filled
parameter values
2012-10-11 10:46:06 +02:00
Michael Peter Christen
36c13ed15b less solr prefetch 2012-10-11 10:17:05 +02:00
Michael Peter Christen
f3fc8eac80 fixed clear scripts 2012-10-11 10:16:37 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
53789555b9 fix for crawl start filter 2012-10-10 10:40:32 +02:00
Michael Peter Christen
abebb3b124 added a crawl start checker which makes a simple analysis on the list of
all given urls: shows if the url can be loaded and if there is a robots
and/or a sitemap.
2012-10-10 02:02:17 +02:00
Michael Peter Christen
941873fba4 moved the index deletion functions from IndexControlRWIs to
IndexControlURLs where it appears more naturally. Because the RWI
administration is less important in the presence of Solr, the
IndexControlURL is now the default servlet when the Index Administration
button on the main menu is selected.
2012-10-10 00:09:27 +02:00
orbiter
ae246c30c3 fixed interpretation of directDocByURL attribute during crawl start 2012-10-09 23:11:31 +02:00
orbiter
68d0f8de03 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-10-09 20:36:32 +02:00
reger
bfb0d4c69b - add language detection from <html lang="xx"> tag
- add jaudiotagger jar to Netbeans-IDE project classpath
2012-10-09 20:02:58 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
a06930662c replaced some more .getBytes() with UTF8/ASCII.getBytes() 2012-10-09 12:14:28 +02:00