Commit Graph

7296 Commits

Author SHA1 Message Date
orbiter
49d4f95faf bugfix to latest commit 2014-08-27 00:16:50 +02:00
orbiter
68211f8244 enable Crawler_p servlet if a rss feed or a wiki dump import was
submitted.
2014-08-27 00:15:31 +02:00
orbiter
a65df4ce7e do not push noindex errors into log if in intranet mode. noindex
attributes are attached to artificial constructed index.html files which
list directories. Such files are naturally rejected by the crawler and
should not appear in the error log because these files are part of the
construction of file crawlers and confuse users if they see them in the
error log.
2014-08-27 00:10:51 +02:00
orbiter
688c6d8954 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-08-27 00:04:36 +02:00
orbiter
4ae7aead28 addon to latest fix 2014-08-27 00:03:49 +02:00
Marc Nause
2af56fa37d Improved UPnP. (still not perfect)
*) set HTTPS port if enabled
*) improved data structures (may not be final)
*) moved UPnP to own package
2014-08-26 22:47:13 +02:00
orbiter
b3ebd38079 removed the HTDOCS repository concept because the concept to host files
on the YaCy http server is obsolete; YaCy can index file:// and smb://
paths
2014-08-26 19:02:53 +02:00
reger
1fdcc2d67b change seedfile upload ip check to allow intranet ip in intranet mode
- this allows to setup a principal peer in intranet environment
2014-08-25 01:25:22 +02:00
reger
e31b0e6d67 - update javadoc Seed.getIP
- default mySeed.ip to hostip in SeedDB.initMySeed() if Intranetmode
this allows to become senior status in intranet hosted search network with view peers,
otherwise peer would stay junior because of default init with loopback ip as public (dna) ip.
2014-08-24 21:13:36 +02:00
reger
350c6b8250 in IntranetMode allow intranet hosted seedlist with Network_Domain "any"
- so far intranet seedlist hosts are always denied but need to be allowed in intranet mode
2014-08-24 05:20:06 +02:00
orbiter
d68438c3d9 make sure that the postprocessing background thread never dies by any
exception
2014-08-23 10:35:38 +02:00
orbiter
b4f2a1db6e added a unlock icon for all protected pages that are unlocked because
the administrator is logged in.
2014-08-19 19:58:31 +02:00
reger
ea6c9e9b07 reduce mem buffer overhead for gap files during r/w
(they are typically small compared to idx allowing to use smaller buffersize -> set to 16k records)
2014-08-18 00:03:24 +02:00
reger
e88537522d allow single quote " ' " in query
see http://mantis.tokeek.de/view.php?id=379
-add QueryGoal test case for this
2014-08-16 14:29:52 +02:00
orbiter
487021fb0a snippet computation update 2014-08-15 01:17:11 +02:00
orbiter
1c2f1f233a Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-08-14 20:58:05 +02:00
reger
5a4995ded3 fill solr rss writer dc:subject tag with keyword content 2014-08-14 03:06:41 +02:00
orbiter
927aaa95a6 concurrency bugfix 2014-08-13 00:59:11 +02:00
orbiter
c9e593cf78 removed warnings 2014-08-11 23:53:12 +02:00
reger
7584352e7b use more predefined Solr query parameter constants
- use CommonParams and DisMaxParams constants
- fix typo in get sort parameter
- getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface
2014-08-10 22:33:10 +02:00
reger
f9db5dd6c5 reduce doublecontent check document (prevent out of memory)
see http://mantis.tokeek.de/view.php?id=437

test result (concurrency=7)
2000 docs = eom always
1000 docs = eom always
100 docs = eom never

chosen -> 200 docs (eom not encountered during test with 1GB mem setting)
2014-08-10 03:18:15 +02:00
reger
e9eae45b55 simplify rssreader and improve atom feed link extraction
- type detection (rss/atom) 
    - init type parameter overwritten during parse, parameter obsolete
    - detection by endtag changed to simpler first-tag evaluation
- channel image not used, removed related extra parser handling
    - remove unused code (set/getImage) in rssfeed
- atom link extraction to account for possible multipe link tags
   - spec limits link to one with rel="alternate" or one without rel attribute
     not accounting for the follwing type & hreflang exception yet:

   o  atom:entry elements MUST NOT contain more than one atom:link
      element with a rel attribute value of "alternate" that has the
      same combination of type and hreflang attribute values.
2014-08-10 01:29:16 +02:00
reger
a8508417d1 catch NPE during crawl (OAI import)
- condenseDocument mime=null (allowed)
- collectionconfiguration responseheader = null (allowed)
2014-08-08 00:02:59 +02:00
reger
3dde94422f center searchevent lines on network graph
(PerformanceSearch_p.html)
2014-08-06 23:04:42 +02:00
Michael Peter Christen
3860711aef fix for possible interruption of concurrent queries 2014-08-06 12:55:18 +02:00
Michael Peter Christen
6344718f8b reducing the concurrent query stack size and reduced concurrency of
postprocessing to avoid OOM situations
2014-08-06 12:36:59 +02:00
Michael Peter Christen
eca9380e3d bugfix for crawler double-check: if an url is redirected, the
redirect-target was not double-checked. This is now done by replacing
the redirect-URL on the crawl queue again (where it is double-checked)
2014-08-06 12:35:12 +02:00
Michael Peter Christen
9ac0c93f17 fix for subpath crawl filter 2014-08-06 01:33:24 +02:00
Michael Peter Christen
66106bdaf0 fix for crawler attribute maxdompages 2014-08-05 21:32:25 +02:00
Michael Peter Christen
49d91b94c3 npe fix in crawler 2014-08-05 21:31:59 +02:00
Michael Peter Christen
b7183a7321 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-05 09:54:18 +02:00
reger
ea2e627662 fix ConfigAccounts del user with uppercase letter in name
(usernames are case sensitive, userdb.delete used toLower)
2014-08-05 01:27:27 +02:00
Michael Peter Christen
c465b791af typo 2014-08-04 16:13:39 +02:00
Michael Peter Christen
191ec8c82a added concurrency to postprocess rewrite process 2014-08-04 15:28:58 +02:00
Michael Peter Christen
a1e8bdd5e9 log ppm instead of docs/second 2014-08-04 14:44:42 +02:00
Michael Peter Christen
cc0ded7abd set process type of web graph according to fields as defined in the
schema
2014-08-04 14:44:20 +02:00
Michael Peter Christen
12fb9d7cd1 log postprocessing constraints in case that postprocessing is not
performed
2014-08-04 14:19:37 +02:00
Michael Peter Christen
3c23b89823 less logging 2014-08-04 13:37:34 +02:00
Michael Peter Christen
a0c53174c5 better solr query logging to detect unnecessary sort requests for more
performance profiling
2014-08-04 13:00:45 +02:00
Michael Peter Christen
338f574bdc no sorting if http/www unique fields are not demanded (makes query
faster) and some code restrucuring
2014-08-04 12:59:38 +02:00
Michael Peter Christen
1609763be5 toString fix 2014-08-04 12:58:39 +02:00
Michael Peter Christen
b983e68254 more retries, less sleep 2014-08-04 08:29:35 +02:00
Michael Peter Christen
1503ba7794 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-04 08:24:31 +02:00
reger
8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
(e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue)
2014-08-04 02:38:58 +02:00
Michael Peter Christen
0ceeceb35e more logic on Solr queries; usage of the query terms in posprocessing,
saving one query for double document detection now per document
2014-08-04 02:35:38 +02:00
orbiter
38864ae004 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-08-03 22:44:49 +02:00
orbiter
4099296b45 added new classes which shall reduce call overhead to Solr (stub) 2014-08-03 22:44:22 +02:00
reger
d0c02e1de7 adjust rss lat/lon to double
(common format across other classes)
2014-08-03 20:09:23 +02:00
orbiter
3491ab4c38 removed unused images from webgraph edge computation 2014-08-01 13:21:16 +02:00
orbiter
2371d6b8db target linktexts must be string to enable search facets on these fields 2014-08-01 13:20:25 +02:00