A topic-centric list of HQ open datasets.
Go to file
2016-01-27 10:14:11 +08:00
Datasets Add titanic dataset 2014-11-21 17:38:34 +08:00
.travis.yml [travis] update 2016-01-25 07:49:50 -08:00
LICENSE Update license copyright info. 2015-04-30 15:38:51 +08:00
README.rst Add OME powered data repositories 2016-01-21 16:38:29 +00:00

Awesome Public Datasets
=======================
.. image:: https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg
   :alt: Awesome
   :target: https://github.com/sindresorhus/awesome
.. image:: https://travis-ci.org/caesar0301/awesome-public-datasets.svg
   :target: https://travis-ci.org/caesar0301/awesome-public-datasets

`This list of public data sources <https://github.com/caesar0301/awesome-public-datasets>`_
are collected and tidied from blogs, answers, and user responses.
Most of the data sets listed below are free, however, some are not.
Other amazingly awesome lists can be found in the
`awesome-awesomeness <https://github.com/bayandin/awesome-awesomeness>`_ and
`sindresorhus's awesome <https://github.com/sindresorhus/awesome>`_ list.

Contents
----------
.. contents::

Agriculture
------------
* `U.S. Department of Agriculture's PLANTS Database <http://www.plants.usda.gov/dl_all.html>`_


Biology
-------

* `1000 Genomes <http://www.1000genomes.org/data>`_
* `American Gut (Microbiome Project) <https://github.com/biocore/American-Gut>`_
* `Cell Image Library <http://www.cellimagelibrary.org>`_
* `Collaborative Research in Computational Neuroscience (CRCNS) <http://crcns.org/data-sets>`_
* `EBI ArrayExpress <http://www.ebi.ac.uk/arrayexpress/>`_
* `EBI Protein Data Bank in Europe <http://www.ebi.ac.uk/pdbe/emdb/index.html/>`_
* `ENCODE project <https://www.encodeproject.org>`_
* `Ensembl Genomes <http://ensemblgenomes.org/info/genomes>`_
* `Gene Expression Omnibus (GEO) <http://www.ncbi.nlm.nih.gov/geo/>`_
* `Gene Ontology (GO) <http://geneontology.org/page/download-annotations>`_
* `Global Biotic Interations (GloBI) <https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data>`_
* `Harvard Medical School (HMS) LINCS Project <http://lincs.hms.harvard.edu>`_
* `Human Microbiome Project (HMP) <http://www.hmpdacc.org/reference_genomes/reference_genomes.php>`_
* `ICOS PSP Benchmark <http://ico2s.org/datasets/psp_benchmark.html>`_
* `Journal of Cell Biology DataViewer <http://jcb-dataviewer.rupress.org>`_
* `MIT Cancer Genomics Data <http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi>`_
* `NIH Microarray data <http://bit.do/VVW6>`_ or `FTP <ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/>`_
* `OpenSNP genotypes data <https://opensnp.org/>`_
* `Pathguid - Protein-Protein Interactions Catalog <http://www.pathguide.org/>`_
* `Protein Data Bank <http://www.rcsb.org/>`_
* `PubChem Project <https://pubchem.ncbi.nlm.nih.gov/>`_
* `PubGene (now Coremine Medical) <http://www.pubgene.org/>`_
* `Sequence Read Archive(SRA) <http://www.ncbi.nlm.nih.gov/Traces/sra/>`_
* `Stanford Microarray Data <http://smd.stanford.edu/>`_
* `Stowers Institute Original Data Repository <http://www.stowers.org/research/publications/odr>`_
* `Systems Science of Biological Dynamics (SSBD) Database <http://ssbd.qbic.riken.jp>`_
* `The Catalogue of Life <http://www.catalogueoflife.org/content/annual-checklist-archive>`_
* `The Personal Genome Project <http://www.personalgenomes.org/>`_ or `PGP <https://my.pgp-hms.org/public_genetic_data>`_
* `UCSC Public Data <http://hgdownload.soe.ucsc.edu/downloads.html>`_
* `UniGene <http://www.ncbi.nlm.nih.gov/unigene>`_


Climate/Weather
---------------

* `Australian Weather <http://www.bom.gov.au/climate/dwo/>`_
* `Brazilian Weather - Historical data (In Portuguese) <http://sinda.crn2.inpe.br/PCD/SITE/novo/site/>`_
* `Canadian Meteorological Centre <http://weather.gc.ca/grib/index_e.html>`_
* `Climate Data from UEA (updated monthly) <https://crudata.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/>`_
* `European Climate Assessment & Dataset <http://eca.knmi.nl/>`_
* `Global Climate Data Since 1929 <http://en.tutiempo.net/climate>`_
* `NASA Global Imagery Browse Services <https://wiki.earthdata.nasa.gov/display/GIBS>`_
* `NOAA Bering Sea Climate <http://www.beringclimate.noaa.gov/>`_
* `NOAA Climate Datasets <http://www.ncdc.noaa.gov/data-access/quick-links>`_
* `NOAA Realtime Weather Models <http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction>`_
* `The World Bank Open Data Resources for Climate Change <http://data.worldbank.org/developers/climate-data-api>`_
* `UEA Climatic Research Unit <http://www.cru.uea.ac.uk/data>`_
* `WorldClim - Global Climate Data <http://www.worldclim.org>`_
* `WU Historical Weather Worldwide <https://www.wunderground.com/history/index.html>`_


Complex Networks
----------------

* `CrossRef DOI URLs <https://archive.org/details/doi-urls>`_
* `DBLP Citation dataset <https://kdl.cs.umass.edu/display/public/DBLP>`_
* `NBER Patent Citations <http://nber.org/patents/>`_
* `NIST complex networks data collection <http://math.nist.gov/~RPozo/complex_datasets.html>`_
* `Protein-protein interaction network <http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm>`_
* `PyPI and Maven Dependency Network <https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/>`_
* `Scopus Citation Database <https://www.elsevier.com/solutions/scopus>`_
* `Small Network Data <http://www-personal.umich.edu/~mejn/netdata/>`_
* `Stanford GraphBase (Steven Skiena) <http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml>`_
* `Stanford Large Network Dataset Collection <http://snap.stanford.edu/data/>`_
* `The Koblenz Network Collection <http://konect.uni-koblenz.de/>`_
* `The Laboratory for Web Algorithmics (UNIMI) <http://law.di.unimi.it/datasets.php>`_
* `The Nexus Network Repository <http://nexus.igraph.org/>`_
* `UCI Network Data Repository <https://networkdata.ics.uci.edu/resources.php>`_
* `UFL sparse matrix collection <http://www.cise.ufl.edu/research/sparse/matrices/>`_
* `WSU Graph Database <http://www.eecs.wsu.edu/mgd/gdb.html>`_
* `Stanford Longitudnal Network Data Sources <http://stanford.edu/group/sonia/dataSources/index.html>`_


Computer Networks
-----------------

* `3.5B Web Pages from CommonCraw 2012 <http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us>`_
* `53.5B Web clicks of 100K users in Indiana Univ. <http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/>`_
* `CAIDA Internet Datasets <http://www.caida.org/data/overview/>`_
* `ClueWeb09 - 1B web pages <http://lemurproject.org/clueweb09/>`_
* `ClueWeb12 - 733M web pages <http://lemurproject.org/clueweb12/>`_
* `CommonCrawl Web Data over 7 years <http://commoncrawl.org/the-data/get-started/>`_
* `CRAWDAD Wireless datasets from Dartmouth Univ. <https://crawdad.cs.dartmouth.edu/>`_
* `Criteo click-through data <http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/>`_
* `Open Mobile Data by MobiPerf <https://console.developers.google.com/storage/openmobiledata_public/>`_
* `UCSD Network Telescope, IPv4 /8 net <http://www.caida.org/projects/network_telescope/>`_


Contextual Data
---------------

* `Context-aware data sets from five domains <http://students.depaul.edu/~yzheng8/DataSets.html#Data>`_ or `GitHub <https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets>`_


Data Challenges
---------------

* `Challenges in Machine Learning <http://www.chalearn.org/>`_
* `CrowdANALYTIX dataX <http://data.crowdanalytix.com>`_
* `D4D Challenge of Orange <http://www.d4d.orange.com/en/home>`_
* `DrivenData Competitions for Social Good <http://www.drivendata.org/>`_
* `ICWSM Data Challenge (since 2009) <http://icwsm.cs.umbc.edu/>`_
* `Kaggle Competition Data <https://www.kaggle.com/>`_
* `KDD Cup by Tencent 2012 <http://www.kddcup2012.org/>`_
* `Localytics Data Visualization Challenge <https://github.com/localytics/data-viz-challenge>`_
* `Netflix Prize <http://www.netflixprize.com/leaderboard>`_
* `Space Apps Challenge <https://2015.spaceappschallenge.org>`_
* `Telecom Italia Big Data Challenge <https://dandelion.eu/datamine/open-big-data/>`_
* `Yelp Dataset Challenge <http://www.yelp.com/dataset_challenge>`_


Economics
---------

* `American Economic Ass (AEA) <https://www.aeaweb.org/RFE/toc.php?show=complete>`_
* `EconData from UMD <http://inforumweb.umd.edu/econdata/econdata.html>`_
* `Economic Freedom of the World Data <http://www.freetheworld.com/datasets_efw.html>`_
* `Historical MacroEconomc Statistics <http://www.historicalstatistics.org/>`_
* `International Trade Statistics <http://www.econostatistics.co.za/>`_
* `Internet Product Code Database <http://www.upcdatabase.com/>`_
* `Joint External Debt Data Hub <http://www.jedh.org/>`_
* `Jon Haveman International Trade Data Links <http://www.macalester.edu/research/economics/PAGE/HAVEMAN/Trade.Resources/TradeData.html>`_
* `OpenCorporates Database of Companies in the World <https://opencorporates.com/>`_
* `Our World in Data <http://ourworldindata.org/>`_
* `SciencesPo World Trade Gravity Datasets <http://econ.sciences-po.fr/thierry-mayer/data>`_
* `The Atlas of Economic Complexity <http://atlas.cid.harvard.edu>`_
* `The Center for International Data <http://cid.econ.ucdavis.edu>`_
* `The Observatory of Economic Complexity <http://atlas.media.mit.edu/en/>`_
* `UN Commodity Trade Statistics <http://comtrade.un.org/db/>`_
* `UN Human Development Reports <http://hdr.undp.org/en>`_


Energy
------

* `AMPds <http://ampds.org/>`_
* `BLUEd <http://nilm.cmubi.org/>`_
* `COMBED <http://combed.github.io/>`_
* `Dataport <https://dataport.pecanstreet.org/>`_
* `ECO <http://www.vs.inf.ethz.ch/res/show.html?what=eco-data>`_
* `EIA <http://www.eia.gov/electricity/data/eia923/>`_
* `HFED <http://hfed.github.io/>`_
* `iAWE <http://iawe.github.io/>`_
* `Plaid <http://plaidplug.com/>`_
* `REDD <http://redd.csail.mit.edu/>`_
* `UK-Dale <http://www.doc.ic.ac.uk/~dk3810/data/>`_


Finance
-------

* `CBOE Futures Exchange <http://cfe.cboe.com/Data/>`_
* `Google Finance <https://www.google.com/finance>`_
* `Google Trends <http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0>`_
* `NASDAQ <https://data.nasdaq.com/>`_
* `OANDA <http://www.oanda.com/>`_
* `OSU Financial data <http://fisher.osu.edu/fin/fdf/osudata.htm>`_
* `Quandl <https://www.quandl.com/>`_
* `St Louis Federal <https://research.stlouisfed.org/fred2/>`_
* `Yahoo Finance <http://finance.yahoo.com/>`_


Geology
-------

* `Earth Models <http://www.earthmodels.org/>`_
* `Smithsonian Institution Global Volcano and Eruption Database <http://volcano.si.edu/>`_
* `USGS Earthquake Archives <http://earthquake.usgs.gov/earthquakes/search/>`_


GeoSpace/GIS
------------

* `BODC - marine data of ~22K vars <http://www.bodc.ac.uk/data/where_to_find_data/>`_
* `Cambridge, MA, US, GIS data on GitHub <http://cambridgegis.github.io/gisdata.html>`_
* `EOSDIS - NASA's earth observing system data <http://sedac.ciesin.columbia.edu/data/sets/browse>`_ 
* `Factual Global Location Data <https://www.factual.com/>`_
* `Geo Spatial Data from ASU <http://geodacenter.asu.edu/datalist/>`_
* `Geo Wiki Project - Citizen-driven Environmental Monitoring <http://geo-wiki.org/>`_
* `GeoNames Worldwide <http://www.geonames.org/>`_
* `Global Administrative Areas Database (GADM) <http://www.gadm.org/>`_
* `International Institute for Systems Analysis - GIS Datasets <http://www.iiasa.ac.at/web/home/research/modelsData/Models--Tools--Data.en.html>`_
* `Landsat 8 on AWS <https://aws.amazon.com/public-data-sets/landsat/>`_
* `List of all countries in all languages <https://github.com/umpirsky/country-list>`_
* `Natural Earth - vectors and rasters of the world <http://www.naturalearthdata.com/>`_
* `OpenAddresses <http://openaddresses.io/>`_
* `OpenStreetMap (OSM) <http://wiki.openstreetmap.org/wiki/Downloading_data>`_
* `Reverse Geocoder using OSM data <https://github.com/kno10/reversegeocode>`_ & `additional high-resolution data files <http://data.ub.uni-muenchen.de/61/>`_
* `TIGER/Line - U.S. boundaries and roads <http://www.census.gov/geo/maps-data/data/tiger-line.html>`_
* `TwoFishes - Foursquare's coarse geocoder <https://github.com/foursquare/twofishes>`_
* `TZ Timezones shapfiles <http://efele.net/maps/tz/world/>`_
* `UN Environmental Data <http://geodata.grid.unep.ch/>`_
* `World countries in multiple formats <https://github.com/mledoze/countries>`_


Government
----------

* `Alberta, Province of Canada <http://open.alberta.ca>`_
* `Antwerp, Belgium <http://opendata.antwerpen.be/datasets>`_
* `Argentina (non official) <http://datar.noip.me/>`_
* `Argentina <http://datos.argentina.gob.ar/>`_
* `Austin, TX, US <https://data.austintexas.gov/>`_
* `Australia (abs.gov.au) <http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument>`_
* `Australia (data.gov.au) <https://data.gov.au/>`_
* `Austria (data.gv.at) <https://www.data.gv.at/>`_
* `Baton Rouge, LA, US <https://data.brla.gov/>`_
* `Belgium <http://data.gov.be/>`_
* `Brazil <http://dados.gov.br/dataset>`_
* `Buenos Aires, Argentina <http://data.buenosaires.gob.ar/>`_
* `Calgary, AB, Canada <https://data.calgary.ca/OpenData/Pages/DatasetListingAlphabetical.aspx>`_
* `Cambridge, MA, US <https://data.cambridgema.gov/>`_
* `Canada <http://open.canada.ca/en?lang=En&n=5BCD274E-1>`_
* `Chicago <https://data.cityofchicago.org/>`_
* `Dallas Open Data <https://www.dallasopendata.com/>`_
* `DataBC - data from the Province of British Columbia <http://www.data.gov.bc.ca/>`_ 
* `Denver Open Data <http://data.denvergov.org//>`_
* `Durham, NC Open Data <https://opendurham.nc.gov/explore/>`_
* `Edmonton, AB, Canada <https://data.edmonton.ca/>`_
* `England LGInform <http://lginform.local.gov.uk/>`_
* `EuroStat <http://ec.europa.eu/eurostat/data/database>`_
* `FedStats <http://fedstats.sites.usa.gov/>`_
* `Finland <https://www.opendata.fi/en>`_
* `France <https://www.data.gouv.fr/en/datasets/>`_
* `Fredericton, NB, Canada <http://www.fredericton.ca/en/citygovernment/Catalogue.asp>`_
* `Gatineau, QC, Canada <http://www.gatineau.ca/donneesouvertes/default_fr.aspx>`_
* `Germany <https://www-genesis.destatis.de/genesis/online>`_
* `Ghent, Belgium <https://data.stad.gent/datasets>`_
* `Glasgow, Scotland, UK <https://data.glasgow.gov.uk/>`_
* `Guardian world governments <http://www.guardian.co.uk/world-government-data>`_
* `Halifax, NS, Canada <http://www.halifax.ca/opendata/index.php>`_
* `Helsinki Region, Finland <http://www.hri.fi/en/>`_
* `Houston Open Data <http://data.ohouston.org>`_
* `Indian Government Data <https://data.gov.in/>`_
* `Indonesian Data Portal <http://data.go.id/>`_
* `Laval, QC, Canada <http://www.laval.ca/Pages/Fr/Citoyens/donnees.aspx>`_
* `London Datastore, UK <http://data.london.gov.uk/dataset>`_
* `London, ON, Canada <http://www.london.ca/city-hall/open-data/Pages/default.aspx>`_
* `Los Angeles Open Data <https://data.lacity.org/>`_
* `MassGIS, Massachusetts, U.S. <http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/>`_
* `Mexico <http://catalogo.datos.gob.mx/dataset>`_
* `Missisauga, ON, Canada <http://www.mississauga.ca/portal/residents/publicationsopendatacatalogue>`_
* `Moncton, NB, Canada <http://www.moncton.ca/Government/Terms_of_use/Open_Data_Purpose/Data_Catalogue.htm>`_
* `Montreal, QC, Canada <http://donnees.ville.montreal.qc.ca/>`_
* `Netherlands <https://data.overheid.nl/>`_
* `New Zealand <http://www.stats.govt.nz/browse_for_stats.aspx>`_
* `NYC betanyc <http://betanyc.us/>`_
* `NYC Open Data <https://nycplatform.socrata.com/>`_
* `OECD <https://data.oecd.org/>`_
* `Oklahoma <https://data.ok.gov/>`_
* `Open Government Data (OGD) Platform India <https://data.gov.in/>`_
* `Oregon <https://data.oregon.gov/>`_
* `Ottawa, ON, Canada <http://data.ottawa.ca/en/>`_
* `Portland, Oregon <https://www.portlandoregon.gov/28130>`_
* `Puerto Rico Government <https://data.pr.gov//>`_
* `Quebec City, QC, Canada <http://donnees.ville.quebec.qc.ca/>`_
* `Quebec Province of Canada <http://donnees.gouv.qc.ca/>`_
* `Regina SK, Canada <http://open.regina.ca/>`_
* `Rio de Janeiro, Brazil <http://data.rio.rj.gov.br/>`_ 
* `Romania <http://data.gov.ro/>`_
* `Russia <http://data.gov.ru>`_
* `San Francisco Data sets <http://datasf.org/>`_
* `Saskatchewan, Province of Canada <http://opendatask.ca/data/>`_
* `Seattle <https://data.seattle.gov/>`_
* `Singapore Government Data <https://data.gov.sg/>`_
* `South Africa <http://beta2.statssa.gov.za/>`_
* `South Africa Trade Statistics <http://www.econostatistics.co.za/>`_
* `State of Utah, US <https://opendata.utah.gov/>`_
* `Switzerland <http://www.opendata.admin.ch/>`_
* `Texas Open Data <https://data.texas.gov/>`_
* `The World Bank <http://wdronline.worldbank.org/>`_
* `Toronto, ON, Canada <http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=1a66e03bb8d1e310VgnVCM10000071d60f89RCRD>`_
* `U.K. Government Data <http://data.gov.uk/data>`_
* `U.S. American Community Survey <http://www.census.gov/acs/www/data_documentation/data_release_info/>`_
* `U.S. CDC Public Health datasets <http://www.cdc.gov/nchs/data_access/ftp_data.htm>`_
* `U.S. Census Bureau <http://www.census.gov/data.html>`_
* `U.S. Department of Housing and Urban Development (HUD) <http://www.huduser.gov/portal/datasets/pdrdatas.html>`_
* `U.S. Federal Government Agencies <http://www.data.gov/metrics>`_
* `U.S. Federal Government Data Catalog <http://catalog.data.gov/dataset>`_
* `U.S. Food and Drug Administration (FDA) <https://open.fda.gov/index.html>`_
* `U.S. National Center for Education Statistics (NCES) <http://nces.ed.gov/>`_
* `U.S. Open Government <http://www.data.gov/open-gov/>`_
* `UK 2011 Census Open Atlas Project <http://www.alex-singleton.com/r/2013/02/05/2011-census-open-atlas-project/>`_
* `United Nations <http://data.un.org/>`_
* `Uruguay <https://catalogodatos.gub.uy/>`_
* `Vancouver, BC Open Data Catalog <http://data.vancouver.ca/datacatalogue/>`_
* `Victoria, BC, Canada <http://www.victoria.ca/EN/main/city/open-data-catalogue.html>`_


Healthcare
----------

* `EHDP Large Health Data Sets <http://www.ehdp.com/vitalnet/datasets.htm>`_
* `Gapminder World demographic databases <http://www.gapminder.org/data/>`_
* `Medicare Coverage Database (MCD), U.S. <https://www.cms.gov/medicare-coverage-database/>`_
* `Medicare Data Engine of medicare.gov Data <https://data.medicare.gov/>`_
* `Medicare Data File <http://go.cms.gov/19xxPN4>`_
* `MeSH, the vocabulary thesaurus used for indexing articles for PubMed <https://www.nlm.nih.gov/mesh/filelist.html>`_
* `Number of Ebola Cases and Deaths in Affected Countries (2014) <https://data.hdx.rwlabs.org/dataset/ebola-cases-2014>`_
* `Open-ODS (structure of the UK NHS) <http://www.openods.co.uk>`_
* `The Cancer Genome Atlas project (TCGA) <https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp>`_ and `BigQuery table <http://google-genomics.readthedocs.org/en/latest/use_cases/discover_public_data/isb_cgc_data.html>`_
* `World Health Organization Global Health Observatory <http://www.who.int/gho/en/>`_ 


Image Processing
----------------

* `10k US Adult Faces Database <http://wilmabainbridge.com/facememorability2.html>`_
* `2GB of Photos of Cats <http://137.189.35.203/WebUI/CatDatabase/catData.html>`_ or `Archive version <https://web.archive.org/web/20150520175645/http://137.189.35.203/WebUI/CatDatabase/catData.html>`_
* `Affective Image Classification <http://www.imageemotion.org/>`_
* `Animals with attributes <http://attributes.kyb.tuebingen.mpg.de/>`_
* `Face Recognition Benchmark <http://www.face-rec.org/databases/>`_
* `ImageNet (in WordNet hierarchy) <http://www.image-net.org/>`_
* `Indoor Scene Recognition <http://web.mit.edu/torralba/www/indoor.html>`_
* `International Affective Picture System, UFL <http://csea.phhp.ufl.edu/media/iapsmessage.html>`_
* `Massive Visual Memory Stimuli, MIT <http://cvcl.mit.edu/MM/stimuli.html>`_
* `Several Shape-from-Silhouette Datasets <http://kaiwolf.no-ip.org/3d-model-repository.html>`_
* `Stanford Dogs Dataset <http://vision.stanford.edu/aditya86/ImageNetDogs/>`_
* `SUN database, MIT <http://groups.csail.mit.edu/vision/SUN/hierarchy.html>`_
* `The Oxford-IIIT Pet Dataset <http://www.robots.ox.ac.uk/~vgg/data/pets/>`_
* `YouTube Faces Database <http://www.cs.tau.ac.il/~wolf/ytfaces/>`_


Machine Learning
----------------

* `Delve Datasets for classification and regression (Univ. of Toronto) <http://www.cs.toronto.edu/~delve/data/datasets.html>`_
* `Discogs Monthly Data <http://data.discogs.com/>`_
* `eBay Online Auctions (2012) <http://www.modelingonlineauctions.com/datasets>`_
* `IMDb Database <http://www.imdb.com/interfaces>`_
* `Keel Repository for classification, regression and time series <http://sci2s.ugr.es/keel/datasets.php>`_
* `Lending Club Loan Data <https://www.lendingclub.com/info/download-data.action>`_
* `Machine Learning Data Set Repository <http://mldata.org/>`_
* `Million Song Dataset <http://labrosa.ee.columbia.edu/millionsong/>`_
* `More Song Datasets <http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets>`_
* `MovieLens Data Sets <http://grouplens.org/datasets/movielens/>`_
* `RDataMining - "R and Data Mining" ebook data <http://www.rdatamining.com/data>`_
* `Registered Meteorites on Earth <http://healthintelligence.drupalgardens.com/content/registered-meteorites-has-impacted-earth-visualized>`_
* `Restaurants Health Score Data in San Francisco <http://missionlocal.org/san-francisco-restaurant-health-inspections/>`_
* `UCI Machine Learning Repository <http://archive.ics.uci.edu/ml/>`_
* `Yahoo! Ratings and Classification Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=r>`_
* `Labeled Faces in the Wild (LFW) <http://vis-www.cs.umass.edu/lfw/>`_


Museums
-------

* `Canada Science and Technology Museums Corporation's Open Data <http://techno-science.ca/en/data.php>`_
* `Cooper-Hewitt's Collection Database <https://github.com/cooperhewitt/collection>`_
* `Minneapolis Institute of Arts metadata <https://github.com/artsmia/collection>`_
* `Natural History Museum (London) Data Portal <http://data.nhm.ac.uk/>`_
* `Rijksmuseum Historical Art Collection <https://www.rijksmuseum.nl/en/api>`_
* `Tate Collection metadata <https://github.com/tategallery/collection>`_
* `The Getty vocabularies <http://vocab.getty.edu>`_


Natural Language
----------------

* `Blogger Corpus <http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm>`_
* `ClueWeb09 FACC <http://lemurproject.org/clueweb09/FACC1/>`_
* `ClueWeb12 FACC <http://lemurproject.org/clueweb12/FACC1/>`_
* `DBpedia - 4.58M things with 583M facts <http://wiki.dbpedia.org/Datasets>`_
* `Flickr Personal Taxonomies <http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html>`_
* `Freebase.com of people, places, and things <http://www.freebase.com/>`_
* `Google Books Ngrams (2.2TB) <https://aws.amazon.com/datasets/google-books-ngrams/>`_
* `Google Web 5gram (1TB, 2006) <https://catalog.ldc.upenn.edu/LDC2006T13>`_
* `Gutenberg eBooks List <http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs>`_
* `Hansards text chunks of Canadian Parliament <http://www.isi.edu/natural-language/download/hansard/>`_
* `Machine Comprehension Test (MCTest) of text from Microsoft Research <http://research.microsoft.com/en-us/um/redmond/projects/mctest/index.html>`_
* `Machine Translation of European languages <http://statmt.org/wmt11/translation-task.html#download>`_
* `SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles) <https://github.com/ParallelMazen/SaudiNewsNet>`_
* `SMS Spam Collection in English <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/>`_
* `USENET postings corpus of 2005~2011 <http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html>`_
* `Wikidata - Wikipedia databases <https://www.wikidata.org/wiki/Wikidata:Database_download>`_
* `Wikipedia Links data - 40 Million Entities in Context <https://code.google.com/p/wiki-links/downloads/list>`_
* `WordNet databases and tools <http://wordnet.princeton.edu/wordnet/download/>`_


Physics
-------

* `CERN Open Data Portal <http://opendata.cern.ch/>`_
* `NASA Exoplanet Archive <http://exoplanetarchive.ipac.caltech.edu/>`_
* `NSSDC (NASA) data of 550 space spacecraft <http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html>`_
* `Sloan Digital Sky Survey (SDSS) - Mapping the Universe <http://www.sdss.org/>`_


Psychology/Cognition
--------------------

* `OSU Cognitive Modeling Repository Datasets <http://www.cmr.osu.edu/browse/datasets>`_


Public Domains
--------------

* `Amazon <http://aws.amazon.com/datasets/>`_
* `Archive.org Datasets <https://archive.org/details/datasets>`_
* `CMU JASA data archive <http://lib.stat.cmu.edu/jasadata/>`_
* `CMU StatLab collections <http://lib.stat.cmu.edu/datasets/>`_
* `Data360 <http://www.data360.org/index.aspx>`_
* `Datamob.org <http://datamob.org/datasets>`_
* `Google <http://www.google.com/publicdata/directory>`_
* `Infochimps <http://www.infochimps.com/>`_
* `KDNuggets Data Collections <http://www.kdnuggets.com/datasets/index.html>`_
* `Microsoft Azure Data Market Free DataSets <http://datamarket.azure.com/browse/data?price=free>`_
* `Numbray <http://numbrary.com/>`_
* `Reddit Datasets <https://www.reddit.com/r/datasets>`_
* `RevolutionAnalytics Collection <http://packages.revolutionanalytics.com/datasets/>`_
* `Sample R data sets <http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html>`_
* `Stats4Stem R data sets <http://www.stats4stem.org/data-sets.html>`_
* `StatSci.org <http://www.statsci.org/datasets.html>`_
* `The Washington Post List <http://www.washingtonpost.com/wp-srv/metro/data/datapost.html>`_
* `UCLA SOCR data collection <http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data>`_
* `UFO Reports <http://www.nuforc.org/webreports.html>`_
* `Wikileaks 911 pager intercepts <https://911.wikileaks.org/files/index.html>`_
* `Yahoo Webscope <http://webscope.sandbox.yahoo.com/catalog.php>`_


Search Engines
--------------

* `Academic Torrents of data sharing from UMB <http://academictorrents.com/>`_
* `Archive-it from Internet Archive <https://www.archive-it.org/explore?show=Collections>`_
* `Datahub.io <https://datahub.io/dataset>`_
* `DataMarket (Qlik) <https://datamarket.com/data/list/?q=all>`_
* `Harvard Dataverse Network of scientific data <https://dataverse.harvard.edu/>`_
* `ICPSR (UMICH) <http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp>`_
* `Institute of Education Sciences <http://eric.ed.gov>`_
* `National Technical Reports Library <http://www.ntis.gov/products/ntrl/>`_
* `Open Data Certificates (beta) <https://certificates.theodi.org/en/datasets>`_
* `OpenDataNetwork - A search engine of all Socrata powered data portals <http://www.opendatanetwork.com/>`_
* `Statista.com - statistics and Studies <http://www.statista.com/>`_
* `Zenodo - An open dependable home for the long-tail of science <https://zenodo.org/collection/datasets>`_


Social Networks
---------------

* `72 hours #gamergate Twitter Scrape <http://waxy.org/random/misc/gamergate_tweets.csv>`_
* `Ancestry.com Forum Dataset over 10 years <http://www.cs.cmu.edu/~jelsas/data/ancestry.com/>`_
* `Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape <https://archive.org/details/twitter_cikm_2010>`_
* `CMU Enron Email of 150 users <http://www.cs.cmu.edu/~enron/>`_
* `EDRM Enron EMail of 151 users, hosted on S3 <https://aws.amazon.com/datasets/enron-email-data/>`_
* `Facebook Data Scrape (2005) <https://archive.org/details/oxford-2005-facebook-matrix>`_
* `Facebook Social Networks from LAW (since 2007) <http://law.di.unimi.it/datasets.php>`_
* `Foursquare from UMN/Sarwat (2013) <https://archive.org/details/201309_foursquare_dataset_umn>`_
* `GetGlue - users rating TV shows <http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz>`_
* `GitHub Collaboration Archive <https://www.githubarchive.org/>`_
* `Google Scholar citation relations <http://www3.cs.stonybrook.edu/~leman/data/gscholar.db>`_
* `Mobile Social Networks from UMASS <https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks>`_
* `Network Twitter Data <http://snap.stanford.edu/data/higgs-twitter.html>`_
* `Reddit Comments <https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/>`_
* `Skytrax' Air Travel Reviews Dataset <https://github.com/quankiquanki/skytrax-reviews-dataset>`_
* `Social Twitter Data <http://snap.stanford.edu/data/egonets-Twitter.html>`_
* `SourceForge.net Research Data <http://www3.nd.edu/~oss/Data/data.html>`_
* `Twitter Data for Sentiment Analysis <http://help.sentiment140.com/for-students/>`_
* `Twitter Graph of entire Twitter site <http://an.kaist.ac.kr/traces/WWW2010.html>`_
* `Twitter Scrape Calufa May 2011 <http://archive.org/details/2011-05-calufa-twitter-sql>`_
* `UNIMI/LAW Social Network Datasets <http://law.di.unimi.it/datasets.php>`_
* `Yahoo! Graph and Social Data <http://webscope.sandbox.yahoo.com/catalog.php?datatype=g>`_
* `Youtube Video Social Graph in 2007,2008 <http://netsg.cs.sfu.ca/youtubedata/>`_


Social Sciences
---------------

* `Canadian Legal Information Institute <https://www.canlii.org/en/index.php>`_
* `Center for Systemic Peace Datasets - Conflict Trends, Polities, State Fragility, etc <http://www.systemicpeace.org/>`_
* `Correlates of War Project <http://www.correlatesofwar.org/>`_
* `Cryptome Conspiracy Theory Items <http://cryptome.org>`_
* `Datacards <http://datacards.org>`_
* `European Social Survey <http://www.europeansocialsurvey.org/data/>`_
* `FBI Hate Crime 2013 - aggregated data <https://github.com/emorisse/FBI-Hate-Crime-Statistics/tree/master/2013>`_
* `GDELT Global Events Database <http://gdeltproject.org/data.html>`_
* `General Social Survey (GSS) since 1972 <http://gss.norc.org>`_
* `German Social Survey <http://www.gesis.org/en/home/>`_
* `Global Religious Futures Project <http://www.globalreligiousfutures.org/>`_
* `Institute for Demographic Studies <http://www.ined.fr/en/>`_
* `International Networks Archive <http://www.princeton.edu/~ina/>`_
* `International Social Survey Program ISSP <http://www.issp.org>`_
* `International Studies Compendium Project <http://www.isacompendium.com/public/>`_
* `James McGuire Cross National Data <http://jmcguire.faculty.wesleyan.edu/welcome/cross-national-data/>`_
* `MIT Reality Mining Dataset <http://realitycommons.media.mit.edu/realitymining.html>`_
* `Paul Hensel General International Data Page <http://www.paulhensel.org/dataintl.html>`_
* `PewResearch Internet Survey Project <http://www.pewinternet.org/datasets/pages/2/>`_
* `PewResearch Society Data Collection <http://www.pewresearch.org/data/download-datasets/>`_
* `Political Polarity Data <http://www3.cs.stonybrook.edu/~leman/data/14-icwsm-political-polarity-data.zip>`_
* `StackExchange Data Explorer <http://data.stackexchange.com/help>`_
* `Terrorism Research and Analysis Consortium <http://www.trackingterrorism.org/>`_
* `Texas Inmates Executed Since 1984 <http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html>`_
* `The MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste <http://nsd.uib.no>`_
* `Titanic Survival Data Set <https://github.com/caesar0301/awesome-public-datasets/tree/master/Datasets>`_
* `UCB's Archive of Social Science Data (D-Lab) <http://ucdata.berkeley.edu/>`_
* `UCLA Social Sciences Data Archive <http://dataarchives.ss.ucla.edu/Home.DataPortals.htm>`_
* `UN Civil Society Database <http://esango.un.org/civilsociety/>`_
* `Universities Worldwide <http://univ.cc/>`_
* `UPJOHN for Labor Employment Research <http://www.upjohn.org/services/resources/employment-research-data-center>`_
* `WorldPop project - Worldwide human population distributions <http://www.worldpop.org.uk/data/get_data/>`_


Sports
------

* `Betfair Historical Exchange Data <http://data.betfair.com/>`_
* `Cricsheet Matches (cricket) <http://cricsheet.org/>`_
* `Ergast Formula 1, from 1950 up to date (API) <http://ergast.com/mrd/db>`_
* `Football/Soccer resources (data and APIs) <http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/>`_
* `Lahman's Baseball Database <http://www.seanlahman.com/baseball-archive/statistics/>`_
* `Pinhooker: Thoroughbred Bloodstock Sale Data <https://github.com/phillc73/pinhooker>`_
* `Retrosheet Baseball Statistics <http://www.retrosheet.org/game.htm>`_


Time Series
-----------

* `Databanks International Cross National Time Series Data Archive <http://www.cntsdata.com>`_
* `Hard Drive Failure Rates <https://www.backblaze.com/hard-drive-test-data.html>`_
* `Heart Rate Time Series from MIT <http://ecg.mit.edu/time-series/>`_
* `Time Series Data Library (TSDL) from MU <https://datamarket.com/data/list/?q=provider:tsdl>`_
* `UC Riverside Time Series Dataset <http://www.cs.ucr.edu/~eamonn/time_series_data/>`_


Transportation
--------------

* `Airlines OD Data 1987-2008 <http://stat-computing.org/dataexpo/2009/the-data.html>`_
* `Bay Area Bike Share Data <http://www.bayareabikeshare.com/open-data>`_
* `Bike Share Systems (BSS) collection <https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems>`_
* `GeoLife GPS Trajectory from Microsoft Research <http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/>`_
* `German train system by Deutsche Bahn <http://data.deutschebahn.com/datasets/>`_
* `Hubway Million Rides in MA <http://hubwaydatachallenge.org/trip-history-data/>`_
* `Marine Traffic - ship tracks, port calls and more <http://www.marinetraffic.com/de/ais-api-services>`_
* `Montreal BIXI Bike Share <https://montreal.bixi.com/donn%C3%A9es-libre-service>`_
* `NYC Taxi Trip Data 2009- <http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml>`_
* `NYC Taxi Trip Data 2013 (FOIA/FOILed) <https://archive.org/details/nycTaxiTripData2013>`_
* `NYC Uber trip data April 2014 to September 2014 <https://github.com/fivethirtyeight/uber-tlc-foil-response>`_
* `OpenFlights - airport, airline and route data <http://openflights.org/data.html>`_
* `Plane Crash Database, since 1920 <http://www.planecrashinfo.com/database.htm>`_
* `RITA Airline On-Time Performance data <http://www.transtats.bts.gov/Tables.asp?DB_ID=120>`_
* `RITA/BTS transport data collection (TranStat) <http://www.transtats.bts.gov/DataIndex.asp>`_
* `Toronto Bike Share Stations (XML file) <http://www.bikesharetoronto.com/data/stations/bikeStations.xml>`_
* `Transport for London (TFL) <https://tfl.gov.uk/info-for/open-data-users/our-feeds>`_
* `Travel Tracker Survey (TTS) for Chicago <http://www.cmap.illinois.gov/data/transportation/travel-tracker-survey>`_
* `U.S. Bureau of Transportation Statistics (BTS) <http://www.rita.dot.gov/bts/>`_
* `U.S. Domestic Flights 1990 to 2009 <http://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a>`_
* `U.S. Freight Analysis Framework since 2007 <http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm>`_


Complementary Collections
-------------------------

* `Database of Scientific Code Contributions <https://mozillascience.org/collaborate>`_
* DataWrangling: `Some Datasets Available on the Web <http://www.datawrangling.com/some-datasets-available-on-the-web>`_
* Inside-r: `Finding Data on the Internet <http://www.inside-r.org/howto/finding-data-internet>`_
* OpenDataMonitor: `An overview of available open data resources in Europe <http://opendatamonitor.eu>`_
* Quora: `Where can I find large datasets open to the public? <http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public>`_
* RS.io: `100+ Interesting Data Sets for Statistics <http://rs.io/100-interesting-data-sets-for-statistics/>`_
* StaTrek: `Leveraging open data to understand urban lives <http://xiaming.me/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/>`_