awesome-public-datasets/README.rst
Spencer Alexander 11ad760707 Adding Weather Datasources to Climate Section
Cleaning up capitalization in the climate section, and adding:

1) NOAA Real-time Models: These are the models that power forecast
   systems all over the world. The models consist of both North America
   specific simulations (NAM, RUC) as well as global models (GFS).
2) Canadian Meteorological Centre: These models offer a nice compliment
   to the NOAA real-time models, both in North America and throughout the
   world.
2014-12-11 17:59:19 -08:00

345 lines
16 KiB
ReStructuredText

Awesome Public Datasets
=======================
This list of public data sources are collected and tidyed from blogs, answers,
and user reponses. Most of the data sets listed below are free, however, some
are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.
Climate/Weather
-------
* Australian Weather: http://www.bom.gov.au/climate/dwo/
* Canadian Meteorological Centre: https://weather.gc.ca/grib/index_e.html
* Climate Data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
* Global Climate Data Since 1929: http://www.tutiempo.net/en/Climate
* NOAA Bering Sea Climate: http://www.beringclimate.noaa.gov/
* NOAA Climate Datasets: http://ncdc.noaa.gov/data-access/quick-links
* NOAA Realtime Weather Models: http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction
* WU Historical Weather Worldwide: http://www.wunderground.com/history/index.html
Economics
---------
* American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
* EconData (UMD): http://inforumweb.umd.edu/econdata/econdata.html
* Internet Product Code Database: http://www.upcdatabase.com/
* World bank: http://data.worldbank.org/indicator
Energy
------
* AMPds: http://ampds.org/
* BLUEd: http://nilm.cmubi.org/
* COMBED: http://combed.github.io/
* Dataport: https://dataport.pecanstreet.org/
* ECO: http://www.vs.inf.ethz.ch/res/show.html?what=eco-data
* EIA: http://www.eia.gov/electricity/data/eia923/
* iAWE: http://iawe.github.io/
* HFED: http://hfed.github.io/
* Plaid: http://plaidplug.com/
* REDD: http://redd.csail.mit.edu/
* UK-Dale: http://www.doc.ic.ac.uk/~dk3810/data/
Finance
-------
* CBOE Futures Exchange: http://cfe.cboe.com/Data/
* Google Finance: https://www.google.com/finance
* Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
* NASDAQ: https://data.nasdaq.com/
* OANDA: http://www.oanda.com/
* OSU Financial data: http://fisher.osu.edu/fin/osudata.htm or http://fisher.osu.edu/fin/fdf/osudata.htm
* Quandl: http://www.quandl.com/
* St Louis Federal: http://research.stlouisfed.org/fred2/
* Yahoo Finance: http://finance.yahoo.com/
Biology
-------
* CRCNS: http://crcns.org/data-sets
* Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
* Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
* MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
* NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
* Protein structure: http://www.infobiotic.net/PSPbenchmarks/
* Protein Data Bank: http://pdb.org/
* Public Gene Data: http://www.pubgene.org/
* Stanford Microarray Data: http://smd.stanford.edu/
* UniGene: http://www.ncbi.nlm.nih.gov/unigene
* The Personal Genome Project: http://www.personalgenomes.org/ or https://my.pgp-hms.org/public_genetic_data
* 1000 Genomes: http://www.1000genomes.org/data
* UCSC Public Data: http://hgdownload.soe.ucsc.edu/downloads.html
Physics
-------
* NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
Healthcare
----------
* EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm
* Gapminder: http://www.gapminder.org/data/
* Medicare Data File: http://go.cms.gov/19xxPN4
GeoSpace/GIS
--------
* EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse
* Factual Global Location Data: http://www.factual.com/
* Geo Spatial Data: http://geodacenter.asu.edu/datalist/
* OpenStreetMap (a free map worldwide): http://wiki.openstreetmap.org/wiki/Downloading_data
* GeoNames (over eight million placenames): http://www.geonames.org/
* BODC (marine data of nearly 22,000 oceanographic vars): http://www.bodc.ac.uk/data/where_to_find_data/
* GADM (Global Administrative Areas database): http://www.gadm.org/
* twofishes (Foursquare's coarse geocoder): https://github.com/foursquare/twofishes
* Natural Earth (vectors and rasters of the world): http://www.naturalearthdata.com/
* tz_world (timezone polygons): http://efele.net/maps/tz/world/
* TIGER/Line (official United States boundaries and roads): http://www.census.gov/geo/maps-data/data/tiger-line.html
Transportation
--------------
* Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
* Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
* Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229
* Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/
* NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013
* OpenFlights (airport, airline and route data): http://openflights.org/data.html
* RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120
* RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp
* Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
* U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm
* Marine Traffic - ship tracks, port calls and more: https://www.marinetraffic.com/de/p/api-services
Government
----------
* Archive-it: : https://www.archive-it.org/explore?show=Collections
* Australia: https://data.gov.au/
* Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
* Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
* Chicago: https://data.cityofchicago.org/
* FDA: https://open.fda.gov/index.html
* Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
* Guardian world governments: http://www.guardian.co.uk/world-government-data
* HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
* London Datastore, U.K: http://data.london.gov.uk/dataset
* Glasgow, Scotland, UK: http://data.glasgow.gov.uk/
* Netherlands: https://data.overheid.nl/
* New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx
* NYC betanyc: http://betanyc.us/
* NYC Open Data: http://nycplatform.socrata.com/
* OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
* RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
* San Francisco Data sets: http://datasf.org/
* The World Bank: http://wdronline.worldbank.org/
* U.K. Government Data: http://data.gov.uk/data
* U.S. Census Bureau: http://www.census.gov/data.html
* U.S. American Community Survey: http://www.census.gov/acs/www/data_documentation/data_release_info/
* U.S. Federal Government Agencies: http://www.data.gov/metric
* U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
* U.S. Open Government: http://www.data.gov/open-gov/
* UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/
* United Nations: http://data.un.org/
* US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
* Open Government Data (OGD) Platform India: http://www.data.gov.in/
Sport
----------
* Cricsheet (cricket): http://cricsheet.org/
* Betfair (betting exchange) Event Results: http://data.betfair.com/
* Lahman's Baseball Database: http://www.seanlahman.com/baseball-archive/statistics/
* Retrosheet (baseball): http://www.retrosheet.org/game.htm
* Ergast Formula 1 (API available): http://ergast.com/mrd/db
Data Challenges
---------------
* Challenges in Machine Learning: http://www.chalearn.org/
* DrivenData Competitions for Social Good: http://www.drivendata.org/
* ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/
* Kaggle Competition Data: http://www.kaggle.com/
* KDD Cup by Tencent 2012: https://www.kddcup2012.org/
* Netflix Prize: http://www.netflixprize.com/leaderboard
* Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge
Machine Learning
----------------
* eBay Online Auctions: http://www.modelingonlineauctions.com/datasets
* IMDb database: http://www.imdb.com/interfaces
* Keel Repository: http://sci2s.ugr.es/keel/datasets.php
* Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action
* Machine Learning Data Set Repository: http://mldata.org/
* Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
* More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
* MovieLens Data Sets: http://datahub.io/dataset/movielens
* RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data
* Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
* SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/
* UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
* University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html
* Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
Natural Language
----------------
* 40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list
* ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/
* ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/
* Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
* Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670
* Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
* Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
* Hansards: http://www.isi.edu/natural-language/download/hansard/
* Machine Translation: http://statmt.org/wmt11/translation-task.html#download
* SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
* USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
* WordNet: http://wordnet.princeton.edu/wordnet/download/
Image Processing
----------------
* 2GB of photos of cats: http://137.189.35.203/WebUI/CatDatabase/catData.html
* Face Recognition Benchmark: http://www.face-rec.org/databases/
* ImageNet: http://www.image-net.org/
Time Series
-----------
* Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl
* UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
Social Sciences
---------------
* China Hotel Checkin/out data: http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml
* CMU Enron Email: http://www.cs.cmu.edu/~enron/
* Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php
* Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
* Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
* Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn
* General Social Survey (GSS): http://www3.norc.org/GSS+Website/
* GetGlue (users rating TV shows): http://bit.ly/1aL8XS0
* GitHub Archive: http://www.githubarchive.org/
* ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
* Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
* PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/
* Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
* SourceForge Graph: http://www.nd.edu/~oss/Data/data.html
* Titanic Survival Data Set: https://github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
* Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html
* UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/
* UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
* UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php
* Universities Worldwide: http://univ.cc/
* UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html
* Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g
* Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/
Complex Networks
----------------
* CrossRef DOI URLs: https://archive.org/details/doi-urls
* DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP
* NBER Patent Citations: http://nber.org/patents/
* NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html
* Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
* PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
* Scopus Citation Database: http://www.elsevier.com/online-tools/scopus
* Stanford GraphBase (Steven Skiena): http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
* Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
* The Koblenz Network Collection: http://konect.uni-koblenz.de/
* UCI Network Data Repository: http://networkdata.ics.uci.edu/resources.php
* UFL sparse matrix collection: http://www.cise.ufl.edu/research/sparse/matrices/
* The Laboratory for Web Algorithmics (UNIMI): http://law.di.unimi.it/datasets.php
* WSU Graph Database: http://www.eecs.wsu.edu/mgd/gdb.html
Computer Networks
-----------------
* 3.5B Web Pages: http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
* 53.5B Web clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
* CAIDA Internet Datasets: http://www.caida.org/data/overview/
* ClueWeb09: http://lemurproject.org/clueweb09/
* ClueWeb12: http://lemurproject.org/clueweb12/
* CommonCrawl Web Data: http://commoncrawl.org/the-data/get-started/
* Dartmouth CRAWDAD Wireless datasets: http://crawdad.cs.dartmouth.edu/
* OpenMobileData (MobiPerf): https://console.developers.google.com/storage/openmobiledata_public/
* UCSD Network Telescope: http://www.caida.org/projects/network_telescope/
Museums
-------
* Cooper-Hewitt's Collection Database: https://github.com/cooperhewitt/collection
* Tate Collection metadata: https://github.com/tategallery/collection
* Minneapolis Institute of Arts metadata: https://github.com/artsmia/collection
* The Getty vocabularies: http://vocab.getty.edu
Data SEs
--------
* Academic Torrents: http://academictorrents.com/
* Datahub.io: http://datahub.io/dataset
* DataMarket: https://datamarket.com/data/list/?q=all
* Harvard Dataverse: http://thedata.harvard.edu/dvn/
* Statista: http://www.statista.com/
* Freebase: http://www.freebase.com/
Public Domains
--------------
* Amazon: http://aws.amazon.com/datasets
* Archive.org Datasets: https://archive.org/details/datasets
* CMU JASA data archive: http://lib.stat.cmu.edu/jasadata/
* CMU StatLab collections: http://lib.stat.cmu.edu/datasets/
* Data360: http://www.data360.org/index.aspx
* Datamob.org: http://datamob.org/datasets
* Google: http://www.google.com/publicdata/directory
* infochimps: http://www.infochimps.com/
* KDNuggets Data Collections: http://www.kdnuggets.com/datasets/index.html
* Numbray: http://numbrary.com/
* RevolutionAnalytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
* Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
* Stats4Stem R data sets: http://www.stats4stem.org/data-sets.html
* StatSci.org: http://www.statsci.org/datasets.html
* The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
* UCLA SOCR data collection: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
* UFO Reports: http://www.nuforc.org/webreports.html
* Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
* Yahoo Webscope: http://webscope.sandbox.yahoo.com/catalog.php
Complementary Collections
-------------------------
* DataWrangling: http://www.datawrangling.com/some-datasets-available-on-the-web
* Inside-r: http://www.inside-r.org/howto/finding-data-internet
* Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
* Reddit: http://www.reddit.com/r/datasets
* RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
* StaTrek: http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/