Sunteți pe pagina 1din 41

Datasets

Big Data

Data Science

Seeking Question

Where can I nd large datasets open to the public?


Answer Wiki
Here are many of the links mentioned so far:
Cross-disciplinary data repositories, data collections and data search engines:
1. https://www.kaggle.com/datasets
2. http://usgovxml.com
3. http://aws.amazon.com/datasets
4. http://databib.org
5. http://datacite.org
6. http://gshare.com
7. http://linkeddata.org
8. http://reddit.com/r/datasets
9. http://thewebminer.com /
10. http://thedatahub.org

alias http://ckan.net

11. http://quandl.com
12. Social Network Analysis Interactive Dataset Library

(Social Network Datasets)

13. Datasets for Data Mining


14. http://enigma.io
15. http://www.undthem.com/
16. http://NetworkRepository.com - The First Interactive Network Data Repository
17. http://MLvis.com
18. Open Data Inception - A Comprehensive List of 2500+ Open Data Portals in the
World
19. http://data.opendatasoft.com

OpenDataSoft catalog

Single datasets and data repositories


1. http://archive.ics.uci.edu/ml/
2. http://crawdad.org/
3. http://data.austintexas.gov
4. http://data.cityofchicago.org
5. http://data.govloop.com
6. http://data.gov.uk/
7. data.gov.in
8. http://data.medicare.gov
9. http://data.seattle.gov
10. http://data.sfgov.org
11. http://data.sunlightlabs.com
12. https://datamarket.azure.com/
13. http://developer.yahoo.com/geo/g...
14. http://econ.worldbank.org/datasets
15. http://en.wikipedia.org/wiki/Wik...
16. http://factnder.census.gov/ser...
17. http://ftp.ncbi.nih.gov/
18. http://gettingpastgo.socrata.com
19. http://googleresearch.blogspot.c...
20. http://books.google.com/ngrams/
21. http://medihal.archives-ouvertes.fr
22. http://public.resource.org/
23. http://rechercheisidore.fr
24. http://snap.stanford.edu/data/in...
25. http://timetric.com/public-data/
26. https://wist.echo.nasa.gov/~wist...
27. http://www2.jpl.nasa.gov/srtm

28. http://www.archives.gov/research...
29. http://www.bls.gov/
30. http://www.crunchbase.com/
31. http://www.dartmouthatlas.org/
32. http://www.data.gov/
33. http://www.datakc.org
34. http://dbpedia.org
35. http://www.delicious.com/jbaldwi...
36. http://www.faa.gov/data_research/
37. http://www.factual.com/
38. http://research.stlouisfed.org/f...
39. http://www.freebase.com/
40. http://www.google.com/publicdata...
41. http://www.guardian.co.uk/news/d...
42. http://www.infochimps.com
43. http://www.kaggle.com/
44. http://build.kiva.org/
45. http://www.nationalarchives.gov....
46. http://www.nyc.gov/html/datamine...
47. http://www.ordnancesurvey.co.uk/...
48. http://www.philwhln.com/how-to-g...
49. http://www.imdb.com/interfaces
50. http://imat-relpred.yandex.ru/en...
51. http://www.dados.gov.pt/pt/catal...
52. http://knoema.com
53. http://daten.berlin.de/
54. http://www.qunb.com
55. http://databib.org/
56. http://datacite.org/
57. http://data.reegle.info/
58. http://data.wien.gv.at/
59. http://data.gov.bc.ca
60. https://pslcdatashop.web.cmu.edu/

(interaction data in learning environments)

61. http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric


Epidemiology Surveys: (A collection of three national surveys focused on each of the
major ethnic groups to study psychiatric illnesses and health services use)
62. http://www.dati.gov.it
63. http://dati.trentino.it
64. http://www.databagg.com/
65. http://networkrepository.com
analytics
66. Home

- Network/ML data repository w/ visual interactive

(United Nations Environment Programme Grid Genava a lot of GIS datasets)

100+ Answers

Bret Taylor, CEO of Quip. Ex-CTO of Facebook, co-founder FriendFeed, cocreator Google Maps.
Written Apr 5, 2011

I did a blog post about open data a long time ago (http://bret.appspot.com/entry/we... ),
and ReadWriteWeb did a nice roundup based on all the comments from the blog post:
http://www.readwriteweb.com/arch... .
Since that post, there have been a lot more comments on the blog (105 and counting), so
you may want to comb the comments for any ones the RWW post missed.
142.6k Views View Upvotes

Related Questions
Where can I nd large datasets open to the public for India specically?
Where can I nd large datasets closed to the public?
What are some free but large datasets of general products?
Have a link to a large free e-mail dataset (not Enron)?
Where can I get public spatial datasets?

More Answers Below

Alex K. Chen, ethereal gwernophile, aspires towards timeless, contextindependence existence


Updated Apr 20, 2015 Upvoted by Mark Meloon, US Head of Data Science at Impetus and
Ankit Sharma, Data Scientist at DataRPM

A database of open databases? (also see most-upvoted questions on the Open Data
Stack Exchange at Highest Voted Questions )
http://www.reddit.com/r/datasets
https://d396qusza40orc.cloudfron...
Analysis course)

(large collection from Coursera's Data

Where is it possible to find raw climate data? (also NCAR - Climate Data Guide )
| Ecological Data Wiki
PhysioNet - largest repository of free, open-access databases and open-source
computational tools devoted to complex signals informatics
Page on sdss.org - SDSS Astronomy datasets. For more on astronomy, see What are
some astronomy datasets open to the public?
http://berkeleyearth.org/dataset...

- Berkeley Earth dataset

http://static.reddit.com/RedditS... - massive survey of Redditors and their


preferences - see http://blog.reddit.com/2011/09/w... for some analysis
Welcome to the CRCNS data sharing website

- for neuroscience

http://archiveteam.org/index.php... - Old archives of websites that no longer


exist. Includes data on the affinities of 60,000+ Reddit users
http://www.r-bloggers.com/datase... - Datasets to practice your data mining discussed at http://www.reddit.com/r/MachineL...
http://www.ers.usda.gov/Data/
http://www.mortality.org/

- USDA Economic Research Service datasets

- human mortality datasets

http://www.fda.gov/Food/FoodSafe...

- FDA pesticide datasets

http://www.ams.usda.gov/AMSv1.0/pdp

- USDA pesticide datasets

Climatology: What are some historical weather databases?


http://www.epa.gov/data/

- EPA data

http://data.giss.nasa.gov/

- NASA GISS data

http://jimwatsonsequence.cshl.edu/

- James Watson's DNA sequence

http://evidence.personalgenomes.... - public genomes of people enrolled in the


personal genome project - includes genomes of Steven Pinker and Esther Dyson.
http://evidence.personalgenomes.... for their genomes
http://voteview.org/downloads.asp - Congressional Voting datasets (probably
contains *everything* about what any politician voted for)
http://www.norc.uchicago.edu/GSS...
http://blogs.discovermagazine.co...

- General Social Survey. For tutorial, see

http://www.cfa.harvard.edu/hitran/ - high-resolution transmission molecular


absorption database. HITRAN on the web: http://hitran.iao.ru/molecule
http://sarahsinbox.com/ - Sarah Palin emails - analyzed by Edwin Chen using
Latent Dirichlet Allocation (LDA) - see http://blog.echen.me/2011/06/27/topicmodeling-the-sarah-palin-emails/

Some others:
http://www.cdc.gov/nchs/nhanes/n...
Examination Survey
http://www.nlsinfo.org/ordering/...
http://road.hmdc.harvard.edu/

- National Health and Nutrition


- NSLY data (sociology) [1]

- election datasets (only 1984-1990 though)

[1] The NLSY79 Geocode data can only be made available to users who have successfully
completed a geocode application and signed a confidentiality agreement with the U.S.
Bureau of Labor Statistics. If interested in gaining access to the NLSY79 Geocode data,
please review the information at http://stats.bls.gov/nls/nlsgeo7... .
216.1k Views View Upvotes

Je Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at


Techammer
Updated Jan 15 Upvoted by Mark Meloon, US Head of Data Science at Impetus and Ankit
Sharma, Data Scientist at DataRPM

I'll try to restrict my answers to datasets greater than 1 GB in size, and order my answers
by the size of the dataset.
Morethan1TB
The 1000Genomes project makes 260 TB of human genome data available [13]
The InternetArchive is making an 80 TB web crawl available for research [17]

The TREC conference made the ClueWeb09 [3] dataset available a few years back.
You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the
sneakernet data transfer. The data is about 5 TB compressed.
ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1[22]
CNetS at Indiana University makes a 2.5 TB click dataset available [19]
ICWSMmade a large corpus of blog posts available for their 2011 conference [2].
You'll have to register (an actual form, not an online form), but it's free. It's about 2.1
TB compressed.
The Yahoo News Feed

dataset is 1.5 TB compressed, 13.5 TB uncompressed

The ProteomeCommonsmakes several large datasets available. The largest, the


Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in
size.
Morethan1GB
The ReferenceEnergyDisaggregationDataSet[12] has data on home energy
use; it's about 500 GB compressed.
The TinyImages dataset [10] has 227 GB of image data and 57 GB of metadata.
The ImageNet dataset [18] is pretty big.
The MOBIO dataset [14] is about 135 GB of video and audio data
TheYahoo!Webscope program [7] makes several 1 GB+ datasets available to
academic researchers, including an 83 GB data set of Flickr image features and the
dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The
dataset is about 10 GB compressed.
Yandex has recently made a very large web search click dataset available [1]. You'll
have to register online for the contest to download. It's about 5.6 GB compressed.
Freebasemakes regular data dumps available [5]. The largest is their Quad dump
[4], which is about 3.6 GB compressed.
The OpenAmericanNationalCorpus[8] is about 4.8 GB uncompressed.
Wikipedia made a dataset containing information about edits available for a recent
Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
The ResearchandInnovativeTechnologyAdministration(RITA)has made
available a dataset about the on-time performance of domestic flights operated by
large carriers. The ASA compressed this dataset and makes it available for download
[16].
The wikilinks data made available by Google is about 1.75 GB total [20].
[1] http://imat-relpred.yandex.ru/en...
[2] http://www.icwsm.org/2011/data.php
[3] http://lemurproject.org/clueweb0...
[4] http://wiki.freebase.com/wiki/Da...
[5] http://download.freebase.com/dat...
[6] http://www.kaggle.com/c/wikichal...
[7] http://webscope.sandbox.yahoo.co...
[8] http://americannationalcorpus.or...
[9] http://kddcup.yahoo.com/datasets...
[10] http://horatio.cs.nyu.edu/mit/ti...
[11] https://proteomecommons.org/data...
[12] http://redd.csail.mit.edu/
[13] http://www.1000genomes.org/ftpse...
[14] https://www.idiap.ch/dataset/mobio
[15] http://www-nlp.stanford.edu/pubs...
[16] http://stat-computing.org/dataex...
[17] http://blog.archive.org/2012/10/...
[18] http://www.image-net.org/index
[19] http://cnets.indiana.edu/groups/...
[20] wiki-links - Wikipedia Links Data - Google Project Hosting
[21] The ClueWeb12 Dataset
[22] ClueWeb12 Related Data:
385.1k Views View Upvotes Not for Reproduction

Felipe Hoa, Google software engineer / Developer Advocate


Written Feb 19, 2015

Google BigQuery is an awesome place to share open datasets: Once data is loaded in
BigQuery, you can make it public - allowing others to instantly analyze it using just SQL.
See a list of some of the amazing datasets shared on BigQuery: http://www.reddit.com/r/b
igquery...
Among those datasets I'd like to highlight GDELT: More than a quarter billion rows
(growing every day) of every event happening around the world. I made a video about it:

27.2k Views View Upvotes

Shimonee Shah, Quorious, Eccentric, Free spirited


Updated Jul 7, 2014

Here is a useful link.


Finding Data on the Internet
Finding Data on the InternetBy RevoJoe
on October 6, 2011
The following list of data sources has been modified as of 8/19/13. Most of the data sets
listed below are free, however, some are not.
If an (R) appears after source this means that the data are already in R format or there
exist R commands for directly importing the data from R. (Seeexamples :: intro for some
code.) Otherwise, I have limited the list to data sources for which there is a reasonably
simple process for importing csv files. What follows is a list of data sources organized into
categories that are not mutually exclusive but which reflect what's out there.
Economics
American Economic Ass. (AEA): AEAweb: RFE
UMD:: Inforum - EconData
World bank: Indicators | Data
Finance
CBOE Futures Exchange: CFE | Market Data
Google Finance: Stock market quotes, news, currency conversions & more (R)
Google Trends: Google Trends - Web Search interest - Worldwide, 2004 - present
St Louis Fed: Federal Reserve Economic Data

(R)

NASDAQ: NASDAQ - Datastore


OANDA: Forex Trading | Trade Currency Online | Forex Broker | OANDA (R)
Quandl: Find, Use and Share Numerical Data
Yahoo Finance: Yahoo Finance - Business Finance, Stock Market, Quotes, News
Government
Archived national government statistics: Web Archiving Services for Libraries and
Archives
Australia: 3301.0 - Births, Australia, 2009
Canada: Home | data.gc.ca
DataMarket: DataMarket - Find, Understand and Share Data - DataMarket
Fed Stats: FedStats: Subjects A to Z
Guardian world governments: Page on guardian.co.uk
London, U.K. data: Catalogue | London DataStore
NewZealand: http://www.stats.govt.nz/tools_and_services/tools/
TableBuilder/tables-by...
NYC data: NYC Open Data
OECD: Page on oecd.org
RITA: RITA | BTS | Title from h2
San Francisco Data sets: Data | San Francisco

(R)

U.K. Government Data: Data Search | data.gov.uk


United Nations: UNdata
U.S. Federal Government Agencies: Federal Agency Participation - Data.gov
US CDC Public Health datasets: Public-Use Data Files and Documentation
The World Bank: World Development Report
UK 2011 Census Open Atlas Project: Page on alex-singleton.com
HealthCare
Gapminder: Data
MachineLearning
Airlines Data (2009 ASA Challenge): The data. Data expo 09. ASA Statistics
Computing and Graphics
Airports and their locations: Airports and Their Locations
AppliedPredictiveModeling (R package): Page on bit.ly
Australian Weather: Daily Weather Observations
Causality Workbench: Data - Repository - Causality Workbench
Edge data for US domestic flights 1990 to 2009: US Domestic Flights From 1990 to
2009
GroupLens Research (movie ratings and more): Datasets
Kaggle competition data: Go from Big Data to Big Analytics
KDNuggets competition site: Datasets for Data Mining and Data Science
The Koblenz Network Collection: The Koblenz Network Collection
Machine Learning Data Set Repository: mldata :: Welcome
Medicare Data File: Page on cms.gov
Microsoft Research: Our research - Microsoft Research
Million songs: The Million Song Dataset: Giving Back to Music Research
RDataMining.com: R and Data Mining
RDataMining.com: R and Data Mining

R and Data Mining ebook data:Data -

The Revolution Analytics Collection: Index of /datasets/


Social Networking: Ancestry.com Forum Dataset
UCI Machine Learning Repository: UCI Machine Learning Repository
53.5 billion clicks: Center for Complex Networks and Systems Research
PublicDomainCollections
Data360: Data360 Homepage
Page on datamob.org : Page on datamob.org
Factual: Page on factual.com
Freebase: Freebase
Google: Google Public Data Explorer
infochimps: Big Data - Cloud Services
numbray: Page on numbrary.com
Sample R data sets: The R Datasets Package

(R)

SourceForge Research Data: Data


UFO Reports: National UFO Reporting Center Web Reports
Wikileaks 911 pager intercepts: 9/11 Pager data
Resources for AP Statistics, Intro to Statistics, and R | STATS4STEM.ORG : R data
sets: Statistical Data Sets, Statistics Data Sets, Data Sets For Statistics, R Datasets
(R)
The Washington Post List: Post Databases (washingtonpost.com)
Science
Agricultural Experiments: agridat {agridat}

(R)

Climate data: Temperature data (HadCRUT4)

andftp://ftp.cmdl.noaa.gov/

Gene Expression Omnibus: Home - GEO - NCBI


Geo Spatial Data: Data | GeoDa Center
Human Microbiome Project: Microbial Reference Genomes
MIT Cancer Genomics Data: Page on broadinstitute.org
NASA: Obtaining Data From the NSSDC
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/D...
Protein structure: PSP benchmark
Public Gene Data: Browse literature or sequence neighbours
Stanford Microarray Data: Page on stanford.edu

(R)

SocialSciences
General Social Survey: General Social Survey
ICPSR: Page on umich.edu
SNAP: Stanford Large Network Dataset Collection
UCLA Social Sciences Archive: Data Portals
UPJOHN INST: Employment Research Data Center
TimeSeries
Time Series data Library: Time Series Data Library
Universities
Carnegie Mellon University Enron email: Enron Email Dataset
Carnegie Mellon University StatLab: StatLib---Datasets Archive
Carnegie Mellon University JASA data archive: StatLib---JASA Data Archive
Ohio State University Financial data: Financial Data Finder
UC Berkeley: UC DATA :HOME
UCLA: SOCR Data - Socr
UC Riverside Time Series: Welcome to the UCR Time Series Classification/Clustering
Page
University of Toronto: Delve Datasets
44.1k Views View Upvotes

Alex Kamil
Updated Sep 28, 2013

1000Genomes project: http://www.1000genomes.org/data#...


Internet Movie Database (IMDb) data: http://www.imdb.com/interfaces
Twitter (product) feed scrapes (some are free): http://blog.infochimps.com/2008/...
(thanks to Joseph Misiti)
What are some free, public data sets?
What data APIs or sources should be in my O'Reilly guide?
http://news.ycombinator.com/item...
Are there any free large datasets in the format of an Apache access log?
30TB of web crawl data: http://www.commoncrawl.org/data/
Images database: http://sipi.usc.edu/database/dat...
http://warsteiner.db.cs.cmu.edu/...
Datasets released by Google
13k Views View Upvotes

Nitin Madnani, Computer Scientist, NLPer & Dataviz Nerd


Written Oct 4, 2011

Here are some big corpora we use in NLP in addition to the ones already mentioned:
ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .uk
domain and using medium-frequency words from the BNC as seeds. The corpus was
POS-tagged and lemmatized with the TreeTagger. There's also a parsed version called
pukWac. Get both at: http://wacky.sslmit.unibo.it/dok...
WaCkypedia: a 2009 dump of the English Wikipedia (about 800 million tokens),
including part of speech/lemma information, as well as a full syntactic parse. The texts
were extracted from the dump and cleaned using the Wikipedia extractor. Get it at the
same URL as ukWac: http://wacky.sslmit.unibo.it/dok...
USENET corpus: A collection of public USENET postings. This corpus was collected
between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file
news groups. Get it at: http://www.psych.ualberta.ca/~we... [CAVEAT: it's huge!]
The collection of data that comes with the Natural Language Toolkit (NLTK). It's
probably not as large as the others but it's a good set. See descriptions at:
http://nltk.googlecode.com/svn/t...
Europarl: Proceedings of the European Parliament in 13 languages. Cleaned and preprocessed for machine translation research. Get it at: http://www.statmt.org/eur
oparl [FYI, NLTK has a built-in interface to access this corpus.]
The Google Books Ngram corpus: Pretty big. Get it at: http://books.google.com/n
grams/d...
12.3k Views View Upvotes

Mukesh Chapagain, Programmer, Blogger, Engineer, Spiritual Seeker


Written Nov 16, 2015

Yelp provides data and reviews of the 250 closest businesses for 30 universities for
students and academics to explore and research. I had downloaded the Yelp'sAcademic
Dataset in early 2015 and it contained a total of 330,071 reviews provided by 130,873
users to 13,481 businesses.
The dataset is a single gzip-compressed file, composed of one json-object per line. Every
object contains a 'type' field, which tells you whether it is a business, a user, or a review.

Related Questions
Where can I nd large datasets open to the public for
India specically?
2,589 Views
Where can I nd large datasets closed to the public?
3,023 Views
What are some free but large datasets of general
products?
1,057 Views
Have a link to a large free e-mail dataset (not Enron)?
1,136 Views
Where can I get public spatial datasets?
1,132 Views
Where can I nd datasets (open to public) of
eCommerce websites?
5,250 Views
Where can I nd large historic datasets on exemployees or recruitment open to the public?
822 Views
Where can I nd large data sets open to the public of
all available drugs and medicines?
1,220 Views
Where can I nd large datasets open to the public for
merger and acquisition integration performance?
897 Views
What large, open and public datasets are there for
Educational Data Mining?
4,360 Views

Business objects contain basic information about local businesses.


{
'type': 'business',
'business_id': (a unique identifier for this business),
'name': (the full business name),
'neighborhoods': (a list of neighborhood names, might be empty),
'full_address': (localized address),
'city': (city),
'state': (state),
'latitude': (latitude),
'longitude': (longitude),
'stars': (star rating, rounded to half-stars),
'review_count': (review count),
'photo_url': (photo url),
'categories': [(localized category names)]
'open': (is the business still open for business?),
'schools': (nearby universities),
'url': (yelp url)
}
Review objects contain the review text, the star rating, and information on votes Yelp users
have cast on the review.
{
'type': 'review',
'business_id': (the identifier of the reviewed business),
'user_id': (the identifier of the authoring user),
'stars': (star rating, integer 1-5),
'text': (review text),
'date': (date, formatted like '2011-04-19'),
'votes': {
'useful': (count of useful votes),
'funny': (count of funny votes),
'cool': (count of cool votes)
}
}
User objects contain aggregate information about a single user across all of Yelp (including
businesses and reviews not in the dataset).
{
'type': 'user',
'user_id': (unique user identifier),
'name': (first name, last initial, like 'Matt J.'),
'review_count': (review count),
'average_stars': (floating point average, like 4.31),
'votes': {
'useful': (count of useful votes across all reviews),
'funny': (count of funny votes across all reviews),
'cool': (count of cool votes across all reviews)
}
}
Yelp also holds a YelpDatasetChallenge
awarded.

where over $35,000 in cash prizes are

For dataset challenge, Yelp provides a larger dataset than the AcademicDataset
mentioned above. At present (when this answer is written), the ChallengeDataset
includes information about local businesses in 10 cities across 4 countries.
The ChallengeDataset contains:
1.6M reviews and 500K tips by 366K users for 61K businesses
481K business attributes, e.g., hours, parking availability, ambience.
Social network of 366K users for a total of 2.9M social edges.
Aggregated check-ins over time for each of the 61K businesses
7.7k Views View Upvotes

Wim Van Leuven, co-organizer of BigData.be, co-founder at BigBoards.io


Written Dec 2, 2014 Upvoted by William Chen, Data Scientist at Quora and Jerrod Lowmaster,
LinkedIn Data Scientist

Recently I came across CERN's open data initiative. Having talked to a few guys that have
worked there, I'm pretty sure these guys currently gather one of the largest datasets in the
world! Have a look at CERN Open Data Portal
Hope this helps!
-w
13.3k Views View Upvotes

Gregory Piatetsky, KDnuggets Editor. Analytics/Data Mining Consultant. KDD and


SIGKDD co-founder...
Written Sep 9, 2013

Here is KDnuggets large and comprehensive list of


Government,State,City,Local,andPublic datasets

24.2k Views View Upvotes

Atakan Cetinsoy, SaaS Product Strategy | Data Science | Lean Startup Advisory |
Go-to-Market Plan
Written Sep 27, 2015

Since we get asked this question by our Machine Learning oriented users very frequently,
my company (BigML) has compiled a list with over 250 sources here:
List of Public Data Sources Fit for Machine Learning
You may also want to check out the related blog post for some more context:
Data, Data, Data: Thousands of Public Data Sources
11.6k Views View Upvotes

Erik Hille, Economist-SMU Alpinist Actuary Biologist-Caltech Father Dreamer de


la Mancha
Written Aug 14

Large data sets mostly from finance and economics that could also be applicable in related
fields studding the human condition:
World Bank Data. Lots of years. Lots of Countries Countries | Data . Lots of of data
variables (Topics | Data - Indicators | Data - Catalog ), years and Countries .
Your Window Into U.S. Federal Statistics
FRB: Data Releases
Federal Reserve Economic Data
Our government also likes to stay globally informed and is willing to share some of that
data: CIA -The World Factbook
Human Development Reports
Explorer

- United Nations Development Programme - Public Data

Consumer Price Index


Unveiling the beauty of statistics for a fact based world view. (http://www.gapminder.org/)
Data Plotter
Possibly looking at the Human Capital Report 2015 has Rankings of human capital
index has various measures of education and productivity capabilities.
International Trade
InternationalHistoricalStatistics(byBrianMitchell)
Data: Aggregate trade (current value), bilateral trade with main trading partners
(current value), and major commodity exports by main exporting countries. No
data on trade as share of GDP is readily available.
Geographicalcoverage: Countries around the world
Timespan: Long time series with annual observations from 19th century up to
today (2010)
Availableat: The books are published in three volumes covering more than 5000
pages. 11 At some universities you can access the online version of the books
where data tables can be downloaded as ePDFs and Excel files. The online access
ishere .
Datafromthe19thcenturyonwardsforcountriesaroundtheworldisavailablein
theInternationalHistoricalStatistics(IHS).Thesestatisticsoriginallypublished
undertheeditorialleadershipofBrianMitchell(since1983)areacollectionof
datasetstakenfrommanyprimarysources,includingbothofficialnationaland
internationalabstracts.

PennWorldTables
Data: Real and PPP-adjusted GDP in US millions of dollars, national accounts
(household consumption, investment, government consumption, exports and
imports), exchange rates and population figures.
Geographicalcoverage: Countries around the world
Timespan: from 1950-2011 (version 8.1)
Availableat: Online here
Feenstra,RobertC.,RobertInklaarandMarcelP.Timmer(2015),TheNext
GenerationofthePennWorldTableforthcomingAmericanEconomicReview,
availablefordownloadatwww.ggdc.net/pwt

CorrelatesofWarBilateralTrade
Data: Total national trade and bilateral trade flows between states. Total imports
and exports of each country in current US millions of dollars and bilateral flows in

current US millions of dollars


Geographicalcoverage: Single countries around the world
Timespan: from 1870-2009
Availableat: Online at www.correlatesofwar.org
ThisdatasetishostedbyKatherineBarbieri,UniversityofSouthCarolina,and
OmarKeshk,OhioStateUniversity.

WorldBankWorldDevelopmentIndicators
Data: Trade (% of GDP) and many more specific series: trade in merchandise,
trade in services, trade in high-technology, trade in ICT goods, trade in ICT services
always exports and imports separately. Also export and import value index and
volume index.
Geographicalcoverage: Countries and world regions
Timespan: Annual since 1960
Availableat: Online at http://data.worldbank.org

UNComtrade
Data: Bilateral trade flows by commodity
Geographicalcoverage: Countries around the world
Timespan: 1962-2013
Availableat: Online here

UNCTADstat
Data: Many different measures, including trade by volumes and value
Geographicalcoverage: Countries around the world
Timespan: For some series, data is available since 1948 mostly annual,
sometimes quarterly.
Availableat: Online here

EurostatCOMEXT
Data: Trade flows (also by commodity)
Geographicalcoverage: Europe (EU and EFTA)
Timespan: Mostly since 1988
Availableat: Online here
Also,theEurostatwebsiteStatisticsExplainedpublishesuptodatestatistical
informationoninternationaltradeingoods andservices .

WorldTradeOrganizationWTO
Data: Many series on tariffs and trade flows
Geographicalcoverage: Countries around the world
Timespan: Since 1948 for some series
Availableat: Online here

CEPIIdatabaseontheWorldEconomy
Data: Many different data sets related to international trade, including trade flows
by commodity geographical variables, and variables to estimate gravity models
Geographicalcoverage: Countries around the world
Timespan: Some series go back to the 1990s.
Availableat: Online here

NBERUnitedNationsTradeData,19622000
Data: Export and import values and volumes by commodity

Geographicalcoverage: Single countries


Timespan: 1962-2000
Availableat: Online here
ThisdataisalsoavailablefromtheCenterforInternationalData

Smallerhistoricaltradedatasets
Data on UKbilateraltrade for the time 1870-1913 was collected by David S.
Jacks. It is downloadable in excel format here .
For the time 18701913 21,000 bilateral trade observations can be found in
Mitchener and Weidenmier (2008) Trade and empire, available in the Economic
Journal here .
Data on UK,Germany,France,andUS between mid-19th to 20th Century can
be found here .
Data on DevelopingCountryExport in 1840, 1860, 1880 and 1900 by John
Hanson is available here .
Data on tradebetweenEnglandandAfrica during the period 1699-1808 is
available on the Dutch Data Archiving and Networked Services . It was compiled
by Marion Johnson.
Applying these same sources to Education quality in developing countries:
EducationIndex multiple sheets of excel datais available at Human
DevelopmentReports or you can use their tool to explore the data Human
Development Reports also google has access to explore the data Google Public Data
Explorer additional indexes in this HD report that you might be interest in are:
Human Development Index and Adult Literacy Index and Gross enrollment ratio.
The World Bank has Literacy rates Adult literacy rate, population 15+ years, both
sexes (%) in addition to lots of other data: World Bank Data. Lots of years. Lots of
CountriesCountries | Data . Lots of data variables Topics | Data - Indicators |
Data - Catalog | The World Bank .
Our government also likes to stay informed and is willing to share some of that data:
CIA -The World Factbook
Possibly looking at the Human Capital Report 2015 has Rankings of human
capital index has various measures of education and productivity capabilities.
Unveiling the beauty of statistics for a fact based world view. (http://www.gapminder.org/)
Data Plotter

- has Average Test Scores

PennWorldTablesData: Real and PPP-adjusted GDP in US millions of dollars,


national accounts (household consumption, investment, government consumption,
exports and imports), exchange rates and population figures. Feenstra,RobertC.,
RobertInklaarandMarcelP.Timmer(2015),TheNextGenerationofthePenn
WorldTableforthcomingAmericanEconomicReview,availablefordownloadat
www.ggdc.net/pwt
748 Views View Upvotes

Alf Fyhrlund, http://swedeneurostat.blogspot.se/ http://stataccess.blogspot.se/


http://www....
Written Mar 22, 2013

SwedenStatisticaldatabase
WhatistheStatisticaldatabase?
Since January 1997, Statistics Sweden has had databases available on the Internet. The aim
is to provide increased access to statistics and allow users to easily download information
to their own computers.
Statistical database
Contentandsearch
The Statistical database contains a large amount of official statistics that Statistics Sweden
is responsible for. Also included are official statistics from other statistical authorities. The
database contains a number of tables where selected information can be presented on the
screen, in print or transmitted to the user's computer for further processing.
The search process can be made in three ways:
via the link NYA SIFFROR Vlj frn senast uppdaterade tabeller (only in the
Swedish version of the website). Nya siffror shows the latest updated tables in the
Statistical database.
via the subject areas
or via Search the Statistical database.
The Statistical database is available free-of-charge. When making minor retrievals of less
than 10000 table cells, registration is not necessary. For larger retrievals and some future
supplementary services, registration is done by completing theregistrationform .
Largestatisticalfiles(PCAxis)(onlyintheSwedishversionofthewebsite)
The database capacity is limited when it comes to large retrievals. In order to best serve
users of very large retrievals, ready-made statistics files in PC-Axis format have been
created, mainly for regionally distributed material.

PCAxis
PC-Axis is software that handles very large statistical tables. PC-Axis can be used for
processing ready-made statistics files or PC-Axis files from the database. The program can
also pass on the statistics to other programs such as spreadsheets, etc. PC-Axis can be
downloaded free-of-charge from this website.
Services in connection with the Statistical databases
TailormadedatabaseretrievalsonCDROMordiskette
Tailor-made retrievals can be ordered for delivery on diskette or CD-ROM. The price
depends on the production cost.
Microdatabases
Micro databases are available after a harm test of de-identified (anonymised) data is done
at Statistics Sweden. More information on registers is available inDocumentationof
statistics (only in the Swedish version of the website).
Courses
Courses are held regularly (in Swedish) as an aid for those who want to use the Statistical
database. For more information on contents, times and prices of courses, check the
Swedish version of the website Kurser .
Formoreinformation,pleasecontactStatisticsSweden'sInformation
services
Postal address: Box 24300, SE-10451 Stockholm, Sweden
Telefax: +46-8-506 948 99
Telephone: +46-8-506 948 01
WhatistheStatisticaldatabase?
Since January 1997, Statistics Sweden has had databases available on the Internet. The aim
is to provide increased access to statistics and allow users to easily download information
to their own computers.
Statistical database
Contentandsearch
The Statistical database contains a large amount of official statistics that Statistics Sweden
is responsible for. Also included are official statistics from other statistical authorities. The
database contains a number of tables where selected information can be presented on the
screen, in print or transmitted to the user's computer for further processing.
The search process can be made in three ways:
via the link NYA SIFFROR Vlj frn senast uppdaterade tabeller (only in the
Swedish version of the website). Nya siffror shows the latest updated tables in the
Statistical database.
via the subject areas
or via Search the Statistical database.
The Statistical database is available free-of-charge. When making minor retrievals of less
than 10000 table cells, registration is not necessary. For larger retrievals and some future
supplementary services, registration is done by completing theregistrationform .
Largestatisticalfiles(PCAxis)(onlyintheSwedishversionofthewebsite)
The database capacity is limited when it comes to large retrievals. In order to best serve
users of very large retrievals, ready-made statistics files in PC-Axis format have been
created, mainly for regionally distributed material.
PCAxis
PC-Axis is software that handles very large statistical tables. PC-Axis can be used for
processing ready-made statistics files or PC-Axis files from the database. The program can
also pass on the statistics to other programs such as spreadsheets, etc. PC-Axis can be
downloaded free-of-charge from this website.
Services in connection with the Statistical databases
TailormadedatabaseretrievalsonCDROMordiskette
Tailor-made retrievals can be ordered for delivery on diskette or CD-ROM. The price
depends on the production cost.
Microdatabases
Micro databases are available after a harm test of de-identified (anonymised) data is done
at Statistics Sweden. More information on registers is available inDocumentationof
statistics (only in the Swedish version of the website).
Courses
Courses are held regularly (in Swedish) as an aid for those who want to use the Statistical
database. For more information on contents, times and prices of courses, check the
Swedish version of the website Kurser .
Formoreinformation,pleasecontactStatisticsSweden'sInformation
services
Postal address: Box 24300, SE-10451 Stockholm, Sweden
Telefax: +46-8-506 948 99
Telephone: +46-8-506 948 01
9.1k Views View Upvotes

Robert Morton, Data Nerd at Tableau Software (Senior Software Engineer)


Written Oct 23, 2011

The Bureau of Transportation Statistics (bts.gov ) has tremendous amounts of data on


airline on-time / delays, airfares, fuel costs, etc. Most are very wide and several data sets
range from 100M - 300M rows. Here's an index of their best data sets:
http://www.transtats.bts.gov/dat...
-Robert
6.8k Views View Upvotes

Anton Tarasenko
Updated Dec 5, 2014

CustomGoogleSearch

You can use the Custom Google Search for datasets:


GoogleCustomSearch:Datasets
230 sources and meta-sources of datasets, including all mentioned in this question. Please,
feel free to exclude .gov and any other websites from results by adding " -.gov " or " site.com " to the search line. Other Google Search Operators work.
Don't hesitate to contact me if you have ideas what websites to add.

IOGDS
The following service puts in order more than 1,000,000 public datasets:
IOGDS:InternationalOpenGovernmentDatasetSearch
12.7k Views View Upvotes

Alan Morrison, Researches the topic for publications


Written Feb 16

Reposting from Alan Morrison's answer to Where on the web can I find free samples of Big
Data to analyze?
This link list, available on Github, is quite long and thorough: caesar0301/awesome-public
-datasets You will see many census data sources listed. Then the challenge becomes how
to get to what you really want and can use.
Note that this list also references a Quora answer that also includes a long list: Where can I
find large datasets open to the public?
For your convenience, I've copied the list of lists as it stood in January 2015 here, but won't
be updating it:
AwesomePublicDatasets
This list of public data sources are collected and tidyed from blogs, answers, and user
reponses. Most of the data sets listed below are free, however, some are not. Other
amazingly awesome lists can be found in theawesome-awesomeness andanother
awesome list.
Agriculture
U.S. Department of Agriculture's PLANTS Database
Biology
1000 Genomes
Collaborative Research in Computational Neuroscience (CRCNS)
Gene Expression Omnibus (GEO)
Human Microbiome Project (HMP)
ICOS PSP Benchmark
MIT Cancer Genomics Data
NIH Microarray data (FTP)
Protein Data Bank
PubChem Project
PubGene (now Coremine Medical)
Stanford Microarray Data
The Personal Genome Project

or PGP

UCSC Public Data


UniGene
Climate/Weather
Australian Weather
Canadian Meteorological Centre
Climate Data from UEA (updated monthly)
Global Climate Data Since 1929
NOAA Bering Sea Climate
NOAA Climate Datasets
NOAA Realtime Weather Models
WU Historical Weather Worldwide
Complex Networks
CrossRef DOI URLs
DBLP Citation dataset

NBER Patent Citations


NIST complex networks data collection
Protein-protein interaction network
PyPI and Maven Dependency Network
Scopus Citation Database
Stanford GraphBase (Steven Skiena)
Stanford Large Network Dataset Collection
The Koblenz Network Collection
The Laboratory for Web Algorithmics (UNIMI)
UCI Network Data Repository
UFL sparse matrix collection
WSU Graph Database
Computer Networks
3.5B Web Pages from CommonCraw 2012
53.5B Web clicks of 100K users in Indiana Univ.
CAIDA Internet Datasets
ClueWeb09 - 1B web pages
ClueWeb12 - 733M web pages
CommonCrawl Web Data over 7 years
CRAWDAD Wireless datasets from Dartmouth Univ.
Open Mobile Data by MobiPerf
UCSD Network Telescope, IPv4 /8 net
Data Challenges
Challenges in Machine Learning
DrivenData Competitions for Social Good
ICWSM Data Challenge (since 2009)
Kaggle Competition Data
KDD Cup by Tencent 2012
Localytics Data Visualization Challenge
Netflix Prize
Yelp Dataset Challenge
Economics
American Economic Ass (AEA)
EconData from UMD
Internet Product Code Database
Energy
AMPds
BLUEd
COMBED
Dataport
ECO
EIA
HFED
iAWE
Plaid
REDD
UK-Dale
Finance
CBOE Futures Exchange
Google Finance
Google Trends
NASDAQ
OANDA
OSU Financial data
Quandl

St Louis Federal
Yahoo Finance
GeoSpace/GIS
BODC - marine data of ~22K vars
EOSDIS - NASA's earth observing system data
Factual Global Location Data
Global Administrative Areas Database (GADM)
Geo Spatial Data from ASU
GeoNames Worldwide
Natural Earth - vectors and rasters of the world
Open Street Map (OSM)
TIGER/Line - U.S. boundaries and roads
TwoFishes - Foursquare's coarse geocoder
TZ Timezones shapfiles
Government
Australia (abs.gov.au)
Australia (data.gov.au)
Canada
Chicago
EuroStat
FedStats
Germany
Glasgow, Scotland, UK
Guardian world governments
London Datastore, UK
MassGIS, Massachusetts, U.S.
Netherlands
New Zealand
NYC betanyc
NYC Open Data
OECD
Open Government Data (OGD) Platform India
San Francisco Data sets
South Africa
The World Bank
U.K. Government Data
U.S. American Community Survey
U.S. CDC Public Health datasets
U.S. Census Bureau
U.S. Department of Housing and Urban Development (HUD)
U.S. Federal Government Agencies
U.S. Federal Government Data Catalog
U.S. Food and Drug Administration (FDA)
U.S. Open Government
UK 2011 Census Open Atlas Project
United Nations
Healthcare
EHDP Large Health Data Sets
Gapminder World, demographic databases
Medicare Coverage Database (MCD), U.S.
Medicare Data Engine of medicare.gov Data
Medicare Data File
Image Processing
2GB of Photos of Cats
Face Recognition Benchmark

ImageNet - an image database in WordNet hierarchy


Machine Learning
Delve Datasets for classification and regression (Univ. of Toronto)
Discogs Monthly Data
eBay Online Auctions (2012)
IMDb Database
Keel Repository for classification, regression and time series
Lending Club Loan Data
Machine Learning Data Set Repository
Million Song Dataset
More Song Datasets
MovieLens Data Sets
RDataMining - "R and Data Mining" ebook data
Registered Meteorites on Earth
Restaurants Health Score Data in San Francisco
UCI Machine Learning Repository
Yahoo! Ratings and Classification Data
Museums
Cooper-Hewitt's Collection Database
Minneapolis Institute of Arts metadata
Tate Collection metadata
The Getty vocabularies
Natural Language
ClueWeb09 FACC
ClueWeb12 FACC
DBpedia - 4.58M things with 583M facts
Flickr Personal Taxonomies
Google Books Ngrams (2.2TB)
Google Web 5gram (1TB, 2006)
Gutenberg eBooks List
Hansards text chunks of Canadian Parliament
Machine Translation of European languages
SMS Spam Collection in English
USENET postings corpus of 2005~2011
Wikidata - Wikipedia databases
Wikipedia Links data - 40 Million Entities in Context
WordNet databases and tools
Physics
CERN Open Data Portal
NSSDC (NASA) data of 550 space spacecraft
Public Domains
Amazon
Archive.org Datasets
CMU JASA data archive
CMU StatLab collections
Data360
Datamob.org
Google
Infochimps
KDNuggets Data Collections
Numbray
Reddit Datasets
RevolutionAnalytics Collection
Sample R data sets
Stats4Stem R data sets

StatSci.org
The Washington Post List
UCLA SOCR data collection
UFO Reports
Wikileaks 911 pager intercepts
Yahoo Webscope
Search Engines
Academic Torrents of data sharing from UMB
Archive-it from Internet Archive
Datahub.io
DataMarket (Qlik)
Freebase.com of people, places, and things
Harvard Dataverse Network of scientific data
ICPSR (UMICH)
Statista.com - statistics and Studies
Social Sciences
Ancestry.com Forum Dataset over 10 years
CMU Enron Email of 150 users
Facebook Data Scrape (2005)
Facebook Social Networks from LAW (since 2007)
Foursquare Social Network in 2010, 2011
Foursquare from UMN/Sarwat (2013)
General Social Survey (GSS) since 1972
GetGlue - users rating TV shows
GitHub Collaboration Archive
Mobile Social Networks from UMASS
PewResearch Internet Survey Project
SourceForge.net Research Data
StackExchange Data Explorer
Titanic Survival Data Set
Twitter Graph of entire Twitter site
UCB's Archive of Social Science Data (D-Lab)
UCLA Social Sciences Data Archive
UNIMI/LAW Social Network Datasets
Universities Worldwide
UPJOHN for Labor Employment Research
Yahoo! Graph and Social Data
Youtube Video Social Graph in 2007,2008
Sports
Betfair Historical Exchange Data
Cricsheet Matches (baseball)
Ergast Formula 1, from 1950 up to date (API)
Football/Soccer resouces (data and APIs)
Lahman's Baseball Database
Retrosheet Baseball Statistics
Time Series
Time Series Data Library (TSDL) from MU
UC Riverside Time Series Dataset
Transportation
Airlines OD Data 1987-2008
Bike Share Systems (BSS) collection
Hubway Million Rides in MA
Marine Traffic - ship tracks, port calls and more
NYC Taxi Trip Data 2013 (FOIA/FOILed)

OpenFlights - airport, airline and route data


RITA Airline On-Time Performance data
RITA/BTS transport data collection (TranStat)
Transport for London (TFL)
Travel Tracker Survey (TTS) for Chicago
U.S. Bureau of Transportation Statistics (BTS)
U.S. Domestic Flights 1990 to 2009
U.S. Freight Analysis Framework since 2007
Complementary Collections
DataWrangling: Some Datasets Available on the Web
Inside-r: Finding Data on the Internet
Quora: Where can I find large datasets open to the public?
like being punched in the brain! : 100+ Interesting Data Sets for Statistics
StaTrek: Leveraging open data to understand urban lives"
Source: Xiaming's Github caesar0301/awesome-public-datasets , January 2015. Please
go to Github for this and other updated lists.
2.2k Views View Upvotes

Attila Csordas, Cloudera Certied Hadoop Developer


Written Dec 13, 2014

In mass spectrometry proteomics the ProteomeXchange consortium has been set up to


provide a coordinated submission of MS proteomics data to the main existing proteomics
repositories, and to encourage optimal data dissemination: Page on
proteomexchange.org partner repositories: PRIDE Archive at EMBL-EBI in Hinxton,
Cambridgeshire, UK, PASSEL at ISB in Seattler, MassIVE at UCSD, San Diego. PRIDE
accounts for ~90% of the data, currently in total ~1600 datasets out of which ~50% is
public, ~70 TB, some individual datasets are in the TB range mainly due to unprocessed,
binary machine raw files. See also ProteomeXchange Datasets
4.4k Views View Upvotes

Orin Hargraves, what do I wait for?


Written Jun 16, 2011

Two fully annotated corpora, put together for use by researchers and lexicographers, are:
The BNC (British National Corpus) http://www.natcorp.ox.ac.uk/
and
COCA (Corpus of Contemporary American English)
http://www.americancorpus.org/
The BNC is a little dated now. COCA is excellent, though its user interface is a little clunky
at times.
If you have legitimate, nonprofit research concerns, you may be able to get access to the
granddaddy of them all, the Oxford English Corpus. For commercial use there is a feebased access:
http://oxforddictionaries.com/pa...
4.3k Views View Upvotes

Shehroz Khan, I have to play with data to tame it


Written Feb 1

Always try this infallible technique, It Always work


Otherwise, you may like to see these
IBM Knowledge Center
NASA
Datasets for education and for fun
Science Hack Day / Datasets
Science On a Sphere
1.8k Views View Upvotes

Abdelbarre Chak, Big Data


Written Jul 29

Here are a list of open Datasets


Data.gov

(USA),

The World Bank DataBank


http://www.reddit.com/r/datasets

A Deep Catalog of Human Genetic Variation

(Size:396.7TB)

City of Chicago | Data Portal (Size:9.5GB)


Google Ngram Viewer
Open Government

Size:863.4GB

(Canada)

Education - Data.gov

(Education)

School of Geographical Sciences & Urban Planning

Geo-data

Hope its helpful


535 Views View Upvotes

Thia Kai Xin, Data scientist at Lazada, Co-Founder of DataScience SG.


Written Apr 7

My favorites are:
Awesome Public Datasets
100+ Interesting Data Sets for Statistics
7 Datasets You've Likely Never Seen Before
Another collection of free and open-source datasets
1k Views View Upvotes

Ferris Jumah, Data and Products


Written Jan 23, 2014

The best source of structured data I've seen so far is the UCI Machine Learning Repository:
Data Sets
This question has extensive resources for data sets open to the public, Where can I find
large datasets open to the public?
5.5k Views View Upvotes

Chris Metcalf, Director of Platform / Developer Evangelist at Socrata


Written Jan 8, 2011

Socrata hosts open data websites for a number of governments, government agencies, and
non-profits including:
http://data.seattle.gov
http://data.cityofchicago.org
http://data.medicare.gov
http://data.sunlightlabs.com
http://www.datakc.org
http://gettingpastgo.socrata.com
http://data.govloop.com

There are also over 100K datasets available on our public data portal,
http://opendata.socrata.com
9.6k Views View Upvotes

Ian Johnson, Data Alchemist at Lever (http://lever.co)


Written Jan 23, 2014

I've been begging for this. Seriously, someone take my money!


One startup I'm excited about is Enigma.io
they are curating and nicely formatting open and public data, providing a slick interface for
searching and exporting as well.
Another good source of serious and well formatted data is the Bureau of Labor Statistics
There was a startup called buzzdata that tried to be the GitHub of data for a while, but they
pivoted away :(
4.3k Views View Upvotes

Krishnan Srinivasarengan, .
Written Jan 21, 2013

For Non-Intrusive Appliance Load monitoring research, data bases are emerging. While
REDD is one instance (already in another answer), there are a few more of them (not as
comprehensive):
BLUED: NILM@CMU
Tracebase: tracebase " Welcome
UMass Smart*: Smart - UMass Trace Repository
6k Views View Upvotes

Ben Hamner
Written Feb 6

Kaggle recently launched Kaggle Datasets . You can download high quality public
datasets here, run analytics on them through Kaggle Scripts, see others analyses, and
discuss them in the forums.
Here's a blog post describing this in more depth: Introducing Kaggle Datasets
2.1k Views View Upvotes

Anonymous
Written Apr 15, 2014

Since I haven't seen it mentioned yet, and work at one of the main sources of its data:
SMOKA , the Subaru-Mitaka-Okayama-Kiso Archive, holds about 15 TB of astronomical
data from facilities run by the National Astronomical Observatory of Japan. All data
becomes publicly available after an embargo period of 12-24 months (to give the original
observers time to publish their papers).
With over a decade of data from some facilities and instruments, it has now become
possible for many researchers to make discoveries just by looking at archived data for
something other than what the original observers had in mind.
Astrophotographer Robert Gendler has also processed images from the SMOKA archive
to create several "NASA Astronomy Picture of the Day" winners.
10.9k Views View Upvotes

Mike Lambert, streetart.cats.gadgets.cameras


Written May 30, 2014

The Global Database of Events, Language, and Tone


"The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct
a catalog of human societal-scale behavior and beliefs across all countries of the world,
connecting every person, organization, location, count, theme, news source, and event
across the planet into a single massive network that captures what's happening around the
world, what its context is and who's involved, and how the world is feeling about it, every
single day. - See more at: The Global Database of Events, Language, and Tone "
"The entire GDELT dataset is available for free download, including over a quarter-billion
event records capturing over 300 categories of human society across every corner of the
globe georeferenced to the city level back to 1979 and updated each morning, and the
massive Global Knowledge Graph recording the underlying actors, themes, and
relationships that underlie global society. - See more at: The Global Database of Events,
Language, and Tone ""
6.7k Views View Upvotes

Alket Cecaj, PhD in location data mining.


Updated Apr 19

If you are looking for mobility data there is the Telecom Italia Bigdata challenge dataset.
You can find it here : Open Data Institute - node Trento
Its about 120 GB of data and there are 7 different typologies of datasets from city life.
Another dataset of mobility data type is the Data 4 Development released by Orange a
french operator. In 2013 they released Call description records about Ivory coast and in
2014 CDR data of Senegal.
Info about the challenge can be found here : http://www.d4d.orange.com/en/home
A new challenge organized by American Society of Statistics can be found here : Support
the Data Challenge at JSM 2015
If you want some more datasets of any kind from pollution data to social network data
then check this post here : Data sets of any type: some links. by Alket Cecaj on Algorithms
and DataFusion
The post is updated regularly as I find new data sets such as the Panama Papers dataset.
3.3k Views View Upvotes

Christian Pietsch, computational linguist and digital library technologist


Written Sep 27, 2012

If you are interested in research datasets (large and small), these sites let you search for
them:
http://databib.org/ (a collaborative, annotated bibliography of primary research
data repositories)
http://datacite.org/ (support researchers by helping them to find, identify, and cite
research datasets with confidence)
3.6k Views View Upvotes

Mike Kruger
Written May 16, 2014

IRI has a large (130 gigabyte) set of consumer packaged goods marketing data available.
30 categories, 11 years. For information see Academic Data Set - IRI
3.9k Views View Upvotes

Gopi Krishnan Nambiar, Software Engineer at Salesforce

Written Jan 21, 2013

Dataset of 13 billion clicks available for research made available on Jan 20, 2013 here:
Center for Complex Networks and Systems Research
3.6k Views View Upvotes

Deepshikha Mehta, Program Management


Written Feb 24

This online course on applied machine learning provides you released dataset for
Datathon.
Aspiring Minds presents AMDataBootcamp2016, an online + offline bootcamp on
applying machine learning to real world problems. Register and GRAB this unique offering
comprising of a MOOC + a data release + a data competition + a one-day workshop. Last
date of submissions is 8th March 2016. To enroll now and dive deep into ML : Aspiring
Minds University | Boot Camp

920 Views View Upvotes

Yasmin Lucero, statistician/mathematical biologist/data scientist


Written Jan 23, 2014

R has a built in library called datasets. This has several structured datasets that are useful
for testing and learning. Type library(help=datasets) to get a list. These are available in
your namespace at all times, but they are lazy loaded. To use them, just call them by name,
e.g. str(iris).
2.3k Views View Upvotes

Yuval Feinstein, Algorithmic Software Engineer in NLP,IR and Machine Learning


Written Feb 16

Please see Bernard Marr's Big Data: 33 Brilliant And Free Data Sources For 2016
810 Views View Upvotes

Franck Dernoncourt, PhD student in AI @ MIT


Written Dec 26, 2013

See databases of open databases .


15.7k Views View Upvotes

Alex Copulsky
Written Feb 10, 2014

If you're interested in the social sciences, a great resources is University of Michigan's


Interuniversity Consortium for Political and Social Research, or ICPSR. Great resource for
all sorts of public data sets, as well as the data sets used in many published papers.
5.5k Views View Upvotes

Kevin Edward Kline, data and database expert, I know a 'lil bit about Twitter and
social media
Written Mar 5, 2014

I wrote a blog post about this a while back. For large data sets to tinker with, I recommend
that go to data.gov for large USA data sets orData Search | data.gov.uk for large UK
data sets. In both cases, you'll find a wide variety of data to play with.
Also, don't forget TCP.ORGtheLeadingTcpSiteontheNet .
2.6k Views View Upvotes

Phillip Rhodes, Open Source hacker, founder of Fogbeam Labs


Written Jan 11, 2011

FWIW, there's a subreddit dedicated to cataloging available datasets. It may be of interest


to you:
http://reddit.com/r/datasets
And on a related note, there is also:
http://reddit.com/r/opendata
2.1k Views View Upvotes

Pete Warden
Written Jan 8, 2011 Upvoted by Bradley Voytek, Former Data Scientist, Uber Inc. and Leo
Polovets, Partner at a data-focused seed fund (Susa Ventures). Worked at Factual.
http://codingvc.com/

Here's the ones I've found most useful:


CrunchBase, US Census, Google Public Data, Infochimps, Timetric, Factual, Freebase,
Wikipedia, World Bank, Kaggle
I cover them in more detail in a free ebook here:
http://radar.oreilly.com/2011/01...
8.5k Views View Upvotes

Adam Nyhan, Attorney, former Congressional aide, Mainer, entrepreneur


Written Jan 21

In the legal world, the Enron dataset is often considered the best public-access dataset. My
recollection is that it was opened to the public by a federal regulatory agency in the course
of its Enron investigaiton. There is a massive industry of "litigation support technology"
and "electronic discovery" firms that develops software to mine and analyze enormous
data sets, and the Enron set is often trotted out in marketing demonstrations of these
software products to demonstrate their effectiveness. Thanks to Shimonee Shah for the
link to it:
Enron Email Dataset
1.2k Views View Upvotes

Gaurav Bhardwaj
Written Sep 21, 2014

I have been collecting this dataset provided by UIDAI,


Adhar(UIDAI) a wonderful data provided by Indian government.
Things I like about this dataset:
Great way for beginners like me to explore Data Science basics using latest tools like
ipython, Pandas, Anaconda etc.
This dataset is being used by UDACITY courses (Introduction to data science) see
references for videos
It is a real-time data, it updates every other day
You can use REST api calls to get the data for a particular day, particular month OR
just the latest data.
Its probably going to be a huge data thinking of Indias population. For more info
regarding download please see:
http://bhardwajgaurav.wordpress....
4.2k Views View Upvotes

Sourabh Daptardar
Written Sep 14, 2013

http://data.gov.in/
Indian government offers about 4K datasets from collected from about 50 departments for

analysis : http://data.gov.in/ and the list is growing. Not every dataset might be 'big
data' from a computer science perspective, but it is, nevertheless, a good source.
12.8k Views View Upvotes

Lorenzo Ruzzene, Web Content & Social Media Specialist


Written Feb 3, 2015

I'd suggest the Ookla's Net Net Index source data (1.5 GB)
"Download the largest publicly available dataset of anonymous broadband speed and
quality test results, with data from every geographic region currently represented in
NetIndex going back to January 2008." Global Broadband
5.3k Views View Upvotes

Martin Linkov, UniGraph.rocks


Written Dec 12, 2014

Open Data collection for Greece


CrunchBase Data Exports

(.xls)

Crunchbase people and organizations in .csv


http://static.crunchbase.com/exp...
2k Views View Upvotes

Dror Cohen, CEO & Co-Founder at CodersClan


Written May 11, 2014

The Stack Exchange network has its whole data base open for queries and you can even
download the whole dump to yourself. It contains mostly data from Stack Overflow.
Stack Exchange Data Explorer
3.9k Views View Upvotes

Mark Posen, Chartered engineer in satellite and radio comms.


Updated Jun 17, 2011

The Shuttle Radar Topography Mission database is a near-global high-resolution digital


topographic database of Earth (i.e. it maps the whole Earth with terrain altitude data at
around a 90m resolution. The data may be accessed here:
http://www2.jpl.nasa.gov/srtm
A large number of different Earth science datasets are available from NASA WIST
(Warehouse Inventory Search Tool):
https://wist.echo.nasa.gov/~wist...
3k Views View Upvotes

Hilary Mason, I data and cheeseburgers.


Written Apr 5, 2011 Upvoted by Mark Meloon, US Head of Data Science at Impetus and Ankit
Sharma, Data Scientist at DataRPM

I've been collecting public research-quality datasets here: http://bit.ly/bundles/hma


son/1
Feedback and additional datasets are welcome!
30.2k Views View Upvotes

Sandeep Vasani, Software Engineer


Updated Jul 6, 2012

The EnronCorpus is a large database of over 600,000 emails generated by 158 employees
of the Enron Corporation. I have used the Enron Email Corpus for training and testing my
email classification algorithm.
https://www.cs.cmu.edu/~enron/
Download link [tgz] https://www.cs.cmu.edu/~enron/en...
4.4k Views View Upvotes

Ian C. Grieve
Written Apr 7, 2011

Another bioinformatics dataset repository worth mentioning here is EnsEMBL:


http://www.ensembl.org .
EnsEMBL contains genomics information, including annotated DNA sequence, protein
sequence on a wide range of species.
An API, written in Perl, is provided with documentation (http://www.ensembl.org/i
nfo/docs... ). Additionally, the data can be downloaded from the EnsEMBL FTP site.
4.3k Views View Upvotes

Stefan de Konink, is chairman of Stichting OpenGeo, a Dutch non-prot targeting


the availabili...
Written May 12, 2014

Research :: NDOV Loket offers historical realtime vehicle information (Automatic


Vehicle Location, Fleet Tracking) for The Netherlands. You can see what it does realtime at
OVradar or Live Openbaar Vervoer
About 1GB per day is collected in CSV format, which compresses to about 80MB LZMA.
We now have over a year worth of data.
4k Views View Upvotes

Chris Thomson, Waterloo CS student + Shopify dev


Written Jan 12, 2011

The City of Toronto publishes a few interesting datasets. Their Dinesafe dataset is
particularly interesting, as it contains information about every restaurant's inspection
(infractions, etc) conducted by Toronto Public Health. You can find all of Toronto's open
datasets at http://toronto.ca/open .
3.6k Views View Upvotes

Ryan Compton, http://ryancompton.net/


Written May 11, 2014

Several TB of network connectivity data M-Lab


BigQuery)

(easy to work with via Google's

Lots of social networks Stanford Network Analysis Project


4k Views View Upvotes

Abhishek Gupta, have wanderlust, want to experience ambedo again & again, a
kleptomaniac for ...
Updated Feb 11, 2014

1. Academic Torrents
2. Links to free data sets for computer vision applications
3. Amsterdam Library of Object Images
4. The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images
dataset.
5. Traffic signs dataset
6. Machine Learning and Data Mining - Datasets
7. Quora Thread
8. DataMob
9. Some More shared on bitly
10. UCD Machine Learning Group
11. Some Links from Open Directory
12. A thread on dataWrangling
13. Kevin's Blog
14. Recommendation and Ratings Public Data Sets
15. Another Quora thread for Kinnect Specific Data
16. /r/datasets
3.2k Views View Upvotes

Kunal Jain, loves probability, puns, pizza. (and alliteration)


Written Feb 3, 2015

I just thought I'll add Nation Master to the list, because I use it all the time.
For comparison of all kinds of statistics between countries:
International statistics: Compare countries on just about anything! NationMaster.com
2.2k Views View Upvotes

Drazen Zaric, Grad student interested in machine learning and data science
Written Dec 12, 2010

Stanford Large Network Dataset Collection has some pretty impressive datasets, like
complete Wikipedia edit history (till January 2008) or a collection of 467 million tweets
collected from June to December 2009.
http://snap.stanford.edu/data/in...
3.8k Views View Upvotes

Giuseppe Sollazzo, Senior Systems Analyst @ St. George's, University of London


Written Oct 19, 2011

Many countries are releasing open-data portals. For data relative to Italy, these are the
main links:
- http://www.dati.gov.it/ (the main governmental website)
- http://dati.piemonte.it/ (data portal for Piemonte region, the first regional portal
developed)
- http://dati.emilia-romagna.it/ (data portal for Emilia Romagna region)
- http://data.enel.com/ (data portal for the ENEL company, a energy/gas supplier)
3.1k Views View Upvotes

Edmond Lau, Author of The Eective Engineer


Updated Mar 19, 2012 Upvoted by Mark Meloon, US Head of Data Science at Impetus and
William Chen, Data Scientist at Quora

Google Research released a large 24GB n-gram data set back in 2006 based on processing
10^12 words of text and published counts of all sequences up to 5 words in length:
http://googleresearch.blogspot.c...
You can also just search over a related data set via the Google Books Ngram Viewer:
http://books.google.com/ngrams/
47k Views View Upvotes

Jim Kenyon, Data science practitioner - all models are wrong. some models are
useful.
Written Jan 23, 2014

Data.gov

is a great place for structured data.

1k Views View Upvotes

Brian Risk, Lover of data.


Written Oct 3, 2014

http://Quandl.com has over 10 million data sets gleaned from all over the internet. The
great thing about this resource is that it gives a single way to access all of the data. The site
has a free Excel plug in or there are libraries in R, Python, Ruby, etc.
3.6k Views View Upvotes

Clinton Little, Coastal Program Specialist working to change the data climate in
Minnesota's ...
Updated Feb 4, 2011

Data.gov

http://www.data.gov/

Multipurpose Marine Cadastre http://www.marinecadastre.gov


Digital Coast http://www.csc.noaa.gov/digitalc...
Geospatial One-Stop http://www.geodata.gov
nowCoast http://nowcoast.noaa.gov
Data.gov http://www.data.gov
Great Lakes Commission http://www.glc.org
Great Lakes Information Network http://www.greatlakes.net
The Lake Superior Binational Forumhttp://www.superiorforum.org
MN DNR Data Deli http://deli.dnr.state.mn.us
MnGeo http://www.mngeo.state.mn.us
NRRI Coastal GIS http://www.nrri.umn.edu/coastalGIS
Lake Superior Streams (Minnesota) http://www.lakesuperiorstreams.org
Minnesota Beach Health Warnings http://www.mnbeaches.org
North Shore GIS Collaborative. http://ardcgis.org
2.7k Views View Upvotes

Eliot Jarrett, Digital Brain / Analog Mind, Voracious Reader, Data Synthesizer,
Strategist
Updated Mar 29, 2012

I've found Kaggle.com to be a fantastic resource, as the datasets relate to specific


business problems and are provided by respective companies.
Kaggle holds contests for developing the best predictive models based on sourced datasets.
The current competitions are:
1. Improve credit scoring by predicting the probability someone will experience financial
distress within two years
2. Predict if a car purchased at auction is a "bad buy"
3. Identify patients who will be admitted to a hospital within the next year, using historical
claims data
Prizes are provided for the best predictive models, anywhere from $5,000 to $3 million
(for the health insurance competition).
You may use the datasets for free after signing up as a competitor, although there are legal
issues concerning ownership of predictive models that must be considered.
1.7k Views View Upvotes

Ossama Alami
Written Jul 4, 2014

Firebase provides a number of realtime datasets for free: Firebase Open Data Sets .
They're easy to use in web or mobile apps, some data sets available:
Cryptocurrency/USD Exchange Rates (Bitcoin, Litecoin, Dogecoin)
Realtime Global Earthquake data
Public transit data & bus GPS positions for several US cities
Airport delay data

Realtime parking availability in SF


Weather data
2.9k Views View Upvotes

Disa Johnson, Marketing technology programmer. VP or director at digital


agencies.
Written Aug 31, 2011

Although there are lots of answers here, many that look very good,
http://www.wolframalpha.com is a search engine which spiders and houses most
open data that is findable on the Web. It also allows you to use your query syntax to
preform calculations, making it a true computation engine. I love it and use it for a variety
of purposes myself.
1.4k Views View Upvotes

Ertan Dogrultan, Software/Data guy, Entrepreneur


Written Feb 11, 2011

Taken from the syllabus of my data mining class,


National Bureau of Economic Research http://www.nber.org/data
(many interesting datasets: Macroeconomics, industry, trade, demographics, hospital,
patents, ...)
Federal Reserve Data Economic Research & Data http://www.federalreserve.gov/ec...
(including data about mortgage defaults, interest rates, exchange rates, industrial
production, ...)
Federal Statistics Data Access Tools: http://www.fedstats.gov/toolkit....
1.8k Views View Upvotes

Pardeep Kullar, SaaS, Email marketing, Social tools and pilgrims pizza
Written Nov 15, 2015

There are some companies where, on their free trial, you can get free data.
For example: FollowerWonk (Twitter analytics, follower segmentation, social graph
tracking, & more ) lets you download up to 50,000 followers of any Twitter account.
Datadrip (Free data into sales ) has a bunch of followerwonk files like 50,000 CEOs that
you can download from the home page.
1k Views View Upvotes

Udit Saini, Research Engineer Data Science @Ant farm


Written May 15, 2015

20 newsgroups: classification task, mapping word occurences to newsgroup ID (Home


Page for 20 Newsgroups Data Set )
Reuters (RCV*) Corpuses: text/topic prediction (Page on reuters.com )
Penn Treebank : used for next word prediction or next character prediction (Penn
Treebank Project )
Broadcast News: large text dataset, classically used for next word prediction (1996 English
Broadcast News Speech (HUB4) )
Wikipedia Dataset
Multidomain sentiment analysis dataset: Multi-Domain Sentiment Dataset

Recommendation Systems
MovieLens: Two datasets available from GroupLens . The first dataset has 100,000
ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second
dataset has about 1 million ratings for 3900 movies by 6040 users.
Jester: This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes
from 73,421 users.
Netflix Prize: Netflix released an anonymised version of their movie rating dataset; it
consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of
the 17,770 movies.
Book-Crossing dataset: This dataset is from the Book-Crossing community, and contains
278,858 users providing 1,149,780 ratings about 271,379 books.
1.7k Views View Upvotes

Rohan Somni, Amazon AWS, Georgia Institute of Technology


Written Apr 5

A2A. Depending on the type of datasets you're interested in, I'd suggest taking a look at
https://www.reddit.com/r/datasets , or maybe Data.gov (The U.S. government's
open data) or Disability and Health (CDC datasets).
Some other random sets I recall/have used before are:
Google Public Data Explorer
Webscope | Yahoo Labs
Overview | Yelp For Developers | Yelp
AWS Public Data Sets

(Yelp's academic dataset)

Beer Data
This list is by no means exhaustive, and some Googling can get you a lot more - but it's
what I was able to come up with off the top of my head.
235 Views View Upvotes Answer requested by Joy Xu

John Goodwin
Written Sep 13, 2011

UK gov data: http://data.gov.uk , lots of interesting linked data


http://beta.kasabi.com , Ordnance Survey linked data http://data.ordnancesurve
y.co.uk and see http://www.ordnancesurvey.co.uk for general open data.
2.5k Views View Upvotes

Nikhil Anand Hegde, Curious Quoran.


Written Oct 10, 2014

Some of good data source for Economic data:


https://www.quandl.com/ - Quandl
http://einstein.library.emory.ed...
What are the most useful sources of economics data?
1k Views View Upvotes

Shafqat Islam, CEO & Cofounder of NewsCred. We help web publishers delight
their users (and ...
Written Apr 6, 2011

We have a 20 million+ dataset (last three years) of news articles (headline, description,
plus metadata). The data can be access programatically via an API at
http://developer.newscred.com .
People have done some really interesting things with it. We could potentially make it
available as a dump file if someone wants it for research purposes.
1.5k Views

Vikas Majjagi, Bored Analyst


Written Jan 16, 2015

You can get real world data sets in kaggle.


Many companies host challenges in kaggle for a real problem that they are trying to solve.
They upload their data to the site (although it might be altered). They may not be
considered big data, but you can get pretty huge data sets.
Once I worked with around 17GB data. Close to 45 million records with around 50
features. That can be considered as pretty huge.
Hope this helps.
1.1k Views View Upvotes

Vaibhav Mallya, Jobhunting? LMK-I will help you get what you are worth.
OerLetter.io Founder.
Updated Sep 23, 2011

There are some text corpora here: Where can I find large datasets open to the public?
If you're looking for a vast source of public domain literature, Project Gutenberg is
wonderful: http://www.gutenberg.org/wiki/Ma...
The Presidential Speech Archive: http://millercenter.org/scripps/...
Hitler's Speeches: http://www.hitler.org/speeches/
The Vedas: http://www.sacred-texts.com/hin/
The Gita: http://www.gita4free.com/english...
The Bible: http://patriot.net/~bmcgin/kjvpa...
Take a look at the NYT archive: http://www.nytimes.com/ref/membe...
929 Views View Upvotes

Ramakanth Dorai, @ramakanth_d


Written Jan 15, 2012

Amazon has announced Public datasets hosted on aws at no charge for the community.
This datasets can be seamlessly integrated with your application running on aws. Pay per
use.
https://aws.amazon.com/datasets?...
4.6k Views View Upvotes

Abhinav Upadhyay, Created https://man-k.org


Written Mar 11, 2014

Academic Torrents : Distributing large datasets using torrents, this project was started
very recently and has some of the most interesting datasets.
8.1k Views View Upvotes

Taylan Malak, Database Engineer


Written Sep 22, 2014

U.S. Bureau of Economic Analysis (BEA)


U.S. Bureau of Labor Statistics
EconPapers
WTDB- Select Location
Data | The World Bank
The World Top Incomes Database
Online Data - Robert Shiller
2.4k Views View Upvotes

Joseph Hopper, My alter ego is a penguin.


Written Apr 6

Raw data sets of what?


Here is a bag of words Bag of Words Data Set
Here is a some streamflow data USGS Current Water Data for the Nation
I would suggest that you start by doing a search on whatever type of data you are interested
in and adding the word dataset. You might also want to try raw data, data set, raw data set,
etc.
164 Views

Thomas Marquart, astronomer, programmer, dog owner


Written Apr 5, 2011

The Sloan Digital Sky Survey seems to not have been mentioned yet:
http://www.sdss.org
350 million objects on the night sky, many different measured parameters for each of
them.
1.1k Views View Upvotes

Marin Dimitrov, engineering manager @ Uber


Written Jan 10, 2011

check out the Linked Open Data - http://linkeddata.org/


currently it includes 220+ datasets with 24+ billion RDF triples
3.1k Views View Upvotes

Nidhi Kohli
Written Oct 4, 2013

It depends on what kind of data you need (business/economic/social etc).

My top 3 picks for useful, large business datasets


1. Kaggle
2. KDNuggets
3. Frequent Itemsets Data Repository (Frequent Itemset Mining Dataset Repository )
2.5k Views View Upvotes

Matthew Hurst
Written May 11, 2011

d8taplex.com (which I run) has >1MM time series in >50k data sets pulled from 122
sites. The data sets are derived automatically from resources like excel spreadsheets, html
tables and plain csv and tsv files.
1k Views

Georey Anderson, Former Data Processing Product Manager for InfoGroup &
General DBA that likes...
Updated Apr 15, 2011

There are several free providers on Microsoft's Azure Data Mart for the time being
including several of the mentioned above. The single platform for delivery and excel plugin will make the data easier to consume however than your typical API / SOAP end point.
https://datamarket.azure.com/
1.3k Views View Upvotes

Marcel Janus, IT-Professional and dad


Updated May 24, 2012

Here are some more links you my consider:


http://www.factual.com/
http://publicrecords.searchsyste...
http://opendata.socrata.com/
http://www.dados.gov.pt/pt/catal...
Especially for the German folks around:
http://daten.berlin.de/
1.5k Views View Upvotes

Mithun Kalan, Streaming analytics. Storm and AWS lambda

Written Jul 14

There is an open data source of Open Data | UNCDF with about 10 developing country
data sets. There is a detailed zip file in the export with all 1000+ questions.
69 Views View Upvotes

Prathamesh Kulkarni, knowledge seeker


Written Jun 15

Google Public Data Explorer is one good dataset. Its not large, but has valuable data
regarding economics and other factors of human development. For example this is
about income inequality in the United States.
Google Trends

is another good one.

170 Views View Upvotes

Vladimir Bougay, Co-founder and CTO, Knoema


Written May 23, 2012

If you're looking for public data you should definitely take a look at Knoema
(http://knoema.com ). Knoema is one-stop shop for your data needs.
Here you will find 600+ public datasets on almost any topic like economics, healthcare,
demographics or energy. Knoema accumulated public data from many credible
international sources in a single place and provides convenient search/browsing tools
881 Views View Upvotes

Omar Alonso, Data gaucho at Microsoft


Written Aug 30, 2014

Depends on what you are looking for. Wikipedia is the best crowdsourced data set
available for generic use. Now, if you are looking for domain specific data sets (e.g., query
logs, annotations, entities, etc.), that's a different matter.
1.4k Views View Upvotes Answer requested by Martin Engwicht

Minat Kumar Verma, Love to play...anything !!


Written Jan 23, 2014

Try this link once :


Datalist.xlsx - Google Drive
Hope you find it useful.
Found it on Tableau Software Site, unable to get the original link though.
727 Views

Daniel Cave, Product Manager and Digital Marketing Manager


Written Jul 17, 2015

Where can you find them? Stop looking and start building them yourself.
The internet is One Big Data set waiting to be made, and it's laughably easy to combine
data many many websites to make a large table of data these days.
Any of the modern web scrapers will let even a 'non-programmer' put together a data set
very quickly and easily.
I know this because I work at http://import.io and our platform is being used to create
datasets with billions of data points every single day.
I suppose the main reason i suggest this that you can be free from needing other people
build big data sets for you, and make your own, becoming more independent in the
process.
694 Views View Upvotes

Agastya Mishra
Written May 17, 2013

Books and Movies Data: Book-Crossing Dataset


Contains data about book, book rating in csv format. Also has sql queries for CRUD
operations.
1.5k Views View Upvotes

Bob Calder, Internet and Society, Science and Society Fort Lauderdale, FL
Written Jan 23, 2011

I didn't see anyone mention the WHO Global Health Observatory:


http://apps.who.int/ghodata/
Observatories *should* be constructed with an ontology api in mind for the use of what are
increasingly being called "observatories" as in the virtual observatory the astronomers put
together a couple of years ago. Also look at the "science accelerator" the DOE funded and
of course Abe Lederman's federated search engines.
1.7k Views View Upvotes

Tom Greif
Written Jul 22, 2013

Stack Exchange Data Dump - Anonymized data dump of all creative commons questions
and answers from the Stack Exchange family of websites at thttp://stackexchange.com

/sites

. XML format, 7zipped, released every 3 months.

1.6k Views View Upvotes

Pavan Keerthi, I work with Data


Updated Oct 10, 2011

I was doing this research few days ago and found these
http://www.delicious.com/pskomor...
http://www.datawrangling.com/som...
http://www.day-trading-stocks.or...
http://www.kdnuggets.com/datasets
http://data.worldbank.org/
http://setiquest.org/ -(You need to sign up)
http://www.grouplens.org/node/73
1k Views View Upvotes

Marc Millstone, visioneer


Written Jun 16, 2011

Many people use the bible, as it is available in many languages and many different
versions. Another option is to find the proceedings of the UN, which is also published in
many different languages.
1k Views View Upvotes

Edwin Khoo, Graduate student


Written Feb 1

One of the most comprehensive lists can be found at https://github.com/caesar


0301/aw... .
661 Views View Upvotes

Themis Papavasileiou, {math,cs}U{intelligent machines}


Written Jan 22, 2015

I happened to stumble upon caesar0301/awesomepublicdatasets

on dataTau.

As the title suggests the datasets are indeed awesome.


Hope it helps!
3k Views View Upvotes

Frank Scurlock
Written Dec 11, 2014

I did some research on low impact fuel sources vs. coal in power plants larger than 50
megawatts. I found these to be helpful.
DepartmentofEnergy(DOE)OpenNetdocumentsOSTI
https://www.osti.gov/opennet
l
DepartmentofEnergy (DOE) declassified documents, part of DOE openness initiative. ...
The OpenNet database provides easy, timely access to over 485,000 ...
DOEGlobalEnergyStorageDatabase
www.energystorageexchange.org/
l
The DOE Global Energy Storage Database provides free, up-to-date information on gridconnected energy storage projects and relevant state and federal ...
GasificationPlantDatabases
www.netl.doe.gov/research/coal/energy.../gasificationplantdatabases
l
Welcome to the U. S. DepartmentofEnergy, National Energy Technology Laboratory's
Gasification Plant Databases. Within these databases you will find current ...
854 Views View Upvotes

Hersh Reddy, Programmer and Lawyer


Written Jun 20, 2013

Google and the USPTO make bulk downloads of US patents and trademarks available in
zip archives:
USPTO Bulk Downloads
2.2k Views View Upvotes

Konstantinos Psychas
Written Apr 18, 2013

The following platform hosts open data to help in scientific analysis and computational
research.
Contribute to the Cure
Information about the platform which is currently in beta are here (Sage Bionetworks Redefining. Challenging. Predicting )
1.6k Views View Upvotes

Shashank Kumar, Software Developer, Computer Science alumni IIT Roorkee


Written Apr 4, 2012

Time Series datasets maintained by Dr Eamonn Keogh


http://www.cs.ucr.edu/~eamonn/ti...

University of California Machine Learning Repository


http://archive.ics.uci.edu/ml/
4.5k Views View Upvotes

Aaron Anderson, I am a recent graduate from Kings College London with a


degree in Intelligen...
Updated Apr 15, 2013

The Correlates of War (COW) project provides data sets on security.


http://www.correlatesofwar.org/
6.1k Views View Upvotes

Colin Baldwin, Software Engineer


Written Apr 5, 2011

There are some great datasets relating to Bioinformatics out there. These are usually
databases of molecules of biological interest.
BLAST: http://blast.ncbi.nlm.nih.gov/Bl...
SCOP: http://scop.mrc-lmb.cam.ac.uk/sc...
There are many others - a huge amount of information is available in this field.
1k Views View Upvotes

David A Springate, Biostatistician, Evolutionary genetics PhD. Python, R, Lisp


Written Oct 28, 2011

The Pubmed Central Open Access Subset contains about 350000 full-text academic
articles in the Biosciences over more than 2000 journals. You can download the lot as
compressed XML files via FTP: http://www.ncbi.nlm.nih.gov/pmc/...
1.7k Views View Upvotes

Alberto Escarlate, Collaborative Fund


Written Jan 10, 2011

NYC DataMine http://www.nyc.gov/html/datamine...


Public data produced by NYC agencies and other City organizations.
2k Views View Upvotes

John Flurry, Connector, Writer, Mobile App Evangelist, Communicator


Written Jan 24, 2011

I am the head of communications for http://databasin.org a free community


conservation mapping tool. We have thousands of data sets available for both download
and use inside the tool itself. One aspect of data on the site is that full and useful metadata
is required to be uploaded to the site. If you have any questions you can contact me directly
at johnb at consbio dot org.
1k Views View Upvotes

Arya Asemanfar
Written Jan 10, 2011

Amazon has repository of datasets as well. They currently have 42 datasets:


http://aws.amazon.com/datasets?_...
2.3k Views View Upvotes

Sunil Sangwan, 3rd Year UG at Mnnit Allahabad


Written Jul 17

caesar0301/awesome-public-datasets
here you can find all type of public datasets. Its a awesome list of all type of resources of
datasets.
55 Views View Upvotes

Joscelyn Upendran
Written Jan 10, 2011

Ordnance Survey mapping datasets available for Great Britain:


http://www.ordnancesurvey.co.uk/... licensed with UK Government's Open
Government Licence (OGL) : http://www.nationalarchives.gov....
1.1k Views View Upvotes

Mark Braggins, A keen interest in technology, innovation and open data


Updated Dec 1, 2014

There's themed linked open data being published under the Open Government Licence
(OGL) on the Hampshire Hub at:
http://data.hampshirehub.net/def/concept/folders/themes
2.1k Views View Upvotes

Sebastian ScheIter, Committer and PMC member at Apache Mahout and Apache
Giraph
Written Sep 13, 2011

Konect is a collection of network datasets:


http://konect.uni-koblenz.de/
5.3k Views View Upvotes

Shunsuke Mikami
Written Jun 6, 2011

The Internet Traffic Archive http://ita.ee.lbl.gov/ publish some Web access logs.
For example, http://ita.ee.lbl.gov/html/contr... were access logs from 1998 World Cup
Web site between April 30, 1998 and July 26, 1998. During this period of time the site
received 1,352,804,107 requests.
946 Views View Upvotes

Rob Jensen, always learning, doing more doing. interested in data science,
minimalism and...
Written Jan 10, 2011

If not already, subscribe the the Guardian's DataBlog. They have great articles and always
link out to the data so you can play with it.
http://www.guardian.co.uk/news/d...
984 Views View Upvotes

Miles Woodroe, Software Engineer, Tech Leadership, formerly Pro Sound


Engineer
Written Jan 6, 2011

also some great sources for test data here: http://www.philwhln.com/how-to-g...


1.6k Views View Upvotes

Raymond Lam
Written May 5

Data
This has a large list of data collected from around the world and not limited to one
organisation. You have the ability to view the data sets, download the data as a .xlsx file or
visualise the data in browser.
159 Views View Upvotes

Enrique R Rivera, Owner of www.followthehashtag.com , a tool for twitter


research
Written Apr 23

You can find some free Twitter datasets (about 200,000 tweets per dataset) in Datasets
section (Datasets Archive - Followthehashtag // Free twitter search analytics and business
intelligence tool ) of Followthehashta (http://www.followthehashtag.com)g
This section is brand new (2016 / 04) and we are adding about 2 or 3 new datasets per
week, hope you enjoy it
If you need custom datasets (paid) in this URL you can see pricing for datasets from 2000
to 200,000 tweets (>Followthehashtag // Twitter keyword search analytics, influence, geo
content analysis tool, and much more )
77 Views

Ian Mercer, Prolic Entrepreneur, Inventor, Guinness World Record Holder and
creator of ...
Written Jan 10, 2011

Yahoo Geoplanet for geographic information: http://developer.yahoo.com/geo/g...


1.5k Views View Upvotes

Anuj Prakash, Computer Science student


Written Dec 11, 2014

caesar0301/awesome-public-datasets
1.6k Views View Upvotes

Timothe Poisot, Evolutionary ecologist, geek, blogger (http://www.sce.fr/)


Written Sep 8, 2011

I'd like to point out the ROpenSci project : http://ropensci.org/


It's dedicated to building interfaces to several data repository, within the R program
6.2k Views View Upvotes

Salvatore D'Agostino, Identity, credentialing and access control infrastructure


and services
Written Jan 10, 2011

Bureau of Labor Statistics, http://www.bls.gov/ ; International Monetary Fund,


National Archives http://www.archives.gov/research... as well as the above.
1.3k Views View Upvotes

Andrew Semenyak
Written Nov 11, 2013

Here are two sample datasets with companies data available for free:

UK Companies Dataset contains information on random 10,000 UK companies


sampled from HitCompanies (all data in this DB extracted and updated automatically
from WWW using AI and Machine Learning): company name and aliases, company
description, industry tags, industry codes, registration numbers, addresses, phone
numbers, VAT numbers, website, number of about/contact/management/product
pages, incorporation date, team size, number of clients and partners, number of
emails, number of key changes (client/partner changes, contact changes, people
changes), and many more
Worldwide Companies Dataset contains information on random 10,000 worldwide
companies sampled from HitCompanies (all data in this DB extracted and updated
automatically from WWW using AI and Machine Learning): company name and
aliases, company description, industry tags, industry codes, registration numbers,
addresses, phone numbers, VAT numbers, website, number of
about/contact/management/product pages, incorporation date, team size, number of
clients and partners, number of emails, number of key changes (client/partner
changes, contact changes, people changes), and many more
1.4k Views View Upvotes

Jonas Mattias
Written Apr 5, 2011

For science data from Australia:


Australian National Data Service - http://services.ands.org.au/home...
Integrated Marine Observing System - http://imos.aodn.org.au/webportal/
Australian Ocean Data Network - http://portal.aodn.org.au/webpor...
AuScope (Geology) - http://portal.auscope.org/portal...
Terrestrial Ecosystem Research Network - http://portal.auscover.org.au/we...
Atlas of Living Australia - http://www.ala.org.au/
1.3k Views View Upvotes

James Thornton, Relentlessly pursuing "Why?"


Written Apr 5, 2011

Linked Data Sets


http://www.w3.org/wiki/TaskForce...
Web Services Directory
http://www.programmableweb.com/a...
1.6k Views View Upvotes

Martijn de Boer, Harvard Business CORe & Python enthusiast


Written Jul 5

I think this one could be nice :)


Using Microsoft R Server on a single machine for experiments with 600 million taxi
rides.
42 Views Answer requested by Ronnie Gladney

Bill Sobel
Written Jan 11, 2011

a good place to get started is http://www.data.gov/ ThepurposeofData.gov isto


increasepublicaccesstohighvalue,machinereadabledatasetsgeneratedbythe
ExecutiveBranchoftheFederalGovernment.
1k Views View Upvotes

Ganesh Raja
Written Feb 16, 2015

Amazon Web Services have public data sets that you can use freely for your big data
projects. You can also contribute to the list.
Please find more information here aws.amazon.com/public-data-sets/

346 Views

Anthony Gerdeman, Statistician, other


Written Aug 9, 2012

If you're looking for US economic data or time series, try FRED. It's free, comprehensive,
and regularly updated. Provided by the St. Louis Fed.
research.stlouisfed.org/fred2
1.9k Views View Upvotes

Moustafa Alzantot, CS Ph.D. Student at UCLA


Written Feb 10

Here is an awesome categorized list of publicly available datasets.


caesar0301/awesome-public-datasets
143 Views

Francisco Restivo, Husband, father, engineer, educator, dreamer


Written Jun 26, 2014

I keep a collection of datasets here Datasets - Francisco Restivo's recommended sites .


6.1k Views View Upvotes

Gianfranco Cecconi, Bringing people and data together


Written Nov 5, 2014

OpenStreetMap

is another obvious example.

482 Views

Brian Chan, just another programmer


Written Nov 16, 2015

Just some suggestions to get you started. =)


https://www.quandl.com/
http://catalog.data.gov/
https://data.sfgov.org/
264 Views

Ankush Chopra
Written Mar 27, 2014

Datasets for Data Mining and Data Science


I've used it in past. Hope this helps.

has a laundry list of free dataset repositories.

1k Views View Upvotes

Annie Pettit, Self serve sample, surveys, polling plus charts and statistics. I am
the Chie...
Written Oct 9, 2014

DataFerrett (U.S. Census Bureau) is a great option for US census data. Lots of data you
can plug directly into any statistics program.
1.4k Views View Upvotes

Guilherme Defreitas
Written Jun 5, 2015

Stanford Large Network Dataset Collection


738 Views View Upvotes

Tim Gerla, CTO and Co-founder, Ansible, Inc.


Written Jan 14, 2011

DataSF from the City of San Francisco: http://datasf.org/


1k Views View Upvotes

Anonymous
Written Jul 16

Big data analytics is to help companies make more informed business decisions by
enabling DATA Scientist, predictive modelers and other analytics professionals to analyze
large volumes of transaction data, as well as other forms of data that may be untapped by
conventional business intelligence(BI) programs. That could include Web server logs and
Internet Click Stream data, social media content and social network activity reports, text
from customer emails and survey responses, mobile-phone call detail records and machine
data captured by sensors connected to the INTERNET Things Some people exclusively
associate big data with semi-structured and unstructured Data of that sort, but consulting
firms like Gartner Inc. and Forrester Research Inc. also consider transactions and other
structured data to be valid components of big data analytics applications. Big Data, Data
Science - Combo Course Training Classes Online | Big Data, Data Science - Combo Course
Courses Online
Big data can be analyzed with the software tools commonly used as part of Advance
Analytics disciplines such as Predictive Analysis Data Mining, Text Analytics and Statical
Method. Mainstream BI software and Visualization tools can also play a role in the analysis
process. But the semi-structured and unstructured data may not fit well in traditional Data
Warehouse based on Relational Database. Furthermore, data warehouses may not be able
to handle the processing demands posed by sets of big data that need to be updated
frequently or even continually -- for example, real-time data on the performance of mobile
applications or of oil and gas pipelines. As a result, many organizations looking to collect,
process and analyze big data have turned to a newer class of technologies that includes
Hadoop and related tools such as Yarn Spook, Spark, and Pig as well as No Sql databases.
Those technologies form the core of an open source software framework that supports the
processing of large and diverse data sets across clustered systems.
In some cases, Hadoop Cluster and No SQL systems are being used as landing pads and
staging areas for data before it gets loaded into a data warehouse for analysis, often in a
summarized form that is more conducive to relational structures. Increasingly though, big
data vendors are pushing the concept of a Hadoop Data Take that serves as the central
repository for an organization's incoming streams of Raw Data. In such architectures,
subsets of the data can then be filtered for analysis in data warehouses and Analytics
Databases, or it can be analyzed directly in Hadoop using batch query tools, stream
processing software and Sql AND Hdoop technologies that run interactive, ad hoc queries

written in Sql Potential pitfalls that can trip up organizations on big data analytics
initiatives include a lack of internal analytics skills and the high cost of hiring experienced
analytics professionals. The amount of information that's typically involved, and its
variety, can also cause data management headaches, including Data Quality and
consistency issues. In addition, integrating Hadoop systems and data warehouses can be a
challenge, although various vendors now offer software connectors between Hadoop and
relational databases, as well as other data integration tools with big data capabilities.
Businesses are using the power of insights provided by big data to instantaneously
establish who did what, when and where. The biggest value created by these timely,
meaningful insights from large data sets is often the effective enterprise decision-making
that the insights enable.
Extrapolating valuable insights from very large amounts of structured and unstructured
data from disparate sources in different formats require the proper structure and the
proper tools. To obtain the maximum business impact, this process also requires a precise
combination of people
208 Views

Simon Tse, Trying to learn something new every day that I nd refreshing
Written Mar 16

Try the uci data repositories


http://archive.ics.uci.edu/ml/
103 Views

Siddha Ganju, Grad Student, School of Computer Science, Carnegie Mellon


University
Written Mar 6

For Machine learning purposes a lot of data sets are availabile on the UCI Machine
Learning Repository
230 Views

Nikita Zhiltsov, Computer science researcher at Kazan University; Textocat, cofounder & CTO
Written Apr 5, 2011

http://getthedata.org

is a Q&A site dedicated to such questions.

2.3k Views View Upvotes

Mike Xu, Insatiably curious scientist/engineer


Written Apr 5, 2011

http://datamarket.com/
opened at oreilly strateconf
2.6k Views View Upvotes

Mikko Heikkinen, Biologist and a web developer working in a natural history


museum.
Written Nov 20, 2015

Global Biodiversity Information Facility has the largest biodiversity dataset, with 600M +
records currently: Free and Open Access to Biodiversity Data
272 Views

Philippe Beaudoin, I've written my share of C++, working on many projects in the
video game indu...
Written Apr 6, 2011

A free dataset of motion capture data:


http://mocap.cs.cmu.edu/
1.5k Views View Upvotes

Niall McCarthy
Written Jan 24, 2013

You can find a huge selection of free statistics, data and infographics at Statista .
880 Views View Upvotes

Harit Himanshu, Software Engineer at Yahoo!


Written Jun 9, 2011

Check this one out!


http://www.icwsm.org/data/
http://webscope.sandbox.yahoo.co...
875 Views View Upvotes

David James, Developer and Curator: National Data Catalog


Written Feb 4, 2011

The National Data Catalog (http://nationaldatacatalog.com ) brings together data sets by


and about government at all levels of government. It is a project of the Sunlight
Foundation.
1.1k Views View Upvotes

Colin Kegler
Written May 11, 2013

The National Bureau of Economics has several datasets:


Data
1k Views View Upvotes

Michael Munsey
Written Mar 19, 2014

There is quite a bit of data available from the FAA.


http://www.faa.gov/data_research/
I particularly found the Airline On-Time Statistics & Delay Causes interesting.
1.1k Views View Upvotes

Iain Chalmers, Web Strategist. Motorcycle Rider. Music Lover. Coee Tragic.
Written Apr 5, 2011

A collection from an admirable data-hoarder:


http://jacquesmattheij.com/Free%...
And discussion:
http://news.ycombinator.com/item...
1.8k Views View Upvotes

Evan Thomas, World traveler, surfer, internet marketer, UCSB alumnus from
Manhattan Beach, CA
Written Dec 5, 2011

Findthedata.org
11.8k Views View Upvotes

Anand V. Chhatpar, Tech entrepreneur


Written Oct 20, 2011

US Department of Energy has weather data available for free for over 2000 global
locations:
http://apps1.eere.energy.gov/bui...
1.8k Views View Upvotes

Evan Schuss
Written Jun 15, 2011

Junar.com is great source for data and statics pertaining to populations of people,
business, sports, geography and also other types of data. This site is a collaboration of data
from around the web and is continually expanding its entries.
847 Views View Upvotes

Paul Jones, director of ibiblio.org, professor at University of North Carolina


Written Jan 11, 2011

Carl Malamud at http://public.resource.org


information databases.

has some of the best large public

1k Views View Upvotes

Jordan Mendelson, Founder/CTO and Good Sharer


Written May 29, 2013

Common Crawl makes available for free ~250 TB of web page data from 2008-2012. - |
CommonCrawl
1.4k Views View Upvotes

Andrey Fedorov
Written May 2, 2013

http://www.cancerimagingarchive....
http://cancergenome.nih.gov/
934 Views View Upvotes

Robert Maguire, Tragic optimist.


Written Aug 28, 2011

How about the Center for Responsive POlitics and its site Opensecrets.org
1k Views

Mark Hahnel, PhD in Stem Cells at Imperial College London, Founder of gshare
Written Apr 8, 2011

The datasets at http://figshare.com

are scientific research datasets licensed under CC0.

871 Views View Upvotes

Aziz Gilani, I am a VC investor in Big Data


Written Jan 17, 2011

I personally use Infochimps.org


(I am also an investor in both).
6.1k Views View Upvotes

and DataMarketplace.com

for all of my dataset needs

Anoop Vasant Kumar, Data Scientist


Written Mar 15

MovieLens
Ideal site for trying out movie recommendations
384 Views

Abhishek Shivkumar, Data Scientist


Written Jan 8, 2014

archive.ics.uci.edu/ml/

has lots of data sets. I'm sure it'll be very useful.

454 Views

Athlan Lathan
Written May 30, 2015

Opendatanetwork.com
Some large datasets, some small, all public.
782 Views View Upvotes

Phil Darnowsky, I know things and have opinions


Written Jan 11, 2011

UC Irvine maintains a collection of datasets for machine learning testing at


http://archive.ics.uci.edu/ml/
4.3k Views View Upvotes

Stephen Turner, Bioinformatics Core Director


Written Apr 11, 2012

Gene expression omnibus for gene expression data http://www.ncbi.nlm.nih.gov/geo/


910 Views View Upvotes

Vinay Kumar, Quorious..


Written Dec 19, 2013

You can get large datasets from the sources,mentioned in Where can I find large datasets
open to the public?
372 Views

Jitendra Harlalka, Data Mining enthusiast


Written Apr 5, 2011

The link contains a dataset of 1 million songs: http://www.infochimps.com/collec...


1.1k Views View Upvotes

Arun Patre, Enabler for social enterprises / startups in India


Written Apr 5, 2011

On the India Water Portal we have a 100 year dataset of the meteorological data for all the
districts of India:
http://www.indiawaterportal.org/...
842 Views View Upvotes

Abhishek Mishra, tinkerer


Written Feb 2, 2011

A collection of free public datasets - http://jacquesmattheij.com/Free,...


226 Views

Nathan Ketsdever, Facinated by science, discovery, & innovation.


Written Mar 22, 2011

Pete Warden summarizes some of the options here that he covers in "Data Source
Handbook" from O'Reilly:
http://petewarden.typepad.com/se...
Here are 18 data-related links that Warden points to in addition to whats covered in the
book--for those wanting to learn more:
http://petewarden.typepad.com/se...
2.2k Views View Upvotes

Olya Romanova
Updated Sep 26, 2013

Check Knoema via http://knoema.com - the largest open and public data repository with
100 M+ time series and 3000+ datasets
1.4k Views View Upvotes

Milstein Munakami
Written Feb 10, 2015

Milstein/awesome-public-datasets
672 Views View Upvotes

Nazmul Hasan
Written Jul 26, 2013

There are some very cool datasets about Philadelphia here:


Connecting People With Data <====CLICK HERE
If you want data about Chicago this website is the best.
City of Chicago | Data Portal
It's got everything from pothole reports to gun deaths and you can do a lot of cool stuff
with it.
735 Views View Upvotes

Mitsuharu Hamba, Software Engineer


Written Jun 17, 2011

this is great summary. how informative!


btw, you can get another data from the below as well!
http://rit.rakuten.co.jp/rdr/ind...
994 Views View Upvotes

Douglas Moore, Principal Consultant at Think Big, a Teradata Company


Written Oct 10, 2014

See: Where can I find large datasets open to the public?


514 Views View Upvotes

John Wong
Written Oct 5, 2014

Computer failure traces: Failure Trace Archive


618 Views View Upvotes

Craig Danton, Head of BD at Enigma


Updated Feb 12, 2014

Enigma.io is a product that aggregates thousands publicly available data sets. Over 80
Billion rows of data in 100,000 tables. Also available in API.
828 Views View Upvotes

Dan Bair
Written Feb 2, 2011

Here is another link that lists some publicly available data sets.
Link: http://highscalability.com/blog/...
246 Views

Vincent van Haa, software engineer, data viz expert, ux designer, hacker, vj,
musician, cyclis...
Written Apr 6, 2011

http://data.vancouver.ca/
640 Views

Julian Ranger, Serial entrepreneur & angel investor


Written Jan 10, 2011

and for the UK Government a large number of datasets is available at http://data.gov.uk/


634 Views View Upvotes

Raviteja Chirala, Data Scientist, Avid Programmer..


Written Nov 24, 2014

There is another question where you'll be over whelmed with sources.


Where can I find large datasets open to the public?
676 Views View Upvotes

Yap Kai Lun Leon


Written Apr 22

I have develop a charting platform which also allow user to download data after register
free membership.
https://chartist.deltaspace.com.sg
55 Views

Margaret Warren
Written Apr 5, 2011

Lots of Australian data at http://data.gov.au


729 Views View Upvotes

Robert Loftin

Australian government data repository

Written Mar 26, 2013

http://USGovXML.com
4.5k Views View Upvotes

Jan Willem Tulp, freelance information visualizer


Written Apr 7, 2011

Here are 2 good resources with Open Data from the EU: http://publicdata.eu/
http://lod2.okfn.org/eu-data-cat...
684 Views

Brad Pauly, Rails application developer


Written Jan 16, 2011

Wikipedia has been mentioned but I didn't see a link. This is for current articles.
http://en.wikipedia.org/wiki/Wik...
722 Views View Upvotes

Vasundhar Boddapati, Another distinct you!


Written Feb 2, 2011

Some data and sources might repeat.


kdnuggets
http://www.kdnuggets.com/dataset...
datawrangling
http://www.datawrangling.com/som...
http://infochimps.com/
214 Views View Upvotes

Patrick Hochstenbach, from the heart of Henry Van de Velde's Booktower at


Ghent University: librari...
Written Jan 12, 2011

Try CKAN http://ckan.net/


2k Views View Upvotes

Shantanu Sharma, Post Doc in Computer Science at UC Irvine.


Written Aug 2

You can get TPC-H dataset from the following link:


TPC-H - Homepage
31 Views

Igor Kiselev
Written Jun 6, 2015

Stanford Large Network Dataset Collection


Datasets for Data Mining and Data Science
623 Views View Upvotes

Thieme Hennis, PhD in education & technology @ tudelft.nl


Written Sep 25, 2013

Check out http://www.engagedata.eu/


It is aimed at providing a community and hub for open datasets for Europe.
1.3k Views View Upvotes

Tristan Henderson, Lecturer in Computer Science, University of St Andrews


Written Jan 10, 2011

We host a number of wireless network datasets at http://crawdad.org


792 Views View Upvotes

Yunhong Gu, Software Engineer


Written Jan 11, 2011

Biologists have a huge amount of public data at NCBI: ftp://ftp.ncbi.nih.gov/ . The total
size may be close to 1PB.
984 Views View Upvotes

Richard Pauli
Written Apr 5, 2011

Tons of Climate Data http://www.easterbrook.ca/steve/...


And it needs your attention for building climate models.
http://www.google-melange.com/gs...
653 Views View Upvotes

Robert Prescott
Written May 17, 2013

Page on Sciencebase
Survey
919 Views View Upvotes

is a searchable data repository for the United States Geological

Ben Toth

Sign In

Search for questions, people, and topics

Written Jan 13, 2011

http://www.google.com/publicdata...
http://www.ic.nhs.uk/statistics-...

and

970 Views View Upvotes

Teng Qiu
Written May 11, 2014

wikidata.org

and freebase.com

900 Views View Upvotes

Lenny Kiyoshi Bogdono


Written Jun 10, 2013

The best that aggregates all OPEN government data, as an API, is:
http://www.pediacities.com
805 Views View Upvotes

Chrystall Kanyuck, Community journalist and US expat in the British Virgin


Islands. Web and data...
Written Jun 10, 2011

Another one for a long list: The Guardian lets you search for open government data from
around the world at
http://www.guardian.co.uk/world-...
587 Views View Upvotes

Philip Zavliaris, Democratizing Quantitative Investing. Quantitative Investment


Strategist
Written Feb 6

AssetMacro provides free access to historical data of 10,000+ Macroeconomic indicators


and Market Data covering global stock markets, bonds, Fx and commodities
92 Views

Misha Denil
Written Feb 2, 2011

Peter Skomoroch has a delicious page with links to many data sets.
http://www.delicious.com/pskomor...
196 Views

Akshay Mall, May the odds be ever in your favor!


Written Apr 29, 2014

Please check this question : Where can I find large datasets open to the public?
269 Views View Upvotes

Jim Shi
Written Mar 30, 2013

National Center for Biotechnology Information


851 Views View Upvotes

Owen Stephens
Written Dec 9, 2011

For specific data you can try asking over at http://getthedata.org


dedicated to questions about finding data

which is a Q&A site

849 Views View Upvotes

Daniel McNamara
Written Jul 19, 2011

www.kaggle.com has datasets freely available and data analysis competitions with
prizemoney attached
206 Views

Martin Kelly
Written Apr 10, 2011

http://www-958.ibm.com/software/...

has a nice frontend to many open data set

625 Views View Upvotes

Enrique Cusba
Written May 21, 2011

You can find some interesting information here:


http://datos.fundacionctic.org/s...
720 Views

Anonymous
Updated Sep 19, 2011

See Where can I find large datasets open to the public?


187 Views

Sign In

Search for questions, people, and topics

Related Questions
Where can I nd datasets (open to public) of eCommerce websites?
Where can I nd large historic datasets on ex-employees or recruitment open to the
public?
Where can I nd large data sets open to the public of all available drugs and medicines?
Where can I nd large datasets open to the public for merger and acquisition integration
performance?
What large, open and public datasets are there for Educational Data Mining?
What is the most comprehensive list of international open government datasets?
Where can I nd web analytics datasets open to the public?
Where can I nd large bank and credit related datasets open to the public?
Datasets: Where can I nd home address histories for large numbers anonymous
individuals?
Datasets: Where can I nd a corpus open to the public concerning controversies over ecigarette?
Where can I nd complain-related large datasets open to public?
What kinds of large datasets open to the public do you analyze the mostly?
How can I get large datasets collected from sensors? For example thermo dataset like
(temperature, humidity, wind speed, etc)!
What are the most extensive media and TV listing datasets open to the public?
Where can I nd publicly available automotive datasets or OBD2 datasets?

Top Stories
How do Americans view George
W. Bush's handling of Hurricane
Katrina?

Am I more likely to kill an intruder


or a family member with a gun in
my house?

Ernest W. Adams, Game Design


Consultant, Author, and Professor

Chris Everett, gun owner, extensive


knowledge in technical and legal
issues related to guns.

Written May 24, 2015

The anti-government rhetoric of every


Republican politician since Ronald Reagan
has got a lot of people convinced that the
government can never do anything right and
shouldn't try. Then when a massive disaster
comes along that could not possibly be
handled by anything less than the financial
and organizational power of the federal
government, it does a bad job because it is
Read More
underfunded, unprepa...

Updated Mar 4

The Kellerman study is widely debunked.


That said: If you own a firearm and are
suicidal, you are likely to use that firearm to
kill yourself vs. some other method. Since
suicide is a depressingly common cause of
death in the US, and guns are widely
available, it should come as no surprise that
firearm suicides are common. Suicide
Read
accounts for nearly
3/4More
of gun deaths in the
US. Suicide is a...

What does it feel like to have your


name constantly mispronounced?
Jeremy M. Thompson, I've been
through good days and awful days.
But mostly good!
Written Aug 13

My name is Jeremy.
It is pronounced j eh - r uh - m ee.
But some white people want to call me
Jerome or sometimes Jerry. They tend to be
older white people.
Then, some Black people want to call me
Germey or Jermey like Germ E. Or
Read More
Germany. As a nickname,
I would
sometimes get called Germs or Germey

Sitemap # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
About - Careers - Privacy - Terms - Contact

S-ar putea să vă placă și