Data and Software

From Introduction to Data Science
Revision as of 19:40, 13 November 2017 by Tzhu43 (talk | contribs) (Data)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



The purpose of this section is to keep track of all the data used in the class. There are two tables below, titled Collections of datasets and Specific datasets. The Collections table is intended to be a resource for students to go and find interesting datasets for their projects, and the Specific Datasets table is intended to be a record of who found the dataset (the Link Contributor) and who is using it for their project (the User(s)).

It is the responsibility of everyone in the class to keep this page up to date. Once you have settled on the data set(s) you will use for your project, make an entry in the Specific Datasets table and put your name in the User field. If you find other resources/data that you won't be using, make an entry in the appropriate table. (For entries in the Specific Datasets table, leave the User field empty.)

Collections of datasets
Collection and Links Substantive Field Link Contributor
Kaggel Various Datasets Fadi AlMahamid
@CoolDatasets Many Dan Lizotte Many Dan Lizotte
Data Mining and Machine Learning in Astronomy (background paper) Astronomy Dan Lizotte
National Biomedical Imaging Archive Relevant paper Medical (Imaging) André Carrington
CIP Survey of Biomedical Imaging Archives Medical (Imaging) André Carrington
Biomedical Informatics Research Network (BIRN) Medicine André Carrington
Laboratory of Neuro Imaging (LONI) Image Database Medicine André Carrington
MIMIC Databases Medical Dan Lizotte
UCI datasets Various. However note this warning: Many are uninteresting; donations prior to 2007 are not allowed for projects. Clarification: It's okay if the data describe events before 2007 or were collected before 2007, but their donation date should be 2007 or later. Dan Lizotte
PSLC DataShop Education Jason Baek The Stanford Dataset Collection Zak Blacher
Datasets on data mining and cybersecurity The 4th International Workshop on Data Mining and Cybersecurity Wendell Wang
TalkBank Language Jason Baek
Bitly research quality data sets Various Jason Baek
Substance Abuse & Mental Health Data Archive Mental Health Rhiannon Rose
Neuroscience Information Framework Medical Rhiannon Rose
Gapminder Various world data (economic, medical, environmental) Robert Suderman
Cancel Imaging Archive Medical Robert Suderman
Collaborative Research in Computational Neuroscience Neuroscience Xiang Ji
U.S. Criminal Justice Archive Criminology Oliver Trujillo Elections John A. Doucette
cBioPortal Cancer Genomics Katherina Baranova
NCBI GEO Functional Genomics Data Katherina Baranova
Array Express Functional Genomics Data Katherina Baranova

- [ ] Eye movement data Diana Varyvoda

Specific data sets
Dataset and Links Substantive Field Link Contributor User(s)
MIMIC II Waveform


Medicine Dan Lizotte
MIMIC II Clinical


Medicine Dan Lizotte
Lung Image Database Consortium (LIDC) Medicine André Carrington
Digital Database for Screening Mammography (DDSM) Medicine André Carrington
Mammographic Image Analysis Society (MIAS) Medicine André Carrington
Medical Image Reference Center (MEDIREC) Medicine André Carrington
Simulated Brain Database (SBD) Medicine André Carrington
MIT Reality Mining project Privacy Sarah Harvey
Microsoft Geolife GPS trajectories, download dataset Privacy Sarah Harvey
World Cup traces 1998 Cloud Computing Noha Elprince
Planet Lab traces 2008 Cloud Computing Noha Elprince
Real Computing Center Workload logs 2006 Cloud Computing Noha Elprince
Chicago Crimes 2001-Present Criminology Lloyd Rowat M.Alarbi
A collection of various datasets Miscellaneous Bahareh Sarrafzadeh
Sloan Digital Sky Survey Data Release 7 Astronomy Michael Cormier
Galaxy Zoo 1 Data Release Astronomy Michael Cormier
NBA box scores Sports analytics Andrew Arnold
NBA betting lines Sports gambling Andrew Arnold
1871 - 5% sample

1881 - whole census public

Canadian Census Laura Richards
Face images and movies Cognitive Science Jason Baek
Term-Preterm EHG Database Medical Rhiannon Rose
Drug Abuse Warning Network (DAWN) Mental Health Rhiannon Rose
Open Access Series of Imaging Studies (OASIS) Dementia Rhiannon Rose
Classifying Heart Sounds Challenge Medicine Valerie Sugarman (via
Presidential Polling Predictor Polling Abdelhamid El Bably
Bank Marketing Marketing Shiwei Li
Physical Activity Monitoring "Health" Fil Krynicki (via [1])
Treatment Episode Data Set - Discharges (TEDS-D) Mental Health Rhiannon Rose
Ranked ballots from two districts of the 2002 National Election in Ireland Election Ballots John A. Doucette
Ranked ballots from the Debian Project's leadership elections, 7 years Election Ballots John A. Doucette
Ranked ballots for the mayoral election in Burlington Vermont Election Ballots John A. Doucette
Ranked ballots from the Glasgow city council elections, 2007 Election Ballots John A. Doucette
Ranked ballots from 86 elections in various small organizations Election Ballots John A. Doucette
Match Statistics for different leagues in Europe Football League Statistics John Morcos
CrunchBase APIs Statistics about different companies Sandeep Chaudhary
VoxForge Speech Wei-shou Hsu
Speech Accent Archive Speech Wei-shou Hsu
Misc. Outdoor Images Images Zachary Frenette
Bike Sharing Demand forecasting Zhongyu Peng Zhongyu Peng
Enron Email Dataset Email Classification Yunjia Sun
Kidney Medical Dan Lizotte Vivian Tan Internet Scanning Jordan Gould
Student-alcohol-consumption Sociology Fadi AlMahamid Elham Harirpoush
20 years of games Game Zeyu Wang Zeyu Wang


The language of instruction for this course is R, which is freely available, as is the associated development environment RStudio. However, students may write their own code in the language of their choice, and/or make use of other freely available software. You may not use WEKA unless you send an e-mail to Dan with a rationale for why you need WEKA and not something else. The main rationale for this rule is that many projects will have data too large for WEKA, which results in students painting themselves into a corner.

Regardless of the software used, the project report must demonstrate that the student understands exactly what the software is doing.