https://www.csd.uwo.ca/~dlizotte/teaching/IDS/api.php?action=feedcontributions&user=Dan+Lizotte&feedformat=atomIntroduction to Data Science - User contributions [en]2018-04-24T01:01:31ZUser contributionsMediaWiki 1.29.0https://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=127Lecture Materials2017-11-24T21:58:01Z<p>Dan Lizotte: Added classification performance evaluation materials</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Performance Evaluation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.pdf pdf]]<br />
* Model Selection [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.pdf pdf]]<br />
* Classification [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.pdf pdf]]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.pdf pdf] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.pdf pdf] ]<br />
<br />
'''Materials with associated video lectures (see OWL)'''<br />
<br />
* Classification Performance Evaluation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/10_Classification%20Performance%20Evaluation/classification_performance_evaluation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/10_Classification%20Performance%20Evaluation/classification_performance_evaluation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/10_Classification%20Performance%20Evaluation/classification_performance_evaluation.pdf pdf] ]<br />
<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=126Introduction to Data Science I2017-11-23T19:17:30Z<p>Dan Lizotte: /* Timeline (Tentative) */</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
:* Deep Learning Resources (courtesy Ethan Jackson)<br />
:** Tutorials on Word2Vec in Python. Learns semantic relationships between words in very large corpora by mapping each word to a high-dimensional word embedding. Semantic relationships are estimated using contextual frequency, i.e. how often a word appears given a context of other words.<br />
:***https://radimrehurek.com/gensim/models/word2vec.html<br />
:***https://rare-technologies.com/word2vec-tutorial/<br />
:**Some ideas about using t-SNE for visualization<br />
:***https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/<br />
:**Digit classification on MNIST dataset using TensorFlow<br />
:***https://www.tensorflow.org/get_started/mnist/beginners<br />
:**Autoencoders for MNIST in Keras (a very high level interface for deep learning libraries including TensorFlow)<br />
:***https://blog.keras.io/building-autoencoders-in-keras.html<br />
:**Convolutional neural networks for image recognition on CIFAR-10 dataset in TensorFlow. Great starting point for image classification using deep learning.<br />
:*** https://www.tensorflow.org/tutorials/deep_cnn<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi* <br />'''9637 Slots 3:30pm-4:30pm''': Mahtab Ahmed, *Nick DelBen*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - Brainstorming: *slot1*, Gurpreet Singh, Erica Yarmol-Matusiak<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Xiaoyu Yang & Sachi Elkerton, Felipe Urra, Tianzhi Zhu<br />
* 23 Nov - '''Project Draft Due 24 Nov at 5pm''' - Brainstorming: Nanditha Rao, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: Yancong Wang & Jiayi JI, Mohammad, Angela Zhao & Yanbing Zhu, Yu Zhu, Gagan Verma & Kerlin Lobo, Zeyu Wang<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, Roopa Bose, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: (Sanjay Ghanathey, Jenna Le, Tanvi Kumar), *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *Rifayat Samee*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=125Introduction to Data Science I2017-11-23T19:09:20Z<p>Dan Lizotte: /* Timeline (Tentative) */</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
:* Deep Learning Resources (courtesy Ethan Jackson)<br />
:** Tutorials on Word2Vec in Python. Learns semantic relationships between words in very large corpora by mapping each word to a high-dimensional word embedding. Semantic relationships are estimated using contextual frequency, i.e. how often a word appears given a context of other words.<br />
:***https://radimrehurek.com/gensim/models/word2vec.html<br />
:***https://rare-technologies.com/word2vec-tutorial/<br />
:**Some ideas about using t-SNE for visualization<br />
:***https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/<br />
:**Digit classification on MNIST dataset using TensorFlow<br />
:***https://www.tensorflow.org/get_started/mnist/beginners<br />
:**Autoencoders for MNIST in Keras (a very high level interface for deep learning libraries including TensorFlow)<br />
:***https://blog.keras.io/building-autoencoders-in-keras.html<br />
:**Convolutional neural networks for image recognition on CIFAR-10 dataset in TensorFlow. Great starting point for image classification using deep learning.<br />
:*** https://www.tensorflow.org/tutorials/deep_cnn<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi* <br />'''9637 Slots 3:30pm-4:30pm''': Mahtab Ahmed, *Nick DelBen*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - Brainstorming: *slot1*, Gurpreet Singh, Erica Yarmol-Matusiak<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Xiaoyu Yang & Sachi Elkerton, Felipe Urra, Tianzhi Zhu<br />
* 23 Nov - '''Project Draft Due 24 Nov at 5pm''' - Brainstorming: Nanditha Rao, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang & Jiayi JI*, Mohammad, Angela Zhao & Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *slot*, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: (Sanjay Ghanathey, Jenna Le, Tanvi Kumar), *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *Rifayat Samee*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=124Introduction to Data Science I2017-11-21T20:11:32Z<p>Dan Lizotte: /* Timeline (Tentative) */</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
:* Deep Learning Resources (courtesy Ethan Jackson)<br />
:** Tutorials on Word2Vec in Python. Learns semantic relationships between words in very large corpora by mapping each word to a high-dimensional word embedding. Semantic relationships are estimated using contextual frequency, i.e. how often a word appears given a context of other words.<br />
:***https://radimrehurek.com/gensim/models/word2vec.html<br />
:***https://rare-technologies.com/word2vec-tutorial/<br />
:**Some ideas about using t-SNE for visualization<br />
:***https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/<br />
:**Digit classification on MNIST dataset using TensorFlow<br />
:***https://www.tensorflow.org/get_started/mnist/beginners<br />
:**Autoencoders for MNIST in Keras (a very high level interface for deep learning libraries including TensorFlow)<br />
:***https://blog.keras.io/building-autoencoders-in-keras.html<br />
:**Convolutional neural networks for image recognition on CIFAR-10 dataset in TensorFlow. Great starting point for image classification using deep learning.<br />
:*** https://www.tensorflow.org/tutorials/deep_cnn<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi* <br />'''9637 Slots 3:30pm-4:30pm''': Mahtab Ahmed, *Nick DelBen*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - Brainstorming: *slot1*, Gurpreet Singh, Erica Yarmol-Matusiak<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Xiaoyu Yang & Sachi Elkerton, Felipe Urra, Tianzhi Zhu<br />
* 23 Nov - '''Project Draft Due 24 Nov at 5pm''' - Brainstorming: Nanditha Rao, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': Roopa Bose, *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang & Jiayi JI*, Mohammad, Angela Zhao & Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *slot*, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: (Sanjay Ghanathey, Jenna Le, Tanvi Kumar), *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *Rifayat Samee*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=120Introduction to Data Science I2017-11-15T20:48:11Z<p>Dan Lizotte: /* Timeline (Tentative) */ Updated draft due date</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
:* Deep Learning Resources (courtesy Ethan Jackson)<br />
:** Tutorials on Word2Vec in Python. Learns semantic relationships between words in very large corpora by mapping each word to a high-dimensional word embedding. Semantic relationships are estimated using contextual frequency, i.e. how often a word appears given a context of other words.<br />
:***https://radimrehurek.com/gensim/models/word2vec.html<br />
:***https://rare-technologies.com/word2vec-tutorial/<br />
:**Some ideas about using t-SNE for visualization<br />
:***https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/<br />
:**Digit classification on MNIST dataset using TensorFlow<br />
:***https://www.tensorflow.org/get_started/mnist/beginners<br />
:**Autoencoders for MNIST in Keras (a very high level interface for deep learning libraries including TensorFlow)<br />
:***https://blog.keras.io/building-autoencoders-in-keras.html<br />
:**Convolutional neural networks for image recognition on CIFAR-10 dataset in TensorFlow. Great starting point for image classification using deep learning.<br />
:*** https://www.tensorflow.org/tutorials/deep_cnn<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi* <br />'''9637 Slots 3:30pm-4:30pm''': Mahtab Ahmed, *Nick DelBen*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - Brainstorming: *slot1*, Gurpreet Singh, Erica Yarmol-Matusiak<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang & Sachi Elkerton*, Nanditha Rao, Felipe Urra, *TianzhiZhu*<br />
* 23 Nov - '''Project Draft Due 24 Nov at 5pm''' - Brainstorming: *slot*, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': Roopa Bose, *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *Jiayi Ji*, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: *Sanjay Ghanathey*, *Jenna Le*, *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *Rifayat Samee*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Project_Guidelines&diff=118Project Guidelines2017-11-10T19:46:39Z<p>Dan Lizotte: /* Report Submission and Reviewing */</p>
<hr />
<div>== Goal ==<br />
<br />
The goal of this project is for the student to gain experience in understanding a substantive problem/question, acquiring data relevant to the problem/question, and applying appropriate data science techniques in an effort to address the problem/question. Here I'm using the word ''substantive'' in the way a statistician might: the ''substantive field'' refers to the field of science (not statistical science) containing the problem to be addressed. Example substantive fields include medicine, chemistry, astronomy, and computer networks. All project must include a visualization component, which may be static or dynamic.<br />
<br />
== Structure and Regulations ==<br />
<br />
*The project will be submitted as three deliverables, a project [[#Proposal|proposal]] early in the term, a [[#Report Draft|draft]] partway through the term, and a final research [[#Final Report|report]] at the end of the term. '''All of these must be submitted as pdfs generated by Markdown, LaTeX, or Word; see instructions below.''' After this, each '''graduate''' student will [[#Review Guidelines|review]] a subset of projects; reviews are due one week after final project submission.<br />
*Projects are to be completed '''individually'''.<br />
*All projects ''must'' be based on a dataset that is '''sufficiently interesting''' for our purposes as judged by the instructor. Note that any [http://archive.ics.uci.edu/ml/ UCI] dataset that was donated prior to 2007 is considered '''un'''interesting and is therefore disallowed.<br />
*You are encouraged to contact Dan at any point to determine if your project topic is suitable<br />
*'''No Spam Filters. Furthermore, the Enron-Spam datasets are explicitly forbidden'''<br />
<br />
== Proposal ==<br />
<br />
For the proposal, each student will identify an applied problem (or a few related problems) that could be solved using data science methods, identify an appropriate dataset, and give a detailed plan for analyzing the data that includes what pre-processing will be required, what kind of feature development will be necessary, and what analysis and visualization methods might be applied. Don't forget to include details for how you will assess the performance of any models you build. The proposal should have '''three main headings''':<br />
<br />
* Description of Applied Problem<br />
* Description of Available Data<br />
* Plan for Analysis and Visualization<br />
<br />
The main body of the proposal document should be 2 pages long, single spaced. Page 3 and after may only contain references, tables, and figures. If you are using LaTeX, use the [http://www.csd.uwo.ca/~dlizotte/teaching/stylefiles/ CS4637/CS9637 style files], which are based on the ICML style files. There is no style file for markdown, but keep in mind that if you use Markdown, you still need to have proper references. [http://www.chriskrycho.com/2015/academic-markdown-and-citations.html This resource] may help, as might a bit of Google/StackExchange searching, but in the end the onus is on you. If using word, use 3/4" margins and a 12 point serif font.<br />
<br />
Include a brief abstract of a few sentences. '''At least two appropriate references''' must be listed for works (papers or books) that discuss and describe the applied problem, '''at least one reference''' that describes the available data (may be URL(s)) and '''at least two references''' that describe the methods you plan to explore in your analysis and visualization plan.<br />
<br />
'''Whether you are using LaTeX, Markdown, or Word, submit your proposal as a PDF file. Proposals must submitted through OWL. Late submissions will not be accepted.'''<br />
<br />
== Report Draft ==<br />
<br />
A draft of the final report will be due approximately 2/3 of the way through the term. Use Word, Markdown, or LaTeX with the [http://www.csd.uwo.ca/~dlizotte/teaching/stylefiles/ style files], just as you must for the final report. To ensure you get useful feedback, the draft should have a complete abstract, background section, and analysis and visualization plan. The rest of the paper should at least be sketched in, perhaps in point form, to give a sense of the final shape of the document. '''The precise content of the draft is not specified, but the more you provide, the better feedback you will get.'''<br />
<br />
'''Report drafts must be submitted <!-- to EasyChair [https://www.easychair.org/conferences/?conf=amlf14 https://www.easychair.org/conferences/?conf=amlf14] --> through OWL by 5pm on the due date. *Do not e-mail the instructor your draft.*''' Late submissions will not be accepted. <!-- Later, to submit your final report, you will simply "Update" your draft submission with a new .pdf (and maybe title.) --><br />
<br />
== Final Report ==<br />
<br />
The report must be no more than 4 pages long, single spaced, not including references. '''If you wish''', you may also include an additional appendix with an unlimited number of pages that contain '''only figures, figure captions, and tables'''. Use Word, or use the [http://www.csd.uwo.ca/~dlizotte/teaching/stylefiles/ style files], which are based on the ICML style files, or use Markdown. Include a brief abstract. As mentioned above, all reports must include a visualization component.<br />
<br />
An outstanding report might resemble an application-focussed publication in a workshop at one of the top machine learning or AI conferences, like for example ICML or [http://www.aaai.org/Library/IAAI/iaai-library.php IAAI]. (Note however that you are required to include a visualization component, which such papers may not have.) Here are some examples. Note that just because a paper is listed here does not mean it is perfect; you must always read with a fair but critical eye.<br />
<br />
*Philip A. Warrick, Emily F. Hamilton, Robert E. Kearney, Doina Precup. [http://www.aaai.org/ocs/index.php/IAAI/IAAI10/paper/view/1597 A Machine Learning Approach to the Detection of Fetal Hypoxia during Labor and Delivery.]<br />
*Weiss, Page, Peissig, Natarajan, and McCarty. [http://www.aaai.org/ocs/index.php/IAAI/IAAI-12/paper/view/4778/5451 Statistical Relational Learning to Predict Primary Myocardial Infarction from Electronic Health Records]<br />
*Chad Cumby, Rayid Ghani [http://www.aaai.org/ocs/index.php/IAAI/IAAI-11/paper/view/3528 A Machine Learning Based System for Semi-Automatically Redacting Documents.]<br />
*Mitja Luštrek, Hristijan Gjoreski, Simon Kozina, Božidara Cvetković, Violeta Mirchevska, Matjaž Gams [http://www.aaai.org/ocs/index.php/IAAI/IAAI-11/paper/view/2753 Detecting Falls with Location Sensors and Accelerometers]<br />
* Ben George Weber, Michael John, Michael Mateas, Arnav Jhala [http://www.aaai.org/ocs/index.php/IAAI/IAAI-11/paper/view/3526/4029 Modeling Player Retention in Madden NFL 11]<br />
<br />
=== Specific expectations for the report ===<br />
<br />
'''Reproducibility''': The report '''must''' contain enough detail about the methods used to allow a future researcher to reproduce the results if they had access to the appropriate data and access to all appropriate works cited. (Some projects may use proprietary data; that is fine.) Reports that do not contain sufficient method detail will not receive full marks.<br />
<br />
'''Integrity''': The report must adhere to the standards of [http://www.lib.uwaterloo.ca/gradait/content/documents/credit_your_sources.pdf academic honesty].<br />
<br />
'''Formality''': The report should be written in formal academic language appropriate for a technical report/workshop/conference/journal publication. The author should refer to him/herself in the second person plural, i.e. using "we." ("We present a novel analysis...")<br />
<br />
'''Writing Quality''': The writing must of the quality level expected of a senior undergraduate or graduate student at a world-class university. The [http://www.sdc.uwo.ca/writing/ Writing Support Centre] at UWO can help you reach this level.<br />
<br />
== Report Submission and Reviewing ==<br />
<br />
'''Final report submissions will be done through OWL.'''<br />
<br />
Following report submission, each '''Computer Science graduate (9637)''' student will be randomly assigned two project reports to review over the week following the due date but before the end of the exam period.<br />
<br />
* The main purpose of reviewing is to provide feedback to authors that they can make use of in their future careers, which gives them a better return on the investment they have made in their course project.<br />
* The secondary purpose is to give students a view of the variety of work that has been done in the course.<br />
* '''Reviews from other students will not affect the grade of the author in any way.'''<br />
* Reviewing will be single-blind: Authors will not know who reviews their project.<br />
* Reviewers are expected to provide feedback that is '''constructive'''. Constructive feedback '''makes concrete suggestions on improving the work''' under review. Feedback that is both negative and non-constructive will not be tolerated.<br />
<br />
=== Review Guidelines ===<br />
'''Students must follow the review guidelines below. Include headings where appropriate'''<br />
<br />
* '''Summary:''' Summarize the goal of the project. What are the authors trying to achieve? Then summarize the contributions of the project in a few sentences. Describe the substantive problem, the data used, and the analysis applied. Describe the results. Note that not every project will have "good results" and for this project that is not necessarily a fault; the meta-goal of this project is for each author to gain experience with DS methods. Keep that in mind when you summarize: did the authors sufficiently explore the space of appropriate methods?<br />
* After the summary, comment on the following aspects of the report:<br />
** '''Background''': Comment on whether the report clearly explains the problem to be tackled, and whether it clearly describes how the substantive problem will be formulated as a data science problem.<br />
** '''Data''': Comment on whether you were able to clearly understand what data were available and how they were used in the analysis.<br />
** '''Analysis and Visualization''': Comment on the appropriateness of the DS methods used, and '''comment on the reproducibility of the results''' as described above. Comment on the evaluation measures use.<br />
** '''Future work''': Make some suggestions on how the work could be extended in the future.<br />
<br />
Depending on the project, these sections of the review may be longer or shorter. Use your judgement. Be sure to have at least a few interesting sentences under each heading.<br />
<br />
== Brainstorming ==<br />
<br />
A brainstorming session will consist of a 10-minute presentation by a student, followed by a class discussion for a total of 15 minutes. The presenter may choose to take questions during the talk, or save them until the end. The presentation should detail an applied problem, dataset, and potential DS methods that could be useful, much like the project proposal. The Brainstorming Session '''''may or may not''''' be on the student's project topic, but of course it may be advantageous to use your brainstorming slot to get feedback and ideas.<br />
<br />
* Guidelines<br />
** Presentations should use projected slides<br />
** Presentations should cover more or less the same topics as a project proposal: Description of Applied Problem, Description of Available Data, Plan for Analysis and Visualization<br />
** Presenters will receive a 5-minute warning, but presentations *will* be terminated at the 15-minute mark.<br />
<br />
* Evaluation (by instructor) is based on <br />
** Effective explanation of the problem<br />
** Effective explanation of the available data. It is often a good idea to show a specific example of a single "data item" from the available data, whatever that might mean for the specific project.<br />
** Effective explanation potential DS methods<br />
** Ability to answer questions about the data and the analysis and visualization plan<br />
** Working within the strict 10+5 minute timeslot<br />
<br />
In general, it is better to *show* your plan rather than tell it. Use actual examples from your dataset where possible. Show how feature vectors and any class labels/regression targets are constructed.</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=115Introduction to Data Science I2017-11-05T12:56:49Z<p>Dan Lizotte: /* Timeline (Tentative) */ Kerlin will co-present with Gagan</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
:* Deep Learning Resources (courtesy Ethan Jackson)<br />
:** Tutorials on Word2Vec in Python. Learns semantic relationships between words in very large corpora by mapping each word to a high-dimensional word embedding. Semantic relationships are estimated using contextual frequency, i.e. how often a word appears given a context of other words.<br />
:***https://radimrehurek.com/gensim/models/word2vec.html<br />
:***https://rare-technologies.com/word2vec-tutorial/<br />
:**Some ideas about using t-SNE for visualization<br />
:***https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/<br />
:**Digit classification on MNIST dataset using TensorFlow<br />
:***https://www.tensorflow.org/get_started/mnist/beginners<br />
:**Autoencoders for MNIST in Keras (a very high level interface for deep learning libraries including TensorFlow)<br />
:***https://blog.keras.io/building-autoencoders-in-keras.html<br />
:**Convolutional neural networks for image recognition on CIFAR-10 dataset in TensorFlow. Great starting point for image classification using deep learning.<br />
:*** https://www.tensorflow.org/tutorials/deep_cnn<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *slot*, *slot*, *slot*, *Nick DelBen*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, Gurpreet Singh, Erica Yarmol-Matusiak<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *TianzhiZhu*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': Roopa Bose, *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *Jiayi Ji*, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: *Sanjay Ghanathey*, *Jenna Le*, *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *Rifayat Samee*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=114Lecture Materials2017-11-02T20:20:57Z<p>Dan Lizotte: /* Lecture Materials */ Fixed link to nonlinear models pdf</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Performance Evaluation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.pdf pdf]]<br />
* Model Selection [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.pdf pdf]]<br />
* Classification [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.pdf pdf]]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.pdf pdf] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.pdf pdf] ]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=111Lecture Materials2017-10-31T18:07:30Z<p>Dan Lizotte: Updated unsupervised learning lecture materials F2017</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Performance Evaluation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.pdf pdf]]<br />
* Model Selection [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.pdf pdf]]<br />
* Classification [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.pdf pdf]]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Nonlinear%20Models/nonlinear_models.pdf pdf] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/9_Unsupervised%20Learning/unsupervised-learning.pdf pdf] ]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=103Introduction to Data Science I2017-10-17T16:03:07Z<p>Dan Lizotte: /* Materials */</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
:* Deep Learning Resources (courtesy Ethan Jackson)<br />
:** Tutorials on Word2Vec in Python. Learns semantic relationships between words in very large corpora by mapping each word to a high-dimensional word embedding. Semantic relationships are estimated using contextual frequency, i.e. how often a word appears given a context of other words.<br />
:***https://radimrehurek.com/gensim/models/word2vec.html<br />
:***https://rare-technologies.com/word2vec-tutorial/<br />
:**Some ideas about using t-SNE for visualization<br />
:***https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/<br />
:**Digit classification on MNIST dataset using TensorFlow<br />
:***https://www.tensorflow.org/get_started/mnist/beginners<br />
:**Autoencoders for MNIST in Keras (a very high level interface for deep learning libraries including TensorFlow)<br />
:***https://blog.keras.io/building-autoencoders-in-keras.html<br />
:**Convolutional neural networks for image recognition on CIFAR-10 dataset in TensorFlow. Great starting point for image classification using deep learning.<br />
:*** https://www.tensorflow.org/tutorials/deep_cnn<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *Jenna Le*, *Sanjay Ghanathey*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, Gurpreet Singh, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *TianzhiZhu*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *Jiayi Ji*, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *slot6*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=102Introduction to Data Science I2017-10-17T16:01:53Z<p>Dan Lizotte: /* Materials */ Added deep learning resources</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
:* Deep Learning Resources (courtesy Ethan Jackson)<br />
:** Tutorials on Word2Vec in Python. Learns semantic relationships between words in very large corpora by mapping each word to a high-dimensional word embedding. Semantic relationships are estimated using contextual frequency, i.e. how often a word appears given a context of other words. I can give you more details about the training algorithms if you like.<br />
:***https://radimrehurek.com/gensim/models/word2vec.html<br />
:***https://rare-technologies.com/word2vec-tutorial/<br />
:**Some ideas about using t-SNE for visualization<br />
:***https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/<br />
:**Digit classification on MNIST dataset using TensorFlow<br />
:***https://www.tensorflow.org/get_started/mnist/beginners<br />
:**Autoencoders for MNIST in Keras (a very high level interface for deep learning libraries including TensorFlow)<br />
:***https://blog.keras.io/building-autoencoders-in-keras.html<br />
:**Convolutional neural networks for image recognition on CIFAR-10 dataset in TensorFlow. Great starting point for image classification using deep learning.<br />
:*** https://www.tensorflow.org/tutorials/deep_cnn<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *Jenna Le*, *Sanjay Ghanathey*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, Gurpreet Singh, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *TianzhiZhu*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *Jiayi Ji*, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *slot6*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=97Introduction to Data Science I2017-10-16T19:32:13Z<p>Dan Lizotte: /* Timeline (Tentative) */</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://web.stanford.edu/~hastie/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: Midterm Review/Q&A<br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *Zaid Albirawi*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *Jenna Le*, *Sanjay Ghanathey*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, Gurpreet Singh, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *TianzhiZhu*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, Sabyasachi Patjoshi<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *Jiayi Ji*, *Paul Bartlett*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *Kun Xie*, *Nasim Samei*, *Jacob Hunte*, *slot6*<br />
* 7 Dec - Brainstorming: *Nima khairdoodt*, *Sana Ahmadi*, *Mohsen shirpour*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *Andrew Bloch-Hansen*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=85Lecture Materials2017-10-05T18:00:22Z<p>Dan Lizotte: Adding nonlinear models content</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Performance Evaluation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.pdf pdf]]<br />
* Model Selection [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.pdf pdf]]<br />
* Classification [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.pdf pdf]]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/8_Nonlinear%20Models/nonlinear_models.pdf pdf] ]<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=84Introduction to Data Science I2017-10-05T17:45:35Z<p>Dan Lizotte: /* Timeline (Tentative) */ Cancel 30 nov brainstorming</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *slot2*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, Xindi Wang<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *TianzhiZhu*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, *slot*<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *Hao Jiang*, *Abdelkareem Jaradat*, *Debanjan Guha Roy*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Yanbing Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': '''CANCELLED'''<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *Kun Xie*, *Nasim Samei*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Hengyu Yue*, *Zhongwen Zhang*, *Yifang Liu*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=71Introduction to Data Science I2017-09-28T16:01:23Z<p>Dan Lizotte: /* Materials */ Update to ggplot2 book location</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2016). ['''Free''' through [https://alpha.lib.uwo.ca/record=b6962637~S20 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *slot2*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, *slot7*<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *slot6*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *Hao Jiang*, *Abdelkareem Jaradat*, *slot7*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Vanessa Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Andrew Bloch-Hansen*, *Nima Khairdoost*, *Sana Ahmadi*, *Mohsen Shirpour*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=70Introduction to Data Science I2017-09-28T15:32:04Z<p>Dan Lizotte: /* Timeline (Tentative) */</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: Performance Evaluation, Model Selection<br />
* 28 Sep - Lectures: Classification<br />
** 3 Oct - Lectures: Classification, Performance Evaluation for Classification<br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: Nonlinear Classification<br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *slot2*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, Patrick Carnahan, Nathan Phelps<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': Ruoxi Shi, Valeria Cesar, Mingda Sun, *slot7*<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *slot6*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *Hao Jiang*, *Abdelkareem Jaradat*, *slot7*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Vanessa Zhu, Yu Zhu, *Gagan Verma*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Andrew Bloch-Hansen*, *Nima Khairdoost*, *Sana Ahmadi*, *Mohsen Shirpour*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=66Lecture Materials2017-09-26T01:44:21Z<p>Dan Lizotte: Updated lectures F17 up to Classification</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Performance Evaluation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.pdf pdf]]<br />
* Model Selection [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/6_Model%20Selection/model_selection.pdf pdf]]<br />
* Classification [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/7_Classification/classification.pdf pdf]]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=62Introduction to Data Science I2017-09-22T17:54:28Z<p>Dan Lizotte: /* Timeline (Tentative) */ Filling in topics</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Welcome<br />
** 12 Sep - Lectures: Data Preparation, Introduction to Statistics<br />
* 14 Sep - Lectures: Introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning, Performance Evaluation<br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *slot2*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, Jonathan Tan, Duff Jones, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, Valeria Cesar, Mingda Sun, *slot7*<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *slot6*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *Hao Jiang*, *slot6*, *slot7*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, Mohammad, Vanessa Zhu, Yu Zhu, *slot5*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *Marios-Stavros Grigoriou*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Andrew Bloch-Hansen*, *slot5*, *slot6*, *slot7*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=61Lecture Materials2017-09-21T21:23:13Z<p>Dan Lizotte: Added performance evaluation materials</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Performance Evaluation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/5_Performance%20Evaluation/performance_evaluation.pdf pdf]]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=54Introduction to Data Science I2017-09-19T17:59:39Z<p>Dan Lizotte: /* Materials */</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [https://onlinecourses.science.psu.edu/statprogram/calculus_review Calculus Review] from Penn State University. Includes basic mathematical notation.<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: Ethan Jackson, *slot2*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *Roopa Bose*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: Ashutosh Mishra, Brandon Glied-Goldstein, *slot3*, Duff Jones, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, Valeria Cesar, *slot6*, *slot7*<br />
** 21 Nov - Brainstorming: Cole Fisher, Angela Zhao, *Xiaoyu Yang*, Nanditha Rao, Felipe Urra, *slot6*<br />
* 23 Nov - Brainstorming: Mahtab Ahmed, Jumayel Islam, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Rifayat Samee*, *slot5*, *slot6*, *slot7*<br />
** 28 Nov - Brainstorming: *Yancong Wang*, *slot2*, Vanessa Zhu, *slot4*, *slot5*, *Zeyu Wang*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *Andrew Bloch-Hansen*, *slot5*, *slot6*, *slot7*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=53Lecture Materials2017-09-19T14:57:51Z<p>Dan Lizotte: fixed directory for supervised learning f17</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=52Lecture Materials2017-09-19T14:43:18Z<p>Dan Lizotte: Revealed new supervised learning slides</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=35Introduction to Data Science I2017-09-12T21:32:37Z<p>Dan Lizotte: Added address and term information.</p>
<hr />
<div><br />
<br />
== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
'''The University of Western Ontario<br />'''<br />
'''London, Ontario, Canada<br />'''<br />
'''Department of Computer Science<br />'''<br />
'''Course Outline - Fall (September - December) 2017<br />'''<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, Sachi Elkerton<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, Duff Jones, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: Kerlin Lobo, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 21 Nov - Brainstorming: *slot1*, Angela Zhao, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=33Lecture Materials2017-09-12T18:08:51Z<p>Dan Lizotte: Update lecture materials 2 and 3</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=30Introduction to Data Science I2017-09-12T14:48:54Z<p>Dan Lizotte: /* Timeline (Tentative) */ Edited position of 9637 slot text</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, Duff Jones, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />'''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=29Introduction to Data Science I2017-09-12T14:45:23Z<p>Dan Lizotte: /* Timeline (Tentative) */</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, Duff Jones, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots 3:30pm-4:30pm''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=28Introduction to Data Science I2017-09-12T14:43:49Z<p>Dan Lizotte: Added 20 slots for CS9637</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location MC 320''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, Duff Jones, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots''': *slot4*, *slot5*, *slot6*, *slot7*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, '''9637 Slots''': *slot4*, *slot5*, *slot6*, *slot7*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=26Introduction to Data Science I2017-09-11T17:46:25Z<p>Dan Lizotte: /* Timeline (Tentative) */ Holding slot for Amanda Holden</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: '''Guest Lecture by Amanda Holden''' of SAS. Topic TBA.<br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4 - Duff Jones*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=25Introduction to Data Science I2017-09-11T11:51:58Z<p>Dan Lizotte: /* Important Dates */ Fix due date for reviews</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due Friday, 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4 - Duff Jones*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=24Introduction to Data Science I2017-09-11T11:49:55Z<p>Dan Lizotte: /* Timeline (Tentative) */ Kemi talk</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: '''Guest Lecture by Dr. Kemi Ola''' on Visualization<br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4 - Duff Jones*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=22Lecture Materials2017-09-08T15:03:40Z<p>Dan Lizotte: Begun postings for F17</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4414_F17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=21Introduction to Data Science I2017-09-07T20:50:29Z<p>Dan Lizotte: /* Course outline for COMPSCI 4414A/9637A */</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A/9114A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: <br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=20Introduction to Data Science I2017-09-07T11:41:30Z<p>Dan Lizotte: </p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4414:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: <br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=19Introduction to Data Science I2017-09-06T19:22:11Z<p>Dan Lizotte: </p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 6 Oct at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: <br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=18Introduction to Data Science I2017-09-01T19:01:34Z<p>Dan Lizotte: Added course number</p>
<hr />
<div>== Course outline for COMPSCI 4414A/9637A ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: <br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=17Introduction to Data Science I2017-09-01T18:52:26Z<p>Dan Lizotte: Fixed paragraph spacing</p>
<hr />
<div>== Course outline ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: <br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=16Lecture Materials2017-08-31T16:14:44Z<p>Dan Lizotte: </p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
= Previous Offerings =<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=15Lecture Materials2017-08-31T16:14:20Z<p>Dan Lizotte: Moved old lectures to W17</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will be posted here. They will be updated as the term progresses.<br />
<br />
== From W17 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=14Introduction to Data Science I2017-08-31T16:13:06Z<p>Dan Lizotte: Update to schedule, location, etc. for Fall 2017</p>
<hr />
<div>== Course outline ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Because of the volume of requests I receive, I am not able to manage a wait list. Students will have to monitor the registration website for available spots. However, all are welcome to sit in the room if there is space.</span>'''<br />
<br />
<!-- <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>''' --><br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 2:30PM – 4:30PM, and on Thursday from 2:30PM – 3:30PM<br />
* '''Place''': Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC-105B''']<br />
* '''Question and Collaboration Hour:''' Tuesday from 4:30pm - 5:30pm '''Location TBA''' <!-- in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']--><br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 6 Oct at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 27 Oct at 5pm <!-- End of 7th Week --><br />
* Project Draft Due Friday, 17 Nov at 5pm <!-- End of 11th Week --><br />
* Project Report Due Friday, 8 Dec at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 15 Dec at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 31 Oct. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: <br />
** 12 Sep - Lectures: <br />
* 14 Sep - Lectures: <br />
** 19 Sep - Lectures: <br />
* 21 Sep - Lectures: <br />
** 26 Sep - Lectures: <br />
* 28 Sep - Lectures: <br />
** 3 Oct - Lectures: <br />
* 5 Oct - '''Pick Brainstorming Slot by 6 Oct 5pm''' - Lectures: <br />
** ''10 Oct - '''Fall Reading Week''' ''<br />
* ''12 Oct - '''Fall Reading Week''' ''<br />
** 17 Oct - Lectures: <br />
* 19 Oct - Lectures: <br />
** 24 Oct - Lectures: <br />
* 26 Oct - '''Project Proposal Due 27 Oct at 5pm''' - Lectures: <br />
** 31 Oct - Lectures: <br />
* 2 Nov - Lectures: <br />
** 7 Nov - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - '''Project Draft Due 17 Nov at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 8 December 5pm'''<br />
* '''Reviews (graduate students only) Due Thursday 15 December 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=MediaWiki:Mainpage&diff=13MediaWiki:Mainpage2017-08-31T14:20:37Z<p>Dan Lizotte: Update to new main page; course name</p>
<hr />
<div>Introduction to Data Science I</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=11Introduction to Data Science I2017-08-31T14:18:27Z<p>Dan Lizotte: Dan Lizotte moved page Main Page to Introduction to Data Science I: Better name</p>
<hr />
<div>== Course outline ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>'''<br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 11:30AM – 1:30PM, and on Thursday from 3:30PM – 4:30PM<br />
* '''Place''': Talbot College [http://accessibility.uwo.ca/doc/floorplan/bf-tc.pdf '''TC342''']<br />
* '''Question and Collaboration Hour:''' Thursday from 4:30pm - 5:30pm in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']<br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 3 February at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 17 Feb at 5pm <!-- End of 6th Week --><br />
* Project Draft Due Friday, 17 Mar at 5pm <!-- End of 10th Week --><br />
* Project Report Due Friday, 7 Apr at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 13 Apr at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 2 Mar. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Introduction to Data Science, Data Cleaning<br />
** 12 Sep - Lectures: Re-introduction to Statistics<br />
* 14 Sep - Lectures: Re-introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning<br />
** 26 Sep - Lectures: Supervised Learning<br />
* 28 Sep - Lectures: Cancelled<br />
** 3 Oct - Lectures: Cancelled<br />
* 5 Oct - '''Pick Brainstorming Slot''' - Lectures: Linear Models<br />
** 17 Oct - Lectures: Linear Models<br />
* 19 Oct - Lectures: Linear Models / Nonlinear Models<br />
** 24 Oct - Lectures: Nonlinear Models<br />
* 26 Oct - '''Project Proposal Due 17 Feb at 5pm''' - TBA<br />
** 31 Oct - TBA <br />
* 2 Nov - TBA<br />
** 7 Oct - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - '''Project Draft Due 17 Mar at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 7 April 5pm'''<br />
* '''Reviews Due Thursday 13 April 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Main_Page&diff=12Main Page2017-08-31T14:18:27Z<p>Dan Lizotte: Dan Lizotte moved page Main Page to Introduction to Data Science I: Better name</p>
<hr />
<div>#REDIRECT [[Introduction to Data Science I]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=MediaWiki:Sidebar&diff=10MediaWiki:Sidebar2017-08-01T20:22:23Z<p>Dan Lizotte: Updating sidebar</p>
<hr />
<div><br />
* navigation<br />
** mainpage|mainpage-description<br />
*** Project_Guidelines|Project Guidelines<br />
*** Data_and_Software|Data and Software<br />
*** Lecture_Materials|Lecture Materials<br />
** recentchanges-url|recentchanges<br />
** randompage-url|randompage<br />
** helppage|help<br />
* SEARCH<br />
* TOOLBOX<br />
* LANGUAGES</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=9Introduction to Data Science I2017-08-01T19:36:25Z<p>Dan Lizotte: </p>
<hr />
<div>== Course outline ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science, and who are not in the MDA programme, must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>'''<br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 11:30AM – 1:30PM, and on Thursday from 3:30PM – 4:30PM<br />
* '''Place''': Talbot College [http://accessibility.uwo.ca/doc/floorplan/bf-tc.pdf '''TC342''']<br />
* '''Question and Collaboration Hour:''' Thursday from 4:30pm - 5:30pm in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']<br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 3 February at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 17 Feb at 5pm <!-- End of 6th Week --><br />
* Project Draft Due Friday, 17 Mar at 5pm <!-- End of 10th Week --><br />
* Project Report Due Friday, 7 Apr at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 13 Apr at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 2 Mar. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Introduction to Data Science, Data Cleaning<br />
** 12 Sep - Lectures: Re-introduction to Statistics<br />
* 14 Sep - Lectures: Re-introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning<br />
** 26 Sep - Lectures: Supervised Learning<br />
* 28 Sep - Lectures: Cancelled<br />
** 3 Oct - Lectures: Cancelled<br />
* 5 Oct - '''Pick Brainstorming Slot''' - Lectures: Linear Models<br />
** 17 Oct - Lectures: Linear Models<br />
* 19 Oct - Lectures: Linear Models / Nonlinear Models<br />
** 24 Oct - Lectures: Nonlinear Models<br />
* 26 Oct - '''Project Proposal Due 17 Feb at 5pm''' - TBA<br />
** 31 Oct - TBA <br />
* 2 Nov - TBA<br />
** 7 Oct - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - '''Project Draft Due 17 Mar at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 7 April 5pm'''<br />
* '''Reviews Due Thursday 13 April 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Data_and_Software&diff=8Data and Software2017-08-01T19:35:10Z<p>Dan Lizotte: Initial copy-and-paste of old cs4437 wiki</p>
<hr />
<div>__FORCETOC__<br />
== Data ==<br />
<br />
The purpose of this section is to keep track of all the data used in the class. There are two tables below, titled '''Collections of datasets''' and '''Specific datasets'''. The '''Collections''' table is intended to be a resource for students to go and find interesting datasets for their projects, and the '''Specific Datasets''' table is intended to be a record of who found the dataset (the Link Contributor) and who is using it for their project (the User(s)).<br />
<br />
It is the responsibility of everyone in the class to keep this page up to date. '''Once you have settled on the data set(s) you will use for your project, make an entry in the Specific Datasets table and put your name in the User field.''' If you find other resources/data that you won't be using, make an entry in the appropriate table. (For entries in the Specific Datasets table, leave the User field empty.)<br />
<br />
{| class="wikitable"<br />
|+ Collections of datasets<br />
|-<br />
! Collection and Links<br />
! Substantive Field<br />
! Link Contributor<br />
|-<br />
| [https://www.kaggle.com/ Kaggel]<br />
| Various Datasets <br />
| Fadi AlMahamid<br />
|-<br />
| [http://twitter.com/CoolDatasets @CoolDatasets]<br />
| Many<br />
| Dan Lizotte<br />
|-<br />
| [http://mldata.org/ mldata.org]<br />
| Many<br />
| Dan Lizotte<br />
|-<br />
| [http://arxiv.org/abs/0906.2173 Data Mining and Machine Learning in Astronomy] (background paper)<br />
| Astronomy<br />
| Dan Lizotte<br />
|-<br />
| [http://ncia.nci.nih.gov/ncia/ National Biomedical Imaging Archive] [http://www.springerlink.com/content/f0t13w6664403546/ Relevant paper]<br />
| Medical (Imaging)<br />
| Andr&eacute; Carrington<br />
|-<br />
| [https://wiki.nci.nih.gov/display/CIP/CIP+Survey+of+Biomedical+Imaging+Archives CIP Survey of Biomedical Imaging Archives]<br />
| Medical (Imaging)<br />
| Andr&eacute; Carrington<br />
|-<br />
| [http://www.birncommunity.org/resources/data/ Biomedical Informatics Research Network (BIRN)]<br />
| Medicine<br />
| Andr&eacute; Carrington<br />
|-<br />
| [http://www.loni.ucla.edu/Research/Databases/ Laboratory of Neuro Imaging (LONI) Image Database]<br />
| Medicine<br />
| Andr&eacute; Carrington<br />
|-<br />
| [http://mimic.physionet.org/ MIMIC Databases]<br />
| Medical<br />
| Dan Lizotte<br />
|-<br />
| [http://archive.ics.uci.edu/ml/ UCI datasets]<br />
| Various. However note this '''warning:''' Many are uninteresting; donations prior to 2007 are not allowed for projects. Clarification: It's okay if the data describe events before 2007 or were collected before 2007, but their donation date should be 2007 or later.<br />
| Dan Lizotte<br />
|-<br />
| [https://pslcdatashop.web.cmu.edu PSLC DataShop]<br />
| Education<br />
| Jason Baek<br />
|-<br />
| [http://snap.stanford.edu/data/ http://snap.stanford.edu/data/]<br />
| The Stanford Dataset Collection<br />
| Zak Blacher<br />
|-<br />
| [http://csmining.org/index.php/data.html Datasets on data mining and cybersecurity]<br />
| The 4th International Workshop on Data Mining and Cybersecurity<br />
| Wendell Wang<br />
|-<br />
| [http://talkbank.org/ TalkBank]<br />
| Language<br />
| Jason Baek<br />
|-<br />
| [https://bitly.com/bundles/hmason/1 Bitly research quality data sets]<br />
| Various<br />
| Jason Baek<br />
|-<br />
| [http://www.icpsr.umich.edu/icpsrweb/SAMHDA/browse Substance Abuse & Mental Health Data Archive]<br />
| Mental Health<br />
| Rhiannon Rose<br />
|-<br />
| [http://neuinfo.org/ Neuroscience Information Framework]<br />
| Medical<br />
| Rhiannon Rose<br />
|-<br />
| [http://www.gapminder.org/data/ Gapminder]<br />
| Various world data (economic, medical, environmental)<br />
| Robert Suderman<br />
|-<br />
| [https://www.cancerimagingarchive.net/ Cancel Imaging Archive]<br />
| Medical<br />
| Robert Suderman<br />
|-<br />
|[http://crcns.org/data-sets Collaborative Research in Computational Neuroscience]<br />
| Neuroscience<br />
| Xiang Ji<br />
|-<br />
| [http://www.icpsr.umich.edu/icpsrweb/NACJD/ U.S. Criminal Justice Archive]<br />
| Criminology<br />
| Oliver Trujillo<br />
|-<br />
| [http://www.preflib.org/ Preflib.org]<br />
| Elections<br />
| John A. Doucette<br />
|-<br />
| [http://www.cbioportal.org/ cBioPortal]<br />
| Cancer Genomics<br />
| Katherina Baranova<br />
|-<br />
| [http://www.ncbi.nlm.nih.gov/geo/ NCBI GEO] <br />
| Functional Genomics Data<br />
| Katherina Baranova<br />
|-<br />
| [https://www.ebi.ac.uk/arrayexpress/ Array Express]<br />
| Functional Genomics Data<br />
| Katherina Baranova<br />
-<br />
[ http://www.nature.com/articles/sdata2016126 ]<br />
Eye movement data <br />
Diana Varyvoda<br />
|}<br />
<br />
{| class="wikitable"<br />
|+ Specific data sets<br />
|-<br />
! Dataset and Links<br />
! Substantive Field<br />
! Link Contributor<br />
! User(s)<br />
|-<br />
| [http://www.physionet.org/mimic2/mimic2_waveform_overview.shtml MIMIC II Waveform]<br />
[http://mimic.physionet.org/ MIMIC Home]<br />
| Medicine<br />
| Dan Lizotte<br />
| <br />
|-<br />
| [http://www.physionet.org/mimic2/mimic2_clinical_overview.shtml MIMIC II Clinical]<br />
[http://mimic.physionet.org/ MIMIC Home]<br />
| Medicine<br />
| Dan Lizotte<br />
| <br />
|-<br />
| [http://imaging.cancer.gov/programsandresources/informationsystems/lidc Lung Image Database Consortium (LIDC)]<br />
| Medicine<br />
| Andr&eacute; Carrington<br />
|<br />
|-<br />
| [http://figment.csee.usf.edu/Mammography/Database.html Digital Database for Screening Mammography (DDSM)]<br />
| Medicine<br />
| Andr&eacute; Carrington<br />
|<br />
|-<br />
| [http://www.mammoimage.org/databases/ Mammographic Image Analysis Society (MIAS)]<br />
| Medicine<br />
| Andr&eacute; Carrington<br />
|<br />
|-<br />
| [https://www2.ncvc.go.jp/ Medical Image Reference Center (MEDIREC)]<br />
| Medicine<br />
| Andr&eacute; Carrington<br />
|<br />
|-<br />
| [http://mouldy.bic.mni.mcgill.ca/brainweb/ Simulated Brain Database (SBD)]<br />
| Medicine<br />
| Andr&eacute; Carrington<br />
|<br />
|-<br />
| [http://reality.media.mit.edu/download.php MIT Reality Mining project]<br />
| Privacy<br />
| Sarah Harvey<br />
|<br />
|-<br />
| [http://research.microsoft.com/apps/pubs/?id=152176 Microsoft Geolife GPS trajectories], [http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/ download dataset]<br />
| Privacy<br />
| Sarah Harvey<br />
|<br />
|-<br />
| [http://ita.ee.lbl.gov/html/contrib/WorldCup.html World Cup traces 1998]<br />
| Cloud Computing<br />
| Noha Elprince<br />
| <br />
|-<br />
| [http://ridge.cs.umn.edu/pltraces.html Planet Lab traces 2008]<br />
| Cloud Computing<br />
| Noha Elprince<br />
| <br />
|-<br />
| [http://www.cs.huji.ac.il/labs/parallel/workload/logs.html Real Computing Center Workload logs 2006]<br />
| Cloud Computing<br />
| Noha Elprince<br />
| <br />
|-<br />
| [http://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 Chicago Crimes 2001-Present]<br />
| Criminology<br />
| Lloyd Rowat<br />
| M.Alarbi<br />
|-<br />
| [http://mobblog.cs.ucl.ac.uk/datasets/ A collection of various datasets]<br />
| ''Miscellaneous''<br />
| Bahareh Sarrafzadeh<br />
| <br />
|-<br />
| [http://www.sdss.org/dr7/ Sloan Digital Sky Survey Data Release 7]<br />
| ''Astronomy''<br />
| Michael Cormier<br />
| <br />
|-<br />
| [http://data.galaxyzoo.org/ Galaxy Zoo 1 Data Release]<br />
| ''Astronomy''<br />
| Michael Cormier<br />
| <br />
|-<br />
| [http://www.basketball-reference.com/boxscores NBA box scores]<br />
| ''Sports analytics''<br />
| Andrew Arnold<br />
| <br />
|-<br />
| [http://www.covers.com/odds/basketball/nba-spreads.aspx NBA betting lines]<br />
| ''Sports gambling''<br />
| Andrew Arnold<br />
| <br />
|-<br />
| [http://www.census1871.ca 1871] - 5% sample<br />
[http://www.prdh.umontreal.ca/census/en/main.aspx 1881] - whole census public<br />
| Canadian Census<br />
| Laura Richards<br />
| <br />
|-<br />
| [http://face-place.org Face images and movies]<br />
| ''Cognitive Science''<br />
| Jason Baek<br />
| <br />
|-<br />
| [http://physionet.org/physiobank/database/tpehgdb/ Term-Preterm EHG Database]<br />
| ''Medical''<br />
| Rhiannon Rose<br />
| <br />
|-<br />
| [http://www.icpsr.umich.edu/icpsrweb/SAMHDA/ssvd/series/97/studies?paging.startRow=1 Drug Abuse Warning Network (DAWN)]<br />
| ''Mental Health''<br />
| Rhiannon Rose<br />
| <br />
|-<br />
| [http://www.oasis-brains.org/ Open Access Series of Imaging Studies (OASIS)]<br />
| ''Dementia''<br />
| Rhiannon Rose<br />
| <br />
|-<br />
| [http://www.peterjbentley.com/heartchallenge/ Classifying Heart Sounds Challenge]<br />
| ''Medicine''<br />
| Valerie Sugarman (via [http://www.mldata.org mldata.org])<br />
| <br />
|-<br />
| [http://pollingreport.com/ Presidential Polling Predictor]<br />
| ''Polling''<br />
| Abdelhamid El Bably<br />
| <br />
|-<br />
| [http://archive.ics.uci.edu/ml/datasets/Bank+Marketing Bank Marketing]<br />
| ''Marketing''<br />
| Shiwei Li<br />
| <br />
|-<br />
|[http://archive.ics.uci.edu/ml/machine-learning-databases/00231/ Physical Activity Monitoring]<br />
| "Health"<br />
|Fil Krynicki (via [http://archive.ics.uci.edu/ml/datasets/])<br />
| <br />
|-<br />
| [http://www.icpsr.umich.edu/icpsrweb/SAMHDA/studies/30122 Treatment Episode Data Set - Discharges (TEDS-D)]<br />
| ''Mental Health''<br />
| Rhiannon Rose<br />
|<br />
|-<br />
| [http://www.preflib.org/election/irish.php Ranked ballots from two districts of the 2002 National Election in Ireland]<br />
| ''Election Ballots''<br />
| John A. Doucette<br />
|<br />
|-<br />
| [http://www.preflib.org/election/debian.php Ranked ballots from the Debian Project's leadership elections, 7 years]<br />
| ''Election Ballots''<br />
| John A. Doucette<br />
|<br />
|-<br />
| [http://www.preflib.org/election/burlington.php Ranked ballots for the mayoral election in Burlington Vermont]<br />
| ''Election Ballots''<br />
| John A. Doucette<br />
|<br />
|-<br />
| [http://www.preflib.org/election/glasgow.php Ranked ballots from the Glasgow city council elections, 2007]<br />
| ''Election Ballots''<br />
| John A. Doucette<br />
|<br />
|-<br />
| [http://www.preflib.org/election/ers.php Ranked ballots from 86 elections in various small organizations]<br />
| ''Election Ballots''<br />
| John A. Doucette<br />
|<br />
|-<br />
| [http://www.football-data.co.uk/englandm.php Match Statistics for different leagues in Europe]<br />
| ''Football League Statistics''<br />
| John Morcos<br />
|<br />
|-<br />
| [http://developer.crunchbase.com/ CrunchBase APIs]<br />
| ''Statistics about different companies''<br />
| Sandeep Chaudhary<br />
|<br />
|-<br />
| [http://www.voxforge.org/ VoxForge]<br />
| ''Speech''<br />
| Wei-shou Hsu<br />
|<br />
|-<br />
| [http://accent.gmu.edu/ Speech Accent Archive]<br />
| ''Speech''<br />
| Wei-shou Hsu<br />
|<br />
|-<br />
| [http://lear.inrialpes.fr/people/jegou/data.php Misc. Outdoor Images]<br />
| ''Images''<br />
| Zachary Frenette<br />
|<br />
|-<br />
| [http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset Bike Sharing]<br />
| ''Demand forecasting''<br />
| Zhongyu Peng<br />
| Zhongyu Peng<br />
|-<br />
| [https://www.cs.cmu.edu/~./enron/ Enron Email Dataset]<br />
| ''Email Classification''<br />
| Yunjia Sun<br />
| <br />
|-<br />
| [https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease Kidney]<br />
| ''Medical''<br />
| Dan Lizotte<br />
| Vivian Tan<br />
|-<br />
| [https://censys.io/ Censys.io]<br />
| ''Internet Scanning''<br />
| Jordan Gould<br />
|-<br />
| [https://www.kaggle.com/uciml/student-alcohol-consumption/ Student-alcohol-consumption]<br />
| ''Sociology''<br />
| Fadi AlMahamid<br />
| Elham Harirpoush<br />
|}<br />
<br />
== Software ==<br />
<br />
The language of instruction for this course is [https://www.r-project.org R], which is freely available, as is the associated development environment [https://www.rstudio.com RStudio]. However, students may write their own code in the language of their choice, and/or make use of other freely available software. You '''may not use [http://sourceforge.net/projects/weka/ WEKA]''' unless you send an e-mail to Dan with a rationale for why you need WEKA and not something else. The main rationale for this rule is that many projects will have data too large for WEKA, which results in students [http://en.wiktionary.org/wiki/paint_oneself_into_a_corner painting themselves into a corner].<br />
<br />
'''Regardless of the software used, the project report must demonstrate that the student understands exactly what the software is doing.'''<br />
<br />
* [https://github.com/scikit-learn-contrib/imbalanced-learn ImbalancedLearn] for scikit<br />
* [http://www.tableau.com/ Tableau]<br />
* [http://openrefine.org/ Open Refine] (previously Google Refine)<br />
* [http://www.jjj.de/crs4/dmc.c dmc.c] Dynamic Markov Compression - useful for building compression classifiers.<br />
* [http://www.umiacs.umd.edu/~hal/megam/version0_3/ MEGAM] Very fast Maximum Entropy classifier.<br />
* [http://pandas.pydata.org/ pandas] High-performance, easy-to-use data structures and data analysis tools for the Python programming language.<br />
* [http://scikit-learn.org/stable/ scikit-learn] python module<br />
* [http://mlpy.sourceforge.net/ mlpy]<br />
* [http://blog.yhathq.com/posts/image-classification-in-Python.html Code] for PCA and image classification in Python (thanks to Rob Suderman for the link)<br />
* [http://jmlr.csail.mit.edu/mloss/ JMLR's machine learning open source software page]<br />
* [http://www.mloss.org mloss.org]<br />
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm], [http://svmlight.joachims.org/ svmlight]<br />
* [http://www.csie.ntu.edu.tw/~cjlin/liblinear/ liblinear] [http://hunch.net/~vw/ Vowpal Wabbit] for large-scale linear models<br />
* [http://www.cs.cmu.edu/~mccallum/bow/ Bow] for NLP tasks<br />
* Other potentially useful software possibly more specifically suited to NLP tasks: [http://code.google.com/p/factorie/ factorie], [http://www.nltk.org/ nltk], [http://mallet.cs.umass.edu/ mallet], [https://code.google.com/p/language-detection/ language detection library]<br />
* [http://graphlab.org/ GraphLab]<br />
* [http://mahout.apache.org/ mahout], a machine learning library for big data built on top of [http://hadoop.apache.org/ hadoop]<br />
* [http://clopinet.com/CLOP/ CLOP] package for matlab<br />
* [http://mldemos.epfl.ch/ MLDemos] more for visualizing and understanding how algorithms work<br />
* [http://docs.opencv.org/trunk/doc/py_tutorials/py_tutorials.html OpenCV] Machine Learning Methods relating to Images and Video (Zachary Frenette)</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=7Introduction to Data Science I2017-08-01T19:34:37Z<p>Dan Lizotte: Added link to Data and Software page</p>
<hr />
<div>== Course outline ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>'''<br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 11:30AM – 1:30PM, and on Thursday from 3:30PM – 4:30PM<br />
* '''Place''': Talbot College [http://accessibility.uwo.ca/doc/floorplan/bf-tc.pdf '''TC342''']<br />
* '''Question and Collaboration Hour:''' Thursday from 4:30pm - 5:30pm in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']<br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 3 February at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 17 Feb at 5pm <!-- End of 6th Week --><br />
* Project Draft Due Friday, 17 Mar at 5pm <!-- End of 10th Week --><br />
* Project Report Due Friday, 7 Apr at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 13 Apr at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* The [[Data and Software]] Page<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 2 Mar. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Introduction to Data Science, Data Cleaning<br />
** 12 Sep - Lectures: Re-introduction to Statistics<br />
* 14 Sep - Lectures: Re-introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning<br />
** 26 Sep - Lectures: Supervised Learning<br />
* 28 Sep - Lectures: Cancelled<br />
** 3 Oct - Lectures: Cancelled<br />
* 5 Oct - '''Pick Brainstorming Slot''' - Lectures: Linear Models<br />
** 17 Oct - Lectures: Linear Models<br />
* 19 Oct - Lectures: Linear Models / Nonlinear Models<br />
** 24 Oct - Lectures: Nonlinear Models<br />
* 26 Oct - '''Project Proposal Due 17 Feb at 5pm''' - TBA<br />
** 31 Oct - TBA <br />
* 2 Nov - TBA<br />
** 7 Oct - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - '''Project Draft Due 17 Mar at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 7 April 5pm'''<br />
* '''Reviews Due Thursday 13 April 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Lecture_Materials&diff=6Lecture Materials2017-08-01T19:31:58Z<p>Dan Lizotte: Initial copy-and-paste of old cs4437 wiki</p>
<hr />
<div>= Lecture Materials =<br />
Materials from the most recent run of the course will persist here. They will be updated as the term progresses.<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/2_Data%20Preparation/data_preparation.pdf pdf] ]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/3_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.pdf pdf]]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/4_Supervised%20Learning/supervised_learning.pdf pdf]]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/6_Linear%20Models/linear_models.pdf pdf] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models.pdf pdf] ] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/7_Nonlinear%20Models/nonlinear_models_continuous.html html] ]<br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/8_Unsupervised%20Learning/unsupervised-learning_continuous.html html] ]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.html slides] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W17/Lectures/D_Performance%20Measures/performance_measures_continuous.html html] ]<br />
<br />
* Information Visualisation<br />
:* [https://www.youtube.com/watch?v=oJNY5eUbSQI Lecture] on what I would call "Principles of Information Visualisation"<br />
:* [https://public.tableau.com/en-us/s/gallery Inspiration] from the Tableau public gallery. (Recall Tableau is free for students.)<br />
<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
<br />
== From W16 ==<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/1_Welcome/welcome.pdf Welcome]<br />
* Data Preparation [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/2_Data%20Preparation/data_preparation.Rmd Rmd] ]<br />
* Google Flu Trends [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/Google%20Flu%20Trends.pdf pdf] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/3_Google%20Flu%20Trends/google_flu_trends.Rmd Rmd] ]<br />
:* Flu trends papers: On [https://owl.uwo.ca/ OWL]<br />
* (Re)introduction to Statistics [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/4_(Re)introduction%20to%20Statistics/reintroduction_to_statistics.Rmd Rmd] ]<br />
* Supervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/5_Supervised%20Learning/supervised_learning.Rmd Rmd] ]<br />
* Linear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/6_Linear%20Models/linear_models.Rmd Rmd] ]<br />
* Nonlinear Models [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/7_Nonlinear%20Models/nonlinear_models.Rmd Rmd] <br />
* Unsupervised Learning [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/8_Unsupervised%20Learning/unsupervised-learning.Rmd Rmd] ]<br />
* Visual Analytics '''Guest Lecture''' by Arman Didandeh [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/A_Visual%20Analytics/InfoViz4DataScience.pdf pdf]]<br />
* MapReduce '''Guest Lecture''' by Hanan Lutfiyya [[http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/B_MapReduce/mapReduce.pdf pdf]]<br />
* Performance Measures and Class Imbalance [ [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.html html] | [http://www.csd.uwo.ca/~dlizotte/teaching/cs4437_W16/Lectures/D_Performance%20Measures/performance_measures.Rmd Rmd] ]<br />
* Feature Selection and Construction '''Video Lectures''' by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
<br />
= Tutorials and Summaries = <br />
<br />
* [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
* [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
<br />
= Other Resources =<br />
<br />
* [http://cs229.stanford.edu/materials.html Materials from Stanford's ML class] by Andrew Ng. Excellent notes.<br />
<br />
* [http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf Classic tutorial on HMMs by Rabiner]<br />
<br />
* <span id="colinbib">Bibliography</span>/suggested reading from Colin Cherry's lecture:<br />
**Structured Perceptron<br />
***Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP 2002. [http://www.aclweb.org/anthology-new/W/W02/W02-1001.pdf]<br />
**Some applications:<br />
***Scott Miller; Jethran Guinness; Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. NAACL 2004. [http://www.aclweb.org/anthology/N/N04/N04-1043.pdf]<br />
***Robert C. Moore. A Discriminative Framework for Bilingual Word Alignment. EMNLP 2005. [http://www.aclweb.org/anthology-new/H/H05/H05-1011.pdf]<br />
**Passive Aggressive Algorithm and MIRA:<br />
***Koby Crammer and Yoram Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research 2003. [http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03a.html]<br />
***Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 2006. [http://jmlr.csail.mit.edu/papers/v7/crammer06a.html]<br />
**Applications (of MIRA):<br />
***Ryan McDonald; Koby Crammer; Fernando Pereira Online Large-Margin Training of Dependency Parsers. ACL 2005. [http://www.aclweb.org/anthology/P/P05/P05-1012.pdf]<br />
***Sittichai Jiampojamarn; Colin Cherry; Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. ACL 2008. [http://www.aclweb.org/anthology/P/P08/P08-1103.pdf]<br />
**Pegasos<br />
***Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007. [http://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf]<br />
**Structured SVM:<br />
***I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. ICML 2004. [http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf]<br />
***B. Taskar, C. Guestrin and D. Koller. Max-Margin Markov Networks. Neural Information Processing Systems Conference [http://www.seas.upenn.edu/~taskar/pubs/mmmn.pdf]<br />
<br />
== Previous Incarnations of This Course: CS886 at the University of Waterloo ==<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/02-1-logreg-nb-svm.pdf Lecture 3,4,5,6] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-knn.pdf Lecture 7] - k-NN and related methods<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/TUT-trees.pdf Lecture 8] - Decision Trees, Documents<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/f14/Docs-Images-Clustering-Dimred.pdf Lecture 9] - Documents, Images, Clustering, Dimensionality Reduction<br />
* Watch-On-Your-Own - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 10] - Introduction to HMMs - Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/doucette-guest-lecture.pdf Lecture 11] - Machine Learning Words of Wisdom - John Doucette<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/WaterlooTalk_Oct17_14_Online.pdf Lecture 12] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
<br />
=== S13 ===<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-1-intro.pdf Lecture 1] - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/01-2-intro.pdf Lecture 2] - Model Selection, Empirical Evaluation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-1-logreg-nb-svm.pdf Lecture 3,4,5] - Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/02-3-LearningTheory.pdf Lecture 6] - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/07-documents-and-images.pdf Lecture 7] - Documents and Images<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/08-clustering.pdf Lecture 8] - Clustering<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/09-timeseries-and-dimensionality-reduction.pdf Lecture 9] - Sound Features, Dimensionality Reduction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/WaterlooTalk_Jun06_13_Online.pdf Lecture 10] - Scaling Up with Online Learning - Dr. Colin Cherry<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/DataMiningCS886.pdf Lecture 11] - Data Mining - Luiza Antonie<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 12] - Introduction to HMMs - Michelle Karg<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-trees.pdf Short Lecture 1] - Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/s13/TUT-knn.pdf Short Lecture 2] - K-Nearest-Neighbours<br />
<br />
=== EarlierTerms ===<br />
<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-1-intro.pdf Lecture 1] - (F12) - Intro, Regression<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/01-2-intro.pdf Lecture 2] - (F12) - Overfitting, Performance Evaluation, Cross-Validation<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-1-logreg-nb-svm.pdf Lecture 3,4] - (F12) - More Classification: Logistic Regression, Naive Bayes, SVMs<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-2-knn-trees.pdf Lecture 5,6] - (F12) - Non-linear Classifiers: Knn, Decision Trees<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/02-3-LearningTheory.pdf Lecture 6] - (F12) - Learning Theory Light<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/04-image-features-and-clustering.pdf Lecture 7] - (F12) - Image Features, Clustering<br />
** [http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf Paper] on SIFTs + VQ (or Sparse Coding) for classification<br />
** [http://www.vlfeat.org/~vedaldi/code/sift.html Open-Source SIFT (and other) software]<br />
** [http://ufldl.stanford.edu/eccv10-tutorial/ ECCV Tutorial] on Feature Learning for Image Classification. Kai Yu and Andrew Ng<br />
* Lecture 8 - Lectures on feature selection and construction by Isabelle Guyon of [http://www.clopinet.com/ ClopiNet]<br />
** [http://videolectures.net/bootcamp07_guyon_ifs/ Isabelle Guyon] on Feature Selection ([http://videolectures.net/mmdss07_guyon_fsf/ longer version])<br />
** [http://videolectures.net/bootcamp07_guyon_fcon/ Isabelle Guyon] on Feature Construction (starts at 1:00:00)<br />
** [http://clopinet.com/isabelle/Projects/ETH/ Course] on feature selection/construction<br />
** [http://jmlr.csail.mit.edu/papers/special/feature03.html Special issue on features] in JMLR<br />
** [http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf paper] by Guyon et al. on feature selection/construction<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/05-timeseries-and-dimensionality-reduction.pdf Lecture 9] - (F12) - Audio Features, Dimensionality Reduction (PCA)<br />
**[http://videolectures.net/mcvc08_frank_fea/ Feature extraction from audio and their application in music organization and transient enhancement in recorded music]<br />
**[http://videolectures.net/mcvc08_kohler_acs/ Audio Content Search]<br />
**Related [http://ismir2003.ismir.net/papers/McKinney.PDF paper]: Martin F. McKinney and Jeroen Breebaart. Features for Audio and Music Classification.<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/wagstaff-demud.pptx Lecture 10] by Dr. Kiri Wagstaff<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/MKarg_hmm_slides.pdf Lecture 11] by Dr. Michelle Karg<br />
* [http://www.csd.uwo.ca/~dlizotte/teaching/cs886_slides/colin/WaterlooTalk_Oct18_12_Online.pdf Lecture 12] by Dr. [http://sites.google.com/site/colinacherry/ Colin Cherry] - (F12) - See also the [[#colinbib|bibliography]]</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=5Introduction to Data Science I2017-08-01T19:31:23Z<p>Dan Lizotte: Link to lectures page</p>
<hr />
<div>== Course outline ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>'''<br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The [[Lecture Materials|lectures]] give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 11:30AM – 1:30PM, and on Thursday from 3:30PM – 4:30PM<br />
* '''Place''': Talbot College [http://accessibility.uwo.ca/doc/floorplan/bf-tc.pdf '''TC342''']<br />
* '''Question and Collaboration Hour:''' Thursday from 4:30pm - 5:30pm in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']<br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 3 February at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 17 Feb at 5pm <!-- End of 6th Week --><br />
* Project Draft Due Friday, 17 Mar at 5pm <!-- End of 10th Week --><br />
* Project Report Due Friday, 7 Apr at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 13 Apr at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 2 Mar. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Introduction to Data Science, Data Cleaning<br />
** 12 Sep - Lectures: Re-introduction to Statistics<br />
* 14 Sep - Lectures: Re-introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning<br />
** 26 Sep - Lectures: Supervised Learning<br />
* 28 Sep - Lectures: Cancelled<br />
** 3 Oct - Lectures: Cancelled<br />
* 5 Oct - '''Pick Brainstorming Slot''' - Lectures: Linear Models<br />
** 17 Oct - Lectures: Linear Models<br />
* 19 Oct - Lectures: Linear Models / Nonlinear Models<br />
** 24 Oct - Lectures: Nonlinear Models<br />
* 26 Oct - '''Project Proposal Due 17 Feb at 5pm''' - TBA<br />
** 31 Oct - TBA <br />
* 2 Nov - TBA<br />
** 7 Oct - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - '''Project Draft Due 17 Mar at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 7 April 5pm'''<br />
* '''Reviews Due Thursday 13 April 5pm'''</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Project_Guidelines&diff=4Project Guidelines2017-08-01T19:28:24Z<p>Dan Lizotte: Initial copy-and-paste of old cs4437 wiki</p>
<hr />
<div>== Goal ==<br />
<br />
The goal of this project is for the student to gain experience in understanding a substantive problem/question, acquiring data relevant to the problem/question, and applying appropriate data science techniques in an effort to address the problem/question. Here I'm using the word ''substantive'' in the way a statistician might: the ''substantive field'' refers to the field of science (not statistical science) containing the problem to be addressed. Example substantive fields include medicine, chemistry, astronomy, and computer networks. All project must include a visualization component, which may be static or dynamic.<br />
<br />
== Structure and Regulations ==<br />
<br />
*The project will be submitted as three deliverables, a project [[#Proposal|proposal]] early in the term, a [[#Report Draft|draft]] partway through the term, and a final research [[#Final Report|report]] at the end of the term. '''All of these must be submitted as pdfs generated by Markdown, LaTeX, or Word; see instructions below.''' After this, each '''graduate''' student will [[#Review Guidelines|review]] a subset of projects; reviews are due one week after final project submission.<br />
*Projects are to be completed '''individually'''.<br />
*All projects ''must'' be based on a dataset that is '''sufficiently interesting''' for our purposes as judged by the instructor. Note that any [http://archive.ics.uci.edu/ml/ UCI] dataset that was donated prior to 2007 is considered '''un'''interesting and is therefore disallowed.<br />
*You are encouraged to contact Dan at any point to determine if your project topic is suitable<br />
*'''No Spam Filters. Furthermore, the Enron-Spam datasets are explicitly forbidden'''<br />
<br />
== Proposal ==<br />
<br />
For the proposal, each student will identify an applied problem (or a few related problems) that could be solved using data science methods, identify an appropriate dataset, and give a detailed plan for analyzing the data that includes what pre-processing will be required, what kind of feature development will be necessary, and what analysis and visualization methods might be applied. Don't forget to include details for how you will assess the performance of any models you build. The proposal should have '''three main headings''':<br />
<br />
* Description of Applied Problem<br />
* Description of Available Data<br />
* Plan for Analysis and Visualization<br />
<br />
The main body of the proposal document should be 2 pages long, single spaced. Page 3 and after may only contain references, tables, and figures. If you are using LaTeX, use the [http://www.csd.uwo.ca/~dlizotte/teaching/stylefiles/ CS4637/CS9637 style files], which are based on the ICML style files. There is no style file for markdown, but keep in mind that if you use Markdown, you still need to have proper references. [http://www.chriskrycho.com/2015/academic-markdown-and-citations.html This resource] may help, as might a bit of Google/StackExchange searching, but in the end the onus is on you. If using word, use 3/4" margins and a 12 point serif font.<br />
<br />
Include a brief abstract of a few sentences. '''At least two appropriate references''' must be listed for works (papers or books) that discuss and describe the applied problem, '''at least one reference''' that describes the available data (may be URL(s)) and '''at least two references''' that describe the methods you plan to explore in your analysis and visualization plan.<br />
<br />
'''Whether you are using LaTeX, Markdown, or Word, submit your proposal as a PDF file. Proposals must submitted through OWL. Late submissions will not be accepted.'''<br />
<br />
== Report Draft ==<br />
<br />
A draft of the final report will be due approximately 2/3 of the way through the term. Use Word, Markdown, or LaTeX with the [http://www.csd.uwo.ca/~dlizotte/teaching/stylefiles/ style files], just as you must for the final report. To ensure you get useful feedback, the draft should have a complete abstract, background section, and analysis and visualization plan. The rest of the paper should at least be sketched in, perhaps in point form, to give a sense of the final shape of the document. '''The precise content of the draft is not specified, but the more you provide, the better feedback you will get.'''<br />
<br />
'''Report drafts must be submitted <!-- to EasyChair [https://www.easychair.org/conferences/?conf=amlf14 https://www.easychair.org/conferences/?conf=amlf14] --> through OWL by 5pm on the due date. *Do not e-mail the instructor your draft.*''' Late submissions will not be accepted. <!-- Later, to submit your final report, you will simply "Update" your draft submission with a new .pdf (and maybe title.) --><br />
<br />
== Final Report ==<br />
<br />
The report must be no more than 4 pages long, single spaced, not including references. '''If you wish''', you may also include an additional appendix with an unlimited number of pages that contain '''only figures, figure captions, and tables'''. Use Word, or use the [http://www.csd.uwo.ca/~dlizotte/teaching/stylefiles/ style files], which are based on the ICML style files, or use Markdown. Include a brief abstract. As mentioned above, all reports must include a visualization component.<br />
<br />
An outstanding report might resemble an application-focussed publication in a workshop at one of the top machine learning or AI conferences, like for example ICML or [http://www.aaai.org/Library/IAAI/iaai-library.php IAAI]. (Note however that you are required to include a visualization component, which such papers may not have.) Here are some examples. Note that just because a paper is listed here does not mean it is perfect; you must always read with a fair but critical eye.<br />
<br />
*Philip A. Warrick, Emily F. Hamilton, Robert E. Kearney, Doina Precup. [http://www.aaai.org/ocs/index.php/IAAI/IAAI10/paper/view/1597 A Machine Learning Approach to the Detection of Fetal Hypoxia during Labor and Delivery.]<br />
*Weiss, Page, Peissig, Natarajan, and McCarty. [http://www.aaai.org/ocs/index.php/IAAI/IAAI-12/paper/view/4778/5451 Statistical Relational Learning to Predict Primary Myocardial Infarction from Electronic Health Records]<br />
*Chad Cumby, Rayid Ghani [http://www.aaai.org/ocs/index.php/IAAI/IAAI-11/paper/view/3528 A Machine Learning Based System for Semi-Automatically Redacting Documents.]<br />
*Mitja Luštrek, Hristijan Gjoreski, Simon Kozina, Božidara Cvetković, Violeta Mirchevska, Matjaž Gams [http://www.aaai.org/ocs/index.php/IAAI/IAAI-11/paper/view/2753 Detecting Falls with Location Sensors and Accelerometers]<br />
* Ben George Weber, Michael John, Michael Mateas, Arnav Jhala [http://www.aaai.org/ocs/index.php/IAAI/IAAI-11/paper/view/3526/4029 Modeling Player Retention in Madden NFL 11]<br />
<br />
=== Specific expectations for the report ===<br />
<br />
'''Reproducibility''': The report '''must''' contain enough detail about the methods used to allow a future researcher to reproduce the results if they had access to the appropriate data and access to all appropriate works cited. (Some projects may use proprietary data; that is fine.) Reports that do not contain sufficient method detail will not receive full marks.<br />
<br />
'''Integrity''': The report must adhere to the standards of [http://www.lib.uwaterloo.ca/gradait/content/documents/credit_your_sources.pdf academic honesty].<br />
<br />
'''Formality''': The report should be written in formal academic language appropriate for a technical report/workshop/conference/journal publication. The author should refer to him/herself in the second person plural, i.e. using "we." ("We present a novel analysis...")<br />
<br />
'''Writing Quality''': The writing must of the quality level expected of a senior undergraduate or graduate student at a world-class university. The [http://www.sdc.uwo.ca/writing/ Writing Support Centre] at UWO can help you reach this level.<br />
<br />
== Report Submission and Reviewing ==<br />
<br />
'''Final report submissions will be done through OWL.'''<br />
<br />
Following report submission, each '''graduate (9637)''' student will be randomly assigned two project reports to review over the week following the due date but before the end of the exam period.<br />
<br />
* The main purpose of reviewing is to provide feedback to authors that they can make use of in their future careers, which gives them a better return on the investment they have made in their course project.<br />
* The secondary purpose is to give students a view of the variety of work that has been done in the course.<br />
* '''Reviews from other students will not affect the grade of the author in any way.'''<br />
* Reviewing will be single-blind: Authors will not know who reviews their project.<br />
* Reviewers are expected to provide feedback that is '''constructive'''. Constructive feedback '''makes concrete suggestions on improving the work''' under review. Feedback that is both negative and non-constructive will not be tolerated.<br />
<br />
=== Review Guidelines ===<br />
'''Students must follow the review guidelines below. Include headings where appropriate'''<br />
<br />
* '''Summary:''' Summarize the goal of the project. What are the authors trying to achieve? Then summarize the contributions of the project in a few sentences. Describe the substantive problem, the data used, and the analysis applied. Describe the results. Note that not every project will have "good results" and for this project that is not necessarily a fault; the meta-goal of this project is for each author to gain experience with DS methods. Keep that in mind when you summarize: did the authors sufficiently explore the space of appropriate methods?<br />
* After the summary, comment on the following aspects of the report:<br />
** '''Background''': Comment on whether the report clearly explains the problem to be tackled, and whether it clearly describes how the substantive problem will be formulated as a data science problem.<br />
** '''Data''': Comment on whether you were able to clearly understand what data were available and how they were used in the analysis.<br />
** '''Analysis and Visualization''': Comment on the appropriateness of the DS methods used, and '''comment on the reproducibility of the results''' as described above. Comment on the evaluation measures use.<br />
** '''Future work''': Make some suggestions on how the work could be extended in the future.<br />
<br />
Depending on the project, these sections of the review may be longer or shorter. Use your judgement. Be sure to have at least a few interesting sentences under each heading.<br />
<br />
== Brainstorming ==<br />
<br />
A brainstorming session will consist of a 10-minute presentation by a student, followed by a class discussion for a total of 15 minutes. The presenter may choose to take questions during the talk, or save them until the end. The presentation should detail an applied problem, dataset, and potential DS methods that could be useful, much like the project proposal. The Brainstorming Session '''''may or may not''''' be on the student's project topic, but of course it may be advantageous to use your brainstorming slot to get feedback and ideas.<br />
<br />
* Guidelines<br />
** Presentations should use projected slides<br />
** Presentations should cover more or less the same topics as a project proposal: Description of Applied Problem, Description of Available Data, Plan for Analysis and Visualization<br />
** Presenters will receive a 5-minute warning, but presentations *will* be terminated at the 15-minute mark.<br />
<br />
* Evaluation (by instructor) is based on <br />
** Effective explanation of the problem<br />
** Effective explanation of the available data. It is often a good idea to show a specific example of a single "data item" from the available data, whatever that might mean for the specific project.<br />
** Effective explanation potential DS methods<br />
** Ability to answer questions about the data and the analysis and visualization plan<br />
** Working within the strict 10+5 minute timeslot<br />
<br />
In general, it is better to *show* your plan rather than tell it. Use actual examples from your dataset where possible. Show how feature vectors and any class labels/regression targets are constructed.</div>Dan Lizottehttps://www.csd.uwo.ca/~dlizotte/teaching/IDS/index.php?title=Introduction_to_Data_Science_I&diff=3Introduction to Data Science I2017-08-01T19:27:05Z<p>Dan Lizotte: Fixed extra gunk in copy</p>
<hr />
<div>== Course outline ==<br />
<br />
'''From Dan:''' This is a very high-demand course that interests students in various programs across campus. I think this is great because the diversity of backgrounds assembled in the class makes for a better learning experience for all. (Myself included!) However, space is limited. <span style="color:#EE0000">Therefore, '''all ''graduate'' students who are ''not'' in the MSc or PhD programme within the Department of Computer Science must e-mail me a 1/2 page proposal sketch on the project they would like to pursue. (See the Proposal Guidelines for the general idea.) This must be submitted by 5pm on 15 December 2016 and does not guarantee enrolment. Enrolment will be decided based on space available and quality of the proposal sketches.</span>'''<br />
<br />
=== Objective ===<br />
<br />
The objective of this course is to introduce students to data science (DS) techniques, with a focus on application to substantive (i.e. "applied") problems. Students will gain experience in identifying which problems can be tackled by DS methods, and learn to identify which speciﬁc DS methods are applicable to a problem at hand. During the course, students will gain an in-depth understanding of a particular (substantive problem, DS solution) pair, and present their ﬁndings to their peers in the class. '''Although this course does not assume prior machine learning or visualization knowledge, it does require students to show substantial initiative in investigating methods that are applicable for their project. The lectures give an overview of important methods, but the lecture content alone is not sufficient to produce a high quality course project.'''<br />
<br />
This course is designed for students who:<br />
<br />
* Like to '''read''' - have a desire to understand substantive problems<br />
* Like to '''think''' - make connections between methods and problems<br />
* Like to '''hack''' - be willing to [http://en.wikipedia.org/wiki/Data_munging munge] data into usability<br />
* Like to '''speak''' - teach us about what you found<br />
<br />
=== Prerequisites ===<br />
<br />
At least one undergraduate programming course (e.g. CS2035) and at least one statistics course (e.g. STAT1024.) This course entails a significant amount of self-directed learning and is directed toward fourth-year undergraduate and graduate students.<br />
<br />
=== Logistics ===<br />
* '''Instructor''': Dan Lizotte – dlizotte at uwo dot ca – Office MC363<br />
* '''Teaching Assistant''': Brent Davis - bdavis56 at uwo dot ca - Runs Q/C Hour (see below)<br />
* '''Time''': Tuesday from 11:30AM – 1:30PM, and on Thursday from 3:30PM – 4:30PM<br />
* '''Place''': Talbot College [http://accessibility.uwo.ca/doc/floorplan/bf-tc.pdf '''TC342''']<br />
* '''Question and Collaboration Hour:''' Thursday from 4:30pm - 5:30pm in Middlesex College [http://accessibility.uwo.ca/doc/floorplan/bf-mc.pdf '''MC320''']<br />
* '''Communication''': We will be using [https://owl.uwo.ca OWL] for electronic communication.<br />
<br />
===Important Dates===<br />
* Pick Brainstorming Slot by Friday, 3 February at 5pm <!-- End of 4th Week --><br />
* Project Proposal Due Friday, 17 Feb at 5pm <!-- End of 6th Week --><br />
* Project Draft Due Friday, 17 Mar at 5pm <!-- End of 10th Week --><br />
* Project Report Due Friday, 7 Apr at 5pm <!-- Last Day of Class --><br />
* Paper Reviews Due '''Thursday''', 13 Apr at 5pm <!-- Week after Last Day of Class --><br />
<br />
Register for a wiki account. You will need to use the wiki to let us all know about data sources you find, indicate which dataset you are using, and slot yourself in for brainstorming. Also, everyone should free to make improvements to any part of the wiki. (E.g. if you find some useful software or other resources.)<br />
<br />
Slot yourself in for a brainstorming session in the Timeline portion at the bottom of this page before end of '''Friday, 3 Feb at 5pm''' or Dan will pick a slot for you.<br />
<br />
=== Materials ===<br />
* '''Required Texts'''<br />
:* '''JWHT''': James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). ''An introduction to statistical learning with applications in R.'' New York: Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-7138-7 Western]]<br />
:* '''HTF''': ''The Elements of Statistical Learning'' by Hastie, Tibshirani and Friedman. Expanded version of required text. ['''Free''' [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ online]]<br />
:* '''LW''': Leland Wilkinson's ''The Grammar of Graphics'' (2005). ['''Free''' from [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/book/10.1007/0-387-28695-0 Springer]]<br />
:* ggplot2 book by creator Hadley Wickham (2009). ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387981406 Western]]<br />
* '''Review''' if you need to catch up:<br />
:* [http://www.stat.cmu.edu/~larry/all-of-statistics/ Larry Wasserman's] ''All of Statistics.'' ['''Free''' from [http://link.springer.com/book/10.1007/978-0-387-21736-9 Springer]]<br />
:* Devore, J. L., & Berk, K. N. (2007). ''Modern mathematical statistics with applications.'' 2nd ed. Springer. ['''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://link.springer.com/978-1-4614-0391-3 Western]]<br />
:* [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/linalg-review.pdf linear algebra review] - up to and including Section 3.7 - The Inverse<br />
* '''Other Resources'''<br />
:* Cheat Sheets<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ggplot2] cheat sheet<br />
:** [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Data Wrangling] cheat sheet<br />
:* Texts<br />
:** Phil Spector. (2008). ''Data Manipulation with R'' New York: Springer. [ '''Free''' through [https://www.lib.uwo.ca/cgi-bin/ezpauthn.cgi?url=http://www.springer.com/us/book/9780387747309 Western] ]<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/Materials/prob-review.pdf probability review] from Stanford University by way of Doina Precup.<br />
:** [http://www.cs.mcgill.ca/~dprecup/courses/ML/resources.html List of resources] from COMP-652 at McGill (courtesy Doina Precup)<br />
:** C. M. Bishop, Pattern Recognition and Machine Learning (2006)<br />
:** R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (1998)<br />
:** Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.<br />
:** David J. C. MacKay, "Information Theory, Inference and Learning Algorithms", Cambridge University Press, 2003.<br />
:** Richard O. Duda, Peter E. Hart & David G. Stork, "Pattern Classification. Second Edition", Wiley & Sons, 2001.<br />
:* Other Links<br />
:** [https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception Data Visualization for Human Perception]<br />
:** [http://datadrivenjournalism.net/news_and_analysis/is_data_journalism_for_everyone Data Journalism]<br />
:* Software<br />
:** The dplyr package [https://cran.r-project.org/web/packages/dplyr/ documentation]. The "vignettes" are particularly good.<br />
:** The Tensorflow Library (Python, C++) [https://www.tensorflow.org/]<br />
<br />
=== Topics (anticipated) ===<br />
* '''Introduction to Data Science'''<br />
** Definitions<br />
** Components<br />
** Relationships to Other Fields<br />
<br />
* '''Data Munging'''<br />
** Working with structured data: selecting, filtering, joining, aggregating<br />
** Web scraping<br />
** Simple visualizations<br />
** Sanity checking<br />
<br />
* '''(Re)-introduction to Statistics'''<br />
** Data Summaries<br />
** Randomness, Sample Spaces and Events, Probability<br />
** Random Variables, CDF, PMF, PDF<br />
** Expectation<br />
** Estimation<br />
** Sampling Distributions: Law of Large Numbers, Central Limit Theorem, The Bootstrap<br />
** Inference: Hypothesis testing, P-values, Confidence Intervals<br />
** Multivariate Statistics: conditional probability, correlation, independence<br />
<br />
* '''Supervised Machine Learning, Predictive Models'''<br />
** Supervised Learning<br />
*** Regression<br />
*** Classification<br />
** Reinforcement Learning and Sequential Decision Making<br />
<br />
* '''Evaluation'''<br />
** Variance: Test set, cross-validation, bootstrap<br />
** Bias: Confounding, causal inference<br />
<br />
* '''Unsupervised Machine Learning, Representations, and Feature Construction'''<br />
** Clustering<br />
** Dimensionality reduction<br />
** Domain-specific Feature Development<br />
*** Images<br />
*** Sounds<br />
*** Text<br />
<br />
* '''Visualization'''<br />
** Topics to be determined<br />
<br />
=== Evaluation ===<br />
<br />
There will be a midterm test but no final exam. Each student will lead a brainstorming session, produce a proposal, draft, and report for a course project. '''Graduate students (9637)''' will additionally submit peer reviews of other class projects. For detailed requirements, see [[Project Guidelines]].<br />
<br />
Scholastic offences are taken seriously and students are directed to read the appropriate policy, specifically, the definition of what constitutes a Scholastic Offence, at this website: [http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf].<br />
<br />
==== Daily Quizzes – 5% ====<br />
<br />
Starting on the second lecture, there will be a very short quiz at the beginning of class covering the previous day's materials. The final quiz will be on 2 Mar. The lowest quiz mark will be dropped. '''Quiz marks will only be excused for medical reasons.'''<br />
<br />
==== Midterm - 35% ====<br />
<br />
Assessing competencies from the fundamentals taught in the first half of the class.<br />
<br />
==== Brainstorming Session – 5% ====<br />
<br />
Each student will prepare a [[Project Guidelines#Brainstorming|presentation]] explaining an applied problem, as well as some potential data science methods that could be applied to the problem. The presentation should be '''no more than 10 minutes'''. We will then discuss the problem as a class, along with possible approaches for solving the problem using data science methods. '''The student is expected to be prepared to answer deep questions about the nature of their problem to ensure that they receive high quality feedback''' from the brainstorming session.<br />
<br />
==== Project Proposal – '''4437:''' 15% '''9637:''' 10% ====<br />
<br />
Document detailing the plan for the project. See [[Project Guidelines]] for detailed requirements.<br />
<br />
==== Report Draft – 5% ====<br />
<br />
A [[Project Guidelines#Report Draft|draft]] of the final report will be due approximately midway through the term. The purpose of the draft is to allow the instructor to provide feedback on the quality of the writing and the direction of the project.<br />
<br />
==== Project Report – 35% ====<br />
<br />
Each student will prepare a [[Project Guidelines|research paper]] detailing a substantive problem, the data available, the applicable data science methods, and empirical results obtained on the problem.<br />
<br />
==== Peer Review – '''9637 only:''' 5% ====<br />
<br />
Each '''graduate''' student will prepare two [[Project Guidelines#Report Submission and Reviewing|reviews]] of their classmates' work.<br />
<br />
==== Participation and Effort ====<br />
<br />
Success of the course as a useful learning experience hinges on active participation and effort of the students. '''Students are expected to attend all classes''' and are expected to '''actively participate in the brainstorming sessions'''.<br />
<br />
=== Accessibility and Support Available at Western ===<br />
Please contact the course instructor if you require lecture or printed material in an alternate format or if any other arrangements can make this course more accessible to you. You may also wish to contact Services for Students with Disabilities (SSD) at 661-2111 ext. 82147 if you have questions regarding accommodation.<br />
Support Services<br />
Learning-skills counsellors at the Student Development Centre (http://www.sdc.uwo.ca) are ready to help you improve your learning skills. They offer presentations on strategies for improving time management, multiple-choice exam preparation/writing, textbook reading, and more. Individual support is offered throughout the Fall/Winter terms in the drop-in Learning Help Centre, and year-round through individual counselling.<br />
Students who are in emotional/mental distress should refer to Mental Health@Western (http://www.health.uwo.ca/mental_health) for a complete list of options about how to obtain help.<br />
Additional student-run support services are offered by the USC, http://westernusc.ca/services.<br />
The website for Registrarial Services is http://www.registrar.uwo.ca.<br />
<br />
=== Missed Course Components ===<br />
If you are unable to meet a course requirement due to illness or other serious circumstances, you must provide valid medical or supporting documentation to the Academic Counselling Office of your home faculty as soon as possible. <br />
If you are a Science student, the Academic Counselling Office of the Faculty of Science is located in WSC 140, and can be contacted at 519-661-3040 or scibmsac@uwo.ca. Their website is http://www.uwo.ca/sci/undergrad/academic_counselling/index.html.<br />
A student requiring academic accommodation due to illness must use the Student Medical Certificate (https://studentservices.uwo.ca/secure/medical_document.pdf) when visiting an<br />
off-campus medical facility.<br />
For further information, please consult the university’s medical illness policy at http://www.uwo.ca/univsec/pdf/academic_policies/appeals/accommodation_medical.pdf.<br />
<br />
== Timeline (Tentative) ==<br />
<br />
* 7 Sep - Lectures: Introduction to Data Science, Data Cleaning<br />
** 12 Sep - Lectures: Re-introduction to Statistics<br />
* 14 Sep - Lectures: Re-introduction to Statistics<br />
** 19 Sep - Lectures: Supervised Learning<br />
* 21 Sep - Lectures: Supervised Learning<br />
** 26 Sep - Lectures: Supervised Learning<br />
* 28 Sep - Lectures: Cancelled<br />
** 3 Oct - Lectures: Cancelled<br />
* 5 Oct - '''Pick Brainstorming Slot''' - Lectures: Linear Models<br />
** 17 Oct - Lectures: Linear Models<br />
* 19 Oct - Lectures: Linear Models / Nonlinear Models<br />
** 24 Oct - Lectures: Nonlinear Models<br />
* 26 Oct - '''Project Proposal Due 17 Feb at 5pm''' - TBA<br />
** 31 Oct - TBA <br />
* 2 Nov - TBA<br />
** 7 Oct - '''Midterm'''<br />
* 9 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 14 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 16 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 21 Nov - '''Project Draft Due 17 Mar at 5pm''' - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 23 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 28 Nov - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 30 Nov - Brainstorming: *slot1*, *slot2*, *slot3*<br />
** 5 Dec - Brainstorming: *slot1*, *slot2*, *slot3*, *slot4*, *slot5*, *slot6*<br />
* 7 Dec - Brainstorming: *slot1*, *slot2*, *slot3*<br />
<br />
* '''Project Document Due Friday 7 April 5pm'''<br />
* '''Reviews Due Thursday 13 April 5pm'''</div>Dan Lizotte