The goal of this project is for the student to gain experience in understanding a substantive problem/question, acquiring data relevant to the problem/question, and applying appropriate data science techniques in an effort to address the problem/question. Here I'm using the word substantive in the way a statistician might: the substantive field refers to the field of science (not statistical science) containing the problem to be addressed. Example substantive fields include medicine, chemistry, astronomy, and computer networks. All project must include a visualization component, which may be static or dynamic.
Structure and Regulations
- The project will be submitted as three deliverables, a project proposal early in the term, a draft partway through the term, and a final research report at the end of the term. All of these must be submitted as pdfs generated by Markdown, LaTeX, or Word; see instructions below. After this, each graduate student will review a subset of projects; reviews are due one week after final project submission.
- Projects are to be completed individually.
- All projects must be based on a dataset that is sufficiently interesting for our purposes as judged by the instructor. Note that any UCI dataset that was donated prior to 2007 is considered uninteresting and is therefore disallowed.
- You are encouraged to contact Dan at any point to determine if your project topic is suitable
- No Spam Filters. Furthermore, the Enron-Spam datasets are explicitly forbidden
For the proposal, each student will identify an applied problem (or a few related problems) that could be solved using data science methods, identify an appropriate dataset, and give a detailed plan for analyzing the data that includes what pre-processing will be required, what kind of feature development will be necessary, and what analysis and visualization methods might be applied. Don't forget to include details for how you will assess the performance of any models you build. The proposal should have three main headings:
- Description of Applied Problem
- Description of Available Data
- Plan for Analysis and Visualization
The main body of the proposal document should be 2 pages long, single spaced. Page 3 and after may only contain references, tables, and figures. If you are using LaTeX, use the CS4637/CS9637 style files, which are based on the ICML style files. There is no style file for markdown, but keep in mind that if you use Markdown, you still need to have proper references. This resource may help, as might a bit of Google/StackExchange searching, but in the end the onus is on you. If using word, use 3/4" margins and a 12 point serif font.
Include a brief abstract of a few sentences. At least two appropriate references must be listed for works (papers or books) that discuss and describe the applied problem, at least one reference that describes the available data (may be URL(s)) and at least two references that describe the methods you plan to explore in your analysis and visualization plan.
Whether you are using LaTeX, Markdown, or Word, submit your proposal as a PDF file. Proposals must submitted through OWL. Late submissions will not be accepted.
A draft of the final report will be due approximately 2/3 of the way through the term. Use Word, Markdown, or LaTeX with the style files, just as you must for the final report. To ensure you get useful feedback, the draft should have a complete abstract, background section, and analysis and visualization plan. The rest of the paper should at least be sketched in, perhaps in point form, to give a sense of the final shape of the document. The precise content of the draft is not specified, but the more you provide, the better feedback you will get.
Report drafts must be submitted through OWL by 5pm on the due date. *Do not e-mail the instructor your draft.* Late submissions will not be accepted.
The report must be no more than 4 pages long, single spaced, not including references. If you wish, you may also include an additional appendix with an unlimited number of pages that contain only figures, figure captions, and tables. Use Word, or use the style files, which are based on the ICML style files, or use Markdown. Include a brief abstract. As mentioned above, all reports must include a visualization component.
An outstanding report might resemble an application-focussed publication in a workshop at one of the top machine learning or AI conferences, like for example ICML or IAAI. (Note however that you are required to include a visualization component, which such papers may not have.) Here are some examples. Note that just because a paper is listed here does not mean it is perfect; you must always read with a fair but critical eye.
- Philip A. Warrick, Emily F. Hamilton, Robert E. Kearney, Doina Precup. A Machine Learning Approach to the Detection of Fetal Hypoxia during Labor and Delivery.
- Weiss, Page, Peissig, Natarajan, and McCarty. Statistical Relational Learning to Predict Primary Myocardial Infarction from Electronic Health Records
- Chad Cumby, Rayid Ghani A Machine Learning Based System for Semi-Automatically Redacting Documents.
- Mitja Luštrek, Hristijan Gjoreski, Simon Kozina, Božidara Cvetković, Violeta Mirchevska, Matjaž Gams Detecting Falls with Location Sensors and Accelerometers
- Ben George Weber, Michael John, Michael Mateas, Arnav Jhala Modeling Player Retention in Madden NFL 11
Specific expectations for the report
Reproducibility: The report must contain enough detail about the methods used to allow a future researcher to reproduce the results if they had access to the appropriate data and access to all appropriate works cited. (Some projects may use proprietary data; that is fine.) Reports that do not contain sufficient method detail will not receive full marks.
Integrity: The report must adhere to the standards of academic honesty.
Formality: The report should be written in formal academic language appropriate for a technical report/workshop/conference/journal publication. The author should refer to him/herself in the second person plural, i.e. using "we." ("We present a novel analysis...")
Writing Quality: The writing must of the quality level expected of a senior undergraduate or graduate student at a world-class university. The Writing Support Centre at UWO can help you reach this level.
Report Submission and Reviewing
Final report submissions will be done through OWL.
Following report submission, each graduate (9637) student will be randomly assigned two project reports to review over the week following the due date but before the end of the exam period.
- The main purpose of reviewing is to provide feedback to authors that they can make use of in their future careers, which gives them a better return on the investment they have made in their course project.
- The secondary purpose is to give students a view of the variety of work that has been done in the course.
- Reviews from other students will not affect the grade of the author in any way.
- Reviewing will be single-blind: Authors will not know who reviews their project.
- Reviewers are expected to provide feedback that is constructive. Constructive feedback makes concrete suggestions on improving the work under review. Feedback that is both negative and non-constructive will not be tolerated.
Students must follow the review guidelines below. Include headings where appropriate
- Summary: Summarize the goal of the project. What are the authors trying to achieve? Then summarize the contributions of the project in a few sentences. Describe the substantive problem, the data used, and the analysis applied. Describe the results. Note that not every project will have "good results" and for this project that is not necessarily a fault; the meta-goal of this project is for each author to gain experience with DS methods. Keep that in mind when you summarize: did the authors sufficiently explore the space of appropriate methods?
- After the summary, comment on the following aspects of the report:
- Background: Comment on whether the report clearly explains the problem to be tackled, and whether it clearly describes how the substantive problem will be formulated as a data science problem.
- Data: Comment on whether you were able to clearly understand what data were available and how they were used in the analysis.
- Analysis and Visualization: Comment on the appropriateness of the DS methods used, and comment on the reproducibility of the results as described above. Comment on the evaluation measures use.
- Future work: Make some suggestions on how the work could be extended in the future.
Depending on the project, these sections of the review may be longer or shorter. Use your judgement. Be sure to have at least a few interesting sentences under each heading.
A brainstorming session will consist of a 10-minute presentation by a student, followed by a class discussion for a total of 15 minutes. The presenter may choose to take questions during the talk, or save them until the end. The presentation should detail an applied problem, dataset, and potential DS methods that could be useful, much like the project proposal. The Brainstorming Session may or may not be on the student's project topic, but of course it may be advantageous to use your brainstorming slot to get feedback and ideas.
- Presentations should use projected slides
- Presentations should cover more or less the same topics as a project proposal: Description of Applied Problem, Description of Available Data, Plan for Analysis and Visualization
- Presenters will receive a 5-minute warning, but presentations *will* be terminated at the 15-minute mark.
- Evaluation (by instructor) is based on
- Effective explanation of the problem
- Effective explanation of the available data. It is often a good idea to show a specific example of a single "data item" from the available data, whatever that might mean for the specific project.
- Effective explanation potential DS methods
- Ability to answer questions about the data and the analysis and visualization plan
- Working within the strict 10+5 minute timeslot
In general, it is better to *show* your plan rather than tell it. Use actual examples from your dataset where possible. Show how feature vectors and any class labels/regression targets are constructed.