Note

This term project is heavily based on material created by Jenny Smetzer, William Hopper, Beth Brown, and Albert Y. Kim, with some modifications.


Instructions

Everything in this course builds up to the term group project, where there is only one learning goal: Engage in the data/science research pipeline in as faithful a manner as possible while maintaining a level suitable for novices.

In order to break down the task and minimize end-of-semester stress, you’ll be working on the project in five phases:

  1. Project groups: Form groups.
  2. Project data: Propose a data set for your project. This is the phase of the project that is the least straightforward. Thus, we recommend you start early and get help during office hours early and often.
  3. Project proposal: Ensure you can work with your data in R by performing an exploratory data analysis. This phase may require revisions to your original choice of data.
  4. Proposal peer review: You will read other groups’ proposal, evaluate them, and make suggestions.
  5. Project submission: Make an initial submission of your project. Depending on the course schedule, you may skip some of the sections and only complete them after we have covered inference for regression in class. After you submit your work, you will get instructor feedback.
  6. Project resubmission: Incorporate your instructor feedback from the project submission phase, complete the remaining sections, and resubmit your project. You will only be graded on your project resubmission.
  7. Project presentation: Present your results to the class in a 15-20 minute presentation. You should pretend the audience is full of business executives who do not know anything about data science.
  8. Peer Evaluations: Evaluate your group members.

Timeline

Unless otherwise noted, all deadline are by the start of class on that day, i.e., 9:00 AM.

Step Deadline
Groups October 11
Data October 11
Proposal October 27
Proposal Peer Review November 3
Submission November 17
Resubmission December 1
Presentation December 6
Peer Evaluations December 11 (midnight)

1 Project Groups

  1. Form groups of 2 students.
    • All groups members are expected to contribute and you will all be held accountable for your contributions in peer evaluations.
  2. Choose a group name.
  3. One member of the group should email me their group name and the names of all members of their group.

2 Project Data

This is the phase of the project that is the least straightforward. Thus, we recommend you start early and get help during office hours early and often.

  • Get a sense for the requirements of this project phase by reading a possible example data proposal (Note that your data proposal will likely differ slightly from this one):
  • Download the following data.Rmd template R Markdown file and fill it in.

2.1 Find data

Specifications

Find a dataset that fits these specifications. Note your data may need a little wrangling from its original form.

  1. (If available) An identification variable that uniquely identifies each observation in each row.
  2. A numerical outcome variable \(y\). Note: binary outcomes variables with 0/1 values are not technically numerical.
  3. Two explanatory variables:
    • A numerical explanatory variable \(x_1\). Note: this can be some notion of time.
    • A categorical explanatory variable \(x_2\) that has between 3-5 levels. Note: If your data has more than 5 levels, they can be collapsed into 5 using data wrangling later.
  4. At least 50 observations/rows.

Possible sources

Here are some possible data sources:

  • Best option: data from your own research or other courses! The more connected you feel with your data, the more motivated you will be for this project.
  • Next best options: Online data repositories such as (but not limited to):
  • You may not use the following data:
    • Any datasets used in this class, either in ModernDive or in any of the examples.
    • Any data from the data journalism website FiveThirtyEight.com

Note on data confidentiallity

If your data is not confidential or sensitive in nature, then publish your data as a CSV file on Google Sheets. That way your group can all access a single copy of your data on the web. If your data is confidential or sensitive in nature, do not publish it on the web, but rather submit the Excel or CSV file as well.

You can publish your data as a CSV file on Google Sheets by following the 6 steps in this Twitter thread:

2.2 What to submit

Only one group member will make a single submission on behalf of the whole group. They will submit:

  1. The data.Rmd R Markdown file
  2. The data.html HTML report file
  3. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file. Otherwise you should publish your data as a CSV file on Google Sheets as described above.

2.3 Hints

  • Where is this heading?: For the next project phase (project proposal), you will be making a visualization like this one. If you can make a visualization like this one, then your data is set for the rest of the project.
  • Feel free to bring questions to class, in addition to office hours. Your questions might be useful to other groups, too!
  • Only minimal data wrangling using the dplyr package is expected at this time; you will be doing more for the “project proposal” phase coming up. That being said, feel free to experiment!
  • Disclaimer: Just because things seem good now doesn’t mean your project data is set for semester. Unforeseen problems may crop up during the next phase on data wrangling, at which point you may need to revise your data. This is a reality of data collection in the real world!
  • Do not include any View() statements in your .Rmd files as this may cause an error.
  • Avoid “data dumps”. For example, showing the contents of all 1000 rows in a data frame. This will make your output document really large and unreadable.

3 Project Proposal

This phase may require revisions to your original choice of data.

  • Get a sense for the requirements of this project phase by reading a possible example project proposal (Note that your data proposal will likely differ slightly from this one):
  • Download the following proposal.Rmd template R Markdown file and start filling it in.

3.1 Work on your proposal

Your data may require some wrangling to get it in the appropriate format. In addition to chapter 3 of Modern Dive, you may want to look at Appendix C: Tips & Tricks. It’s based on the seven most common data wrangling questions the authors encountered from students while they were working on their term projects:

3.2 What to submit

Only one group member will make a single submission on behalf of the whole group. They will submit:

  1. The proposal.Rmd R Markdown file
  2. The proposal.html HTML report file
  3. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file as well. Otherwise you should publish your data as a CSV file on Google Sheets as described above.

4 Proposal Peer Review

You will be sent proposals from other groups and an evaluation form for each. Fill out the evaluation form and turn it in at the start of class. Each group member should fill out separate evaluation forms.

5 Project Submission

  • Get a sense for the requirements of this project phase by reading only the following sections of this possible project resubmission (Note that your data proposal will likely differ slightly from this one):
    • Section 1: Introduction
    • Section 2: Exploratory data analysis
    • Section 3 subsections 3.1, 3.2, and 3.3: Multiple linear regression: Methods, Model Results, Interpreting the regression table.
  • Download the following project_submission.Rmd template R Markdown file and start filling it in.

5.1 Complete your initial submission

  • Complete the following sections of project_submission.Rmd:
    • Section 1: Introduction
    • Section 2: Exploratory data analysis
    • Section 3 subsections 3.1, 3.2, and 3.3: Multiple linear regression: Methods, Model Results, Interpreting the regression table.
  • Do not complete the following sections (you’ll be doing this at the resubmission phase):
    • Section 3 subsections 3.4, 3.5: Inference for multiple regression
    • Section 4: Discussion. You will write this conclusion based on the results of sections 3.4 and 3.5.

5.2 What to submit

Only one group member will make a single submission on behalf of the whole group. They will submit:

  1. The project_submission.Rmd R Markdown file with sections 1, 2, 3.1, 3.2, and 3.3 completed
  2. The project_submission.html HTML report file.
  3. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file as well. Otherwise you should publish your data as a CSV file on Google Sheets as described above.

6 Project Resubmission

Get a sense for the requirements of this project phase by re-reading all the sections of the possible project resubmission from the previous submission phase. In particular, read the following new sections:

  • Sections 3.4, 3.5: Inference for multiple regression
  • Section 4: Discussion.

6.1 Revise your initial submission

Using the same project_submission.Rmd file you submitted for the project submission phase:

  • Incorporate any feedback given to you from the project submission phase.
  • Complete Sections 3.4 and 3.5: Inference for multiple regression
  • Complete Section 4: Discussion. You will write this conclusion based on the results of sections 3.4 and 3.5.

6.2 What to submit

Only one group member will make a single submission on behalf of the whole group. They will submit:

  1. The updated project_submission.Rmd R Markdown file.
  2. The updated project_submission.html HTML report file.
  3. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file as well. Otherwise you should publish your data as a CSV file on Google Sheets as described above.

7 Project Presentation

Prepare a 15-20 minute presentation of your dataset and results. Unless you talk to me beforehand, your presentation should include slides. You should pretend the audience is full of business executives who do not know anything about data science.

In addition, you will be given an evaluation form for the other groups’ presentations.

Presenting technical content to a non-technical audience is an important skill, and one that is difficult to master. Give yourself plenty of time to design a good presentation, and practice it at least 3 times before the real presentation! If you can, try recording a practice presentation so you can really see what it’s like.

8 Peer Evaluation

You will be given a peer evaluation form for each group member, including one for yourself.