Note
This term project is heavily based on material created by Jenny
Smetzer, William Hopper, Beth Brown, and Albert Y. Kim, with some
modifications.
Instructions
Everything in this course builds up to the term group project, where
there is only one learning goal: Engage in the data/science research
pipeline in as faithful a manner as possible while maintaining a level
suitable for novices.
In order to break down the task and minimize end-of-semester stress,
you’ll be working on the project in five phases:
- Project groups: Form groups.
- Project data: Propose a data set for your project.
This is the phase of the project that is the least straightforward.
Thus, we recommend you start early and get help during office hours
early and often.
- Project proposal: Ensure you can work with your
data in R by performing an exploratory data analysis. This phase may
require revisions to your original choice of data.
- Proposal peer review: You will read other groups’
proposal, evaluate them, and make suggestions.
- Project submission: Make an initial submission of
your project. Depending on the course schedule, you may skip some of the
sections and only complete them after we have covered inference for
regression in class. After you submit your work, you will get instructor
feedback.
- Project resubmission: Incorporate your instructor
feedback from the project submission phase, complete the remaining
sections, and resubmit your project. You will only be graded on your
project resubmission.
- Project presentation: Present your results to the
class in a 15-20 minute presentation. You should pretend the audience is
full of business executives who do not know anything about data
science.
- Peer Evaluations: Evaluate your group members.
Timeline
Unless otherwise noted, all deadline are by the start of class on
that day, i.e., 9:00 AM.
Groups |
October 11 |
Data |
October 11 |
Proposal |
October 27 |
Proposal Peer Review |
November 3 |
Submission |
November 17 |
Resubmission |
December 1 |
Presentation |
December 6 |
Peer Evaluations |
December 11 (midnight) |
Project Groups
- Form groups of 2 students.
- All groups members are expected to contribute and you will all be
held accountable for your contributions in peer evaluations.
- Choose a group name.
- One member of the group should email me their group name and the
names of all members of their group.
Project Data
This is the phase of the project that is the least
straightforward. Thus, we recommend you start early and get help during
office hours early and often.
- Get a sense for the requirements of this project phase by reading a
possible example data proposal (Note that your data proposal
will likely differ slightly from this one):
- Download the following
data.Rmd
template R Markdown file
and fill it in.
Find data
Specifications
Find a dataset that fits these specifications. Note your data may
need a little wrangling from its original form.
- (If available) An identification variable that uniquely identifies
each observation in each row.
- A numerical outcome variable \(y\).
Note: binary outcomes variables with 0/1 values are not technically
numerical.
- Two explanatory variables:
- A numerical explanatory variable \(x_1\). Note: this can be some notion of
time.
- A categorical explanatory variable \(x_2\) that has between 3-5 levels. Note: If
your data has more than 5 levels, they can be collapsed into 5 using
data wrangling later.
- At least 50 observations/rows.
Possible sources
Here are some possible data sources:
- Best option: data from your own research or other courses! The more
connected you feel with your data, the more motivated you will be for
this project.
- Next best options: Online data repositories such as (but not limited
to):
- You may not use the following data:
- Any datasets used in this class, either in ModernDive or in any of
the examples.
- Any data from the data journalism website FiveThirtyEight.com
Note on data confidentiallity
If your data is not confidential or sensitive in nature, then publish
your data as a CSV file on Google Sheets. That way your group can all
access a single copy of your data on the web. If your data is
confidential or sensitive in nature, do not publish it
on the web, but rather submit the Excel or CSV file as well.
You can publish your data as a CSV file on Google Sheets by following
the 6 steps in this Twitter thread:
What to submit
Only one group member will make a single submission on behalf of the
whole group. They will submit:
- The
data.Rmd
R Markdown file
- The
data.html
HTML report file
- Only if your data is confidential or sensitive in nature, submit
your Excel or CSV file. Otherwise you should publish your data as a CSV
file on Google Sheets as described above.
Hints
- Where is this heading?: For the next project phase
(project proposal), you will be making a visualization like this one. If you can make a visualization like this one,
then your data is set for the rest of the project.
- Feel free to bring questions to class, in addition to office hours.
Your questions might be useful to other groups, too!
- Only minimal data wrangling using the
dplyr
package is
expected at this time; you will be doing more for the “project proposal”
phase coming up. That being said, feel free to experiment!
- Disclaimer: Just because things seem good now doesn’t mean your
project data is set for semester. Unforeseen problems may crop up during
the next phase on data wrangling, at which point you may need to revise
your data. This is a reality of data collection in the real world!
- Do not include any
View()
statements in your
.Rmd
files as this may cause an error.
- Avoid “data dumps”. For example, showing the contents of all 1000
rows in a data frame. This will make your output document really large
and unreadable.
Project Proposal
This phase may require revisions to your original choice of
data.
- Get a sense for the requirements of this project phase by reading a
possible example project proposal (Note that your data
proposal will likely differ slightly from this one):
- Download the following
proposal.Rmd
template R
Markdown file and start filling it in.
Work on your
proposal
Your data may require some wrangling to get it in the appropriate
format. In addition to chapter 3 of Modern Dive, you may want to look at
Appendix C: Tips & Tricks. It’s based on the
seven most common data wrangling questions the authors encountered from
students while they were working on their term projects:
What to submit
Only one group member will make a single submission on behalf of the
whole group. They will submit:
- The
proposal.Rmd
R Markdown file
- The
proposal.html
HTML report file
- Only if your data is confidential or sensitive in nature, submit
your Excel or CSV file as well. Otherwise you should publish your data
as a CSV file on Google Sheets as described above.
Proposal Peer
Review
You will be sent proposals from other groups and an evaluation form
for each. Fill out the evaluation form and turn it in at the start of
class. Each group member should fill out separate evaluation forms.
Project Submission
- Get a sense for the requirements of this project phase by reading
only the following sections of this possible project resubmission (Note that your data proposal
will likely differ slightly from this one):
- Section 1: Introduction
- Section 2: Exploratory data analysis
- Section 3 subsections 3.1, 3.2, and 3.3: Multiple linear regression:
Methods, Model Results, Interpreting the regression table.
- Download the following
project_submission.Rmd
template
R Markdown file and start filling it in.
Complete your initial
submission
- Complete the following sections of
project_submission.Rmd
:
- Section 1: Introduction
- Section 2: Exploratory data analysis
- Section 3 subsections 3.1, 3.2, and 3.3: Multiple linear regression:
Methods, Model Results, Interpreting the regression table.
- Do not complete the following sections (you’ll be doing this at the
resubmission phase):
- Section 3 subsections 3.4, 3.5: Inference for multiple
regression
- Section 4: Discussion. You will write this conclusion based on the
results of sections 3.4 and 3.5.
What to submit
Only one group member will make a single submission on behalf of the
whole group. They will submit:
- The
project_submission.Rmd
R Markdown file with
sections 1, 2, 3.1, 3.2, and 3.3 completed
- The
project_submission.html
HTML report file.
- Only if your data is confidential or sensitive in nature, submit
your Excel or CSV file as well. Otherwise you should publish your data
as a CSV file on Google Sheets as described above.
Project
Resubmission
Get a sense for the requirements of this project phase by re-reading
all the sections of the possible project resubmission from the previous submission
phase. In particular, read the following new sections:
- Sections 3.4, 3.5: Inference for multiple regression
- Section 4: Discussion.
Revise your initial
submission
Using the same project_submission.Rmd
file you submitted
for the project submission phase:
- Incorporate any feedback given to you from the project submission
phase.
- Complete Sections 3.4 and 3.5: Inference for multiple
regression
- Complete Section 4: Discussion. You will write this conclusion based
on the results of sections 3.4 and 3.5.
What to submit
Only one group member will make a single submission on behalf of the
whole group. They will submit:
- The updated
project_submission.Rmd
R Markdown
file.
- The updated
project_submission.html
HTML report
file.
- Only if your data is confidential or sensitive in nature, submit
your Excel or CSV file as well. Otherwise you should publish your data
as a CSV file on Google Sheets as described above.
Project
Presentation
Prepare a 15-20 minute presentation of your dataset and results.
Unless you talk to me beforehand, your presentation should include
slides. You should pretend the audience is full of business executives
who do not know anything about data science.
In addition, you will be given an evaluation form for the other
groups’ presentations.
Presenting technical content to a non-technical audience is an
important skill, and one that is difficult to master. Give yourself
plenty of time to design a good presentation, and practice it at least
3 times before the real presentation! If you can, try
recording a practice presentation so you can really see what it’s
like.
Peer Evaluation
You will be given a peer evaluation form for each group member,
including one for yourself.