For our third project you’ll be developing a web-based dashboard to visualize and analyze baseball data. This dashboard will serve as an interactive tool to explore various statistics and trends within the dataset provided. The backend of the application must be implemented in Python, while you have the freedom to choose the technologies used for the frontend and data visualization components.

Note: I realize not everyone is familiar with baseball. You do NOT need to become an expert in baseball to do this project. There is enough flexibility in the requirements that you can learn a few things about baseball and still ask/answer some very interesting questions about the data. Many data scientists start a new job knowing NOTHING about their dataset, so this is pretty typical, and is an opportunity to learn how to extract and present interesting insights from any dataset. This will prepare you for real-world scenarios.

As in the first two projects, we will continue to challenge your programming skills while building on your ability to work in a team, use version control effectively, and apply an Agile development methodology.

Learning Goals

  • Learn SQL for data retrieval and manipulation.
  • Learn to integrate Python with a SQL database (PostgreSQL) for data retrieval.
  • Practice researching and analyzing suitable technologies for a given project.
  • Gain experience wrangling and manipulating data in Python.
  • Explore data visualization techniques and tools to present data in an interactive and insightful manner.
  • Team-based software development and project management:
    • Agile methodology to plan, execute, and track progress, including the use of sprints, user stories, and retrospectives.
    • Git and GitHub for version control and collaboration
    • GitHub pull requests and comments for code reviews
    • Git feature branch workflow
  • Independent learning and problem-solving skills
  • Continue practicing presentation skills

The Dataset: Lahman’s Baseball Database

The project will utilize Lahman’s Baseball Database, a comprehensive dataset containing historical baseball data. This dataset includes player performance statistics, team data, batting and pitching statistics, and more, spanning from 1871 to 2022. You will have access to a PostgreSQL database containing this information.

Key Components of the Dataset

  • People: Player names, birth dates, and biographical info.
  • Batting: Batting statistics for players.
  • Pitching: Pitching statistics for players.
  • Teams: Yearly stats and information for each team.
  • Awards: Player awards (e.g., MVP, Cy Young).

Querying the Data

I will email you the access information necessary to access the database. You may NOT use other forms of the Lahman database — you MUST use SQL to retrieve the data from my database. You may, however, add other baseball datasets, in whatever format you choose, if you’re feeling ambitious.

You will need to learn SQL to query the data. There are many online resources available, including W3 Schools, Codecademy, and the free, interactive book Select Star SQL.

The baseball data is split up into tables. To ask and answer interesting questions, you will need to join some of these tables together. This will likely be the hardest aspect of SQL you’ll need to learn. See more about joins here and here.

Although SQL is the query language, you will need to use this from WITHIN Python. I highly recommend using the Python module psycopg2 as an “adapter” to the database, SQLAlchemy to run your queries, and Pandas for data manipulation.

Dashboard Technologies

You are required to build a web-based dashboard, as opposed to a native desktop or mobile app. Unlike last project’s website, this app requires both a “front” end (HTML/CSS/JavaScript) and a “back” end that retrieves the data and perhaps does some processing. Your backend MUST be in Python.

Other than that, you are free to choose what your dashboard will be built with. In fact, research different technologies is one of the learning goals for this project! There are some Python libraries made specifically for dashboards. There are many others made for websites, and you can build dashboards on top of them.

Here are a few libraries to help you start your search:

  • Plotly Dash is made for dashboards. Under the hood it uses Plotly to create visualizes and Flask as the web serving software.
  • Panel is also made for dashboards, and is built off of Bokeh.
  • Streamlit is a new library that is easy to build with. It seems less flexible than the other choices, but I don’t have any direct experience with it, so I may be wrong.
  • Bokeh is a plotting library and server, though it is a bit more “low-level” than something like Panel.
  • Flask is a general purpose web framework. You can build any kind of website with it. It is very flexible, though you’ll need to have some other visualization packages to create your visualizations.

Find out more about Python dashboards here.

Agile Development and GitHub Usage

As before, you are expected to adopt an Agile development methodology for this project. Plan your work in one-week sprints, create user stories, maintain a backlog of tasks, estimate the difficulty of tasks and keep track of your velocity.

User GitHub for version control, collaborating on code through feature branches and pull requests. I should see pull requests from everyone, as well as comments on those pull requests that serve as code reviews.

Weekly Checkpoints and Requirements

  • Week 1: You need to accomplish two separate tasks during this week:
    • Research technologies and build a basic prototype or toy dashboard. It’s fine to follow an online tutorial to do this. You might even build multiple toy dashboards to get a feel for which dashboard/visualization technology you like best.
    • Begin learning SQL and exploring the dataset. You should be able to retrieve part of the dataset from Python.
  • Week 2: Integrate week 1’s two pieces so that the dashboard shows some baseball information. Focus on simple things, like showing descriptive analytics.
  • Week 3: Expand the dashboard to include interactive, exploratory visualizations. The user should be able to filter or subset parts of the data. Refine the user interface and user experience based on feedback.
  • Week 4: Attempt to incorporate a descriptive model into the dashboard. Prepare for the final presentation.

Notice that there are no specific required visualizations or questions to answer — it’s up to you! Imagine that you work for a baseball executive in the 90s, before analytics was a big part of baseball. What might be interesting and useful to them? This is actually the hardest part of the job for any data analyst or data scientist: figuring out what will help your company’s decision makers, even though they don’t even know!

What is a “Descriptive Model”

For week 4 above I mentioned you need a descriptive model. A descriptive model aims to summarize or describe the main features of a dataset of the relationships between variables within it. Unlike predictive models, which forecast unobserved outcomes or future trends, descriptive models focus on presenting the current data in a manner that is easy to understand, highlighting patterns, trends, and relationships that are already present in the dataset.

In the context of this project, a descriptive model could involve:

  • Correlation Analysis: Probably the most expected form. Identifying and quantifying the relationships between variables. For instance, exploring how batting average correlates with runs batted in (RBIs) or home runs.
  • Trend Analysis: Looking at how certain metrics (e.g., player performance, team wins) change over time. For example, how has power hitting or base stealing changed since the 1980s?
  • Segmentation and Clustering: Grouping similar data points together to identify distinct categories within the data. For example, categorizing pitchers into different roles based on their performance statistics.

Normally, summary statistics would also count as a descriptive model, but that is too simple for your last week; I expect summary statistics to be implemented earlier.

You will probably need some additional Python libraries, such as statsmodels or scikit-learn to implement such a model.

Presentations

As before, you’ll have weekly checkpoint presentations. The requirements for those are unchanged; please see project 2’s description of them.

Your final presentation will be similar, but I want you to start the presentation as if you are presenting to a baseball executive. Demo your dashboard, explain why it’s useful and what questions it can answer, without using technical jargon. Then, for the rest of the presentation, target intro programming students and include the rest of the final presentation components mentioned in project 2.