In today's post Sam Bornstein breaks down the basics of getting started analyzing baseball data with R. If you're an aspiring baseball analyst, this post is a MUST READ! He will dive into where you can find publicly available MLB data, give you the exact code you need to get started, and walk you through several examples of ways that code can be applied.
The dplyr package is the backbone of data manipulation in R. It is a simple, yet critical library that contains several useful functions for working with data frame-like objects.
For this article we will be using pitch-by-pitch Statcast data from the 2020 MLB season. This data was acquired from Bill Petti’s baseballr package. I have provided a download link for this data set, as well as a Github R script with the code used to acquire the data. Additionally, there is an R script in the same folder with the exact code you will see in this post.
Inside the Statcast data set we will be focusing on this year’s National League Rookie of the Year and three Cy Young finalists: Devin Williams, Trevor Bauer, Jacob deGrom, and Yu Darvish, respectively. If you are a beginner level programmer in R, this article will give you the foundation to manipulate Statcast data to create meaningful data frames.
Setup Your Workspace
Before we jump into the exercise, let’s make sure your R Studio environment is ready to go. Once the dplyr package is installed (if it isn’t already), we will need to load the library, set our working directory, and import the Statcast data set. Here is the necessary code to do so.
install.packages("dplyr")
library(dplyr)
setwd("~/PATH TO FOLDER WITH CSV FILE")
statcast_data <- read.csv("mlb_2020_statcast_pitcher.csv")
Each individual person's working directory will be different, so this is where you would replace "PATH TO FOLDER WITH CSV FILE" with your personal computer's path. This is the folder where your Statcast .csv file is located. For example, my file path is "~/BASEBALL/Simple Sabermetrics" because within my Documents folder I have a folder titled "BASEBALL" which holds my "Simple Sabermetrics" folder. If you have any trouble doing this, check out the R script on Github for assistance, or please reach out to me with questions! Without further ado, let’s begin.
Filtering, Selecting, and Arranging Data
First, we will want to filter our original dataset, 'statcast_data', to the desired players. In this exercise, we are going to create two separate data frames, 'NL_ROY' and 'NL_CY'.
NL_ROY <- statcast_data %>%
filter(player_name == "Devin Williams")
NL_CY <- statcast_data %>%
filter(player_name %in% c("Trevor Bauer", "Jacob deGrom", "Yu Darvish"))
In the 'NL_ROY' data frame, we have filtered for just Devin Williams, telling the dplyr function that we want the "player_name" to exactly equal, ==, his name. Other the other hand, we want to filter for three different players in the 'NL_CY' data frame. To do this, we will use the %in% operator to tell the dplyr function that there are multiple values, and that list of pitchers are in a combined list, c(). If you have written code in the dplyr package previously, this setup should look familiar. If not, this is an organized way to keep your syntax consistent. Let me break this down for you.
new_data_frame <- old_data_frame %>%
dplyr function
Thinking of the first two data frames we created, 'statcast_data' is the original data we imported, and 'NL_ROY' and 'NL_CY' are the newly created data frames. The dplyr function(s) follow in the subsequent lines of code, chained together by the use of the pipe operator: %>%. As I like to call it, it is a comma on steroids.
Now that we have the basics covered, let’s continue with creating more data frames. The 'NL_ROY' data frame that we created contains 90 variables of data for each pitch Devin Williams threw in the 2020 season. 90 variables is a lot, and we often only need to focus on a few of them for certain tasks. Let’s choose just a handful of columns to look at.
ROY_pitch_info <- NL_ROY %>%
select(player_name, pitch_type, release_speed, release_spin_rate)
The data frame this chunk of code creates also includes every pitch Williams threw in 2020, but it only returns the five columns we selected. What if we wanted to select more columns but did not want to type out the name of each variable?
ROY_columns <- NL_ROY %>%
select(player_name, pitch_type:release_speed)
By using a colon between variable names instead of commas, the select() function returns the columns within that range. While this example only spans three columns, this can be applied to any length of variables across a data frame.
Taking a step back to the pitch info example, let’s add the arrange() function to sort the data to our liking. Let’s say we want to see the data ordered by pitch speed, with the fastest pitches at the top.
ROY_release_speed <- NL_ROY %>%
select(player_name, pitch_type, release_speed, release_spin_rate) %>%
arrange(desc(release_speed))
Now let's tie this section together through the use of all three dplyr functions we have learned about thus far: filter(), select(), and arrange().
ROY_FF <- NL_ROY %>%
select(player_name, pitch_type, release_speed, release_spin_rate) %>%
filter(pitch_type == "FF") %>%
arrange(desc(release_speed))
The prompt for this chunk of code is to display all four-seam fastballs for Devin Williams, sorted by pitch speed, and only showing his name, pitch type, pitch speed, and spin rate.
Grouping, Summarizing, and Mutating Data
Up until this point we have only worked with the existing data points in the data set, filtering to the desired pitchers or pitch types, selecting the columns we chose to look at, and arranging the order of the rows to our liking. In this next section we are going to create new columns with the group_by(), summarize(), and mutate() functions.
To begin, let's tally the number of times Devin Williams threw each pitch type.
ROY_pitch_count <- NL_ROY %>%
group_by(pitch_type) %>%
summarize('pitch_count' = n())
Within the group_by() function we have chosen to group by the "pitch_type" variable. By using the pipe operator, we can then create a new variable, "pitch_count", within the summarize() function. Below is the proper syntax for creating new columns.
summarize('COLUMN NAME' = FUNCTION())
Some possible functionalities include mean(), median(), min(), max(), n(), etc. The n() function counts the number of instances and is very commonly used. Let's run through a few more examples.
ROY_avg_velos <- NL_ROY %>%
group_by(pitch_type) %>%
summarize('pitch_count' = n(),
'average_velocity' = mean(release_speed, na.rm = TRUE))
ROY_range_velos <- NL_ROY %>%
group_by(pitch_type) %>%
summarize('pitch_count' = n(),
'min_velocity' = min(release_speed, na.rm = TRUE),
'max_velocity' = max(release_speed, na.rm = TRUE))
Calculating the mean, minimum, and maximum of a grouped variable is quite common in data manipulation. These three functions barely scratch the surface of what can be used to create new columns.
This next chunk of code ties together multiple pieces we've touched on already, but the focus here is on the mutate() function. Mutate() is different from summarize() in that it changes an existing column rather than focusing on a single group. Let's use this function to calculate Bauer Units for each of the three pitchers in the 'NL_CY' data frame.
CY_FF_ranks <- NL_CY %>%
filter(pitch_type == "FF") %>%
group_by(player_name, pitch_type) %>%
summarize('average_velocity' = mean(release_speed, na.rm = TRUE),
'average_spin' = mean(release_spin_rate, na.rm = TRUE)) %>%
mutate('bauer_units' = round(average_spin/average_velocity,1))
Let's walk through this one. First, we will filter the data frame to only include four-seam fastballs. Second, we form our group on the pitchers and pitch type, which at this point is a little redundant, but we will do so anyways. Next we re-create the "average_velocity" column and add in "average_spin" as well. Finally, using the mutate() function we can perform a calculation on two existing columns - "average_velocity" and "average_spin" - to create the "bauer_units" column.
Wrapping It Up
These six functions (filter(), select(), arrange(), group_by(), summarize(), and mutate()) are the foundation of the dplyr package, but only a fraction of what the library has to offer. Data manipulation is a necessary first step in the process of analyzing baseball data. Not often will you be presented with a data set that is tailored exactly to your needs. Pitch-by-pitch Statcast data has a lot to offer, but can be an intimidating data set at first glance. Manipulating this data with the dplyr package is the first step in the right direction towards analyzing baseball data.
In my next post I will be moving on to data visualization in the ggplot2 library. We will mostly work off the information from this post, but I would love to hear some feedback on what you would like me to cover. If you found this post helpful, want to see what other data manipulation is possible with Statcast data, or have requests for my next post, please leave a comment or reach out to me via Twitter!