Baseball Reference batting

This pulls standard batting statistics from baseball-reference.com. We want to tables “Player Standard Batting” which is downpage on this for 2024. There are actually two tables, one for regular season and one for playoffs. We want multiple seasons.

Setup

library(tidyverse)
library(janitor)
library(rvest)

Demonstrate pulling a single year

br_year <- 2023

# Builds the url for the standard batting page
url <- paste0("https://www.baseball-reference.com/leagues/majors/", br_year, "-standard-batting.shtml")

url
[1] "https://www.baseball-reference.com/leagues/majors/2023-standard-batting.shtml"

Then we get all the tables on the page to see what they look like. We don’t actually use this object later, but it shows that you can build a list of tables.

# reads in the HTML
br_batting_raw <- read_html(url)

br_batting_raw
{html_document}
<html data-version="klecko-" data-root="/home/br/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="br">\n<div id="wrap">\n  \n  <div id="header" role="banner"> ...

That just shows we are getting the HTML of the page.

Plucking out tables

We could get all the tables with html_table()

all_tables <- br_batting_raw |> html_table()

all_tables
[[1]]
# A tibble: 33 × 29
   Tm        `#Bat` BatAge `R/G` G     PA    AB    R     H     `2B`  `3B`  HR   
   <chr>     <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 Arizona … 54     27.4   4.60  162   6124  5436  746   1359  274   44    166  
 2 Atlanta … 53     27.9   5.85  162   6249  5597  947   1543  293   23    307  
 3 Baltimor… 50     27.2   4.98  162   6123  5495  807   1399  309   28    183  
 4 Boston R… 56     28.6   4.77  162   6174  5562  772   1437  339   19    182  
 5 Chicago … 48     28.4   5.06  162   6220  5504  819   1399  269   30    196  
 6 Chicago … 56     27.8   3.96  162   5980  5501  641   1308  264   13    171  
 7 Cincinna… 65     26.8   4.83  162   6195  5499  783   1371  268   37    198  
 8 Clevelan… 50     26.6   4.09  162   6096  5513  662   1379  294   29    124  
 9 Colorado… 57     28.1   4.45  162   6055  5496  721   1368  305   31    163  
10 Detroit … 51     27.4   4.08  162   6080  5478  661   1292  245   29    165  
# ℹ 23 more rows
# ℹ 17 more variables: RBI <chr>, SB <chr>, CS <chr>, BB <chr>, SO <chr>,
#   BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, TB <chr>,
#   GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>, LOB <chr>

[[2]]
# A tibble: 881 × 34
      Rk Player        Age Team  Lg      WAR     G    PA    AB     R     H  `2B`
   <int> <chr>       <int> <chr> <chr> <dbl> <int> <int> <int> <int> <int> <int>
 1     1 Marcus Sem…    32 TEX   AL      7.4   162   753   670   122   185    40
 2     2 Ronald Acu…    25 ATL   NL      8.2   159   735   643   149   217    35
 3     3 Freddie Fr…    33 LAD   NL      6.5   161   730   637   131   211    59
 4     4 Alex Bregm…    29 HOU   AL      4.9   161   724   622   103   163    28
 5     5 Nathaniel …    27 TEX   AL      2.6   161   724   623    89   163    38
 6     6 Matt Olson*    29 ATL   NL      7.4   162   720   608   127   172    27
 7     7 Kyle Schwa…    30 PHI   NL      0.6   160   720   585   108   115    19
 8     8 Steven Kwa…    25 CLE   AL      3.6   158   718   638    93   171    36
 9     9 Austin Ril…    26 ATL   NL      5.9   159   715   636   117   179    32
10    10 Julio Rodr…    22 SEA   AL      5.3   155   714   654   102   180    37
# ℹ 871 more rows
# ℹ 22 more variables: `3B` <int>, HR <int>, RBI <int>, SB <int>, CS <int>,
#   BB <int>, SO <int>, BA <dbl>, OBP <dbl>, SLG <dbl>, OPS <dbl>,
#   `OPS+` <int>, rOBA <dbl>, `Rbat+` <int>, TB <int>, GIDP <int>, HBP <int>,
#   SH <int>, SF <int>, IBB <int>, Pos <chr>, Awards <chr>

[[3]]
# A tibble: 293 × 30
      Rk Player        Age Team  Lg        G    PA    AB     R     H  `2B`  `3B`
   <int> <chr>       <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
 1     1 Corey Seag…    29 TEX   AL,WS    17    82    66    18    20     6     0
 2     2 Marcus Sem…    32 TEX   AL,WS    17    82    76    12    17     2     1
 3     3 Ketel Mart…    29 ARI   NL,WS    17    79    73     6    24     7     1
 4     4 Corbin Car…    22 ARI   NL,WS    17    78    66    11    18     1     1
 5     5 Christian …    32 ARI   NL,WS    17    75    60     7    13     5     0
 6     6 Evan Carte…    20 TEX   AL,WS    17    72    60     9    18     9     0
 7     7 Nathaniel …    27 TEX   AL,WS    17    72    66    10    14     2     0
 8     8 Jonah Heim#    28 TEX   AL,WS    17    71    66     5    14     0     0
 9     9 Lourdes Gu…    29 ARI   NL,WS    17    70    66     5    18     3     0
10    10 Josh Jung      25 TEX   AL,WS    17    70    65    13    20     4     1
# ℹ 283 more rows
# ℹ 18 more variables: HR <int>, RBI <int>, SB <int>, CS <int>, BB <int>,
#   SO <int>, BA <dbl>, OBP <dbl>, SLG <dbl>, OPS <dbl>, TB <int>, GIDP <int>,
#   HBP <int>, SH <int>, SF <int>, IBB <int>, Pos <chr>, Awards <chr>

If I then wanted the second table, I could use.

all_tables[[2]]

Instead, I wanted to find the specific table more precisely by keying in on a specific element in the HTML code. I found that the batting tables I wanted had the id players_standard_batting and players_standard_batting_post. I can use the html_element function to find the specific table I want, then use html_table to convert it to a data frame.

I do some other things to clean up the data, like renaming columns, adding a season and season_type (regular vs playoffs).

# finds the regular season batting table
# cleans names, adds year, add season type
br_batting_reg <- br_batting_raw |> 
  html_element("#players_standard_batting") |>
  html_table() |> 
  clean_names() |> 
  mutate(
    season = br_year,
    season_type = "Regular",
    .before = rk
  )

# finds the playoff batting table
# cleans names, adds year, add season type
br_batting_post <- br_batting_raw |> 
  html_element("#players_standard_batting_post") |>
  html_table() |> 
  clean_names() |> 
  mutate(
    season = br_year,
    season_type = "Playoffs",
    .before = rk
  )

br_batting_reg

Export the data

Now that I have that data, I can export it to a file.

I’m using the paste0 function to build the file name based on the year I’m working with.

export_url_reg <- paste0("data-raw/batting/br_bat_reg_", br_year, ".rds")
export_url_post <- paste0("data-raw/batting/br_bat_post_", br_year, ".rds")

export_url_reg
[1] "data-raw/batting/br_bat_reg_2023.rds"
export_url_post
[1] "data-raw/batting/br_bat_post_2023.rds"

And then I export ..

br_batting_reg |> write_rds(export_url_reg)
br_batting_post |> write_rds(export_url_post)

Create scraping function

Here we turn what we learned above into a function so we can loop through a range of years.

scrape_batting <- function(br_year) {

  # Builds the url for the standard batting page
  url <- paste0("https://www.baseball-reference.com/leagues/majors/", br_year, "-standard-batting.shtml")
  
  # reads in the HTML
  br_batting_raw <- read_html(url)
  
  # finds the regular season batting table
  # cleans names, adds year, add season type
  br_batting_reg <- br_batting_raw |> 
    html_element("#players_standard_batting") |>
    html_table() |> 
    clean_names() |> 
    mutate(
      season = br_year,
      season_type = "Regular",
      .before = rk
    )
  
  # finds the playoff batting table
  # cleans names, adds year, add season type
  br_batting_post <- br_batting_raw |> 
    html_element("#players_standard_batting_post") |>
    html_table() |> 
    clean_names() |> 
    mutate(
      season = br_year,
      season_type = "Playoffs",
      .before = rk
    )
  
  # builds the export path for each based on year
  export_url_reg <- paste0("data-raw/batting/br_bat_reg_", br_year, ".rds")
  export_url_post <- paste0("data-raw/batting/br_bat_post_", br_year, ".rds")
  
  # the actual export
  br_batting_reg |> write_rds(export_url_reg)
  br_batting_post |> write_rds(export_url_post)

}

Do the deed

Here I’m pulling just three years, but it could be extended.

# Sets a range of years to collect
yrs <- c(2000:2003)

# Creates a loop to get those files
for (i in yrs) {
  scrape_batting(i)
}