library(tidyverse)
library(janitor)
library(rvest)
Baseball Reference batting
This pulls standard batting statistics from baseball-reference.com. We want to tables “Player Standard Batting” which is downpage on this for 2024. There are actually two tables, one for regular season and one for playoffs. We want multiple seasons.
Setup
Demonstrate pulling a single year
<- 2023
br_year
# Builds the url for the standard batting page
<- paste0("https://www.baseball-reference.com/leagues/majors/", br_year, "-standard-batting.shtml")
url
url
[1] "https://www.baseball-reference.com/leagues/majors/2023-standard-batting.shtml"
Then we get all the tables on the page to see what they look like. We don’t actually use this object later, but it shows that you can build a list of tables.
# reads in the HTML
<- read_html(url)
br_batting_raw
br_batting_raw
{html_document}
<html data-version="klecko-" data-root="/home/br/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="br">\n<div id="wrap">\n \n <div id="header" role="banner"> ...
That just shows we are getting the HTML of the page.
Plucking out tables
We could get all the tables with html_table()
<- br_batting_raw |> html_table()
all_tables
all_tables
[[1]]
# A tibble: 33 × 29
Tm `#Bat` BatAge `R/G` G PA AB R H `2B` `3B` HR
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Arizona … 54 27.4 4.60 162 6124 5436 746 1359 274 44 166
2 Atlanta … 53 27.9 5.85 162 6249 5597 947 1543 293 23 307
3 Baltimor… 50 27.2 4.98 162 6123 5495 807 1399 309 28 183
4 Boston R… 56 28.6 4.77 162 6174 5562 772 1437 339 19 182
5 Chicago … 48 28.4 5.06 162 6220 5504 819 1399 269 30 196
6 Chicago … 56 27.8 3.96 162 5980 5501 641 1308 264 13 171
7 Cincinna… 65 26.8 4.83 162 6195 5499 783 1371 268 37 198
8 Clevelan… 50 26.6 4.09 162 6096 5513 662 1379 294 29 124
9 Colorado… 57 28.1 4.45 162 6055 5496 721 1368 305 31 163
10 Detroit … 51 27.4 4.08 162 6080 5478 661 1292 245 29 165
# ℹ 23 more rows
# ℹ 17 more variables: RBI <chr>, SB <chr>, CS <chr>, BB <chr>, SO <chr>,
# BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, TB <chr>,
# GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>, LOB <chr>
[[2]]
# A tibble: 881 × 34
Rk Player Age Team Lg WAR G PA AB R H `2B`
<int> <chr> <int> <chr> <chr> <dbl> <int> <int> <int> <int> <int> <int>
1 1 Marcus Sem… 32 TEX AL 7.4 162 753 670 122 185 40
2 2 Ronald Acu… 25 ATL NL 8.2 159 735 643 149 217 35
3 3 Freddie Fr… 33 LAD NL 6.5 161 730 637 131 211 59
4 4 Alex Bregm… 29 HOU AL 4.9 161 724 622 103 163 28
5 5 Nathaniel … 27 TEX AL 2.6 161 724 623 89 163 38
6 6 Matt Olson* 29 ATL NL 7.4 162 720 608 127 172 27
7 7 Kyle Schwa… 30 PHI NL 0.6 160 720 585 108 115 19
8 8 Steven Kwa… 25 CLE AL 3.6 158 718 638 93 171 36
9 9 Austin Ril… 26 ATL NL 5.9 159 715 636 117 179 32
10 10 Julio Rodr… 22 SEA AL 5.3 155 714 654 102 180 37
# ℹ 871 more rows
# ℹ 22 more variables: `3B` <int>, HR <int>, RBI <int>, SB <int>, CS <int>,
# BB <int>, SO <int>, BA <dbl>, OBP <dbl>, SLG <dbl>, OPS <dbl>,
# `OPS+` <int>, rOBA <dbl>, `Rbat+` <int>, TB <int>, GIDP <int>, HBP <int>,
# SH <int>, SF <int>, IBB <int>, Pos <chr>, Awards <chr>
[[3]]
# A tibble: 293 × 30
Rk Player Age Team Lg G PA AB R H `2B` `3B`
<int> <chr> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
1 1 Corey Seag… 29 TEX AL,WS 17 82 66 18 20 6 0
2 2 Marcus Sem… 32 TEX AL,WS 17 82 76 12 17 2 1
3 3 Ketel Mart… 29 ARI NL,WS 17 79 73 6 24 7 1
4 4 Corbin Car… 22 ARI NL,WS 17 78 66 11 18 1 1
5 5 Christian … 32 ARI NL,WS 17 75 60 7 13 5 0
6 6 Evan Carte… 20 TEX AL,WS 17 72 60 9 18 9 0
7 7 Nathaniel … 27 TEX AL,WS 17 72 66 10 14 2 0
8 8 Jonah Heim# 28 TEX AL,WS 17 71 66 5 14 0 0
9 9 Lourdes Gu… 29 ARI NL,WS 17 70 66 5 18 3 0
10 10 Josh Jung 25 TEX AL,WS 17 70 65 13 20 4 1
# ℹ 283 more rows
# ℹ 18 more variables: HR <int>, RBI <int>, SB <int>, CS <int>, BB <int>,
# SO <int>, BA <dbl>, OBP <dbl>, SLG <dbl>, OPS <dbl>, TB <int>, GIDP <int>,
# HBP <int>, SH <int>, SF <int>, IBB <int>, Pos <chr>, Awards <chr>
If I then wanted the second table, I could use.
2]] all_tables[[
Instead, I wanted to find the specific table more precisely by keying in on a specific element in the HTML code. I found that the batting tables I wanted had the id players_standard_batting
and players_standard_batting_post
. I can use the html_element
function to find the specific table I want, then use html_table
to convert it to a data frame.
I do some other things to clean up the data, like renaming columns, adding a season and season_type (regular vs playoffs).
# finds the regular season batting table
# cleans names, adds year, add season type
<- br_batting_raw |>
br_batting_reg html_element("#players_standard_batting") |>
html_table() |>
clean_names() |>
mutate(
season = br_year,
season_type = "Regular",
.before = rk
)
# finds the playoff batting table
# cleans names, adds year, add season type
<- br_batting_raw |>
br_batting_post html_element("#players_standard_batting_post") |>
html_table() |>
clean_names() |>
mutate(
season = br_year,
season_type = "Playoffs",
.before = rk
)
br_batting_reg
Export the data
Now that I have that data, I can export it to a file.
I’m using the paste0
function to build the file name based on the year I’m working with.
<- paste0("data-raw/batting/br_bat_reg_", br_year, ".rds")
export_url_reg <- paste0("data-raw/batting/br_bat_post_", br_year, ".rds")
export_url_post
export_url_reg
[1] "data-raw/batting/br_bat_reg_2023.rds"
export_url_post
[1] "data-raw/batting/br_bat_post_2023.rds"
And then I export ..
|> write_rds(export_url_reg)
br_batting_reg |> write_rds(export_url_post) br_batting_post
Create scraping function
Here we turn what we learned above into a function so we can loop through a range of years.
<- function(br_year) {
scrape_batting
# Builds the url for the standard batting page
<- paste0("https://www.baseball-reference.com/leagues/majors/", br_year, "-standard-batting.shtml")
url
# reads in the HTML
<- read_html(url)
br_batting_raw
# finds the regular season batting table
# cleans names, adds year, add season type
<- br_batting_raw |>
br_batting_reg html_element("#players_standard_batting") |>
html_table() |>
clean_names() |>
mutate(
season = br_year,
season_type = "Regular",
.before = rk
)
# finds the playoff batting table
# cleans names, adds year, add season type
<- br_batting_raw |>
br_batting_post html_element("#players_standard_batting_post") |>
html_table() |>
clean_names() |>
mutate(
season = br_year,
season_type = "Playoffs",
.before = rk
)
# builds the export path for each based on year
<- paste0("data-raw/batting/br_bat_reg_", br_year, ".rds")
export_url_reg <- paste0("data-raw/batting/br_bat_post_", br_year, ".rds")
export_url_post
# the actual export
|> write_rds(export_url_reg)
br_batting_reg |> write_rds(export_url_post)
br_batting_post
}
Do the deed
Here I’m pulling just three years, but it could be extended.
# Sets a range of years to collect
<- c(2000:2003)
yrs
# Creates a loop to get those files
for (i in yrs) {
scrape_batting(i)
}