library(tidyverse)
library(janitor)
library(rvest)
Playbill
This would be a great example of needing to dive into specific cells within a table to pull out bits. That said, it is not figured out here yet.
Setup
Basic scrape
Create a url based on the week
<- "2024-10-13"
week
<- paste0("https://playbill.com/grosses?week=",week)
url url
[1] "https://playbill.com/grosses?week=2024-10-13"
Read in the url to get the page
<- read_html(url)
raw
raw
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="ua-desktop bsp-site-header-slidingnav " style="" data-dl-act ...
Find the table and get its contents.
<- raw |> html_element(".bsp-table") |> html_table()
main_table
main_table
This works but it is a really shitty table that needs a don of cleaning. See potential refactor section below.
Export
Exporting this awful table into a folder called playbill to keep it away from other things.
<- paste0("data-raw/playbill/playbill_",week,".rds")
export_path
export_path
[1] "data-raw/playbill/playbill_2024-10-13.rds"
|> write_rds(export_path) main_table
Refactoring this code
This “works” but the table is formatted in such a way that there are probably better (but more complicated) ways to pull out the data more cleanly, especially form columns like the first one that have the name and theater in the same td. Those have data labels and such.
I just don’t know if it would be better. It would take some work to find out.
<td data-label="Show" class="col-0">
<a href="https://playbill.com/production/gross?production=c3b6dace-a78e-439f-b3f2-bde3381bc6ff" data-cms-ai="0" rel="external">
<span class="data-value">& Juliet</span>
</a>
<span class="subtext">Stephen Sondheim Theatre</span>
</td>