library(tidyverse)
library(janitor)
library(httr2)
library(rvest)
Paginated tables
Figuring out how to scrape a table with pagination based on a site a student wants to scrape.
Figure out how page works
Even before we scrape the page, we need to learn about how it works.
- Look at the page in the browser
- Use the Inspect tool on the pagination part of the page
- What type of HTML element displays this data?
- It is a
<table>
tag, which is good for us. It’s easy to scrape tables with rvest.
- It is a
- How is the “next page” url formulated?
- If we click on the next page in our table, the browser url doesn’t change. But, if you look at the HTML elements that make up the pagination navigation you can see the url pattern.
https://planestrategico.conl.mx/indicadores/detalle/ods/242/datos?page=2
gets you the second page of the table.
Scrape a single page to work the logic
Before we can scrape all the pages, we need to figure out how to scrape a single one.
Get the html
We use rvest functions to read the entire page into memory. We are saving the URL separately so we can test it with our “paginated” page urls.
<- "https://planestrategico.conl.mx/indicadores/detalle/ods/242/datos?page"
url # url <- "https://planestrategico.conl.mx/indicadores/detalle/ods/242/datos?page=2"
<- read_html(url) html
Find the content on the page
We saw from inspecting the page that our data is in a table. Rvest has a function to pull all the tables from a page and put them into a list.
Our page only has one table, but the function still saves it into a list, so we have to select the the first table from the list of tables.
# puts all the tables on the page into a list we call "tables"
<- html |> html_table()
tables
# selects the first table from the list (the one we want)
|> _[[1]] tables
So now we know how to read the html of the page, get a list of all the tables, then pluck out the first table in that list.
Function to parse the page
Now that we know where our table is, we will build a function that when fed the URL of a page, it will pluck out that first table based on what we learned above.
One additional thing we do here vs above is to use clean_names()
on the resulting table.
<- function(our_url) {
parse_page |>
our_url read_html() |>
html_table() |> _[[1]] |>
clean_names()
}
# We test this by feeding it the url variable we also used above
parse_page(url)
To make sure this works with one of the paginated pages, you can go back to the top of the script and modify the url
variable to pull the page with ?page=2
tacked onto the end.
Get and combine paginated pages
We are lucky that we have a predictable URL pattern that includes sequential numbers. This allows us to create a list of URLs that we can run through our parse_page()
function.
We have to feed this the correct number of pages to put together. You can get that by looking at how many pages are in the table’s pagination navigation.
# This range has to be valid. See how many pages are in the table
<- 1:39
i
# This creates a list of urls based on that range
<- str_glue("https://planestrategico.conl.mx/indicadores/detalle/ods/242/datos?page={i}")
urls
# This takes that list of urls and then runs our parse_page() function on each one.
# The result is a list tibbles, i.e., a table from each page
<- map(urls, parse_page)
requests
# list_rbind is a special function that binds a list of tibbles into a single one
<- requests |> list_rbind()
combined_table
# here we just peek at the table
combined_table
Some summary notes
- Since there are a number of pages on this website that have data, it is possible to take this last part above and extrapolate it into a new function that takes two arguments: a) the URL of the page, b) the max number of pages in the table.
- In Hadley’s example he used some httr2 features to do some parallel processing of pages, but I couldn’t figure out how to get that to work.