Appendix C — R Functions

An opinionated list of the most common data wrangling functions. It leans heavily into the Tidyverse.

Another good place to figure out which functions to use and how to use them, is the Posit Recipes page.

C.1 Import/Export

  • read_csv() imports data from a CSV file. (It handles data types better than the base R read.csv()). Also write_csv() when you need export as CSV. Example: read_csv("path/to/file.csv").
  • write_rds to save a data frame as an .rds R data data file. This preserves all the data types. read_rds() to import R data. Example: read_rds("path/to/file.rds").
  • readxl is a package we didn’t use, but it has read_excel() that allows you to import from an Excel file, including specified sheets and cell ranges.
  • clean_names() from the library(janitor) package standardizes column names.

C.2 Data manipulation

  • select() to select columns. Example: select(col01, col02) to select cols or select(!c(col1, col2)) to remove them.
  • rename() to rename a column. Example: rename(new_name = old_name).
  • filter() to filter rows of data. Example: filter(column_name == "value").
  • distinct() will filter rows down to the unique values of the columns given.
  • arrange() sorts data based on values in a column. Use desc() to reverse the order. Example: arrange(col_name %>% desc())
  • mutate() changes and existing column or creates a new one. Example: mutate(new_col = (col01 / col02)).
  • We’ll typically use round_half_up() in journalism. This is often used within a mutate() function. (The base R round() function rounds to even numbers where 12.5 and 11.5 are both rounded to 12.)
  • if_else(), case_match() and case_when() are all functions that can be used with mutate() to create new categorizations with your data.
  • pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns. Example: pivot_longer(cols = 3:5, names_to = "new_key_col_name", values_to = "new_val_col_name") will take the third through the fifth columns and turn each value into a new row of data. It will put them into two columns: The first column will have the name you give it in names_to and contain the old column name that corresponds to each value pivoted. The second column will have the name of whatever you set in values_to and will contain all the values from each of the columns.
  • pivot_wider() is the opposite of pivot_longer(). Example: pivot_wider(names_from = col_of_key_values, values_from = col_with_values). See the link.

C.3 Aggregation

  • group_by() and summarize() often come together. When you use group_by(), every function after it is broken down by that grouping. We often add arrange() to these, calling this our GSA functions. Example: group_by(song, artist) %>% summarize(weeks = n(), top_chart_position = min(peak_position)). To break or remove groupings, use ungroup().
  • count() is a shortcut for GSA that counts the number rows based on variable groups you feed it.
  • tabyl() is another shortcut for GSA but it also calculates percentages.

C.4 Math

These are the function often used within summarize():

  • n() to count the number of rows. n_distinct() counts the unique values.
  • sum() to add things together.
  • mean() to get an average.
  • median() to get the median.
  • min() to get the smallest value. max() for the largest.
  • +, -, *, / are math operators similar to a calculator.