R Functions

An opinionated list of the most common data wrangling functions. It leans heavily into the Tidyverse.

Import/Export

read_csv() imports data from a CSV file. (It handles data types better than the base R read.csv()). Also write_csv() when you need export as CSV. Example: read_csv("path/to/file.csv").
write_rds to save a data frame as an .rds R data data file. This preserves all the data types. read_rds() to import R data. Example: read_rds("path/to/file.rds").
readxl is a package we didn’t use, but it has read_excel() that allows you to import from an Excel file, including specified sheets and ranges.
clean_names() from the library(janitor) package standardizes column names.

Data manipulation

select() to select columns. Example: select(col01, col02) or select(-excluded_col).
rename() to rename a column. Example: rename(new_name = old_name).
filter() to filter rows of data. Example: filter(column_name == "value").
- See Relational Operators like ==, >, >= etc.
- See Logical operators like &, | etc.
- See is.na tests if a value is missing.
distinct() will filter rows down to the unique values of the columns given.
arrange() sorts data based on values in a column. Use desc() to reverse the order. Example: arrange(col_name %>% desc())
mutate() changes and existing column or creates a new one. Example: mutate(new_col = (col01 / col02)).
round() is a base R function that can round a number to a set decimal point. Often used within a mutate() function.
recode(), if_else() and case_when() are all functions that can be used with mutate() to create new categorizations with your data.
pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns. Example: pivot_longer(cols = 3:5, names_to = "new_key_col_name", values_to = "new_val_col_name") will take the third through the fifth columns and turn each value into a new row of data. It will put them into two columns: The first column will have the name you give it in names_to and contain the old column name that corresponds to each value pivoted. The second column will have the name of whatever you set in values_to and will contain all the values from each of the columns.
pivot_wider() is the opposite of pivot_longer(). Example: pivot_wider(names_from = col_of_key_values, values_from = col_with_values). See the link.

Aggregation

group_by() and summarize() often come together. When you use group_by(), every function after it is broken down by that grouping. We often add arrange() to these, calling this our GSA functions. Example: group_by(song, artist) %>% summarize(weeks = n(), top_chart_position = min(peak_position)). To break or remove groupings, use ungroup().
count() is a shortcut for GSA that count the number rows based on variable groups you feed it.

Math

These are the function often used within summarize():

n() to count the number of rows. n_distinct() counts the unique values.
sum() to add things together.
mean() to get an average.
median() to get the median.
min() to get the smallest value. max() for the largest.
+, -, *, / are math operators similar to a calculator.