Appendix C — R Functions
An opinionated list of the most common data wrangling functions. It leans heavily into the Tidyverse.
Another good place to figure out which functions to use and how to use them, is the Posit Recipes page.
C.1 Import/Export
read_csv()
imports data from a CSV file. (It handles data types better than the base Rread.csv()
). Alsowrite_csv()
when you need export as CSV. Example:read_csv("path/to/file.csv")
.write_rds
to save a data frame as an.rds
R data data file. This preserves all the data types.read_rds()
to import R data. Example:read_rds("path/to/file.rds")
.readxl
is a package we didn’t use, but it has read_excel() that allows you to import from an Excel file, including specified sheets and cell ranges.clean_names()
from thelibrary(janitor)
package standardizes column names.
C.2 Data manipulation
select()
to select columns. Example:select(col01, col02)
to select cols orselect(!c(col1, col2))
to remove them.rename()
to rename a column. Example:rename(new_name = old_name)
.filter()
to filter rows of data. Example:filter(column_name == "value")
.- See Relational Operators like
==
,>
,>=
etc. - See Logical operators like
&
,|
etc. - See is.na tests if a value is missing.
- See Relational Operators like
distinct()
will filter rows down to the unique values of the columns given.arrange()
sorts data based on values in a column. Usedesc()
to reverse the order. Example:arrange(col_name %>% desc())
mutate()
changes and existing column or creates a new one. Example:mutate(new_col = (col01 / col02))
.- We’ll typically use round_half_up() in journalism. This is often used within a
mutate()
function. (The base Rround()
function rounds to even numbers where 12.5 and 11.5 are both rounded to 12.) if_else()
,case_match()
andcase_when()
are all functions that can be used withmutate()
to create new categorizations with your data.pivot_longer()
“lengthens” data, increasing the number of rows and decreasing the number of columns. Example:pivot_longer(cols = 3:5, names_to = "new_key_col_name", values_to = "new_val_col_name")
will take the third through the fifth columns and turn each value into a new row of data. It will put them into two columns: The first column will have the name you give it innames_to
and contain the old column name that corresponds to each value pivoted. The second column will have the name of whatever you set invalues_to
and will contain all the values from each of the columns.pivot_wider()
is the opposite ofpivot_longer()
. Example:pivot_wider(names_from = col_of_key_values, values_from = col_with_values)
. See the link.
C.3 Aggregation
group_by()
andsummarize()
often come together. When you usegroup_by()
, every function after it is broken down by that grouping. We often addarrange()
to these, calling this our GSA functions. Example:group_by(song, artist) %>% summarize(weeks = n(), top_chart_position = min(peak_position))
. To break or remove groupings, useungroup()
.count()
is a shortcut for GSA that counts the number rows based on variable groups you feed it.tabyl()
is another shortcut for GSA but it also calculates percentages.
C.4 Math
These are the function often used within summarize()
:
n()
to count the number of rows.n_distinct()
counts the unique values.sum()
to add things together.mean()
to get an average.median()
to get the median.min()
to get the smallest value.max()
for the largest.+
,-
,*
,/
are math operators similar to a calculator.