Appendix B — R Functions
An opinionated list of the most common data wrangling functions. It leans heavily into the Tidyverse.
B.1 Import/Export
read_csv()
imports data from a CSV file. (It handles data types better than the base Rread.csv()
). Alsowrite_csv()
when you need export as CSV. Example:read_csv("path/to/file.csv")
.write_rds
to save a data frame as an.rds
R data data file. This preserves all the data types.read_rds()
to import R data. Example:read_rds("path/to/file.rds")
.readxl
is a package we didn’t use, but it has read_excel() that allows you to import from an Excel file, including specified sheets and ranges.clean_names()
from thelibrary(janitor)
package standardizes column names.
B.2 Data manipulation
select()
to select columns. Example:select(col01, col02)
orselect(-excluded_col)
.rename()
to rename a column. Example:rename(new_name = old_name)
.filter()
to filter rows of data. Example:filter(column_name == "value")
.- See Relational Operators like
==
,>
,>=
etc. - See Logical operators like
&
,|
etc. - See is.na tests if a value is missing.
- See Relational Operators like
distinct()
will filter rows down to the unique values of the columns given.arrange()
sorts data based on values in a column. Usedesc()
to reverse the order. Example:arrange(col_name %>% desc())
mutate()
changes and existing column or creates a new one. Example:mutate(new_col = (col01 / col02))
.round()
is a base R function that can round a number to a set decimal point. Often used within amutate()
function.recode()
,if_else()
andcase_when()
are all functions that can be used withmutate()
to create new categorizations with your data.pivot_longer()
“lengthens” data, increasing the number of rows and decreasing the number of columns. Example:pivot_longer(cols = 3:5, names_to = "new_key_col_name", values_to = "new_val_col_name")
will take the third through the fifth columns and turn each value into a new row of data. It will put them into two columns: The first column will have the name you give it innames_to
and contain the old column name that corresponds to each value pivoted. The second column will have the name of whatever you set invalues_to
and will contain all the values from each of the columns.pivot_wider()
is the opposite ofpivot_longer()
. Example:pivot_wider(names_from = col_of_key_values, values_from = col_with_values)
. See the link.
B.3 Aggregation
group_by()
andsummarize()
often come together. When you usegroup_by()
, every function after it is broken down by that grouping. We often addarrange()
to these, calling this our GSA functions. Example:group_by(song, artist) %>% summarize(weeks = n(), top_chart_position = min(peak_position))
. To break or remove groupings, useungroup()
.count()
is a shortcut for GSA that count the number rows based on variable groups you feed it.
B.4 Math
These are the function often used within summarize()
:
n()
to count the number of rows.n_distinct()
counts the unique values.sum()
to add things together.mean()
to get an average.median()
to get the median.min()
to get the smallest value.max()
for the largest.+
,-
,*
,/
are math operators similar to a calculator.