R Functions
An opinionated list of the most common data wrangling functions. It leans heavily into the Tidyverse.
Import/Export
read_csv()imports data from a CSV file. (It handles data types better than the base Rread.csv()). Alsowrite_csv()when you need export as CSV. Example:read_csv("path/to/file.csv").write_rdsto save a data frame as an.rdsR data data file. This preserves all the data types.read_rds()to import R data. Example:read_rds("path/to/file.rds").readxlis a package we didn’t use, but it has read_excel() that allows you to import from an Excel file, including specified sheets and ranges.clean_names()from thelibrary(janitor)package standardizes column names.
Data manipulation
select()to select columns. Example:select(col01, col02)orselect(-excluded_col).rename()to rename a column. Example:rename(new_name = old_name).filter()to filter rows of data. Example:filter(column_name == "value").- See Relational Operators like
==,>,>=etc. - See Logical operators like
&,|etc. - See is.na tests if a value is missing.
- See Relational Operators like
distinct()will filter rows down to the unique values of the columns given.arrange()sorts data based on values in a column. Usedesc()to reverse the order. Example:arrange(col_name %>% desc())mutate()changes and existing column or creates a new one. Example:mutate(new_col = (col01 / col02)).round()is a base R function that can round a number to a set decimal point. Often used within amutate()function.recode(),if_else()andcase_when()are all functions that can be used withmutate()to create new categorizations with your data.pivot_longer()“lengthens” data, increasing the number of rows and decreasing the number of columns. Example:pivot_longer(cols = 3:5, names_to = "new_key_col_name", values_to = "new_val_col_name")will take the third through the fifth columns and turn each value into a new row of data. It will put them into two columns: The first column will have the name you give it innames_toand contain the old column name that corresponds to each value pivoted. The second column will have the name of whatever you set invalues_toand will contain all the values from each of the columns.pivot_wider()is the opposite ofpivot_longer(). Example:pivot_wider(names_from = col_of_key_values, values_from = col_with_values). See the link.
Aggregation
group_by()andsummarize()often come together. When you usegroup_by(), every function after it is broken down by that grouping. We often addarrange()to these, calling this our GSA functions. Example:group_by(song, artist) %>% summarize(weeks = n(), top_chart_position = min(peak_position)). To break or remove groupings, useungroup().count()is a shortcut for GSA that count the number rows based on variable groups you feed it.
Math
These are the function often used within summarize():
n()to count the number of rows.n_distinct()counts the unique values.sum()to add things together.mean()to get an average.median()to get the median.min()to get the smallest value.max()for the largest.+,-,*,/are math operators similar to a calculator.