Section 2: Data Wrangling with dplyr
What You Will Need
Packages: - dplyr - Data manipulation - readr - Reading data files - haven - Reading Stata/SPSS/SAS files - pacman - Package management
Data: Download the auto.csv or auto.dta file from course materials.
Skills from Section 1: Loading packages, reading data, basic R operations.
Summary
In this section, we dive deep into data wrangling using the dplyr package. We’ll cover:
- The pipe operator
%>%for chaining operations -
Core dplyr verbs:
select(),filter(),mutate(),arrange(),summarize() -
Grouping operations with
group_by() - Joining datasets with various join functions
-
Handling missing data (
NAvalues) - Reshaping data between wide and long formats
Setup
The Pipe Operator %>%
The pipe operator %>% is one of the most powerful features in modern R. It allows you to chain operations together, making your code more readable and avoiding intermediate objects.
How Pipes Work
The pipe takes the output from one expression and feeds it into the first argument of the next expression:
When reading code with pipes, say “then” for each %>%:
“Take cars, then select price, mpg, and weight, then summarize.”
Saving Results
To save the final output, assign it as usual:
# Save the result
car_summary <- cars %>%
select(price, mpg, weight) %>%
summary()Using the Dot . Placeholder
Sometimes you need to specify where the piped data should go:
# The dot specifies where to place the input
cars %>%
select(price, mpg) %>%
lm(price ~ mpg, data = .)Core dplyr Verbs
The dplyr package is built around verbs that describe what you want to do with your data.
select() - Choose Columns
Select specific variables (columns) from your dataset:
# Select specific columns
cars %>%
select(make, price, mpg, weight)
# Select a range of columns
cars %>%
select(make:mpg)
# Select columns that start with "w"
cars %>%
select(starts_with("w"))
# Select columns that contain "price"
cars %>%
select(contains("price"))
# Exclude columns with minus sign
cars %>%
select(-foreign, -rep78)select()
-
starts_with("abc")- Columns starting with “abc” -
ends_with("xyz")- Columns ending with “xyz” -
contains("ijk")- Columns containing “ijk” -
matches("regex")- Columns matching a regular expression -
num_range("x", 1:3)- Columns x1, x2, x3
filter() - Choose Rows
Filter rows based on logical conditions:
# Cars with mpg greater than 25
cars %>%
filter(mpg > 25)
# Foreign cars only
cars %>%
filter(foreign == 1)
# Expensive AND fuel-efficient cars
cars %>%
filter(price > 10000, mpg > 20)
# Expensive OR fuel-efficient cars
cars %>%
filter(price > 10000 | mpg > 20)
# Cars that are NOT foreign
cars %>%
filter(foreign != 1)
# or
cars %>%
filter(!foreign)-
==equal to -
!=not equal to -
>greater than -
<less than -
>=greater than or equal to -
<=less than or equal to -
&and -
|or -
!not -
%in%in a set
Filtering with Multiple Conditions
mutate() - Create/Modify Columns
Create new variables or modify existing ones:
# Create new variable: price per mpg
cars %>%
mutate(price_per_mpg = price / mpg)
# Multiple new variables
cars %>%
mutate(
price_1000 = price / 1000,
weight_tons = weight / 2000,
efficiency = mpg * weight
)
# Modify existing variable
cars %>%
mutate(price = price / 1000) # Convert to thousands
# Use newly created variables
cars %>%
mutate(
price_1000 = price / 1000,
log_price = log(price_1000) # Use price_1000 immediately
)
arrange() - Sort Rows
Sort your data by one or more variables:
# Sort by price (ascending)
cars %>%
arrange(price)
# Sort by price (descending)
cars %>%
arrange(desc(price))
# Sort by multiple variables
cars %>%
arrange(foreign, desc(price)) # Foreign first, then price within
summarize() - Aggregate Data
Create summary statistics:
Grouped Operations
The real power of dplyr emerges when you combine verbs with group_by().
group_by() - Group Data
Group your data by one or more variables:
# Group by foreign status
cars %>%
group_by(foreign)
# Nothing appears to happen... but it's grouped!Summarize by Groups
# Compare domestic vs. foreign cars
cars %>%
group_by(foreign) %>%
summarize(
n_cars = n(),
mean_price = mean(price),
mean_mpg = mean(mpg),
mean_weight = mean(weight)
)
# Group by multiple variables
cars %>%
group_by(foreign, rep78) %>%
summarize(
count = n(),
avg_price = mean(price, na.rm = TRUE)
)Mutate by Groups
Create variables that depend on group-level calculations:
# Price relative to group mean
cars %>%
group_by(foreign) %>%
mutate(
mean_price = mean(price),
price_diff = price - mean_price
)
# Rank within groups
cars %>%
group_by(foreign) %>%
mutate(price_rank = min_rank(desc(price)))
# Calculate percentiles within groups
cars %>%
group_by(foreign) %>%
mutate(price_pct = percent_rank(price))
ungroup() - Remove Grouping
Always ungroup when you’re done with grouped operations:
cars_grouped <- cars %>%
group_by(foreign) %>%
mutate(mean_price = mean(price)) %>%
ungroup() # Important!Handling Missing Data
Missing values (NA) require special attention in R.
Understanding NA
Filtering NAs
Working with NAs in Summaries
Replacing NAs
Combining Operations
The power of dplyr comes from combining multiple operations:
Example 1: Data Pipeline
# Complex analysis in one pipeline
cars %>%
# Remove missing values
filter(!is.na(rep78)) %>%
# Keep only relevant variables
select(make, price, mpg, weight, foreign) %>%
# Create new variables
mutate(
price_1000 = price / 1000,
efficiency = mpg / weight * 1000
) %>%
# Group by foreign status
group_by(foreign) %>%
# Calculate summaries
summarize(
n = n(),
mean_price = mean(price_1000),
mean_efficiency = mean(efficiency)
) %>%
# Sort results
arrange(desc(mean_efficiency))Example 2: Top N by Group
# Find top 3 most expensive cars by foreign status
cars %>%
group_by(foreign) %>%
arrange(desc(price)) %>%
slice(1:3) %>%
select(make, price, mpg, foreign)Example 3: Conditional Summaries
Joining Datasets
Often you need to combine data from multiple sources.
Types of Joins
# Example datasets
df1 <- data.frame(
id = c(1, 2, 3),
x = c("a", "b", "c")
)
df2 <- data.frame(
id = c(1, 2, 4),
y = c("A", "B", "D")
)
# Inner join - keep only matching rows
inner_join(df1, df2, by = "id")
# Result: id 1, 2
# Left join - keep all rows from df1
left_join(df1, df2, by = "id")
# Result: id 1, 2, 3 (y is NA for id 3)
# Right join - keep all rows from df2
right_join(df1, df2, by = "id")
# Result: id 1, 2, 4 (x is NA for id 4)
# Full join - keep all rows from both
full_join(df1, df2, by = "id")
# Result: id 1, 2, 3, 4 (NAs where no match)-
inner_join()- Only keep matching observations -
left_join()- Keep all from left, match from right -
right_join()- Keep all from right, match from left -
full_join()- Keep everything -
semi_join()- Keep left observations that have a match -
anti_join()- Keep left observations that DON’T have a match
Practical Join Example
# Car prices dataset
prices <- data.frame(
make = c("Toyota", "Honda", "Ford"),
msrp = c(25000, 27000, 22000)
)
# Add MSRP to cars dataset
cars %>%
left_join(prices, by = "make") %>%
select(make, price, msrp, mpg)Reshaping Data
Sometimes you need to convert between wide and long formats.
Wide vs. Long Format
Wide format:
student math science history
Alice 90 85 88
Bob 75 92 80
Long format:
student subject score
Alice math 90
Alice science 85
Alice history 88
Bob math 75
Bob science 92
Bob history 80
pivot_longer() - Wide to Long
# Example: reshape test scores
scores_wide <- data.frame(
student = c("Alice", "Bob"),
math = c(90, 75),
science = c(85, 92),
history = c(88, 80)
)
# Convert to long format
scores_long <- scores_wide %>%
pivot_longer(
cols = math:history, # Columns to pivot
names_to = "subject", # Name for the new "name" column
values_to = "score" # Name for the new "value" column
)
pivot_wider() - Long to Wide
# Convert back to wide format
scores_long %>%
pivot_wider(
names_from = subject, # Column containing new column names
values_from = score # Column containing values
)Best Practices
- Use pipes to create readable chains of operations
- One operation per line when using pipes
- Comment your code to explain why, not just what
- Use descriptive variable names
- Group related operations together
-
Always
ungroup()after grouped operations - Handle
NAvalues explicitly - Test on small subsets before running on full data
Example of Good Style
# Analyze fuel efficiency by car origin
fuel_analysis <- cars %>%
# Remove incomplete observations
filter(!is.na(mpg), !is.na(weight)) %>%
# Create relevant variables
mutate(
origin = ifelse(foreign == 1, "Foreign", "Domestic"),
efficiency_ratio = mpg / (weight / 1000) # MPG per 1000 lbs
) %>%
# Group by origin
group_by(origin) %>%
# Calculate summary statistics
summarize(
n_cars = n(),
mean_mpg = mean(mpg),
mean_efficiency = mean(efficiency_ratio),
sd_efficiency = sd(efficiency_ratio)
) %>%
# Remove grouping
ungroup() %>%
# Sort by efficiency
arrange(desc(mean_efficiency))
# View results
fuel_analysisPractice Exercises
Key Takeaways
-
Pipes
%>%make code readable - use them! -
Six core verbs:
select(),filter(),mutate(),arrange(),summarize(),group_by() - Group operations unlock powerful analyses
-
NAvalues need special handling withna.rm = TRUEoris.na() - Joins combine datasets - choose the right type
- Reshaping converts between wide and long formats
- Chain operations to create readable data pipelines
Additional Resources
Next: Section 3 - Writing Functions and Loops