# Install the package named "dplyr"
install.packages("dplyr")
# Install the packages named "haven" and "readr"
install.packages(c("haven", "readr"))Section 1: Getting Started with R
What You Will Need
- Working and up-to-date installations of R and RStudio
- Data files:
auto.csvandauto.dta(download from course materials) - An internet connection
- Optional: A healthy source of caffeine ☕
Summary
In this section we will dive into R. We start by installing and loading useful packages (dplyr, haven, readr, and pacman). We then load datasets and begin summarizing and manipulating them. Finally, we’ll make our first plots.
Packages in R
Open RStudio. Make sure you are running the most recent versions of R and RStudio.
- R version: 4.4.2 or newer
- RStudio version: 2024.12.0 or newer
While base R is helpful and powerful, R’s true potential comes from combining the core installation with the many packages generated through open-source collaboration (see CRAN’s list of packages).
Installing Packages
You can check which packages are currently installed using the function installed.packages(). Ready to run your first function? Type installed.packages() into the R console and hit return.
You can also check the Packages tab inside RStudio. The checked boxes denote the packages that are currently loaded.
Now, let’s install a few packages that will prove useful this semester. We use the install.packages() function, giving it the name(s) of the package(s) you want to install.
Important Notes:
- You need an internet connection.
- The function name
install.packages()is plural regardless of the number of packages. - Each package’s name is surrounded by quotes (e.g.,
"haven"). These quotation marks tell R this is a character/string, not an object. Use double quotes for consistency. - We can create a vector of packages using
c(). Example:c(1, 2, 3)is a three-element vector. Similarly,c("haven", "readr")is a two-element vector of package names. Vectors are a big deal in R. - The hashtag (
#) creates comments in R.
Loading Packages
To check that installations were successful, we’ll load the packages using the library() function:
Notice we didn’t need quotation marks around package names when loading (though it would still work with quotes).
Package Management with pacman
Let’s install another package: the pacman package (yes, like the game! 🎮).
# Install the 'pacman' package
install.packages("pacman")
# Load it
library(pacman)The pacman package is very meta: it’s a package to manage packages. You can use it to install, load, unload, and update your R packages.
Rather than typing library(...) for every package, use pacman’s p_load() function:
Better yet: If you try to load a package that isn’t installed, p_load() will install it automatically! Let’s install and load ggplot2:
p_load(ggplot2)Installed and loaded. That’s service! ✨
Loading Data
Loading a dataset in R requires three things:
- The path of the data file (where it exists on your computer)
- The name of the data file
- The proper function for the file type (e.g., different functions for
.csvvs.dtafiles)
The dir() Function
The dir() function shows contents of a folder:
This helps when you forget file names!
Functions to Load Files
We’ll mostly use:
-
readrpackage: for CSV, TSV, and delimited files -
havenpackage: for Stata (.dta), SPSS, and SAS files -
Base R: for R’s own formats (
.rds,.rdata)
To learn more about a function:
- Press Tab after typing the function name (in RStudio)
- Type
?function_namein console (e.g.,?read_dta)
Loading a Stata File
Let’s load auto.dta using read_dta() from the haven package:
<-
The <- operator is central to R. It assigns the value(s) on the right to the name on the left.
When reading aloud, people say “gets”: car_data gets the contents of auto.dta.
To view the data, just type its name:
car_dataLoading a CSV File
For a CSV file, use read_csv() from the readr package:
If you’re already in the right directory, you don’t need the full path:
read_csv("Section01/auto.csv")Playing with Data
You now know how to navigate your computer and load data. Let’s do something with those data!
Exploring the Data
Print the data to the console:
car_dataNotice several things:
- The dataset is a tibble (like a data frame with rules)
- Dimensions: 74 rows × 12 columns
- Column types:
<chr>(character),<dbl>(double/numeric),<int>(integer) - You get a snapshot of the first 10 rows
Getting Column Names
names(car_data)Viewing First/Last Rows
Use View(car_data) or click the dataset in the Environment pane to open RStudio’s data viewer.
Summarizing the Data
Get a quick summary of your dataset:
summary(car_data)To summarize a single variable, use $ to grab it:
# Grab the price variable
car_data$price
# Summarize price
summary(car_data$price)dataset$variable is how you grab a specific column. RStudio’s tab completion works here too!
Manipulating the Data
The dplyr package offers helpful tools for data manipulation using verbs as actions.
select() - Choose Variables
Keep only variables you care about:
# Select desired variables
car_sub <- select(car_data, price, mpg, weight, length)
# View result
car_subWe now have 74 rows but only 4 columns!
Alternative: Exclude variables with a minus sign:
select(car_data, -price, -mpg, -weight, -length)
arrange() - Sort Data
Arrange data by one or more columns:
# Arrange by price, then mpg
arrange(car_sub, price, mpg)If you view car_sub now, it’s not arranged. You must assign the result to change the object:
car_sub <- arrange(car_sub, price)For descending order, use desc():
summarize() - Create Summaries
Create specific summaries of your data:
For simple summaries, you could also just type:
Plotting the Data
R’s default plot functions are simple but effective. We’ll cover ggplot2 later, but for now, let’s make quick plots.
Histogram
Create a histogram of miles per gallon:
# Plain histogram
hist(car_sub$mpg)Make it prettier with labels and a median line:
Scatterplot
Plot price vs. mileage:
plot(
x = car_sub$mpg,
y = car_sub$price,
xlab = "Fuel Economy (MPG)",
ylab = "Price ($)"
)I recommend clearly defining function arguments. It helps keep things straight, as order matters when you’re not naming arguments.
Indexing
Nearly everything in R is numerically indexed. You can access individual elements using square brackets [].
Vector Indexing
Data Frame Indexing
Data frames have rows and columns: [row, column]
# Grab first row (all columns)
car_sub[1, ]
# Grab first column (all rows)
car_sub[, 1]
# Use column name as index
car_sub[, "price"]Remember: Rows before columns. Leave blank to select all. - [1, ] = first row, all columns - [, 2] = all rows, second column - [3, 4] = third row, fourth column
Fun Challenge
Practice Problems
Some classic R-meets-linear-algebra puzzles for your enjoyment:
Key Takeaways
-
Packages extend R’s power - Use
pacman::p_load()for easy management -
File paths are strings - Use
paste0()to build paths programmatically -
Assignment is key - Use
<-to save results:new_data <- old_data -
dplyrverbs -select(),arrange(),summarize()for data manipulation -
Indexing with
[row, column]- Essential for accessing data -
Use
$for columns -dataset$variablegrabs a specific variable
Additional Resources
- R for Data Science - Comprehensive guide
- RStudio Cheatsheets - Quick references
- Stack Overflow - Q&A community
-
?function_name- Built-in R help
Next: Section 2 - Data Wrangling with dplyr