💻Intro to Programming in R Unit 1 – Intro to R and RStudio
R and RStudio are essential tools for data analysis and statistical computing. R offers a wide range of functions for data manipulation, visualization, and modeling, while RStudio provides a user-friendly interface for writing and executing R code.
This introduction covers the basics of R and RStudio, including installation, syntax, data types, and common data structures. It also explores data import, manipulation, and visualization techniques, setting the foundation for more advanced statistical analysis and programming in R.
R is a programming language and environment for statistical computing and graphics
Provides a wide variety of statistical and graphical techniques (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering)
Highly extensible through functions and packages which extend its capabilities
R is an interpreted language, meaning that code can be written and executed without the need for a compiler
R is open-source and freely available, making it accessible to a wide range of users
Widely used in academia and industry for data analysis, statistical modeling, and data visualization
Offers powerful tools for data manipulation, making it easy to clean, transform, and reshape data
Supports reproducible research through tools like R Markdown and Jupyter Notebooks
Getting Started with R and RStudio
RStudio is an integrated development environment (IDE) for R that provides a user-friendly interface
To start using R, first download and install R from the official CRAN (Comprehensive R Archive Network) website
Next, download and install RStudio from the official RStudio website
Launch RStudio and familiarize yourself with the interface, which includes:
Console: where you enter commands and see output
Script editor: where you write and save R code
Environment: shows objects currently in memory
Plots, Packages, Help, and Viewer panes
Set your working directory using
setwd()
to specify where R will look for files and save output
Install packages using
install.packages()
to extend R's functionality
Load packages using
library()
to make their functions available for use in your current session
R Basics: Syntax and Data Types
R is case-sensitive, so
myVariable
and
myvariable
are treated as different objects
Comments start with
#
and are used to explain code or disable lines of code
R has several basic data types, including:
Numeric: real numbers (e.g.,
3.14
)
Integer: whole numbers (e.g.,
42L
)
Character: text strings (e.g.,
"hello"
)
Logical: boolean values (
TRUE
or
FALSE
)
R uses the
<-
operator for assignment (e.g.,
x <- 42
), although
=
can also be used
Mathematical operations follow the usual order of precedence (PEMDAS)
Comparison operators (
<
,
>
,
<=
,
>=
,
==
,
!=
) are used to compare values and return logical values
Logical operators (
&
,
|
,
!
) are used to combine or negate logical values
Working with Variables and Functions
Variables are used to store values and are created using the assignment operator (
<-
or
=
)
Variable names should be descriptive and follow a consistent naming convention (e.g.,
snake_case
or
camelCase
)
Functions are reusable pieces of code that perform a specific task
R has many built-in functions (e.g.,
mean()
,
sum()
,
plot()
) and users can also define their own functions
Functions are called using the syntax
function_name(argument1, argument2, ...)
Arguments are values passed to a function, which can be mandatory or optional
Functions can return a value using the
return()
statement, or the last expression evaluated will be returned automatically
R uses lexical scoping, meaning that functions have access to variables defined in their enclosing environment
Data Structures in R
R has several built-in data structures for storing collections of values:
Vectors: one-dimensional arrays that hold elements of the same data type
Lists: one-dimensional arrays that can hold elements of different data types
Matrices: two-dimensional arrays that hold elements of the same data type
Data frames: two-dimensional structures that can hold elements of different data types (like a table)
Vectors are created using the
c()
function (e.g.,
my_vector <- c(1, 2, 3)
)
Elements in a vector are accessed using square brackets and an index (e.g.,
my_vector[1]
)
Vectors can be used in arithmetic operations, which are applied element-wise
Lists are created using the
list()
function (e.g.,
my_list <- list(1, "a", TRUE)
)
Elements in a list are accessed using double square brackets or
$
(e.g.,
my_list[[1]]
or
my_list$element_name
)
Matrices are created using the
matrix()
function (e.g.,
my_matrix <- matrix(1:6, nrow = 2, ncol = 3)
)
Elements in a matrix are accessed using square brackets and row/column indices (e.g.,
my_matrix[1, 2]
)
Data frames are created using the
data.frame()
function (e.g.,
my_df <- data.frame(x = 1:3, y = c("a", "b", "c"))
)
Elements in a data frame are accessed using
$
or square brackets (e.g.,
my_df$x
or
my_df[, "x"]
)
Importing and Manipulating Data
R can import data from various file formats, including CSV, Excel, and SQL databases
The
read.csv()
function is used to read CSV files (e.g.,
my_data <- read.csv("data.csv")
)
The
readxl
package provides functions for reading Excel files (e.g.,
read_excel()
)
The
DBI
and
RMySQL
/
RPostgreSQL
packages allow for connecting to and querying SQL databases
The
dplyr
package provides a set of functions for data manipulation, including:
filter()
: subset rows based on conditions
select()
: subset columns by name
mutate()
: create new columns or modify existing ones
group_by()
and
summarize()
: aggregate data by groups and calculate summary statistics
The
tidyr
package provides functions for reshaping data, such as
pivot_longer()
and
pivot_wider()
for converting between long and wide formats
Visualizing Data with R
R provides powerful tools for creating a wide range of visualizations, from simple scatter plots to complex interactive dashboards
The base R plotting system includes functions like
plot()
,
hist()
, and
boxplot()
for creating basic graphs
The
ggplot2
package provides a flexible and expressive framework for creating more advanced visualizations
Graphs are built up in layers, starting with the
ggplot()
function and adding components like geometric objects (
geom_point()
,
geom_line()
, etc.), scales, and facets
Aesthetics (e.g., color, size, shape) are used to map variables to visual properties of the graph
Other packages for specific types of visualizations include:
plotly
for interactive web-based graphs
leaflet
for interactive maps
networkD3
for network graphs
R Markdown and Shiny are tools for creating reproducible reports and interactive web applications that incorporate visualizations
Helpful Resources and Next Steps
The official R documentation and help files provide detailed information on functions and packages
Online resources like Stack Overflow, R-bloggers, and the RStudio Community are great places to find answers to questions and learn from other users
Books like "R for Data Science" by Hadley Wickham and Garrett Grolemund and "Advanced R" by Hadley Wickham provide in-depth coverage of R programming and best practices
Online courses on platforms like Coursera, DataCamp, and edX offer structured learning paths for R and data science
Participating in local R user groups or attending conferences like useR! and RStudio Conference is a great way to network and learn from the R community
As you continue learning R, focus on developing your skills in:
Data wrangling and manipulation with
dplyr
and
tidyr
Data visualization with
ggplot2
and other packages
Statistical modeling and machine learning with packages like
lm()
,
glm()
, and
caret
Creating reproducible reports and applications with R Markdown and Shiny
Consider working on personal projects or contributing to open-source packages to apply your skills and build your portfolio