Welcome to Tutorial!

My name is David Veitch, I am a first year Stats PhD student. My website is https://daveveitch.github.io . I used to work in finance before returning to school. I like basketball, camping, board games, and Nintendo.

My goal is to make this tutorial worth your time. To this end I promise to:

Play Hype Video

AI Beats Doctors in Diagnosis Competition

https://www.youtube.com/watch?v=_kLPyDmUUwU

Install R Studio

Get R Studio Desktop here: https://www.rstudio.com/products/rstudio/#Desktop

Simplify Your Life By ‘Creating a Project’

This makes importing data much easier (do not have to worry about setting a working directory). For example I have the dataset faithful.csv in the same folder as my R project. Now I can just go:

data = read.csv('faithful.csv', header = TRUE)
head(data)

Make a R Notebook

This is a nice format since you can output it to a really nice HTML or PDF format (note you will need to install MiTeX to render math in a PDF). Just click ‘Knit’.

Add New Code via ‘chunks’

Must run current chunk to run code

print('Hello World')
## [1] "Hello World"

Can also hit CTRL+Enter on a line to run a line, or highlight mahy lines and hit CTRL+Enter

Get help on anything

help("data.frame")
## starting httpd help server ... done
?data.frame

You can also write math (i.e. Latex) directly in your program

\[ aX = Y^2 -c\]

Use the console

When you create a code chunk that creates a variable, the variable is stored in the ‘Global Environment’ which can be accessed from the console.

ex_variable = 3.1415

Quick aside on why R is good

Any cutting edge statistical package is in it, not the case in Python

Quick aside on why R is bad

Industry mostly uses Python because most everyone doing datascience right now comes from a software engineering background.

The following part of the tutorial will largely follow ISLR Chapter 2.3

x <- c (1,3,2,5)
x
## [1] 1 3 2 5
x = c(1,6,2)
x
## [1] 1 6 2
y = c(1,4,3)

length(x)
## [1] 3
length(y)
## [1] 3
x+y
## [1]  2 10  5
ls()
## [1] "data"        "ex_variable" "x"           "y"
rm(list=ls())

?matrix

x = matrix(data=c(1,2,3,4),
           nrow=2, ncol=2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
sqrt(x)
##          [,1]     [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000
x^2
##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
?rnorm
x = rnorm(50)
y = x+rnorm(50,mean=50,sd=.1)
cor(x,y)
## [1] 0.9957154
set.seed(1303)
rnorm(5)
## [1] -1.14397631  1.34212937  2.18539048  0.53639252  0.06319297
rnorm(5)
## [1]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
set.seed(1303)
rnorm(5)
## [1] -1.14397631  1.34212937  2.18539048  0.53639252  0.06319297
set.seed(3)
y=rnorm(100)
mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675
sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768
x = rnorm(100)
y = rnorm(100)

plot(x,y,
     xlab='This is x',
     ylab='This is y',
     main='Plot of X v Y')

png('Figure.png',width=300,height=300)
plot(x,y,col='green')
dev.off()
## png 
##   2
x = seq(1,10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x=1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x = seq(-pi,pi,length=50)
x
##  [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957
##  [6] -2.50045130 -2.37222302 -2.24399475 -2.11576648 -1.98753821
## [11] -1.85930994 -1.73108167 -1.60285339 -1.47462512 -1.34639685
## [16] -1.21816858 -1.08994031 -0.96171204 -0.83348377 -0.70525549
## [21] -0.57702722 -0.44879895 -0.32057068 -0.19234241 -0.06411414
## [26]  0.06411414  0.19234241  0.32057068  0.44879895  0.57702722
## [31]  0.70525549  0.83348377  0.96171204  1.08994031  1.21816858
## [36]  1.34639685  1.47462512  1.60285339  1.73108167  1.85930994
## [41]  1.98753821  2.11576648  2.24399475  2.37222302  2.50045130
## [46]  2.62867957  2.75690784  2.88513611  3.01336438  3.14159265
y=x
f=outer(x,y,function(x,y)cos(y)/(1+x^2))
contour(x,y,f)

fa=(f-t(f))/2
contour(x,y,fa,nlevels=15)

image(x,y,fa)

persp(x,y,fa)

persp(x,y,fa,theta=30)

A = matrix(1:16,4,4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
A[2,3]
## [1] 10
A[c(1,3),c(2,4)]
##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15
A[1:3,2:4]
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15
A[1:2,]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
A[,1:2]
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
A[1,]
## [1]  1  5  9 13
A[-c(1,3),]
##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16
dim(A)
## [1] 4 4
dim(A)[1]
## [1] 4
Auto = read.table('Auto.data')
fix(Auto)

Auto = read.table('Auto.data',header=T,na.strings='?')

Auto=read.csv('Auto.csv',header=T,na.strings='?')
fix(Auto)
dim(Auto)
## [1] 397   9
Auto = na.omit(Auto)
dim(Auto)
## [1] 392   9
names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"
# plot(cylinders,mpg) # will throw an error

plot(Auto$cylinders,Auto$mpg)

attach(Auto)
plot(cylinders,mpg)

cylinders = as.factor(cylinders)

plot(cylinders,mpg,col='red',varwidth=T,xlab='cylinders',ylab='mpg')

pairs(Auto)

pairs(~ mpg + displacement + horsepower + weight + acceleration,Auto)

plot(horsepower,mpg)
identify(horsepower,mpg,name)

## integer(0)
summary(Auto)
##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365
head(Auto)
tail(Auto)

Chapter 2 Applied Question 8

Get the dataset from http://faculty.marshall.usc.edu/gareth-james/ISL/data.html

Try some of the questions