My name is David Veitch, I am a first year Stats PhD student. My website is https://daveveitch.github.io . I used to work in finance before returning to school. I like basketball, camping, board games, and Nintendo.
My goal is to make this tutorial worth your time. To this end I promise to:
AI Beats Doctors in Diagnosis Competition
Get R Studio Desktop here: https://www.rstudio.com/products/rstudio/#Desktop
This makes importing data much easier (do not have to worry about setting a working directory). For example I have the dataset faithful.csv in the same folder as my R project. Now I can just go:
data = read.csv('faithful.csv', header = TRUE)
head(data)
This is a nice format since you can output it to a really nice HTML or PDF format (note you will need to install MiTeX to render math in a PDF). Just click ‘Knit’.
Must run current chunk to run code
print('Hello World')
## [1] "Hello World"
Can also hit CTRL+Enter on a line to run a line, or highlight mahy lines and hit CTRL+Enter
help("data.frame")
## starting httpd help server ... done
?data.frame
\[ aX = Y^2 -c\]
When you create a code chunk that creates a variable, the variable is stored in the ‘Global Environment’ which can be accessed from the console.
ex_variable = 3.1415
Any cutting edge statistical package is in it, not the case in Python
Industry mostly uses Python because most everyone doing datascience right now comes from a software engineering background.
x <- c (1,3,2,5)
x
## [1] 1 3 2 5
x = c(1,6,2)
x
## [1] 1 6 2
y = c(1,4,3)
length(x)
## [1] 3
length(y)
## [1] 3
x+y
## [1] 2 10 5
ls()
## [1] "data" "ex_variable" "x" "y"
rm(list=ls())
?matrix
x = matrix(data=c(1,2,3,4),
nrow=2, ncol=2)
x
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
sqrt(x)
## [,1] [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000
x^2
## [,1] [,2]
## [1,] 1 9
## [2,] 4 16
x
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
?rnorm
x = rnorm(50)
y = x+rnorm(50,mean=50,sd=.1)
cor(x,y)
## [1] 0.9957154
set.seed(1303)
rnorm(5)
## [1] -1.14397631 1.34212937 2.18539048 0.53639252 0.06319297
rnorm(5)
## [1] 0.5022344825 -0.0004167247 0.5658198405 -0.5725226890 -1.1102250073
set.seed(1303)
rnorm(5)
## [1] -1.14397631 1.34212937 2.18539048 0.53639252 0.06319297
set.seed(3)
y=rnorm(100)
mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675
sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768
x = rnorm(100)
y = rnorm(100)
plot(x,y,
xlab='This is x',
ylab='This is y',
main='Plot of X v Y')
png('Figure.png',width=300,height=300)
plot(x,y,col='green')
dev.off()
## png
## 2
x = seq(1,10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
x=1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
x = seq(-pi,pi,length=50)
x
## [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957
## [6] -2.50045130 -2.37222302 -2.24399475 -2.11576648 -1.98753821
## [11] -1.85930994 -1.73108167 -1.60285339 -1.47462512 -1.34639685
## [16] -1.21816858 -1.08994031 -0.96171204 -0.83348377 -0.70525549
## [21] -0.57702722 -0.44879895 -0.32057068 -0.19234241 -0.06411414
## [26] 0.06411414 0.19234241 0.32057068 0.44879895 0.57702722
## [31] 0.70525549 0.83348377 0.96171204 1.08994031 1.21816858
## [36] 1.34639685 1.47462512 1.60285339 1.73108167 1.85930994
## [41] 1.98753821 2.11576648 2.24399475 2.37222302 2.50045130
## [46] 2.62867957 2.75690784 2.88513611 3.01336438 3.14159265
y=x
f=outer(x,y,function(x,y)cos(y)/(1+x^2))
contour(x,y,f)
fa=(f-t(f))/2
contour(x,y,fa,nlevels=15)
image(x,y,fa)
persp(x,y,fa)
persp(x,y,fa,theta=30)
A = matrix(1:16,4,4)
A
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
A[2,3]
## [1] 10
A[c(1,3),c(2,4)]
## [,1] [,2]
## [1,] 5 13
## [2,] 7 15
A[1:3,2:4]
## [,1] [,2] [,3]
## [1,] 5 9 13
## [2,] 6 10 14
## [3,] 7 11 15
A[1:2,]
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
A[,1:2]
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
A[1,]
## [1] 1 5 9 13
A[-c(1,3),]
## [,1] [,2] [,3] [,4]
## [1,] 2 6 10 14
## [2,] 4 8 12 16
dim(A)
## [1] 4 4
dim(A)[1]
## [1] 4
Auto = read.table('Auto.data')
fix(Auto)
Auto = read.table('Auto.data',header=T,na.strings='?')
Auto=read.csv('Auto.csv',header=T,na.strings='?')
fix(Auto)
dim(Auto)
## [1] 397 9
Auto = na.omit(Auto)
dim(Auto)
## [1] 392 9
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
# plot(cylinders,mpg) # will throw an error
plot(Auto$cylinders,Auto$mpg)
attach(Auto)
plot(cylinders,mpg)
cylinders = as.factor(cylinders)
plot(cylinders,mpg,col='red',varwidth=T,xlab='cylinders',ylab='mpg')
pairs(Auto)
pairs(~ mpg + displacement + horsepower + weight + acceleration,Auto)
plot(horsepower,mpg)
identify(horsepower,mpg,name)
## integer(0)
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## amc matador : 5
## ford pinto : 5
## toyota corolla : 5
## amc gremlin : 4
## amc hornet : 4
## chevrolet chevette: 4
## (Other) :365
head(Auto)
tail(Auto)
Get the dataset from http://faculty.marshall.usc.edu/gareth-james/ISL/data.html
Try some of the questions