Basics of Programming in R

Fundamentals of Data Science

Author

Jeremy Teitelbaum

Some basic characteristics of R

  • The assignment operator in R is <-
  • There is no built-in “dictionary” datatype.
  • The basic datatype in R is the vector, which contains objects of the same type.
  • Vectors are indexed from 1.
# n
x <- c("Hello", 1)
class(x)
[1] "character"

Notice that x is now all characters, and in fact if you now compute 2*x[2] you will get an error.

  • Ranges are inclusive
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
  • TRUE and FALSE instead of True and False

  • indentation does not matter and you can use ; to string multiple statements together.

x <- 1
y <- 1
z <- 1
  • length gives the length of a vector, nchar gives the number of characters of a string.
length("Hello")
[1] 1
length(c("Hello", "GoodBye"))
[1] 2
nchar("Hello")
[1] 5
nchar(c("Hello", "GoodBye"))
[1] 5 7
  • You need to use substr to extract substrings, not subscripts.
s <- "Hello"
s[1]
[1] "Hello"
  • Convert a vector to a string
s <- paste(c("A", "B", "C"), collapse = "")
t <- paste(c("A", "B", "C"), c("D", "E", "F"), sep = ",", collapse = " ")
print(s)
[1] "ABC"
print(t)
[1] "A,D B,E C,F"
s <- "This is a string of letters"
t <- substr(rep(s, nchar(s) / 2), seq(1, nchar(s), 2), seq(1, nchar(s), 2))
paste(t, collapse = "")
[1] "Ti sasrn flte"

Lists

A list can contain objects of different types.

lst <- list("a", 1.5)

In particular, a list can contain vectors and can have named entries.

lst <- list(first = c(1, 2, 3), second = c(4, 5, 6))
print(lst)
$first
[1] 1 2 3

$second
[1] 4 5 6

The presence of [[]] indicates a list.

print(lst[[1]])
[1] 1 2 3
print(lst$first)
[1] 1 2 3
class(lst[1])
[1] "list"
class(lst[[1]])
[1] "numeric"
  • Split a string to a list
a <- strsplit("This is a string", split = " ")
b <- strsplit("this is a string split into letters", split = "")
print(a)
[[1]]
[1] "This"   "is"     "a"      "string"
print(b)
[[1]]
 [1] "t" "h" "i" "s" " " "i" "s" " " "a" " " "s" "t" "r" "i" "n" "g" " " "s" "p"
[20] "l" "i" "t" " " "i" "n" "t" "o" " " "l" "e" "t" "t" "e" "r" "s"
  • Extract every other letter

Functions

Functions are constructed like this:

f <- function(n) {
    n**2
}
f(5)
[1] 25

The last evaluated expression is the value of the function but it is better style to actually use the return statement.

f <- function(n) {
    return(n**2)
}
f(10)
[1] 100

Functions are automatically “vectorized.”

f(1:10)
 [1]   1   4   9  16  25  36  49  64  81 100

R automatically “recyles” when things fit.

1:3 + 1:6
[1] 2 4 6 5 7 9

The principle of scope is essentially the same as discussed in the python programming notes.

Iteration in R

y <- 0
for (x in c(1, 2, 3, 10)) {
    print(x)
    y <- y + x
}
[1] 1
[1] 2
[1] 3
[1] 10
cat("y=", y)
y= 16
y<-0
while(y<10) {
    cat("y = ",y," ",sep="")
    y <- y+1
}
y = 0 y = 1 y = 2 y = 3 y = 4 y = 5 y = 6 y = 7 y = 8 y = 9 

Often iteration in R is unnecessary. Suppose you want to compute the sum of the squares of the first n integers.

f <- function(n) {
    s <- 0
    for (i in seq(1, n)) {
        s <- s + i^2
    }
    return(s)
}
f(10)
[1] 385
f <- function(n) {
    return(sum(seq(1, n)^2))
}
f(10)
[1] 385

Logical statements

if(substr("Hesterday",1,1)=="H") {
    print("Yes")
} else {
    print("No")
}
[1] "Yes"
less_than_one <- function(x) {
if (any(x<1)) {
    print("Yes")
} 
else {
    print("No")
}
}

Again you can avoid iteration.

x <- rnorm(20)
x[x < 1]
 [1]  0.98825215  0.05024871  0.65684013 -2.15985123 -0.64628928 -0.59783264
 [7] -1.16356622 -0.10959459 -0.30605511 -1.16282074 -0.26461149 -0.67179247
[13] -0.58978884  0.60680612 -2.63858979  0.04518775 -0.53765815

Example

Take a string and make its first character upper case and the rest lower.

f<-function(s) {
    a<-paste(toupper(substr(s,1,1)),substr(s,2,nchar(s)),sep="")
    return(a)
}

You can assign to substrings.

f<-function(s) {
    substr(s,1,1)<-toupper(substr(s,1,1))
    return(s)
}

Problems

  1. Write a function which takes a string and standardizes it by:
    • removes all characters which are not letters, numbers, or spaces
    • makes all the letters lower case
    • replacing all spaces by underscore ’_’

Hint: convert the string to a vector of letters

  1. The object penguins_raw is a “tibble”, which is a fancy type of tabular layout. It has named columns that you can extract with $.
library(palmerpenguins)
# view(penguins_raw)
colnames(penguins_raw)
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           

By assigning to colnames you can change the column names. (In other words, colnames(penguins_raw)<-c(...) replaces the column names from the given vector. Use your function from part(1) to simplify the column names of this tibble.

  1. You can access a column of the tibble using $, so for example penguins_raw$species should give you the vector of species. Replace this column with just the first word of the species name (Gentoo, Adelie, Chinstrap).

  2. Let \(n\) be a positive real number and let \(x_0\) be 1. The iteration \[ x_{k+1} = x_{k}/2+n/(2x_k) \]

converges to the square root of \(n\). (This is Newton’s Method). Write an R function which computes the square root using this iteration. You should continue to iterate until \(x_{k+1}\) is within \(10^{-6}\) of \(x_{k}\).

#f<-function(n) {

#}

Suppose you want to save the successive values you computed during the iteration for plotting purposes. How could you do that (and return them)?

Suppose you want the tolerance (here \(10^{6}\)) to be a parameter?

Suppose you want to set a maximum number of iterations, in case something goes wrong, to prevent an infinite loop?