Control flows in R

Every time some operations have to be repeated, a loop may come in handy. Loops are good for:

  • Doing something for every element of an object;
  • Doing something until the processed data runs out;
  • Doing something for every file in a folder;
  • Doing something that can fail, until it succeeds;
  • Iterating a calculation until it reaches convergence.

in this series we look at :

  • for loops
  • We will also dive into the apply() family functions, which are interesting function-based alternatives to for() {} loops.

lets load some data for later on;

# take a few rows
loan_data<-readr::read_csv("loan_data_cleaned.csv") |> 
  select(-1) |> 
  sample_n(size=15) |> 
  as.data.frame()

for loops

A for() loop works in the following way:

for(i in sequence) {
expression
}

note

The letter i can be replaced with any variable name, sequence can be elements or the position of these elements, and expression can be anything. Try the examples below:

example 1

for(a in c("Hello", 
           "R/Posit", 
           "group members")) {
  print(a)
}
#> [1] "Hello"
#> [1] "R/Posit"
#> [1] "group members"

example 2

for(z in 1:4) {
  a <- rnorm(n = 1, 
             mean = 5 * z, 
             sd = 2)
  print(a)
}
#> [1] 5.095703
#> [1] 10.85665
#> [1] 11.36553
#> [1] 17.66163

example 3

In this next example, every instance of m is being replaced by each number between 1 and 7, until it reaches the last element of the sequence

y <- 2
for(m in 1:6) {
  print(y*m)
}
#> [1] 2
#> [1] 4
#> [1] 6
#> [1] 8
#> [1] 10
#> [1] 12

for loops on different classes

As expected, you can use for() loops in different object types and classes, such as a list. Let us take the example below, where we are creating the elements object list.

(elements <- list(a = 1:3, 
                  b = 4:10, 
                  c = 7:-1))
#> $a
#> [1] 1 2 3
#> 
#> $b
#> [1]  4  5  6  7  8  9 10
#> 
#> $c
#> [1]  7  6  5  4  3  2  1  0 -1

Now, let us print the double of every element of the list:

for(element in elements) {
  print(element*2)
}
#> [1] 2 4 6
#> [1]  8 10 12 14 16 18 20
#> [1] 14 12 10  8  6  4  2  0 -2

for and if together

x <- c(2, 5, 3, 9, 6,8)
count <- 0
for(val in x) {
  if(val %% 2 == 0) {
    count <- count + 1
  }
}
print(count)
#> [1] 3

## basically this counts the number of even numbers(numbers divisible by 2)

for with a real dataset

for() loops are often used to loop over a dataset. We will use loops to perform functions on the loan_data . To load and see the first 6 rows of the loan data dataset, execute the following code:

head(loan_data)
#>   loan_status loan_amnt grade home_ownership annual_inc age emp_cat  ir_cat
#> 1           0      1000     A           RENT      51000  23    0-15     0-8
#> 2           0      3000     A       MORTGAGE      67000  25    0-15     0-8
#> 3           0     12000     C           RENT      53000  28    0-15   13.5+
#> 4           0      7000     C           RENT      97000  33    0-15 11-13.5
#> 5           0     20000     B       MORTGAGE      48000  29    0-15 Missing
#> 6           0      6000     B           RENT      30000  24    0-15 11-13.5

Now, to recursively print loan amount, let us do this:

for(i in 1:length(loan_data[,1])) { # for each row in the loan_data dataset
  print(loan_data$loan_amnt[i]) # print the loan amount
}
#> [1] 1000
#> [1] 3000
#> [1] 12000
#> [1] 7000
#> [1] 20000
#> [1] 6000
#> [1] 5200
#> [1] 9000
#> [1] 15400
#> [1] 16000
#> [1] 13000
#> [1] 10000
#> [1] 4000
#> [1] 7500
#> [1] 18000

first five rows

for(i in 1:5) { 
  print(loan_data$loan_amnt[i]) 
}
#> [1] 1000
#> [1] 3000
#> [1] 12000
#> [1] 7000
#> [1] 20000

what about last 5

for (i in 11:15) { 
  print(loan_data$loan_amnt[i]) 
}
#> [1] 13000
#> [1] 10000
#> [1] 4000
#> [1] 7500
#> [1] 18000

Now, let us obtain the loan amount for defaulters only

for(i in 1:length(loan_data[,1])) { # for each row in the loan_data dataset
  if(loan_data$loan_status[i] == 1) { # if the type is "0"
    print(loan_data$loan_amnt[i]) # print the loan_amount
  }
}
#> [1] 10000

To loop over the number of rows of a data frame, we can use the function nrow():

for(i in 1:nrow(loan_data)) {
  # for each row in
  # the loan_data dataset
  print(loan_data$loan_amnt[i])
  # print the loan_amount
}
#> [1] 1000
#> [1] 3000
#> [1] 12000
#> [1] 7000
#> [1] 20000
#> [1] 6000
#> [1] 5200
#> [1] 9000
#> [1] 15400
#> [1] 16000
#> [1] 13000
#> [1] 10000
#> [1] 4000
#> [1] 7500
#> [1] 18000

To perform operations on the elements of one column, we can directly iterate over it.

for(p in loan_data$loan_amnt) {
  # for each element of
  # the column "conc" of
  # the loan_data df
  print(p)
  # print the p-th element
}
#> [1] 1000
#> [1] 3000
#> [1] 12000
#> [1] 7000
#> [1] 20000
#> [1] 6000
#> [1] 5200
#> [1] 9000
#> [1] 15400
#> [1] 16000
#> [1] 13000
#> [1] 10000
#> [1] 4000
#> [1] 7500
#> [1] 18000

The expression within the loop can be almost anything and is usually a compound statement containing many commands.

for(i in c(2,5:6)) { # for i in 4 to 5
  print(colnames(loan_data)[i])
  print(mean(loan_data[,i])) # print the mean of that column from the loan_data dataset
}
#> [1] "loan_amnt"
#> [1] 9806.667
#> [1] "annual_inc"
#> [1] 57206.67
#> [1] "age"
#> [1] 28.06667

The apply() family

R disposes of the apply() function family, which consists of iterative functions that aim at minimizing your need to explicitly create loops.

apply()

Let us consider that we have a height matrix containing the height (in metres) that was taken from five individuals (in rows) at four different times (as columns).

(height <- matrix(runif(20, 1.5, 2),
                  nrow = 5,
                  ncol = 4))
#>          [,1]     [,2]     [,3]     [,4]
#> [1,] 1.763385 1.590613 1.863847 1.501643
#> [2,] 1.534095 1.622937 1.635874 1.962806
#> [3,] 1.877148 1.669160 1.586319 1.540396
#> [4,] 1.901861 1.691776 1.924880 1.568207
#> [5,] 1.789904 1.597716 1.837416 1.621756

We would like to obtain the average height at each time step.

One option is to use a for() {} loop to iterate from column 1 to 4, use the function mean() to calculate the average of the values, and sequentially store the output value in a vector.

Alternatively, we can use the apply() function to set it to apply the mean() function to every column of the height matrix. See the example below:

apply(X = height,
      MARGIN = 2,
      FUN = mean)
#> [1] 1.773279 1.634440 1.769667 1.638962

The apply() function begins with three arguments main arguments: X, which will take a matrix or a data frame; FUN, which can be any function that will be applied to the MARGINs of X; and MARGIN which will take 1 for row-wise computations, or 2 for column-wise computations.

lapply()

lapply() applies a function to every element of a list.

The output returned is also list (explaining the “l” in lapply) and has the same number of elements as the object passed to it.

SimulatedData <- list(
  SimpleSequence = 1:4,
  Norm10 = rnorm(10),
  Norm20 = rnorm(20, 1),
  Norm100 = rnorm(100, 5)
)

# Apply mean to each element of the list

lapply(X = SimulatedData, 
       FUN = mean)
#> $SimpleSequence
#> [1] 2.5
#> 
#> $Norm10
#> [1] -0.1762098
#> 
#> $Norm20
#> [1] 1.039137
#> 
#> $Norm100
#> [1] 4.983601

lapply() operations done in objects different from a list will be coerced to a list via base::as.list().

sapply()

sapply() is a ‘wrapper’ function for lapply(), but returns a simplified output as a vector, instead of a list.

SimulatedData <- list(SimpleSequence = 1:4,
                      Norm10 = rnorm(10),
                      Norm20 = rnorm(20, 1),
                      Norm100 = rnorm(100, 5))

# Apply mean to each element of the list
sapply(SimulatedData, mean)
#> SimpleSequence         Norm10         Norm20        Norm100 
#>     2.50000000    -0.03339704     0.95155936     5.00083959

mapply()

mapply() works as a multivariate version of sapply().

It will apply a given function to the first element of each argument first, followed by the second element, and so on. For example:

lilySeeds <- c(80, 65, 89, 23, 21)
poppySeeds <- c(20, 35, 11, 77, 79)

# Output
mapply(sum, lilySeeds, poppySeeds)
#> [1] 100 100 100 100 100

tapply()

tapply() is used to apply a function over subsets of a vector.

It is primarily used when the dataset contains dataset contains different groups (i.e. levels/factors) and we want to apply a function to each of these groups.

# get the mean loan_amnt by grade
tapply(loan_data$loan_amnt, loan_data$grade, FUN = mean)
#>         A         B         C 
#>  4333.333 11100.000 11280.000
Bongani Ncube
Bongani Ncube
Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.