SQL and Databases - Joins (R version)

Fundamentals of Data Science

Author

Jeremy Teitelbaum

Now let’s take a look at how joins work in R and the tidyverse library. Before bringing the database into it, let’s work with some local tibbles.

As with our python discussion, we’ll make some tiny tibbles to experiment on.

library(tidyverse)

T1 <- tibble(K = c(1, 1, 2, 3, 4), D = c("A", "B", "C", "D", "E"))
T2 <- tibble(K = c(1, 2, 3, 6), D2 = c("U", "V", "W", "Y"))
print(T1)

# A tibble: 5 × 2
      K D    
  <dbl> <chr>
1     1 A    
2     1 B    
3     2 C    
4     3 D    
5     4 E

print(T2)

# A tibble: 4 × 2
      K D2   
  <dbl> <chr>
1     1 U    
2     2 V    
3     3 W    
4     6 Y

Inner Join: The inner_join function does an inner join. The “by=” keyword tells which key to join on.

inner_join(T1, T2, by = c("K"), suffix = c("_T1", "_T2"))

# A tibble: 4 × 3
      K D     D2   
  <dbl> <chr> <chr>
1     1 A     U    
2     1 B     U    
3     2 C     V    
4     3 D     W

If the key fields have different names in the two tables, you use a named vector. In other words, if the key in T1 were called K1 and in T2 were K2 you would say by=c("K1"="K2").

Suffixes are only attached in the column names are the same. For example:

T1 <- tibble(K1 = c(1, 1, 2, 3, 4), D = c("A", "B", "C", "D", "E"))
T2 <- tibble(K2 = c(1, 2, 3, 6), D = c("U", "V", "W", "Y"))
print(T1)

# A tibble: 5 × 2
     K1 D    
  <dbl> <chr>
1     1 A    
2     1 B    
3     2 C    
4     3 D    
5     4 E

print(T2)

# A tibble: 4 × 2
     K2 D    
  <dbl> <chr>
1     1 U    
2     2 V    
3     3 W    
4     6 Y

print(inner_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2")))

# A tibble: 4 × 3
     K1 D_T1  D_T2 
  <dbl> <chr> <chr>
1     1 A     U    
2     1 B     U    
3     2 C     V    
4     3 D     W

There are similar functions left_join, right_join, and full_join (meaning outer join).

leftT1_T2 <- left_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
rightT1_T2 <- right_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
fullT1_T2 <- full_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
print(leftT1_T2)

# A tibble: 5 × 3
     K1 D_T1  D_T2 
  <dbl> <chr> <chr>
1     1 A     U    
2     1 B     U    
3     2 C     V    
4     3 D     W    
5     4 E     <NA>

print(rightT1_T2)

# A tibble: 5 × 3
     K1 D_T1  D_T2 
  <dbl> <chr> <chr>
1     1 A     U    
2     1 B     U    
3     2 C     V    
4     3 D     W    
5     6 <NA>  Y

print(fullT1_T2)

# A tibble: 6 × 3
     K1 D_T1  D_T2 
  <dbl> <chr> <chr>
1     1 A     U    
2     1 B     U    
3     2 C     V    
4     3 D     W    
5     4 E     <NA> 
6     6 <NA>  Y

Now let’s see what happens when we bring the database into it.

library(dbplyr)
library(DBI) # this is the general DB interface library.
library(RMariaDB) # this is the library for MySQL databases.

con <- DBI::dbConnect(RMariaDB::MariaDB(),
    host = awsresource, # get this from the instructor
    user = username, # get this from the instructor
    password = password, # get this from the instructor
    dbname = "sakila"
)

Let’s open the actor, film, and film_actor tables.

actor <- tbl(con, "actor")
film <- tbl(con, "film")
film_actor <- tbl(con, "film_actor")

The syntax looks the same as in the usual dplyr case.

actor_to_film_actor <- left_join(actor, film_actor, by = c("actor_id"))
actor_to_film_actor

# Source:   SQL [?? x 6]
# Database: mysql  [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
   actor_id first_name last_name last_update.x       film_id last_update.y      
      <int> <chr>      <chr>     <dttm>                <int> <dttm>             
 1        1 PENELOPE   GUINESS   2006-02-15 04:34:33       1 2006-02-15 05:05:03
 2        1 PENELOPE   GUINESS   2006-02-15 04:34:33      23 2006-02-15 05:05:03
 3        1 PENELOPE   GUINESS   2006-02-15 04:34:33      25 2006-02-15 05:05:03
 4        1 PENELOPE   GUINESS   2006-02-15 04:34:33     106 2006-02-15 05:05:03
 5        1 PENELOPE   GUINESS   2006-02-15 04:34:33     140 2006-02-15 05:05:03
 6        1 PENELOPE   GUINESS   2006-02-15 04:34:33     166 2006-02-15 05:05:03
 7        1 PENELOPE   GUINESS   2006-02-15 04:34:33     277 2006-02-15 05:05:03
 8        1 PENELOPE   GUINESS   2006-02-15 04:34:33     361 2006-02-15 05:05:03
 9        1 PENELOPE   GUINESS   2006-02-15 04:34:33     438 2006-02-15 05:05:03
10        1 PENELOPE   GUINESS   2006-02-15 04:34:33     499 2006-02-15 05:05:03
# ℹ more rows

To put the film titles together, we can then join on film_id.

actors_and_titles <- left_join(actor_to_film_actor, film, by = c("film_id"))
actors_and_titles <- actors_and_titles |> select(`first_name`, `last_name`, `title`, `length`, `rating`)
actors_and_titles

# Source:   SQL [?? x 5]
# Database: mysql  [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
   first_name last_name title                 length rating
   <chr>      <chr>     <chr>                  <int> <chr> 
 1 PENELOPE   GUINESS   ACADEMY DINOSAUR          86 PG    
 2 PENELOPE   GUINESS   ANACONDA CONFESSIONS      92 R     
 3 PENELOPE   GUINESS   ANGELS LIFE               74 G     
 4 PENELOPE   GUINESS   BULWORTH COMMANDMENTS     61 G     
 5 PENELOPE   GUINESS   CHEAPER CLYDE             87 G     
 6 PENELOPE   GUINESS   COLOR PHILADELPHIA       149 G     
 7 PENELOPE   GUINESS   ELEPHANT TROJAN          126 PG-13 
 8 PENELOPE   GUINESS   GLEAMING JAWBREAKER       89 NC-17 
 9 PENELOPE   GUINESS   HUMAN GRAFFITI            68 NC-17 
10 PENELOPE   GUINESS   KING EVOLUTION           184 NC-17 
# ℹ more rows

The pipe notation also works.

actors_and_titles_2 <- actor_to_film_actor |>
    left_join(film, by = c("film_id")) |>
    select(`first_name`, `last_name`, `title`, `length`, `rating`)
actors_and_titles_2

# Source:   SQL [?? x 5]
# Database: mysql  [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
   first_name last_name title                 length rating
   <chr>      <chr>     <chr>                  <int> <chr> 
 1 PENELOPE   GUINESS   ACADEMY DINOSAUR          86 PG    
 2 PENELOPE   GUINESS   ANACONDA CONFESSIONS      92 R     
 3 PENELOPE   GUINESS   ANGELS LIFE               74 G     
 4 PENELOPE   GUINESS   BULWORTH COMMANDMENTS     61 G     
 5 PENELOPE   GUINESS   CHEAPER CLYDE             87 G     
 6 PENELOPE   GUINESS   COLOR PHILADELPHIA       149 G     
 7 PENELOPE   GUINESS   ELEPHANT TROJAN          126 PG-13 
 8 PENELOPE   GUINESS   GLEAMING JAWBREAKER       89 NC-17 
 9 PENELOPE   GUINESS   HUMAN GRAFFITI            68 NC-17 
10 PENELOPE   GUINESS   KING EVOLUTION           184 NC-17 
# ℹ more rows

What’s fun about doing this in R is that you can retrieve the actual SQL used to do these manipulations.

actors_and_titles |> show_query()

<SQL>
SELECT `first_name`, `last_name`, `title`, `length`, `rating`
FROM `actor`
LEFT JOIN `film_actor`
  ON (`actor`.`actor_id` = `film_actor`.`actor_id`)
LEFT JOIN `film`
  ON (`film_actor`.`film_id` = `film`.`film_id`)

Finally, you can “pull down” the full table using collect.

actors_and_titles_local <- actors_and_titles |> collect()
actors_and_titles_local

# A tibble: 5,462 × 5
   first_name last_name title                 length rating
   <chr>      <chr>     <chr>                  <int> <chr> 
 1 PENELOPE   GUINESS   ACADEMY DINOSAUR          86 PG    
 2 PENELOPE   GUINESS   ANACONDA CONFESSIONS      92 R     
 3 PENELOPE   GUINESS   ANGELS LIFE               74 G     
 4 PENELOPE   GUINESS   BULWORTH COMMANDMENTS     61 G     
 5 PENELOPE   GUINESS   CHEAPER CLYDE             87 G     
 6 PENELOPE   GUINESS   COLOR PHILADELPHIA       149 G     
 7 PENELOPE   GUINESS   ELEPHANT TROJAN          126 PG-13 
 8 PENELOPE   GUINESS   GLEAMING JAWBREAKER       89 NC-17 
 9 PENELOPE   GUINESS   HUMAN GRAFFITI            68 NC-17 
10 PENELOPE   GUINESS   KING EVOLUTION           184 NC-17 
# ℹ 5,452 more rows