library(tidyverse)SQL and Databases - Joins (R version)
Fundamentals of Data Science
Now let’s take a look at how joins work in R and the tidyverse library. Before bringing the database into it, let’s work with some local tibbles.
As with our python discussion, we’ll make some tiny tibbles to experiment on.
T1 <- tibble(K = c(1, 1, 2, 3, 4), D = c("A", "B", "C", "D", "E"))
T2 <- tibble(K = c(1, 2, 3, 6), D2 = c("U", "V", "W", "Y"))
print(T1)# A tibble: 5 × 2
K D
<dbl> <chr>
1 1 A
2 1 B
3 2 C
4 3 D
5 4 E
print(T2)# A tibble: 4 × 2
K D2
<dbl> <chr>
1 1 U
2 2 V
3 3 W
4 6 Y
Inner Join: The inner_join function does an inner join. The “by=” keyword tells which key to join on.
inner_join(T1, T2, by = c("K"), suffix = c("_T1", "_T2"))# A tibble: 4 × 3
K D D2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
If the key fields have different names in the two tables, you use a named vector. In other words, if the key in T1 were called K1 and in T2 were K2 you would say by=c("K1"="K2").
Suffixes are only attached in the column names are the same. For example:
T1 <- tibble(K1 = c(1, 1, 2, 3, 4), D = c("A", "B", "C", "D", "E"))
T2 <- tibble(K2 = c(1, 2, 3, 6), D = c("U", "V", "W", "Y"))
print(T1)# A tibble: 5 × 2
K1 D
<dbl> <chr>
1 1 A
2 1 B
3 2 C
4 3 D
5 4 E
print(T2)# A tibble: 4 × 2
K2 D
<dbl> <chr>
1 1 U
2 2 V
3 3 W
4 6 Y
print(inner_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2")))# A tibble: 4 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
There are similar functions left_join, right_join, and full_join (meaning outer join).
leftT1_T2 <- left_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
rightT1_T2 <- right_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
fullT1_T2 <- full_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
print(leftT1_T2)# A tibble: 5 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
5 4 E <NA>
print(rightT1_T2)# A tibble: 5 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
5 6 <NA> Y
print(fullT1_T2)# A tibble: 6 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
5 4 E <NA>
6 6 <NA> Y
Now let’s see what happens when we bring the database into it.
library(dbplyr)
library(DBI) # this is the general DB interface library.
library(RMariaDB) # this is the library for MySQL databases.con <- DBI::dbConnect(RMariaDB::MariaDB(),
host = awsresource, # get this from the instructor
user = username, # get this from the instructor
password = password, # get this from the instructor
dbname = "sakila"
)Let’s open the actor, film, and film_actor tables.
actor <- tbl(con, "actor")
film <- tbl(con, "film")
film_actor <- tbl(con, "film_actor")The syntax looks the same as in the usual dplyr case.
actor_to_film_actor <- left_join(actor, film_actor, by = c("actor_id"))
actor_to_film_actor# Source: SQL [?? x 6]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
actor_id first_name last_name last_update.x film_id last_update.y
<int> <chr> <chr> <dttm> <int> <dttm>
1 1 PENELOPE GUINESS 2006-02-15 04:34:33 1 2006-02-15 05:05:03
2 1 PENELOPE GUINESS 2006-02-15 04:34:33 23 2006-02-15 05:05:03
3 1 PENELOPE GUINESS 2006-02-15 04:34:33 25 2006-02-15 05:05:03
4 1 PENELOPE GUINESS 2006-02-15 04:34:33 106 2006-02-15 05:05:03
5 1 PENELOPE GUINESS 2006-02-15 04:34:33 140 2006-02-15 05:05:03
6 1 PENELOPE GUINESS 2006-02-15 04:34:33 166 2006-02-15 05:05:03
7 1 PENELOPE GUINESS 2006-02-15 04:34:33 277 2006-02-15 05:05:03
8 1 PENELOPE GUINESS 2006-02-15 04:34:33 361 2006-02-15 05:05:03
9 1 PENELOPE GUINESS 2006-02-15 04:34:33 438 2006-02-15 05:05:03
10 1 PENELOPE GUINESS 2006-02-15 04:34:33 499 2006-02-15 05:05:03
# ℹ more rows
To put the film titles together, we can then join on film_id.
actors_and_titles <- left_join(actor_to_film_actor, film, by = c("film_id"))
actors_and_titles <- actors_and_titles |> select(`first_name`, `last_name`, `title`, `length`, `rating`)
actors_and_titles# Source: SQL [?? x 5]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
first_name last_name title length rating
<chr> <chr> <chr> <int> <chr>
1 PENELOPE GUINESS ACADEMY DINOSAUR 86 PG
2 PENELOPE GUINESS ANACONDA CONFESSIONS 92 R
3 PENELOPE GUINESS ANGELS LIFE 74 G
4 PENELOPE GUINESS BULWORTH COMMANDMENTS 61 G
5 PENELOPE GUINESS CHEAPER CLYDE 87 G
6 PENELOPE GUINESS COLOR PHILADELPHIA 149 G
7 PENELOPE GUINESS ELEPHANT TROJAN 126 PG-13
8 PENELOPE GUINESS GLEAMING JAWBREAKER 89 NC-17
9 PENELOPE GUINESS HUMAN GRAFFITI 68 NC-17
10 PENELOPE GUINESS KING EVOLUTION 184 NC-17
# ℹ more rows
The pipe notation also works.
actors_and_titles_2 <- actor_to_film_actor |>
left_join(film, by = c("film_id")) |>
select(`first_name`, `last_name`, `title`, `length`, `rating`)
actors_and_titles_2# Source: SQL [?? x 5]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
first_name last_name title length rating
<chr> <chr> <chr> <int> <chr>
1 PENELOPE GUINESS ACADEMY DINOSAUR 86 PG
2 PENELOPE GUINESS ANACONDA CONFESSIONS 92 R
3 PENELOPE GUINESS ANGELS LIFE 74 G
4 PENELOPE GUINESS BULWORTH COMMANDMENTS 61 G
5 PENELOPE GUINESS CHEAPER CLYDE 87 G
6 PENELOPE GUINESS COLOR PHILADELPHIA 149 G
7 PENELOPE GUINESS ELEPHANT TROJAN 126 PG-13
8 PENELOPE GUINESS GLEAMING JAWBREAKER 89 NC-17
9 PENELOPE GUINESS HUMAN GRAFFITI 68 NC-17
10 PENELOPE GUINESS KING EVOLUTION 184 NC-17
# ℹ more rows
What’s fun about doing this in R is that you can retrieve the actual SQL used to do these manipulations.
actors_and_titles |> show_query()<SQL>
SELECT `first_name`, `last_name`, `title`, `length`, `rating`
FROM `actor`
LEFT JOIN `film_actor`
ON (`actor`.`actor_id` = `film_actor`.`actor_id`)
LEFT JOIN `film`
ON (`film_actor`.`film_id` = `film`.`film_id`)
Finally, you can “pull down” the full table using collect.
actors_and_titles_local <- actors_and_titles |> collect()
actors_and_titles_local# A tibble: 5,462 × 5
first_name last_name title length rating
<chr> <chr> <chr> <int> <chr>
1 PENELOPE GUINESS ACADEMY DINOSAUR 86 PG
2 PENELOPE GUINESS ANACONDA CONFESSIONS 92 R
3 PENELOPE GUINESS ANGELS LIFE 74 G
4 PENELOPE GUINESS BULWORTH COMMANDMENTS 61 G
5 PENELOPE GUINESS CHEAPER CLYDE 87 G
6 PENELOPE GUINESS COLOR PHILADELPHIA 149 G
7 PENELOPE GUINESS ELEPHANT TROJAN 126 PG-13
8 PENELOPE GUINESS GLEAMING JAWBREAKER 89 NC-17
9 PENELOPE GUINESS HUMAN GRAFFITI 68 NC-17
10 PENELOPE GUINESS KING EVOLUTION 184 NC-17
# ℹ 5,452 more rows