library(tidyverse)
SQL and Databases - Joins (R version)
Fundamentals of Data Science
Now let’s take a look at how joins work in R and the tidyverse
library. Before bringing the database into it, let’s work with some local tibbles.
As with our python discussion, we’ll make some tiny tibbles to experiment on.
<- tibble(K = c(1, 1, 2, 3, 4), D = c("A", "B", "C", "D", "E"))
T1 <- tibble(K = c(1, 2, 3, 6), D2 = c("U", "V", "W", "Y"))
T2 print(T1)
# A tibble: 5 × 2
K D
<dbl> <chr>
1 1 A
2 1 B
3 2 C
4 3 D
5 4 E
print(T2)
# A tibble: 4 × 2
K D2
<dbl> <chr>
1 1 U
2 2 V
3 3 W
4 6 Y
Inner Join: The inner_join
function does an inner join. The “by=” keyword tells which key to join on.
inner_join(T1, T2, by = c("K"), suffix = c("_T1", "_T2"))
# A tibble: 4 × 3
K D D2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
If the key fields have different names in the two tables, you use a named vector. In other words, if the key in T1
were called K1
and in T2
were K2
you would say by=c("K1"="K2")
.
Suffixes are only attached in the column names are the same. For example:
<- tibble(K1 = c(1, 1, 2, 3, 4), D = c("A", "B", "C", "D", "E"))
T1 <- tibble(K2 = c(1, 2, 3, 6), D = c("U", "V", "W", "Y"))
T2 print(T1)
# A tibble: 5 × 2
K1 D
<dbl> <chr>
1 1 A
2 1 B
3 2 C
4 3 D
5 4 E
print(T2)
# A tibble: 4 × 2
K2 D
<dbl> <chr>
1 1 U
2 2 V
3 3 W
4 6 Y
print(inner_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2")))
# A tibble: 4 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
There are similar functions left_join
, right_join
, and full_join
(meaning outer join).
<- left_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
leftT1_T2 <- right_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
rightT1_T2 <- full_join(T1, T2, by = c("K1" = "K2"), suffix = c("_T1", "_T2"))
fullT1_T2 print(leftT1_T2)
# A tibble: 5 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
5 4 E <NA>
print(rightT1_T2)
# A tibble: 5 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
5 6 <NA> Y
print(fullT1_T2)
# A tibble: 6 × 3
K1 D_T1 D_T2
<dbl> <chr> <chr>
1 1 A U
2 1 B U
3 2 C V
4 3 D W
5 4 E <NA>
6 6 <NA> Y
Now let’s see what happens when we bring the database into it.
library(dbplyr)
library(DBI) # this is the general DB interface library.
library(RMariaDB) # this is the library for MySQL databases.
<- DBI::dbConnect(RMariaDB::MariaDB(),
con host = awsresource, # get this from the instructor
user = username, # get this from the instructor
password = password, # get this from the instructor
dbname = "sakila"
)
Let’s open the actor
, film
, and film_actor
tables.
<- tbl(con, "actor")
actor <- tbl(con, "film")
film <- tbl(con, "film_actor") film_actor
The syntax looks the same as in the usual dplyr
case.
<- left_join(actor, film_actor, by = c("actor_id"))
actor_to_film_actor actor_to_film_actor
# Source: SQL [?? x 6]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
actor_id first_name last_name last_update.x film_id last_update.y
<int> <chr> <chr> <dttm> <int> <dttm>
1 1 PENELOPE GUINESS 2006-02-15 04:34:33 1 2006-02-15 05:05:03
2 1 PENELOPE GUINESS 2006-02-15 04:34:33 23 2006-02-15 05:05:03
3 1 PENELOPE GUINESS 2006-02-15 04:34:33 25 2006-02-15 05:05:03
4 1 PENELOPE GUINESS 2006-02-15 04:34:33 106 2006-02-15 05:05:03
5 1 PENELOPE GUINESS 2006-02-15 04:34:33 140 2006-02-15 05:05:03
6 1 PENELOPE GUINESS 2006-02-15 04:34:33 166 2006-02-15 05:05:03
7 1 PENELOPE GUINESS 2006-02-15 04:34:33 277 2006-02-15 05:05:03
8 1 PENELOPE GUINESS 2006-02-15 04:34:33 361 2006-02-15 05:05:03
9 1 PENELOPE GUINESS 2006-02-15 04:34:33 438 2006-02-15 05:05:03
10 1 PENELOPE GUINESS 2006-02-15 04:34:33 499 2006-02-15 05:05:03
# ℹ more rows
To put the film titles together, we can then join on film_id
.
<- left_join(actor_to_film_actor, film, by = c("film_id"))
actors_and_titles <- actors_and_titles |> select(`first_name`, `last_name`, `title`, `length`, `rating`)
actors_and_titles actors_and_titles
# Source: SQL [?? x 5]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
first_name last_name title length rating
<chr> <chr> <chr> <int> <chr>
1 PENELOPE GUINESS ACADEMY DINOSAUR 86 PG
2 PENELOPE GUINESS ANACONDA CONFESSIONS 92 R
3 PENELOPE GUINESS ANGELS LIFE 74 G
4 PENELOPE GUINESS BULWORTH COMMANDMENTS 61 G
5 PENELOPE GUINESS CHEAPER CLYDE 87 G
6 PENELOPE GUINESS COLOR PHILADELPHIA 149 G
7 PENELOPE GUINESS ELEPHANT TROJAN 126 PG-13
8 PENELOPE GUINESS GLEAMING JAWBREAKER 89 NC-17
9 PENELOPE GUINESS HUMAN GRAFFITI 68 NC-17
10 PENELOPE GUINESS KING EVOLUTION 184 NC-17
# ℹ more rows
The pipe notation also works.
<- actor_to_film_actor |>
actors_and_titles_2 left_join(film, by = c("film_id")) |>
select(`first_name`, `last_name`, `title`, `length`, `rating`)
actors_and_titles_2
# Source: SQL [?? x 5]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
first_name last_name title length rating
<chr> <chr> <chr> <int> <chr>
1 PENELOPE GUINESS ACADEMY DINOSAUR 86 PG
2 PENELOPE GUINESS ANACONDA CONFESSIONS 92 R
3 PENELOPE GUINESS ANGELS LIFE 74 G
4 PENELOPE GUINESS BULWORTH COMMANDMENTS 61 G
5 PENELOPE GUINESS CHEAPER CLYDE 87 G
6 PENELOPE GUINESS COLOR PHILADELPHIA 149 G
7 PENELOPE GUINESS ELEPHANT TROJAN 126 PG-13
8 PENELOPE GUINESS GLEAMING JAWBREAKER 89 NC-17
9 PENELOPE GUINESS HUMAN GRAFFITI 68 NC-17
10 PENELOPE GUINESS KING EVOLUTION 184 NC-17
# ℹ more rows
What’s fun about doing this in R
is that you can retrieve the actual SQL used to do these manipulations.
|> show_query() actors_and_titles
<SQL>
SELECT `first_name`, `last_name`, `title`, `length`, `rating`
FROM `actor`
LEFT JOIN `film_actor`
ON (`actor`.`actor_id` = `film_actor`.`actor_id`)
LEFT JOIN `film`
ON (`film_actor`.`film_id` = `film`.`film_id`)
Finally, you can “pull down” the full table using collect
.
<- actors_and_titles |> collect()
actors_and_titles_local actors_and_titles_local
# A tibble: 5,462 × 5
first_name last_name title length rating
<chr> <chr> <chr> <int> <chr>
1 PENELOPE GUINESS ACADEMY DINOSAUR 86 PG
2 PENELOPE GUINESS ANACONDA CONFESSIONS 92 R
3 PENELOPE GUINESS ANGELS LIFE 74 G
4 PENELOPE GUINESS BULWORTH COMMANDMENTS 61 G
5 PENELOPE GUINESS CHEAPER CLYDE 87 G
6 PENELOPE GUINESS COLOR PHILADELPHIA 149 G
7 PENELOPE GUINESS ELEPHANT TROJAN 126 PG-13
8 PENELOPE GUINESS GLEAMING JAWBREAKER 89 NC-17
9 PENELOPE GUINESS HUMAN GRAFFITI 68 NC-17
10 PENELOPE GUINESS KING EVOLUTION 184 NC-17
# ℹ 5,452 more rows