# you need these packages installed
library(tidyverse)
library(dbplyr)
library(DBI) # this is the general DB interface library.
library(RMariaDB) # this is the library for MySQL databases.
Databases and SQL - Selection - R
Fundamentals of Data Science
The R/SQL interface through the tidyverse
is based on the dbplyr
package. While the python tools we used basically download information from the database, the dbplyr
package creates a datastructure on your local computer that interacts with the database in the background as you manipulate it. As a result, you can (somewhat) avoid explicit SQL.
As with the python library, we create a connection to the database we are interested.
<- DBI::dbConnect(RMariaDB::MariaDB(),
con host = awsresource, # get this from the instructor
user = username, # get this from the instructor
password = password, # get this from the instructor
dbname = "sakila"
)
If you want to submit an SQL query directly, you use the DBI::dbGetQuery
command. So to list the tables in the sakila
database, you send a “show tables” command (the result
is a base R dataframe, so we make it a tibble):
<- "show tables"
query <- DBI::dbGetQuery(con, query)
result print(as_tibble(result))
# A tibble: 23 × 1
Tables_in_sakila
<chr>
1 actor
2 actor_info
3 address
4 category
5 city
6 country
7 customer
8 customer_list
9 film
10 film_actor
# ℹ 13 more rows
To see the fields in the actor
table, you send a “describe actor” command.
<- "describe actor"
query <- DBI::dbGetQuery(con, query)
result print(as_tibble(result))
# A tibble: 4 × 6
Field Type Null Key Default Extra
<chr> <chr> <chr> <chr> <chr> <chr>
1 actor_id smallint unsigned NO "PRI" <NA> "auto_increment"
2 first_name varchar(45) NO "" <NA> ""
3 last_name varchar(45) NO "MUL" <NA> ""
4 last_update timestamp NO "" CURRENT_TIMESTAMP "DEFAULT_GENERATE…
The dbplyr
interface mimics the dplyr
interface, and it’s similar to the python one. To access a table, we use the tbl
command.
<- tbl(con, "actor")
actor <- tbl(con, "film") film
The data structures actor
and film
“look like tibbles” but in fact they are a sort of window into the database.
|> head() actor
# Source: SQL [6 x 4]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
actor_id first_name last_name last_update
<int> <chr> <chr> <dttm>
1 1 PENELOPE GUINESS 2006-02-15 04:34:33
2 2 NICK WAHLBERG 2006-02-15 04:34:33
3 3 ED CHASE 2006-02-15 04:34:33
4 4 JENNIFER DAVIS 2006-02-15 04:34:33
5 5 JOHNNY LOLLOBRIGIDA 2006-02-15 04:34:33
6 6 BETTE NICHOLSON 2006-02-15 04:34:33
Notice the description at the top says “Source: SQL” and includes a reference to the database. Still, you can use the usual tidyverse
style commands to work with this object.
For example, if you want to see how often the various first names occur in the actor table, you can use the group_by
, count
, and arrange
functions:
|>
actor group_by(`first_name`) |>
count() |>
arrange(desc(n))
# Source: SQL [?? x 2]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
# Groups: first_name
# Ordered by: desc(n)
first_name n
<chr> <int64>
1 PENELOPE 4
2 JULIA 4
3 KENNETH 4
4 NICK 3
5 ED 3
6 CAMERON 3
7 FAY 3
8 DAN 3
9 MATTHEW 3
10 CHRISTIAN 3
# ℹ more rows
What’s interesting here is that this computation is actually done on the database using SQL. In fact, if you use the show_query
command, you can see what command was used.
|>
actor group_by(`first_name`) |>
count() |>
arrange(desc(n)) |>
show_query()
<SQL>
SELECT `first_name`, COUNT(*) AS `n`
FROM `actor`
GROUP BY `first_name`
ORDER BY `n` DESC
To see another example, let’s look at one of the queries we used in the python example. Suppose we want the “G” rated films from the film table.
In tidyverse
syntax we’d write that like this.
<- film |> filter(`rating` == "G")
G_rated print(G_rated)
# Source: SQL [?? x 13]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
film_id title description release_year language_id original_language_id
<int> <chr> <chr> <int> <int> <int>
1 2 ACE GOLDFI… A Astoundi… 2006 1 NA
2 4 AFFAIR PRE… A Fanciful… 2006 1 NA
3 5 AFRICAN EGG A Fast-Pac… 2006 1 NA
4 11 ALAMO VIDE… A Boring E… 2006 1 NA
5 22 AMISTAD MI… A Emotiona… 2006 1 NA
6 25 ANGELS LIFE A Thoughtf… 2006 1 NA
7 26 ANNIE IDEN… A Amazing … 2006 1 NA
8 39 ARMAGEDDON… A Fast-Pac… 2006 1 NA
9 43 ATLANTIS C… A Thrillin… 2006 1 NA
10 46 AUTUMN CROW A Beautifu… 2006 1 NA
# ℹ more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
# replacement_cost <dbl>, rating <chr>, special_features <chr>,
# last_update <dttm>
Notice in the output here that the header does not tell us how many rows are present. it says “Source: [?? x 13]”. That’s because the code has not yet downloaded everything from the data base; just the first few rows to show you what’s going on. This is called “Lazy Evaluation.”
To see the query and the associated select...where
, we use show_query()
.
|> show_query() G_rated
<SQL>
SELECT *
FROM `film`
WHERE (`rating` = 'G')
The strategy here is that you develop your query until you are satisfied and then pull the results down all at once. Suppose we want to do another of our python exercises by computing the minimum, average, std deviation, and maximum length of films grouped by rating.
In tidyverse world, this is:
<- film |>
summary select(`rating`, `length`) |>
group_by(`rating`) |>
summarize(mean = mean(`length`), minlen = min(`length`), maxlen = max(`length`), stdlen = sd(`length`)) |>
arrange(desc(`mean`))
|> head() summary
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
# Source: SQL [5 x 5]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
# Ordered by: desc(mean)
rating mean minlen maxlen stdlen
<chr> <dbl> <int64> <int64> <dbl>
1 PG-13 120. 46 185 41.2
2 R 119. 49 185 38.5
3 NC-17 113. 46 184 40.8
4 PG 112. 46 185 39.3
5 G 111. 47 185 41.8
Now we can see what the SQL query looked like.
|> show_query() summary
<SQL>
SELECT
`rating`,
AVG(`length`) AS `mean`,
MIN(`length`) AS `minlen`,
MAX(`length`) AS `maxlen`,
STDDEV_SAMP(`length`) AS `stdlen`
FROM (
SELECT `rating`, `length`
FROM `film`
) `q01`
GROUP BY `rating`
ORDER BY `mean` DESC
When you are ready to work locally with the data, you use the collect
command. Notice that actor_local
is a tibble.
<- actor |> collect()
actor_local |> head() actor_local
# A tibble: 6 × 4
actor_id first_name last_name last_update
<int> <chr> <chr> <dttm>
1 1 PENELOPE GUINESS 2006-02-15 04:34:33
2 2 NICK WAHLBERG 2006-02-15 04:34:33
3 3 ED CHASE 2006-02-15 04:34:33
4 4 JENNIFER DAVIS 2006-02-15 04:34:33
5 5 JOHNNY LOLLOBRIGIDA 2006-02-15 04:34:33
6 6 BETTE NICHOLSON 2006-02-15 04:34:33