# you need these packages installed
library(tidyverse)
library(dbplyr)
library(DBI) # this is the general DB interface library.
library(RMariaDB) # this is the library for MySQL databases.Databases and SQL - Selection - R
Fundamentals of Data Science
The R/SQL interface through the tidyverse is based on the dbplyr package. While the python tools we used basically download information from the database, the dbplyr package creates a datastructure on your local computer that interacts with the database in the background as you manipulate it. As a result, you can (somewhat) avoid explicit SQL.
As with the python library, we create a connection to the database we are interested.
con <- DBI::dbConnect(RMariaDB::MariaDB(),
host = awsresource, # get this from the instructor
user = username, # get this from the instructor
password = password, # get this from the instructor
dbname = "sakila"
)If you want to submit an SQL query directly, you use the DBI::dbGetQuery command. So to list the tables in the sakila database, you send a “show tables” command (the result is a base R dataframe, so we make it a tibble):
query <- "show tables"
result <- DBI::dbGetQuery(con, query)
print(as_tibble(result))# A tibble: 23 × 1
Tables_in_sakila
<chr>
1 actor
2 actor_info
3 address
4 category
5 city
6 country
7 customer
8 customer_list
9 film
10 film_actor
# ℹ 13 more rows
To see the fields in the actor table, you send a “describe actor” command.
query <- "describe actor"
result <- DBI::dbGetQuery(con, query)
print(as_tibble(result))# A tibble: 4 × 6
Field Type Null Key Default Extra
<chr> <chr> <chr> <chr> <chr> <chr>
1 actor_id smallint unsigned NO "PRI" <NA> "auto_increment"
2 first_name varchar(45) NO "" <NA> ""
3 last_name varchar(45) NO "MUL" <NA> ""
4 last_update timestamp NO "" CURRENT_TIMESTAMP "DEFAULT_GENERATE…
The dbplyr interface mimics the dplyr interface, and it’s similar to the python one. To access a table, we use the tbl command.
actor <- tbl(con, "actor")
film <- tbl(con, "film")The data structures actor and film “look like tibbles” but in fact they are a sort of window into the database.
actor |> head()# Source: SQL [6 x 4]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
actor_id first_name last_name last_update
<int> <chr> <chr> <dttm>
1 1 PENELOPE GUINESS 2006-02-15 04:34:33
2 2 NICK WAHLBERG 2006-02-15 04:34:33
3 3 ED CHASE 2006-02-15 04:34:33
4 4 JENNIFER DAVIS 2006-02-15 04:34:33
5 5 JOHNNY LOLLOBRIGIDA 2006-02-15 04:34:33
6 6 BETTE NICHOLSON 2006-02-15 04:34:33
Notice the description at the top says “Source: SQL” and includes a reference to the database. Still, you can use the usual tidyverse style commands to work with this object.
For example, if you want to see how often the various first names occur in the actor table, you can use the group_by, count, and arrange functions:
actor |>
group_by(`first_name`) |>
count() |>
arrange(desc(n))# Source: SQL [?? x 2]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
# Groups: first_name
# Ordered by: desc(n)
first_name n
<chr> <int64>
1 PENELOPE 4
2 JULIA 4
3 KENNETH 4
4 NICK 3
5 ED 3
6 CAMERON 3
7 FAY 3
8 DAN 3
9 MATTHEW 3
10 CHRISTIAN 3
# ℹ more rows
What’s interesting here is that this computation is actually done on the database using SQL. In fact, if you use the show_query command, you can see what command was used.
actor |>
group_by(`first_name`) |>
count() |>
arrange(desc(n)) |>
show_query()<SQL>
SELECT `first_name`, COUNT(*) AS `n`
FROM `actor`
GROUP BY `first_name`
ORDER BY `n` DESC
To see another example, let’s look at one of the queries we used in the python example. Suppose we want the “G” rated films from the film table.
In tidyverse syntax we’d write that like this.
G_rated <- film |> filter(`rating` == "G")
print(G_rated)# Source: SQL [?? x 13]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
film_id title description release_year language_id original_language_id
<int> <chr> <chr> <int> <int> <int>
1 2 ACE GOLDFI… A Astoundi… 2006 1 NA
2 4 AFFAIR PRE… A Fanciful… 2006 1 NA
3 5 AFRICAN EGG A Fast-Pac… 2006 1 NA
4 11 ALAMO VIDE… A Boring E… 2006 1 NA
5 22 AMISTAD MI… A Emotiona… 2006 1 NA
6 25 ANGELS LIFE A Thoughtf… 2006 1 NA
7 26 ANNIE IDEN… A Amazing … 2006 1 NA
8 39 ARMAGEDDON… A Fast-Pac… 2006 1 NA
9 43 ATLANTIS C… A Thrillin… 2006 1 NA
10 46 AUTUMN CROW A Beautifu… 2006 1 NA
# ℹ more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
# replacement_cost <dbl>, rating <chr>, special_features <chr>,
# last_update <dttm>
Notice in the output here that the header does not tell us how many rows are present. it says “Source: [?? x 13]”. That’s because the code has not yet downloaded everything from the data base; just the first few rows to show you what’s going on. This is called “Lazy Evaluation.”
To see the query and the associated select...where, we use show_query().
G_rated |> show_query()<SQL>
SELECT *
FROM `film`
WHERE (`rating` = 'G')
The strategy here is that you develop your query until you are satisfied and then pull the results down all at once. Suppose we want to do another of our python exercises by computing the minimum, average, std deviation, and maximum length of films grouped by rating.
In tidyverse world, this is:
summary <- film |>
select(`rating`, `length`) |>
group_by(`rating`) |>
summarize(mean = mean(`length`), minlen = min(`length`), maxlen = max(`length`), stdlen = sd(`length`)) |>
arrange(desc(`mean`))
summary |> head()Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
# Source: SQL [5 x 5]
# Database: mysql [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
# Ordered by: desc(mean)
rating mean minlen maxlen stdlen
<chr> <dbl> <int64> <int64> <dbl>
1 PG-13 120. 46 185 41.2
2 R 119. 49 185 38.5
3 NC-17 113. 46 184 40.8
4 PG 112. 46 185 39.3
5 G 111. 47 185 41.8
Now we can see what the SQL query looked like.
summary |> show_query()<SQL>
SELECT
`rating`,
AVG(`length`) AS `mean`,
MIN(`length`) AS `minlen`,
MAX(`length`) AS `maxlen`,
STDDEV_SAMP(`length`) AS `stdlen`
FROM (
SELECT `rating`, `length`
FROM `film`
) `q01`
GROUP BY `rating`
ORDER BY `mean` DESC
When you are ready to work locally with the data, you use the collect command. Notice that actor_local is a tibble.
actor_local <- actor |> collect()
actor_local |> head()# A tibble: 6 × 4
actor_id first_name last_name last_update
<int> <chr> <chr> <dttm>
1 1 PENELOPE GUINESS 2006-02-15 04:34:33
2 2 NICK WAHLBERG 2006-02-15 04:34:33
3 3 ED CHASE 2006-02-15 04:34:33
4 4 JENNIFER DAVIS 2006-02-15 04:34:33
5 5 JOHNNY LOLLOBRIGIDA 2006-02-15 04:34:33
6 6 BETTE NICHOLSON 2006-02-15 04:34:33