Databases and SQL - Selection - R

Fundamentals of Data Science

Author

Jeremy Teitelbaum

The R/SQL interface through the tidyverse is based on the dbplyr package. While the python tools we used basically download information from the database, the dbplyr package creates a datastructure on your local computer that interacts with the database in the background as you manipulate it. As a result, you can (somewhat) avoid explicit SQL.

# you need these packages installed
library(tidyverse)
library(dbplyr)
library(DBI) # this is the general DB interface library.
library(RMariaDB) # this is the library for MySQL databases.

As with the python library, we create a connection to the database we are interested.

con <- DBI::dbConnect(RMariaDB::MariaDB(),
    host = awsresource, # get this from the instructor
    user = username, # get this from the instructor
    password = password, # get this from the instructor
    dbname = "sakila"
)

If you want to submit an SQL query directly, you use the DBI::dbGetQuery command. So to list the tables in the sakila database, you send a “show tables” command (the result is a base R dataframe, so we make it a tibble):

query <- "show tables"
result <- DBI::dbGetQuery(con, query)
print(as_tibble(result))

# A tibble: 23 × 1
   Tables_in_sakila
   <chr>           
 1 actor           
 2 actor_info      
 3 address         
 4 category        
 5 city            
 6 country         
 7 customer        
 8 customer_list   
 9 film            
10 film_actor      
# ℹ 13 more rows

To see the fields in the actor table, you send a “describe actor” command.

query <- "describe actor"
result <- DBI::dbGetQuery(con, query)
print(as_tibble(result))

# A tibble: 4 × 6
  Field       Type              Null  Key   Default           Extra             
  <chr>       <chr>             <chr> <chr> <chr>             <chr>             
1 actor_id    smallint unsigned NO    "PRI" <NA>              "auto_increment"  
2 first_name  varchar(45)       NO    ""    <NA>              ""                
3 last_name   varchar(45)       NO    "MUL" <NA>              ""                
4 last_update timestamp         NO    ""    CURRENT_TIMESTAMP "DEFAULT_GENERATE…

The dbplyr interface mimics the dplyr interface, and it’s similar to the python one. To access a table, we use the tbl command.

actor <- tbl(con, "actor")
film <- tbl(con, "film")

The data structures actor and film “look like tibbles” but in fact they are a sort of window into the database.

actor |> head()

# Source:   SQL [6 x 4]
# Database: mysql  [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
  actor_id first_name last_name    last_update        
     <int> <chr>      <chr>        <dttm>             
1        1 PENELOPE   GUINESS      2006-02-15 04:34:33
2        2 NICK       WAHLBERG     2006-02-15 04:34:33
3        3 ED         CHASE        2006-02-15 04:34:33
4        4 JENNIFER   DAVIS        2006-02-15 04:34:33
5        5 JOHNNY     LOLLOBRIGIDA 2006-02-15 04:34:33
6        6 BETTE      NICHOLSON    2006-02-15 04:34:33

Notice the description at the top says “Source: SQL” and includes a reference to the database. Still, you can use the usual tidyverse style commands to work with this object.

For example, if you want to see how often the various first names occur in the actor table, you can use the group_by, count, and arrange functions:

actor |>
    group_by(`first_name`) |>
    count() |>
    arrange(desc(n))

# Source:     SQL [?? x 2]
# Database:   mysql  [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
# Groups:     first_name
# Ordered by: desc(n)
   first_name       n
   <chr>      <int64>
 1 PENELOPE         4
 2 JULIA            4
 3 KENNETH          4
 4 NICK             3
 5 ED               3
 6 CAMERON          3
 7 FAY              3
 8 DAN              3
 9 MATTHEW          3
10 CHRISTIAN        3
# ℹ more rows

What’s interesting here is that this computation is actually done on the database using SQL. In fact, if you use the show_query command, you can see what command was used.

actor |>
    group_by(`first_name`) |>
    count() |>
    arrange(desc(n)) |>
    show_query()

<SQL>
SELECT `first_name`, COUNT(*) AS `n`
FROM `actor`
GROUP BY `first_name`
ORDER BY `n` DESC

To see another example, let’s look at one of the queries we used in the python example. Suppose we want the “G” rated films from the film table.

In tidyverse syntax we’d write that like this.

G_rated <- film |> filter(`rating` == "G")
print(G_rated)

# Source:   SQL [?? x 13]
# Database: mysql  [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
   film_id title       description release_year language_id original_language_id
     <int> <chr>       <chr>              <int>       <int>                <int>
 1       2 ACE GOLDFI… A Astoundi…         2006           1                   NA
 2       4 AFFAIR PRE… A Fanciful…         2006           1                   NA
 3       5 AFRICAN EGG A Fast-Pac…         2006           1                   NA
 4      11 ALAMO VIDE… A Boring E…         2006           1                   NA
 5      22 AMISTAD MI… A Emotiona…         2006           1                   NA
 6      25 ANGELS LIFE A Thoughtf…         2006           1                   NA
 7      26 ANNIE IDEN… A Amazing …         2006           1                   NA
 8      39 ARMAGEDDON… A Fast-Pac…         2006           1                   NA
 9      43 ATLANTIS C… A Thrillin…         2006           1                   NA
10      46 AUTUMN CROW A Beautifu…         2006           1                   NA
# ℹ more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
#   replacement_cost <dbl>, rating <chr>, special_features <chr>,
#   last_update <dttm>

Notice in the output here that the header does not tell us how many rows are present. it says “Source: [?? x 13]”. That’s because the code has not yet downloaded everything from the data base; just the first few rows to show you what’s going on. This is called “Lazy Evaluation.”

To see the query and the associated select...where, we use show_query().

G_rated |> show_query()

<SQL>
SELECT *
FROM `film`
WHERE (`rating` = 'G')

The strategy here is that you develop your query until you are satisfied and then pull the results down all at once. Suppose we want to do another of our python exercises by computing the minimum, average, std deviation, and maximum length of films grouped by rating.

In tidyverse world, this is:

summary <- film |>
    select(`rating`, `length`) |>
    group_by(`rating`) |>
    summarize(mean = mean(`length`), minlen = min(`length`), maxlen = max(`length`), stdlen = sd(`length`)) |>
    arrange(desc(`mean`))
summary |> head()

Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.

# Source:     SQL [5 x 5]
# Database:   mysql  [grad5100user@database-1.cr5jdmpqgv4m.us-east-1.rds.amazonaws.com:NA/sakila]
# Ordered by: desc(mean)
  rating  mean  minlen  maxlen stdlen
  <chr>  <dbl> <int64> <int64>  <dbl>
1 PG-13   120.      46     185   41.2
2 R       119.      49     185   38.5
3 NC-17   113.      46     184   40.8
4 PG      112.      46     185   39.3
5 G       111.      47     185   41.8

Now we can see what the SQL query looked like.

summary |> show_query()

<SQL>
SELECT
  `rating`,
  AVG(`length`) AS `mean`,
  MIN(`length`) AS `minlen`,
  MAX(`length`) AS `maxlen`,
  STDDEV_SAMP(`length`) AS `stdlen`
FROM (
  SELECT `rating`, `length`
  FROM `film`
) `q01`
GROUP BY `rating`
ORDER BY `mean` DESC

When you are ready to work locally with the data, you use the collect command. Notice that actor_local is a tibble.

actor_local <- actor |> collect()
actor_local |> head()

# A tibble: 6 × 4
  actor_id first_name last_name    last_update        
     <int> <chr>      <chr>        <dttm>             
1        1 PENELOPE   GUINESS      2006-02-15 04:34:33
2        2 NICK       WAHLBERG     2006-02-15 04:34:33
3        3 ED         CHASE        2006-02-15 04:34:33
4        4 JENNIFER   DAVIS        2006-02-15 04:34:33
5        5 JOHNNY     LOLLOBRIGIDA 2006-02-15 04:34:33
6        6 BETTE      NICHOLSON    2006-02-15 04:34:33