Databases and SQL - A few more keywords

Fundamentals of Data Science

Author

Jeremy Teitelbaum

There are a few additional tricks to know about.

  1. the HAVING clause. This allows you to put conditions on summary variables. For example, suppose we want to find all actors in the sakila database that appear in only one film. Remember that the relationship between films and actors is captured in the film_actor table.

First, we can make a table showing, for each actor_id, how many films that actor appears in.

query = """
select actor_id, count(film_id) as num  from film_actor group by film_id order by num limit 5
"""
run_sql(query, engine)
[(31, 1), (77, 1), (70, 1), (198, 1), (65, 1)]

To pick out the ones where num is one, we use the having clause. (Notice that we include a limit clause to reduce the amount of output).

query = """
select actor_id, count(film_id) as num  from film_actor group by film_id   having count(film_id)=1 limit 5
"""
run_sql(query, engine)
[(70, 1), (65, 1), (77, 1), (198, 1), (31, 1)]

Exercise: How can you fill out the actor names and film names from this information?

  1. The WITH clause.

Suppose you want to find the actors that appear in the most films, it’s a bit trickier One way is to first find the largest number of films, and then join with the table of film counts to pick out those actors having that number of films.

The with clause allows you to nest queries like this.

query = """
with filmcounts as
    (select actor_id, count(film_id) as num from film_actor group by film_id),
    maxfilms as
    (select max(num) as maxnum from filmcounts)
    select actor_id, num from filmcounts join maxfilms on num=maxnum
"""
run_sql(query, engine)
[(28, 15)]

Excercise: Extend this to include the actor names for these most prolific actors. .

If you want the actors with the second largest number of films, you can try this.

query = """
with filmcounts as
    (select actor_id, count(film_id) as num from film_actor group by film_id),
    maxfilms as
    (select max(num) as maxnum from filmcounts where num < (select max(num) from filmcounts))
    select actor_id, num from filmcounts join maxfilms on num=maxnum
"""
run_sql(query, engine)
[(3, 13), (5, 13), (31, 13), (2, 13), (17, 13), (4, 13)]

If you want, say, the top 5, then you need other techniques.