{dm} is fantastic!
A quick overview of the {dm} package for working with relational data models in R.
Sometimes as an R user you may find yourself needing to work with data that is stored in a remote database, but want to continue to use your R-orientated workflow. The dm
package by Kirill Müller provides some great tools for doing exactly that, and in this post I wanted to share some of the functions I’ve found particularly useful.
Now, you may already be familiar with the dbplyr
package which allows you to use the majority of dplyr/tidyr functions on a remote table by auto-magically converting these commands into the necessary SQL and returning the result back to R when you call collect()
. dm
extends this concept and adds extra functionality for working with the whole database, making it a very valuable tool. For the examples in this post, I’ve set up a local SQL server, and copied over the dm::dm_nycflights13()
example database that is included in the dm
package.
# Setup for examples using SQL server
<- DBI::dbConnect(odbc::odbc(),
con Driver = "ODBC Driver 17 for SQL Server",
Server = "localhost",
UID = Sys.getenv("SQL_user"),
PWD = Sys.getenv("SQL_password"),
Port = 1433)
<- dm_nycflights13()
flights
::dbSendQuery(con, "CREATE DATABASE nycflights")
DBI
copy_dm_to(con, dm = flights, temporary = FALSE)
Next I rerun the connection, but this time specifying the newly created nycflights
database.
<- DBI::dbConnect(odbc::odbc(),
con Driver = "ODBC Driver 17 for SQL Server",
Server = "localhost",
Database = "nycflights",
UID = Sys.getenv("SQL_user"),
PWD = Sys.getenv("SQL_password"),
Port = 1433)
Now our data is in the database we can begin working with it. To connect to the database and learn the connections, we can use dm_from_src()
. Setting the learn_keys
argument to TRUE
means dm will attempt to discover the primary and foreign keys between the tables, however this currently only works with Postgres and SQL Server databases.
<- dm_from_src(con, learn_keys = TRUE)
flights
flights
#> ── Table source ───────────────────────────────────────────────────────────
#> src: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`, `mtcars`
#> Columns: 64
#> Primary keys: 3
#> Foreign keys: 2
A really great feature of dm
is the ability to plot the database to visualise the links between tables. We can do this by using the dm_draw()
function
dm_draw(flights)
Now, there is a small issue with the nycflights data in that some of the tailnum
values are not present in both the flights and planes table. However, we can manually create the link by adding an additional foreign key using dm_add_fk()
.
<- flights %>%
flights dm_add_fk(table = flights, columns = tailnum, ref_table = planes)
flights
#> ── Table source ───────────────────────────────────────────────────────────
#> src: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`, `mtcars`
#> Columns: 64
#> Primary keys: 3
#> Foreign keys: 3
And if we draw the model again we can see that planes is now connected to the flights table. I’ve also added some extra styling to show how easy it is to customise these plots with dm
.
You can use dm_get_available_colors()
to see all possible colours
%>%
flights dm_set_colors(
salmon4 = flights,
violetred1 = airlines,
royalblue4 = planes,
wheat = airports
%>%
) dm_draw(rankdir = "TB",
view_type = "title_only",
edge_attrs = "arrowhead = vee")
Working with the data
So, now we have a data model set up we can begin working with it. dm
contains a range of dm_*
functions that work in a similar way to their dpylr
equivalents, but they can affect the whole data model object. For example, we can select certain tables and use filters on the whole data model. Lets drop the weather
table as we aren’t too interested in weather for the time being.
<- flights %>%
flights dm_select_tbl(-weather)
flights
#> ── Table source ───────────────────────────────────────────────────────────
#> src: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `mtcars`
#> Columns: 49
#> Primary keys: 3
#> Foreign keys: 3
By printing the dm object, we can see the weather
table is no longer part of the dm object.
Lets say we are particular interested in finding out about Alaska Airlines Inc. We can use dm_filter()
to find all the linked information for this airline in our data model.
<- flights %>%
flights_filt dm_filter(flights, carrier == "AA")
flights_filt
#> ── Table source ───────────────────────────────────────────────────────────
#> src: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `mtcars`
#> Columns: 49
#> Primary keys: 3
#> Foreign keys: 3
#> ── Filters ────────────────────────────────────────────────────────────────
#> flights: carrier == "AA"
Now we can see we have some additional information regarding the filter when we print the data model. In order to apply this filter, we have a couple of possibilities. If we were interested in returning all the rows in the airports
table, we could apply this filter directly to that table using dm_apply_filters_to_tbl()
.
%>%
flights_filt dm_apply_filters_to_tbl(airports)
#> # Source: lazy query [?? x 8]
#> # Database: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 LGA La Guardia 40.8 -73.9 22 -5 A America/New_York
#> 2 EWR Newark Liberty Intl 40.7 -74.2 18 -5 A America/New_York
#> 3 JFK John F Kennedy Intl 40.6 -73.8 13 -5 A America/New_York
Sometimes, we might want to return all tables with the filter applied. dm provides the dm_flattern_to_tbl()
function to do exactly that. First we need to apply the filter to the dm object, and then we can “flattern” to a single table, specifying the type of join we would like to use. dm
will create the joins based on the keys already defined in the dm object.
%>%
flights_filt dm_apply_filters() %>%
dm_flatten_to_tbl(start = flights, airports, planes, join = left_join)
#> Renamed columns:
#> * year -> flights.year, planes.year
#> # Source: lazy query [?? x 34]
#> # Database: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> flights.year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 531 540 -9 832
#> 2 2013 1 10 553 600 -7 837
#> 3 2013 1 10 555 600 -5 733
#> 4 2013 1 10 604 610 -6 851
#> 5 2013 1 10 604 610 -6 858
#> 6 2013 1 10 625 630 -5 753
#> 7 2013 1 10 633 630 3 1142
#> 8 2013 1 10 652 659 -7 942
#> 9 2013 1 10 659 700 -1 1013
#> 10 2013 1 10 700 700 0 837
#> # … with more rows, and 27 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>, lon <dbl>,
#> # alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>, planes.year <int>,
#> # type <chr>, manufacturer <chr>, model <chr>, engines <int>,
#> # seats <int>, speed <int>, engine <chr>
The start
argument is the table that we want to join the others onto. You can optionally supply tables names that you want to join (in this case airports and planes), or specify nothing to join all tables. Finally you can also specify the type of join you would like to use (left_join
is the default).
Before flattening to a single table, it can sometimes be useful to see how many rows each tables has after you have applied the filter.
%>%
flights_filt dm_apply_filters() %>%
dm_nrow()
#> airlines airports flights planes mtcars
#> 1 3 1089 118 32
This can be good to double check the filter has done what you think it was going to do!
To see what’s going on under the hood here, we can use show_query()
.
%>%
flights_filt dm_apply_filters() %>%
dm_flatten_to_tbl(start = flights, airports, planes, join = left_join) %>%
show_query()
#> Renamed columns:
#> * year -> flights.year, planes.year
#> <SQL>
#> SELECT "flights.year", "month", "day", "dep_time", "sched_dep_time", "dep_delay", "arr_time", "sched_arr_time", "arr_delay", "carrier", "flight", "LHS"."tailnum" AS "tailnum", "origin", "dest", "air_time", "distance", "hour", "minute", "time_hour", "name", "lat", "lon", "alt", "tz", "dst", "tzone", "planes.year", "type", "manufacturer", "model", "engines", "seats", "speed", "engine"
#> FROM (SELECT "flights.year", "month", "day", "dep_time", "sched_dep_time", "dep_delay", "arr_time", "sched_arr_time", "arr_delay", "carrier", "flight", "tailnum", "origin", "dest", "air_time", "distance", "hour", "minute", "time_hour", "name", "lat", "lon", "alt", "tz", "dst", "tzone"
#> FROM (SELECT "year" AS "flights.year", "month", "day", "dep_time", "sched_dep_time", "dep_delay", "arr_time", "sched_arr_time", "arr_delay", "carrier", "flight", "tailnum", "origin", "dest", "air_time", "distance", "hour", "minute", "time_hour"
#> FROM "dbo"."flights"
#> WHERE ("carrier" = 'AA')) "LHS"
#> LEFT JOIN (SELECT * FROM "dbo"."airports" AS "LHS"
#> WHERE EXISTS (
#> SELECT 1 FROM (SELECT *
#> FROM "dbo"."flights"
#> WHERE ("carrier" = 'AA')) "RHS"
#> WHERE ("LHS"."faa" = "RHS"."origin")
#> )) "RHS"
#> ON ("LHS"."origin" = "RHS"."faa")
#> ) "LHS"
#> LEFT JOIN (SELECT "tailnum", "year" AS "planes.year", "type", "manufacturer", "model", "engines", "seats", "speed", "engine"
#> FROM (SELECT * FROM "dbo"."planes" AS "LHS"
#> WHERE EXISTS (
#> SELECT 1 FROM (SELECT *
#> FROM "dbo"."flights"
#> WHERE ("carrier" = 'AA')) "RHS"
#> WHERE ("LHS"."tailnum" = "RHS"."tailnum")
#> )) "q01") "RHS"
#> ON ("LHS"."tailnum" = "RHS"."tailnum")
So with just a few line of code we were able to generate this fairly lengthy SQL statement. Pretty neat 😎.
Zooming with dm’s
Anther cool feature in dm is the ability to “zoom” into a table and use a dplyr style workflow on that table. This will be very familiar if you have ever used dbplyr
. Lets zoom into the planes
table and find the mean number of seats for each manufacturer.
<- flights_filt %>%
zoomed dm_zoom_to(planes) %>%
group_by(manufacturer) %>%
summarise(mean_seats = mean(seats, na.rm = TRUE)) %>%
arrange(desc(mean_seats))
zoomed
#> # Zoomed table: planes
#> # Source: lazy query [?? x 2]
#> # Database: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> # Ordered by: desc(mean_seats)
#> manufacturer mean_seats
#> <chr> <int>
#> 1 AIRBUS 221
#> 2 AIRBUS INDUSTRIE 187
#> 3 BOEING 175
#> 4 MCDONNELL DOUGLAS 162
#> 5 MCDONNELL DOUGLAS CORPORATION 142
#> 6 MCDONNELL DOUGLAS AIRCRAFT CO 142
#> 7 DOUGLAS 102
#> 8 BOMBARDIER INC 74
#> 9 CANADAIR 55
#> 10 EMBRAER 45
#> # … with more rows
After using dm_zoom_to()
to select the planes table the rest of the code should look fairly familiar. We now have a couple of options for what we want to do with our new zoomed in table. We might want to overwrite the existing planes table with our new summary table. If that was the case we could use the dm_update_zoomed()
to replace the original table with the one we have just created. An alternative (a potentially less destructive) approach is to create a new table containing the summary information.
<- zoomed %>%
flights_updated dm_insert_zoomed(new_tbl_name = "planes_summary")
flights_updated
#> ── Table source ───────────────────────────────────────────────────────────
#> src: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `mtcars`, `planes_summary`
#> Columns: 51
#> Primary keys: 3
#> Foreign keys: 3
#> ── Filters ────────────────────────────────────────────────────────────────
#> flights: carrier == "AA"
Now this particular summary table doesn’t need to link to any of the other tables, but the dm zoomed vingette provides lots of great examples for how one might link summary tables into the dm object.
dm_draw(flights_updated)
However, we may want to access this table and include it in our report. We can access a specific table easily by using the pull_tbl()
function.
%>%
flights_updated pull_tbl(planes_summary) %>%
collect() %>%
::reactable() reactable
Nice!
Conclusion
So that was a quick overview of some of the functions I’ve found particularly useful. dm
can do much more though so check out the site https://cynkra.github.io/dm/index.html which contains a great mix of tutorials and technical articles which make the package extremely accessible and fun to use.
Thanks for reading!