Hi. I’m Sharon Machlis at IDG, here with episode 38 of Do More With R: 3 ways to join data – base R, tidyverse, and data.table.
Here, by joining data, I mean combining two data frames by one or more common columns. Not just tacking on additional rows.
For this example I’m going to use one of my favorite data sets – flight delay times. If you want to follow along, head to the bit.ly link on screen (bit dot ly slash USFlightDelays) and download data with the columns Flight Date, Reporting_Airline, Origin, Destination, and DepartureDelayMinutes. Also get the lookup table for Reporting_Airline.
You can get all the code and links to data downloads at my associated InfoWorld article, at the URL on screen.
Let’s get to it. First let me read in both files, the base R way. I first unzip the flight delay file, and then import the delay and code lookup files with read.csv.
Let’s take a peek at both data frames.
I’d like to add a column to the flight delay data with the airline name. The base R way is with the merge() function.
The syntax for merge is first data frame, second data frame. It doesn’t matter the order, but whichever one is first is considered x and the second one is y. If the columns you want to join by don’t have the same name, you need to tell merge which columns you want to join by: by.x for the x data frame column name, and by.y for the y one.
You can also tell merge whether you want all the rows, including ones without a match, or just rows that match. In this case, I’d like all the rows from the delay data – if there’s no airline code in the lookup table, I still want the information. But I don’t need rows from the lookup table that aren’t in the delay data (there are some codes for old airlines that don’t fly anymore in there). So, all.x equals TRUE but all.y equals FALSE. Let me run that
You can see now that my joined data frame includes a column called “Description” with the name of the airline based on the carrier code. And my new data frame has the same number of rows as my initial delay data frame, so I didn’t miss anything.
Next, let’s look at tidyverse dplyr joins. Dplyr uses SQL database syntax for its join functions.
A “left join” means include everything on the left (what was the x data frame in merge() ) and all rows that match from the right, or y, data frame. If the join columns have the same name, all you need is left_join(x, y). If they don’t have the same name, you need a by argument. Note the syntax there: it’s a vector: left data frame column name equals right data frame column name, both names in quotation marks.
Now let me run that in R.
I’ll load the dplyr and readr packages, and then read in the two files with read_csv. With read_csv, I don’t need to unzip the file first.
Read_csv creates a tibble, which is a type of data frame with some extra features. I’ll join the two tibbles with left_join(). Take a look at the syntax. This time, the order matters. Reminder: left_join means include all rows on the left, or first, data set, but only rows that match from the second one. And, I need to join by two differently named columns, I have a by argument.
This joined data set now has a new column with the name of the airline. You probably noticed that this was way faster than base R.
Now, let’s look at a super-fast way to do this, with the data.table package.
Here I load data.table, then use data.table’s fread() function to import the zip file.
To read the zipped file, I use fread’s capability to call shell commands directly. That’s what the unzip -cq part of the argument is doing. fread() creates a data.table object -- a data frame with extra functionality.
There are at least two ways to do joins with data.table. One is to use the exact same merge() syntax as base R. It works the same, but a LOT faster.
If you want to use data.table syntax, you can first use the setkey() function to set which columns you want to join on. Then, the syntax is – as you see on the last line of code – your lookup table (or right data frame – the one where you don’t want all the rows, only the matching ones), open bracket, table with the data, close bracket
Finally, I’ll demo a package for people who like dplyr syntax, or are used to SQL database syntax, but want the speedy data.table back end. That’s possible with the new dtplyr 1.0 package.
Install and load the package the usual ways. To use dtplyr, you need to turn data frames or tibbles into special lazy datatable objects. You do that with dtplyr’s lazy_dt() function. I do that in the first group of code here, piping the result of read_csv to the lazy_dt() function.
Now I can join those two objects the usual dplyr way. What I get back is a special dtplyr step object . If I print it, you can see the data.table code that created the object. That can be handy! I also see the first few rows of data, and a message that I need to turn that object into a data frame, tibble, or data.table if I want to use the data in there.
After running some rather crude benchmarking, the two data.table methods were fastest; dtplyr was almost as fast; dplyr took about twice as long; and base R was 15 or 20 times slower. Major caution here that performance depends on the structure and size of your data sets and can vary wildly depending on your task. But it’s safe to say that base R isn’t a great choice for anything but small data frames.
To recap:, Here’s code to merge two data frames by a common column where you want all rows of the first (or left) data set and only matching rows from the other one. This is the syntax for base R or data.table.
This is the syntax for dplyr. This is syntax for data.table (remember you can also use merge() ). There’s another syntax for data.table that doesn’t need a separate setkey() statement. I don’t use it, but here it is if you like it better: Add an on argument in the brackets.
What if you only want rows where there’s a match? That’s an inner join.
merge() syntax for base R or data.table is all.x and all.y both FALSE. With dplyr, use an inner_join(). And data.table syntax can also be this in addition to merge() (I’ve left out the setkey() commands since I already set keys for this data.)
If you want to see rows in left (or x) data set where there wasn’t a match, that’s an ANTI JOIN. It can be very useful to run that after a left or inner join to see what didn’t match – sometimes you expect no matches, but other times you don’t.
I’d probably use either anti_join() from dplyr or dt bracket syntax from data.table for this.
If you want all rows from both data sets regardless of matching that’s a full join. For that, I’d usually use either merge() with all equals true, or a dplyr full_join.
Finally, one question I often see: What about joining on 2 columns? Let me go through how to specify that Reminder that you can download all the code and this PowerPoint at the associated article on InfoWorld.
For merge() on multiple columns, you create a conventional vector of x column names, each in quotation marks; and a conventional vector of y column names, also each in quotation marks. If the column names are the same in both data sets you don’t have to specify them, but I usually like to anyway, to prevent unexpected things happening. dplyr syntax is slightly different, and a bit more like SQL. It states which columns equal which other columns. Here, ZipCode should match up with zip, and so on. That’s from another example in the related article.
data.table has a couple of ways to set multiple keys in a data set. There’s setkey() to refer to column names unquoted; and setkeyv if you want the names quoted in a vector. There’s also data.table’s on argument. With all these syntaxes just oh so slightly different from each other, my advice would be to pick a way and stick to it . . . or store these all in an easy-to-access place.
Once again, you can download my PowerPoint slides and code for this video at the URL on this screen.
That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at go dot infoworld dot com slash more with R, all lowercase except for the R.
You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel -- where you can subscribe so you never miss an episode. Hope to see you next time!