Closed Captioning Closed captioning available on our YouTube channel

R data.table in 5 minutes

InfoWorld | Oct 31, 2019

Get a quick look at the R data.table package in this 5-minute tutorial that gives a sample of the package’s speed and power.

Similar
Hi. I’m Sharon Machlis at IDG, here with episode 36 of Do More With R: data.table in 5 minutes.

Data.table is a package known for speed and power for data wrangling and analysis. Fans say its syntax is both concise and consistent. That syntax is also a bit different than either base R or the tidyverse.

Let’s dive in and see how it works.

A data table object is a type of data frame, but with special features. There are a couple of ways to create one. Here I’ll load the package and import a CSV file with about 645K rows and 20+ columns using data.table’s fread() function.


You can see fread() is super fast. Looking at the object’s class, mydt is both a data frame and a data table.

You can also turn an existing data frame into a data.table with the as.data.table() function. I’ll create a data frame with base R’s read.csv and then convert it. OK let me get rid of the copies.

Let’s take a look at the structure of mydt: Now I’ve got a data.table here, too.

We’ve got one row per flight, with info like flight date, origin info, destination info, and some time and delay info. What if I want to look at flight delays only from Boston to San Francisco by airline?

Here’s the basic syntax for a LOT of things you’ll want to do with data in data.table:

Mydt open bracket I, j, by close bracket

Which means: Start with mydt, subset or reorder using I, calculate using j, and order by by.

So. If I want to subset this data for the origin being Logan Airport, which is BOS, all I have to do is put ORIGIN == BOS in that I section.

What if I want to look at flights from Boston and to San Francisco? I just add a second condition

I’ll run that code and save it in a new variable called myresult.


To calculate the average (or mean) delay in minutes, I put the calculation in the j part. What I want is the mean of the ARR_DELAY_NEW column, which gives delay in minutes.

I need na.rm = TRUE to remove all the not availables. Let me run that.

You see now I have a single average for all the delays.

Next, I want the average delay by airline. For that, I use the by part.

Let me run that code.

Hmmm. Those Airline carrier names aren’t very intuitive. I don’t have much time to go into joining tables, but quickly so you can see how easy it is . . .


I’ve got a separate lookup table CSV with airline codes and the airline names that I’ll import. Now here’s one way to do the join. I’ll first set joining (and indexing) keys for each table – OP_UNIQUE_CARRIER for the original data and Code for the lookup table. Then it’s just my lookup table with my original data table in brackets. [

There are other ways to join data -- which I’ll explore in a future video since I’m out of time for this episode. Thanks for watching! For more R tips, head to the Do More With R page at go dot infoworld dot com slash more with R, all lowercase except for the R
You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel -- where you can subscribe so you never miss an episode. Hope to see you next time!
Popular
Featured videos from IDG.tv