Closed Captioning Closed captioning available on our YouTube channel

How to use .SD in the data.table package

InfoWorld | Jul 18, 2019

See how to use the special .SD symbol in the R data.table package.

Similar
Hi. I’m Sharon Machlis at IDG Communications, here with episode 31 of Do More With R: the data.table package’s dot-SD symbol.
For some data.table users, dot-SD is a bit of a mystery. But data.table creator Matt Dowle told me that it’s actually quite simple: Just think of it as a symbol representing “each group”. Let me show you a few examples.
I have a data set of daily cycling trips from the Boston area’s bicycle-share system. If you’d like to follow along, the associated article at InfoWorld has the data.
I’ll load data.table and import my CSV file using data.table’s fread() function. Here I’m saving the data into a data table called mydt. \
Next, I’ll look at the first 6 lines to see what the data looks like. I’ve got columns for the date, the user type – subscriber or single-trip customer – number of trips, year, and month starting date so I can easily subtotal by month.
Here’s the first example Matt suggested: Print the first few rows of the data table grouped by user type. (We’re filtering for the first 12 rows just to make it easier to see the output).
print() iterated over each group and printed two separate times, one for each usertype. The problem, though, is I don’t know which is the customer user group and which is the subscriber user group. The “by” column doesn’t print out. Matt showed me a little trick for that, though.
If you’re familiar with data.table syntax, there are three parts to the bracket notation after the data table name: I, j, and by. I is for filtering rows, j is for what you want to do, and by is how you want to group your data.
See the code on line 4? I’ve just put curly braces around the “j” part. That’s going to let me add multiple R expressions inside that “j” part. If I run it now, it’s the same as before. Still no user type names.
But now look at the R statement I added (well, Matt told me to add) on line 6: print(.BY). .BY is a special data.table symbol that holds the value of by – what column or columns I’m grouping by.
If I run this code I now have the name of each grouping variable along with the printout.
So that’s a very basic example. I’m guessing you might want to do something a little more interesting with .SD than print, though. Next we’ll look at which day had the most trips each month.
That first line of code has it all. The “I” first argument in the brackets is filtering for any rows where the year is 2019. The j argument is the interesting part. Think of dot-SD as referring to each group of your data. Or as Matt said, “You do j by by. Like a for loop.” Let’s run that code.
Interesting. Every single maximum is subscribers. The next line of code lets me look at which day had the most trips each month by each user type. See I’ve got two columns for by, not just one.
There are a couple of way to express grouping by more than one column in data.table. And, you can also do a conventional base R vector with quotation marks around each column name. That 3rd line of code is the same as the second.
That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at go dot infoworld dot com slash more with R, all lowercase except for the R.
You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel.
Hope to see you next episode!
Popular
Featured videos from IDG.tv