Closed Captioning Closed captioning available on our YouTube channel

6 R data.table symbols and operators you should know

InfoWorld | Mar 4, 2020

Your R data.table code will be more efficient and elegant if you know some of the package’s special symbols and operators. Plus, learn about the new fcase() function.

Copyright © 2020 IDG Communications, Inc.

Similar
Hi. I’m Sharon Machlis at IDG Communications here with Episode 43 of Do More With R: 6 data.table symbols you want to know – and a new function in the latest dev version of the package.

Data table code becomes more efficient when you know about its special symbols and functions. Here are the ones I’ll be covering:

subsetting symbols
the ‘walrus’ assignment operator
a counting symbol
and 3 special data.table operators

Let’s get started.

For this demo, I’m going to use data from the 2019 stackoverflow developers survey, with about 90,000 responses. If you want to follow along, you can download the data at insights dot stackoverflow dot com slash survey.

First I’ll load the data dot table package. Next, I’ll load in the data file. I can read just the first few rows of this data with data.table’s fread() function and the nrows argument. I’ll read in 10 rows and take a look.

There’s all sorts of interesting data in there. (If you’re wondering what the columns mean, there’s a separate file in the download that explains them all).

Now let me read in all the data.

Next I’ll create a new data.table with just a few columns to make it easier to work with and see results. A reminder that data.table uses this basic syntax:

The data.table intro tells us to read this as “take dt, subset or reorder rows using I, calculate j, grouped by by.” Another way to think of it is: I and j are similar to base R’s bracket ordering: rows first, columns second. I is for operations you’d do on rows (choosing rows based on row numbers or conditions); j is what you’d do with columns (select columns or create new columns from calculations).

Since I’m selecting columns, that goes in the “j” spot, which means the brackets need a comma first to leave the “I” spot empty.

One of the things I like about data.table is that it’s easy to select columns either quoted or unquoted. Unquoted is often more convenient (that’s usually the tidyverse way). But quoted is useful if you’re using data.table inside your own functions, or you want to pass in a vector you created somewhere else in your code.

You can select data.table columns the typical base R way, with a conventional vector of quoted column names. If you want to use them unquoted, create a list instead of a vector and you can pass in the unquoted names.

And now we come to special symbol #1: Instead of typing out list(), you can just use a dot. That dot is a shortcut for list() inside data.table brackets.

What if you want to use an already-existing vector of column names? Putting the vector object name inside data.table brackets won’t work. Here I’ll create a vector with quoted column names and put that object in a data.table. Like I said, it doesn’t work. Instead, you need to put dot-dot in front of that vector object name. Why two dots? That seemed kind of random to me until I read the explanation: Think of it similar to two dots in a Unix command-line terminal. That moves you up one directory. Here, you’re moving up one namespace, from the environment inside data.table brackets up to the global environment. (That really does help me remember it!)

On to the next symbol. To count by group, you can use data.table’s .N symbol. .N stands for “number of rows”. It can be the total number of rows, or number of rows per group if you’re aggregating in the “by” section.

That first line of code is the total number of rows in the data.table. The second line is the number of rows grouped by one variable: whether people in the survey also code as a hobby. You can use the plain column name if there’s just one variable. If you want to group by two or more variables, use the . list symbol, like in the 3rd line of code. Here I group by hobbyist and whether they contribute to open source.

It might be nice to order that grouped first by hobbyist answer and then from most to least. Final line of code does that. Notice that you can chain brackets with data.table -- but without a special symbol like a pipe. Just one set of brackets after another. The .N symbol automatically generates a column named N (of course you can rename it if you want).

So here is the data.table code. As I learn data.table code, I find it helpful to read it step by step. So I’d read this as “For all rows in mydt (since there’s nothing in the “I” spot), count number of rows, grouping by Hobbyist and OpenSourcer. Then order first by Hobbyist and then number of rows descending.” That’s equivalent to this dplyr code.

Next let’s add some columns to the table, and learn a couple of very handy data.table features. I’d like add columns to see if each respondent uses R, if they use Python, and if they use both or neither. The LanguageWorkedWith column has the info of languages used, and it looks like this. Each answer is a single character string. Most have multiple languages separated by a semicolon.

As always, it’s easy to search for Python but a little more complicated to search for R, since you can’t just search for a capital R in the string. You can see that Ruby and Rust also contain a capital R. Let’s first create a column for Python use.

Here’s what data.table code looks like to create a TRUE/FALSE vector that checks if each string in LanguageWorkedWith contains “Python”. If you know SQL, you’ll recognize that “like” syntax. I, well, like %like%, it’s a nice streamlined way to check for pattern matching. The function documentation says it’s meant to be used inside data.table brackets, but actually you can use it in any of your code, not just with data.tables. I checked with data.table creator Matt Dowle, he said the advice to use it inside the brackets is because some extra performance optimization happens there.

Now here’s the code to add a column called PythonUser to the data.table. Notice that “colon equals” operator. Python has an operator like that, too, and ever since I heard it called the “walrus operator,” that’s what I call it. I think it’s officially “assignment by reference,” and let me show you why. Notice at the top right, dt1 has 6 columns. Let me run the code and update my list of environmental variables by clicking refresh. Now dt1 has 7 columns. The walrus operator changed the existing object without needing to save it to a new variable.
To search for R I’ll use a regular expression. This says find a pattern that starts with a word boundary – that’s the slash slash b – then a capital R, then ends with a word boundary. (I can’t just look for R semicolon because the last item in each string doesn’t have a semicolon).

If you wanted to add both columns at once and change the original object, you need to turn that walrus operator into a function by backquoting it, like this second code block.

There are a couple of other operators worth knowing %between% has this syntax. So if you wanted to filter for all responses where compensation was between 50,000 and 100,000 paid in US dollars here’s the code. The part in red is the between condition. This operator includes both the lower and upper values when it checks.

Another special operator is %chin%. It works like base R’s %in% but is optimized for speed, and is for character vectors only. So, if you wanted to filter for all rows where the OpenSourcer column was either “Never” or “Less than once per year” using c.h.in would look like this. This is pretty similar to base R except that base R needs to specify the data frame inside the bracket and also needs a comma after the filter expression.

For our final couple of symbols – and the new dev-version fcase() function – I’m going to start by creating a new data.table with just people who reported compensation in US dollars. That’s the first line of code.

Next, I’m going to create a new column called “Language” for whether someone uses just R, just Python, both, or neither. And I’ll use the new fcase() function. At the time I recorded this, fcase() was only available in data.table’s development version. Fcase() is similar to SQL’s CASE WHEN statement and dplyr’s case_when() function. If you look at the second code block, you can see the syntax: condition, comma, value, comma, next condition, comma, next value, comma, and so on. I’m setting values for 4 conditions – use R not Python, use Python, not R, use both, use neither. And I’m using the “walrus operator” to update my existing usd data.table. When I run that code I get a new column in my data.table. Unlike with a data frame, the data.table structure doesn’t automatically update in the top right RStudio pane; you do have to refresh to see changes.

In the next line of code, I calculate a new column, AvgSalary, with a regular equals sign so I’m not altering my existing table. And, I’m grouping by Language.

There are a few other symbols I won’t be covering in this video. You can find a list of them in the “special symbols” data.table help file. One of the most useful, .SD, has its own separate video at

That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at bit-dot-l-y slash do more with R, all lowercase except for the R. You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel -- where you can subscribe so you never miss an episode. Hope to see you next time!
Popular
Featured videos from IDG.tv