Closed Captioning Closed captioning available on our YouTube channel

A quick look at dplyr’s new across() function

InfoWorld | Apr 14, 2020

See how to use dplyr’s new across() to run functions across multiple columns at once. You can even run more than one function in the same line of code. Access the data here: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

Copyright © 2020 IDG Communications, Inc.

Similar
Hi. I’m Sharon Machlis at IDG Communications, here with Episode 45 of Do More With R: a quick look at dplyr’s new across() function.

Analyzing a data frame by column is one of R’s great strengths. But what if you’re a tidyverse user and you want to run a function across multiple columns?

As of dplyr 1.0, there’ll be a new function for this: across(). Let’s take a look.

On the day I recorded this video, dplyr 1.0 wasn’t yet available on CRAN. However, you can get access to all the new functions by downloading the development version with remotes::install_github(“tidyverse/dplyr”).

For this demo, I’ll use some data showing COVID-19 spread: USA Facts’ confirmed U.S. cases by day and county. If you want to follow along, you can find out more about the data at u.s.a. facts dot org.

You can download it under a Creative Commons license, as long as you credit them in any published work. (As I have!)

I’ll load in the dplyr and readr packages – and remember, I’ve got the development version of dplyr, this won’t work yet with the CRAN version – and read in the file I downloaded. If we view the data frame, you can see that each county is a row, and each date is a column. This is not a tidy data set – but it does work as a good example for using across().

Next, I’m going to subset the data for just New York State in March and April, to make results easier to see. So, I’m filtering for state equals NY, and column names starting with 3 or 4.

Typically, if I wanted to get the total for each day, I’d either reshape the data or use a package like janitor to add a total row. But if I wanted a separate summary data frame in this format, I can now use across().

So what is across()? I think of it as “perform a function on each column, one column at a time.” It lets you do this without having to name every column one by one or use a purrr mapping function.

Here’s what the code looks like for “give me the sum of every numeric column in this data frame”. The first argument of across() should be the data frame, but that’s taken care of with the pipe above. The first argument here for across() is the columns to operate on. That takes any dplyr::select() syntax. I could have also given a range, such as in the second code block. Or, maybe just look at April using select() syntax starts_with(“4”).

across() allows for multiple functions to be run on each column in the same code. For example, what if I want to see the maximum and median for each day? If I create a list of named functions, I can apply that list all at once.

Hopefully you can see in the first block of code that I created a list called median_and_max with two functions: One, med, is the median (removing any missing values); the other, max, is for maximum. Now, I’ll create a new summarized data frame called April_median_and_max. Using across(), I’ll select all the April columns – that’s the starts_with(“4”) – and then add my named list of functions as the last argument.

Look what happens when I run that code. Each date column now has two columns! One with median and the other with max. Not the tidiest of data formats, but sometimes we live in a world where people want a human-readable format like this. Now, no data reshaping required.

That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at bit-dot-l-y slash do more with R, all lowercase except for the R. You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel -- where you can subscribe so you never miss an episode. Hope to see you next time. Stay healthy and safe, everyone!
Popular
Featured videos from IDG.tv