5 handy options in R data.table’s fread

Not just a speedy way to import data into R, fread has useful options for importing subsets of data, setting data types for columns, running system commands, and more

5 handy options in R data.table’s fread

Like all functions in the data.table R package, fread is fast. Very fast. But there’s more to fread than speed. It has several helpful features and options when importing external data into R. Here are five of the most useful.

Note: If you’d like to follow along, download the New York Times CSV file of daily Covid-19 cases by U.S. county at https://github.com/nytimes/covid-19-data/raw/master/us-counties.csv.

Use fread’s nrows option

Is your file large? Would you like to examine its structure before importing the whole thing – without having to open it in a text editor or Excel? Use fread’s nrows option to import only a portion of a file for exploration.

The code below imports just the first 10 rows of the CSV.

mydt10 <- fread("us-counties.csv", nrows = 10)

If you just want to see column names without any data at all, you can use nrows = 0

Use fread’s select option

Once you know the file structure, you can choose which columns to import. fread’s select option lets you pick columns you want to keep. select takes a vector of either column names or column-position numbers. If names, they need to be in quotation marks, like most vectors of character strings:

mydt <- fread("us-counties.csv", 
select = c("date", "county", "state", "cases"))

As always, numbers don’t need quotation marks:

mydt <- fread("us-counties.csv", select = c(1,2,3,5))

You can use an R object with a vector of column names inside fread, as you can see in this next group of code. I create a vector my_cols with date, county, state, and cases; then I use that vector inside fread.

my_cols <- c("date", "county", "state", "cases")
mydt <- fread("us-counties.csv", select = my_cols)

The opposite of select is drop. You can choose to import all columns except the ones you specify with drop, such as:

mydt <- fread("us-counties.csv", drop = c("fips", "deaths"))

Like with select, drop takes a vector of column names or numerical positions. 

Use fread with grep

If you’re familiar with Unix, you can execute command-line tools right from inside fread. For example, if I just wanted California data, I could use grep to only import lines that contain the text “California.” Note that this searches each entire row as a text string, not a specific column, so your data has to be in a format where that makes sense.

ca <- fread("grep California us-counties.csv")

Unfortunately, grep doesn’t understand the original file’s column names, so you end up with default names.

           V1          V2         V3   V4 V5 V6
1: 2020-01-25      Orange California 6059  1  0
2: 2020-01-26 Los Angeles California 6037  1  0
3: 2020-01-26      Orange California 6059  1  0
4: 2020-01-27 Los Angeles California 6037  1  0
5: 2020-01-27      Orange California 6059  1  0
6: 2020-01-28 Los Angeles California 6037  1  0

However, fread lets us specify column names with the col.names option. I can set the names based on names from mydt10 that I created above.

ca <- fread("grep California us-counties.csv", 
             col.names = names(mydt10))
> head(ca)
         date      county      state fips cases deaths
1: 2020-01-25      Orange California 6059     1      0
2: 2020-01-26 Los Angeles California 6037     1      0
3: 2020-01-26      Orange California 6059     1      0
4: 2020-01-27 Los Angeles California 6037     1      0
5: 2020-01-27      Orange California 6059     1      0
6: 2020-01-28 Los Angeles California 6037     1      0

We can also use regular expressions, with grep’s -E option, letting us do more complex searches, such as looking for four states at once. 

states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv", 
col.names = names(mydt10))

Once again, a reminder: This is looking for each of those state names anywhere in the row, not just in the state column. If you run the code above and check what states are included in the results with unique(states4$state), you’ll see Oklahoma and Missouri in the states column along with Texas, Arizona, Florida, and South Carolina. That’s because both Oklahoma and Missouri have counties named Texas.

So, grep during file import is a way to filter out a lot of data you don’t want from a very large data set; but it doesn’t guarantee you only get what you want. After this kind of import, you should still filter specifically on column data to make sure you didn’t get anything unexpected.

Use fread’s colClasses option

You can set column classes during import – for just a few columns, not every one. For example, the date column in this data is coming in as character strings, even though it’s in year-month-day format. We can set the column named date to the data type Date during import using the colClasses option. 

mydt <- fread("us-counties.csv", colClasses = c("date" = "Date"))

Now, dates are Dates.

> str(mydt)
Classes ‘data.table’ and 'data.frame':	322651 obs. of  6 variables:
 $ date  : Date, format: "2020-01-21" "2020-01-22" "2020-01-23" ...
 $ county: chr  "Snohomish" "Snohomish" "Snohomish" "Cook" ...
 $ state : chr  "Washington" "Washington" "Washington" "Illinois" ...
 $ fips  : int  53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
 $ cases : int  1 1 1 1 1 1 1 1 1 1 ...
 $ deaths: int  0 0 0 0 0 0 0 0 0 0 ...

Use fread on zipped files

You can import a zipped file without unzipping it first. fread can import gz and bz2 files directly, such as mydt <- fread("myfile.gz"). If you need to import a zip file, you can unzip it with the unzip system command within fread, using the syntax mydt <- fread(cmd = 'unzip -cq myfile.zip').

For more R tips, head to InfoWorld’s Do More With R page.

Copyright © 2020 IDG Communications, Inc.