Beginner's guide to R: Get your data into R

In part 2 of our hands-on guide to the hot data-analysis environment, we provide tips on how to import data in various formats, both local and on the Web

Page 2 of 3

Categories or values?

Because of R's roots as a statistical tool, when you import non-numerical data, R may assume that character strings are statistical factors -- things like "poor," "average" and "good" -- or "success" and "failure."

But your text columns may not be categories that you want to group and measure, just names of companies or employees. If you don't want your text data to be read in as factors, add stringsAsFactor=FALSE to read.table, like this:

mydata <- read.table("filename.txt", sep="\t", header=TRUE, stringsAsFactor=FALSE)

If you'd prefer, R allows you to use a series of menu clicks to load data instead of "reading" data from the command line as just described. To do this, go to the Workspace tab of RStudio's upper-right window, find the menu option to "Import Dataset," then choose a local text file or URL.

As data are imported via menu clicks, the R command that RStudio generated from your menu clicks will appear in your console. You may want to save that data-reading command into a script file if you're using this for significant analysis work, so that others -- or you -- can reproduce that work.

The 3-minute YouTube video below, recorded by UCLA statistics grad student Miles Chen, shows an RStudio point-and-click data import.

Copying data snippets

If you have just a small section of data already in a table -- a spreadsheet, say, or a Web HTML table -- you can Ctrl-C copy those data to your Windows clipboard and import them into R.

The command below handles clipboard data with a header row that's separated by tabs and stores the data in a data frame (x):

x <- read.table(file = "clipboard", sep="\t", header=TRUE)

You can read more about using the Windows clipboard in R at the R For Dummies website.

On a Mac, the pipe ("pbpaste") function will access data you've copied with Cmd-C, so this will do the equivalent of the previous Windows command:

x <- read.table(pipe("pbpaste"), sep="\t")

Other formats

There are R packages that will read files from Excel, SPSS, SAS, Stata and various relational databases. I don't bother with the Excel package; it requires both Java and Perl, and in general I'd rather export a spreadsheet to CSV in hopes of not running into Microsoft special-character problems. For more info on other formats, see UCLA's How to input data into R, which discusses the foreign add-on package for importing several other statistical software file types.

If you'd like to try to connect R with a database, there are several dedicated packages such as RPostgreSQL, RMySQL, RMongo, RSQLite and RODBC.

(You can see the entire list of available R packages at the CRAN website.)

Remote data

In R, read.csv() and read.table() work pretty much the same to access files from the Web as they do for local data.

Do you want Google Spreadsheets data in R? You don't have to download the spreadsheet to your local system as you do with a CSV. Instead, in your Google spreadsheet -- properly formatted with just one row for headers and then one row of data per line -- select File > Publish to the Web. (This will make the data public, although only to someone who has or stumbles upon the correct URL. Beware of this process, especially with sensitive data.)

Select the sheet with your data and click "Start publishing." You should see a box with the option to get a link to the published data. Change the format type from Web page to CSV and copy the link. Now you can read those data into R with a command such as:

mydata <- read.csv("")

The command structure is the same for any file on the Web. For example, Pew Research Center data about mobile shopping are available as a CSV file for download. You can store the data in a variable called pew_data like this:

pew_data <- read.csv("")

It's important to make sure the file you're downloading is in an R-friendly format first: in other words, that it has a maximum of one header row, with each subsequent row having the equivalent of one data record. Even well-formed government data might include lots of blank rows followed by footnotes -- that's not what you want in an R data table if you plan on running statistical analysis functions on the file.

| 1 2 3 Page 2
From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies