Beginner's guide to R: Easy ways to do basic data analysis

Congrats, you've read your data into an R object. Now take the next step

Page 5 of 5

If you're finding that your selection statement is starting to get unwieldy, you can put your row and column selections into variables first, such as:

mpg20 <- mtcars$mpg > 20

cols <- c("mpg", "hp")

Then you can select the rows and columns with those variables:

mtcars[mpg20, cols]

making for a more compact select statement but more lines of code.

Getting tired of including the name of the data set multiple times per command? If you're using only one data set and you are not making any changes to the data that need to be saved, you can attach and detach a copy of the data set temporarily.

The attach() function works like this:

attach(mtcars)

So, instead of having to type:

mpg20 <- mtcars$mpg > 20

You can leave out the data set reference and type this instead:

mpg20 <- mpg > 20

After using attach(), remember to use the detach function when you're finished:

detach()

Some R users advise avoiding attach() because it can be easy to forget to detach(). If you don't detach() the copy, your variables could end up referencing the wrong data set.

Alternative to bracket notation

Bracket syntax is pretty common in R code, but it's not your only option. If you dislike that format, you might prefer the subset() function instead, which works with vectors and matrices as well as data frames. The format:

subset(your data object, logical condition for the rows you want to return, select statement for the columns you want to return)

In the mtcars example, to find all rows where MPG is greater than 20 and return only those rows with their MPG and HP data, the subset() statement would look like:

subset(mtcars, mpg>20, c("mpg", "hp"))

What if you wanted to find the row with the highest MPG?

subset(mtcars, mpg==max(mpg))

If you just wanted to see the MPG information for the highest MPG:

subset(mtcars, mpg==max(mpg), mpg)

If you just want to use subset to extract some columns and display all rows, you can either leave the row conditional spot blank with a comma, similar to bracket notation:

subset(mtcars, , c("mpg", "hp"))

Or, indicate your second argument is for columns with select= like this:

subset(mtcars, select=c("mpg", "hp"))

Counting factors

To tally up counts by factor, try the table command. For the diamonds data set, to see how many diamonds of each category of cut are in the data, you can use:

table(diamonds$cut)

This will return how many diamonds of each factor -- fair, good, very good, premium, and ideal -- exist in the data. Want to see a cross-tab by cut and color?

table(diamonds$cut, diamonds$color)

Google's Project Loon stakes out no-NSA zone (we think)
R's table function returns a count of each factor in your data.

If you are interested in learning more about statistical functions in R and how to slice and dice your data, there are a number of free academic downloads with many more details. These include Learning statistics with R by Daniel Navarro at the University of Adelaide in Australia (500+ page PDF download, may take a little while). And although not free, books such as "The R Cookbook" and "R in a Nutshell" have a lot of good examples and well-written explanations.

See the entire beginner's guide to R:

Part 1: Introduction to R

Part 2: Getting your data into R

Part 3: Easy ways to do basic data analysis with R

Part 4: Painless data visualization using R

Part 5: Syntax quirks you'll want to know about R

Part 6: Useful resources for R

This story, "Beginner's guide to R: Easy ways to do basic data analysis" was originally published by Computerworld.

| 1 2 3 4 5 Page 5
From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies