Beginner's guide to R: Easy ways to do basic data analysis

Congrats, you've read your data into an R object. Now take the next step

Page 3 of 5

Not all R functions need a robust data set to be useful for statistical work. For example, how many ways can you select a committee of 4 people from a group of 15? You can pull out your calculator and find 15! divided by 4! times 11! ... or you can use the R choose() function:


Or, perhaps you want to see all of the possible pair combinations of a group of 5 people, not simply count them. You can create a vector with the people's names and store it in a variable called mypeople:

mypeople <- c("Bob", "Joanne", "Sally", "Tim", "Neal")

In the example above, c() is the combine function.

Then run the combn() function, which takes two arguments -- your entire set first and then the number you want to have in each group:

combn(mypeople, 2)

Use the combine function to see all possible combinations from a group.

Probably most experienced R users would combine these two steps into one like this:

combn(c("Bob", "Joanne", "Sally", "Tim", "Neal"),2)

But separating the two can be more readable for beginners.

Get slices or subsets of your data

Maybe you don't need correlations for every column in your data frame and you just want to work with a couple of columns, not 15. Perhaps you want to see data that meets a certain condition, such as within 3 standard deviations. R lets you slice your data sets in various ways, depending on the data type.

To select just certain columns from a data frame, you can either refer to the columns by name or by their location (column 1, 2, 3).

For example, the mtcars sample data frame has these column names: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, and carb.

Can't remember the names of all the columns in your data frame? If you just want to see the column names and nothing else, instead of functions such as str(mtcars) and head(mtcars), you can type:


That's handy if you want to store the names in a variable, perhaps called mtcars.colnames (or anything else you'd like to call it):

mtcars.colnames <- names(mtcars)

But back to the task at hand. To access only the data in the mpg column in mtcars, you can use R's dollar sign notation:


More broadly, then, the format for accessing a column by name would be:


That will give you a 1-dimensional vector of numbers like this:

[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8<

[12] 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5

[23] 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

The numbers in brackets are not part of your data, by the way. They indicate what item number each line is starting with. If you have only one line of data, you'll just see [1]. If there's more than one line of data and only the first 11 entries can fit on the first line, your second line will start with [12], and so on.

Sometimes a vector of numbers is exactly what you want -- if, for example, you want to quickly plot mtcars$mpg and don't need item labels, or you're looking for statistical info such as variance and mean.

| 1 2 3 4 5 Page 3