Beginner's guide to R: Easy ways to do basic data analysis

Congrats, you've read your data into an R object. Now take the next step

Page 2 of 5

Pull basic stats from your data frame

Because R is a statistical programming platform, it has some pretty elegant ways to extract statistical summaries from data. To extract a few basic stats from a data frame, use the summary() function:

summary(mydata)

Google's Project Loon stakes out no-NSA zone (we think)
Results of the summary function on a data set called diamonds, which is included in the ggplot2 add-on package.

That returns some basic calculations for each column. If the column has numbers, you'll see the minimum and maximum values along with median, mean, 1st quartile, and 3rd quartile. If it has factors such as "fair," "good," "very good," and "excellent," you'll get the number of each factor listed in the column.

The summary() function also returns stats for a one-dimensional vector.

If you'd like even more statistical summaries from a single command, install and load the psych package with this command:

install.packages("psych")

You need to run this install only once on a system. Then load it:

library(psych)

You need to run the library command each time you start a new R session if you want to use the psych package.

Now try the command:

describe(mydata)

You'll get several more statistics from the data including standard deviation, mad (mean absolute deviation), skew (measuring whether or not the data distribution is symmetrical), and kurtosis (whether the data have a sharp or flatter peak near its mean).

R has the statistical functions you'd expect, including mean(), median(), min(), max(), sd() [standard deviation], var() [variance] and range(), which you can run on a one-dimensional vector of numbers. (Several of these functions -- such as mean() and median() -- will not work on a two-dimensional data frame).

Google's Project Loon stakes out no-NSA zone (we think)
Results of the correlation function on the sample data set of U.S. arrests.

Oddly, the mode() function returns information about data type instead of the statistical mode; there's an add-on package, modeest, that adds a mfv() function (most frequent value) to find the statistical mode.

R also contains a load of more sophisticated functions that let you do analyses with one or two commands: probability distributions, correlations, significance tests, regressions, ANOVA (analysis of variance between groups) and more.

As just one example, run the correlation function cor() on a dataframe:

cor(mydata)

This will give you a matrix of correlations for each column of numerical data compared with every other column of numerical data.

Note: Be aware that you can run into problems when trying to run some functions on data where there are missing values. In some cases, R's default is to return NA even if just a single value is missing. For example, while the summary() function returns column statistics excluding missing values (and also tells you how many NAs are in the data), the mean() function will return NA if even only one value is missing in a vector.

In most cases, adding the argument:

na.rm=TRUE

to NA-sensitive functions will tell that function to remove any NAs when performing calculations, such as:

mean(myvector, na.rm=TRUE)

If you have data with some missing values, read a function's help file by typing a question mark followed by the name of the function, such as:

?median

The function description should say whether the na.rm argument is needed to exclude missing values.

Checking a function's help files -- even for simple functions -- can also uncover additional useful options, such as an optional trim argument for mean() that lets you exclude some outliers.

| 1 2 3 4 5 Page 2
From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies