Pull basic stats from your data frame
Because R is a statistical programming platform, it has some pretty elegant ways to extract statistical summaries from data. To extract a few basic stats from a data frame, use the
Results of the
summary function on a data set called diamonds, which is included in the ggplot2 add-on package.
That returns some basic calculations for each column. If the column has numbers, you'll see the minimum and maximum values along with median, mean, 1st quartile, and 3rd quartile. If it has factors such as "fair," "good," "very good," and "excellent," you'll get the number of each factor listed in the column.
summary() function also returns stats for a one-dimensional vector.
If you'd like even more statistical summaries from a single command, install and load the
psych package with this command:
You need to run this install only once on a system. Then load it:
You need to run the library command each time you start a new R session if you want to use the
Now try the command:
You'll get several more statistics from the data including standard deviation, mad (mean absolute deviation), skew (measuring whether or not the data distribution is symmetrical), and kurtosis (whether the data have a sharp or flatter peak near its mean).
R has the statistical functions you'd expect, including
sd() [standard deviation],
var() [variance] and
range(), which you can run on a one-dimensional vector of numbers. (Several of these functions -- such as
median() -- will not work on a two-dimensional data frame).
Results of the correlation function on the sample data set of U.S. arrests.
mode() function returns information about data type instead of the statistical mode; there's an add-on package, modeest, that adds a
mfv() function (most frequent value) to find the statistical mode.
R also contains a load of more sophisticated functions that let you do analyses with one or two commands: probability distributions, correlations, significance tests, regressions, ANOVA (analysis of variance between groups) and more.
As just one example, run the correlation function
cor() on a dataframe:
This will give you a matrix of correlations for each column of numerical data compared with every other column of numerical data.
Note: Be aware that you can run into problems when trying to run some functions on data where there are missing values. In some cases, R's default is to return NA even if just a single value is missing. For example, while the
summary() function returns column statistics excluding missing values (and also tells you how many NAs are in the data), the
mean() function will return NA if even only one value is missing in a vector.
In most cases, adding the argument:
to NA-sensitive functions will tell that function to remove any NAs when performing calculations, such as:
If you have data with some missing values, read a function's help file by typing a question mark followed by the name of the function, such as:
The function description should say whether the
na.rm argument is needed to exclude missing values.
Checking a function's help files -- even for simple functions -- can also uncover additional useful options, such as an optional trim argument for
mean() that lets you exclude some outliers.