Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among other functions. To give Python these enhanced features, Pandas introduces two new data types to Python: Series
and DataFrame
. The DataFrame
represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame
. A Pandas DataFrame
can also be thought of as a dictionary or collection of Series
objects.
Why should you use a programming language like Python and a tool like Pandas to work with data? It boils down to automation and reproducibility. If a particular set of analyses need to be performed on multiple data sets, a programming language has the ability to automate the analysis on those data sets. Although many spreadsheet programs have their own macro programming languages, many users do not use them. Furthermore, not all spreadsheet programs are available on all operating systems. Performing data analysis using a programming language forces the user to maintain a running record of all steps performed on the data. I, like many people, have accidentally hit a key while viewing data in a spreadsheet program, only to find out that my results no longer make any sense due to bad data. This is not to say that spreadsheet programs are bad or that they do not have their place in the data workflow; they do. Rather, my point is that there are better and more reliable tools out there.
Loading your first data set
When given a data set, you first load it and begin looking at its structure and contents. The simplest way of looking at a data set is to examine and subset specific rows and columns. You can see which type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics.
Because Pandas is not part of the Python standard library, you have to first tell Python to load (import
) the library:
import pandas
With the library loaded, you can use the read_csv
function to load a CSV data file. To access the read_csv
function from Pandas, use dot notation.
# by default the read_csv function will read a comma-separated file; # our Gapminder data are separated by tabs # we can use the sep parameter and indicate a tab with \t df = pandas.read_csv('../data/gapminder.tsv', sep='\t') # we use the head method so Python shows us only the first 5 rows print(df.head())
| country continent year lifeExp pop gdpPercap | 0 Afghanistan Asia 1952 28.801 8425333 779.445314 | 1 Afghanistan Asia 1957 30.332 9240934 820.853030 | 2 Afghanistan Asia 1962 31.997 10267083 853.100710 | 3 Afghanistan Asia 1967 34.020 11537966 836.197138 | 4 Afghanistan Asia 1972 36.088 13079460 739.981106
When working with Pandas functions, it is common practice to give pandas
the alias pd
. Thus the following code is equivalent to the preceding example:
import pandas as pd df = pd.read_csv('../data/gapminder.tsv', sep='\t')
You can check whether you are working with a Pandas DataFrame
by using the built-in type
function (that is, whether it comes directly from Python, not any package such as Pandas).
print(type(df))
| <class 'pandas.core.frame.DataFrame'>
The type
function is handy when you begin working with many different types of Python objects and need to know which object you are currently working on.
The data set you loaded is saved as a Pandas DataFrame
object and is relatively small. Every DataFrame
object has a shape
attribute that will give you the number of rows and columns of the DataFrame
.
# get the number of rows and columns print(df.shape) (1704, 6)
The shape attribute returns a tuple (in which the first value is the number of rows and the second number is the number of columns. From the preceding results, you can see your Gapminder data set has 1,704 rows and 6 columns. (The Gapminder data set originally comes from www.gapminder.org. The version of the Gapminder data used in this book was prepared by Jennifer Bryan from the University of British Columbia. The repository can be found at www.github.com/jennybc/gapminder.)
Because shape
is an attribute of the dataframe, and not a function or method of the DataFrame
, it does not have parentheses after the period. If you made the mistake of putting parentheses after the shape
attribute, you’ll get an error.
# shape is an attribute, not a method # this will cause an error print(df.shape())
| Traceback (most recent call last): | File "<ipython-input-1-e05f133c2628>", line 2, in <module> | print(df.shape()) | TypeError: 'tuple' object is not callable
Typically, when first looking at a data set, you want to know how many rows and columns there are (we just did that). To get the gist of which information it contains, you look at the columns. The column names, like shape
, are specified using the column
attribute of the dataframe object.
# get column names print(df.columns) Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
The Pandas DataFrame
object is similar to the DataFrame-like objects found in other languages (such as Julia and R) Each column (Series
) has to be the same type, whereas each row can contain mixed types. In the current example, you can expect the country
column to be all strings and the year to be integers. However, it’s best to make sure that is the case by using the dtypes
attribute or the info
method.
The table compares the types in Pandas to the types in native Python.
# get the dtype of each column print(df.dtypes)
| country object | continent object | year int64 | lifeExp float64 | pop int64 | gdpPercap float64 | dtype: object
# get more information about our data print(df.info())
| <class 'pandas.core.frame.DataFrame'> | RangeIndex: 1704 entries, 0 to 1703 | Data columns (total 6 columns): | country 1704 non-null object | continent 1704 non-null object | year 1704 non-null int64 | lifeExp 1704 non-null float64 | pop 1704 non-null int64 | gdpPercap 1704 non-null float64 | dtypes: float64(2), int64(2), object(2) | memory usage: 80.0+ KB | None
Looking at columns, rows, and cells
Now that you’re able to load a simple data file, you want to be able to inspect its contents. You could print
the contents of the dataframe, but with today’s data, there are often too many cells to make sense of all the printed information. Instead, the best way to look at your data is to inspect it in parts by looking at various subsets of the data. You already saw that you can use the head
method of a dataframe to look at the first five rows of your data. This is useful to see if your data loaded properly and to get a sense of each of the columns, its name, and its contents. Sometimes, however, you may want to see only particular rows, columns, or values from your data.
Before continuing, make sure you are familiar with Python containers.
Subsetting columns
If you want to examine multiple columns, you can specify them by names, positions, or ranges.
Subsetting columns by row. If you want only a specific column from your data, you can access the data using square brackets.
# just get the country column and save it to its own variable country_df = df['country'] # show the first 5 observations print(country_df.head())
| 0 Afghanistan | 1 Afghanistan | 2 Afghanistan | 3 Afghanistan | 4 Afghanistan | Name: country, dtype: object
# show the last 5 observations print(country_df.tail())
| 1699 Zimbabwe | 1700 Zimbabwe | 1701 Zimbabwe | 1702 Zimbabwe | 1703 Zimbabwe | Name: country, dtype: object
To specify multiple columns by the column name, you need to pass in a Python list
between the square brackets. This may look a bit strange because there will be two sets of square brackets.
# Looking at country, continent, and year subset = df[['country', 'continent', 'year']] print(subset.head()) country continent year
| 0 Afghanistan Asia 1952 | 1 Afghanistan Asia 1957 | 2 Afghanistan Asia 1962 | 3 Afghanistan Asia 1967 | 4 Afghanistan Asia 1972
print(subset.tail()) country continent year
| 1699 Zimbabwe Africa 1987 | 1700 Zimbabwe Africa 1992 | 1701 Zimbabwe Africa 1997 | 1702 Zimbabwe Africa 2002 | 1703 Zimbabwe Africa 2007
Again, you can opt to print
the entire subset
dataframe. We don’t use this option in this book, because it would take up an unnecessary amount of space.
Subsetting columns by index position break in Pandas v0.20. At times, you may want to get a particular column by its position, rather than its name. For example, you want to get the first (“country”) column and third column (“year”), or just the last column (“gdpPercap”).
As of Pandas v0.20, you can no longer pass in a list of integers in the square brackets to subset columns. For example, df[[1]]
, df[[0, -1]
, and df[list(range(5)]
no longer work. There are other ways of subsetting column, but they build on the technique used to subset rows.
Subsetting rows
Rows can be subset in multiple ways, by row name or row index. The table gives a quick overview of the various methods.
Subset rows by index label: loc
. Let’s take a look at part of the Gapminder data.
print(df.head())
| country continent year lifeExp pop gdpPercap | 0 Afghanistan Asia 1952 28.801 8425333 779.445314 | 1 Afghanistan Asia 1957 30.332 9240934 820.853030 | 2 Afghanistan Asia 1962 31.997 10267083 853.100710 | 3 Afghanistan Asia 1967 34.020 11537966 836.197138 | 4 Afghanistan Asia 1972 36.088 13079460 739.981106
On the left side of the printed dataframe, you see what appear to be row numbers. This column-less row of values is the index label of the dataframe. Think of the index label as being like a column name, but for rows instead of columns. By default, Pandas will fill in the index labels with the row numbers (note that it starts counting from 0). A common example where the row index labels are not the same as the row number is when you work with time series data. In that case, the index label will be a timestamp of sorts. For now, though, you will keep the default row number values.
You can use the loc
attribute on the dataframe to subset rows based on the index label.
# get the first row # Python counts from 0 print(df.loc[0])
| country Afghanistan | continent Asia | year 1952 | lifeExp 28.801 | pop 8425333 | gdpPercap 779.445 | Name: 0, dtype: object
# get the 100th row # Python counts from 0 print(df.loc[99])
| country Bangladesh | continent Asia | year 1967 | lifeExp 43.453 | pop 62821884 | gdpPercap 721.186 | Name: 99, dtype: object
# get the last row # this will cause an error print(df.loc[-1])
| Traceback (most recent call last): | File "/home/dchen/anaconda3/envs/book36/lib/python3.6/site- | packages/pandas/core/indexing.py", line 1434, in _has_valid_type | error() | KeyError: 'the label [-1] is not in the [index]'
During handling of the above exception, another exception occurred:
| Traceback (most recent call last): | File "<ipython-input-1-5c89f7ac3971>", line 2, in <module> | print(df.loc[-1]) | KeyError: 'the label [-1] is not in the [index]'
Note that passing -1
as the loc
will cause an error, because it is actually looking for the row index label (row number) -1
, which does not exist in the example. Instead, you can use a bit of Python to calculate the number of rows and pass that value into loc
.
# get the last row (correctly) # use the first value given from shape to get the number of rows number_of_rows = df.shape[0] # subtract 1 from the value since you want the last index value last_row_index = number_of_rows - 1 # now do the subset using the index of the last row print(df.loc[last_row_index])
| country Zimbabwe | continent Africa | year 2007 | lifeExp 43.487 | pop 12311143 | gdpPercap 469.709 | Name: 1703, dtype: object
Alternatively, you can use the tail method to return the last 1
row, instead of the default 5
.
# there are many ways of doing what you want print(df.tail(n=1))
| country continent year lifeExp pop gdpPercap | 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298