Python Hands-on

Introducing Pandas DataFrame for Python data analysis

The open source library gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among other functions

Introducing Pandas DataFrame for Python data analysis
Thinkstock

Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among other functions. To give Python these enhanced features, Pandas introduces two new data types to Python: Series and DataFrame. The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series objects.

Why should you use a programming language like Python and a tool like Pandas to work with data? It boils down to automation and reproducibility. If a particular set of analyses need to be performed on multiple data sets, a programming language has the ability to automate the analysis on those data sets. Although many spreadsheet programs have their own macro programming languages, many users do not use them. Furthermore, not all spreadsheet programs are available on all operating systems. Performing data analysis using a programming language forces the user to maintain a running record of all steps performed on the data. I, like many people, have accidentally hit a key while viewing data in a spreadsheet program, only to find out that my results no longer make any sense due to bad data. This is not to say that spreadsheet programs are bad or that they do not have their place in the data workflow; they do. Rather, my point is that there are better and more reliable tools out there.

Loading your first data set

When given a data set, you first load it and begin looking at its structure and contents. The simplest way of looking at a data set is to examine and subset specific rows and columns. You can see which type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics.

Because Pandas is not part of the Python standard library, you have to first tell Python to load (import) the library:

import pandas

With the library loaded, you can use the read_csv function to load a CSV data file. To access the read_csv function from Pandas, use dot notation.

# by default the read_csv function will read a comma-separated file; 
# our Gapminder data are separated by tabs
# we can use the sep parameter and indicate a tab with \t
df = pandas.read_csv('../data/gapminder.tsv', sep='\t') 
# we use the head method so Python shows us only the first 5 rows
print(df.head())
|    country       continent  year  lifeExp  pop       gdpPercap
| 0  Afghanistan   Asia       1952  28.801    8425333  779.445314
| 1  Afghanistan   Asia       1957  30.332    9240934  820.853030
| 2  Afghanistan   Asia       1962  31.997   10267083  853.100710
| 3  Afghanistan   Asia       1967  34.020   11537966  836.197138
| 4  Afghanistan   Asia       1972  36.088   13079460  739.981106

When working with Pandas functions, it is common practice to give pandas the alias pd. Thus the following code is equivalent to the preceding example:

import pandas as pd
df = pd.read_csv('../data/gapminder.tsv', sep='\t')

You can check whether you are working with a Pandas DataFrame by using the built-in type function (that is, whether it comes directly from Python, not any package such as Pandas).

print(type(df))
| <class 'pandas.core.frame.DataFrame'>

The type function is handy when you begin working with many different types of Python objects and need to know which object you are currently working on.

The data set you loaded is saved as a Pandas DataFrame object and is relatively small. Every DataFrame object has a shape attribute that will give you the number of rows and columns of the DataFrame.

# get the number of rows and columns
print(df.shape) (1704, 6)

The shape attribute returns a tuple (in which the first value is the number of rows and the second number is the number of columns. From the preceding results, you can see your Gapminder data set has 1,704 rows and 6 columns. (The Gapminder data set originally comes from www.gapminder.org. The version of the Gapminder data used in this book was prepared by Jennifer Bryan from the University of British Columbia. The repository can be found at www.github.com/jennybc/gapminder.)

Because shape is an attribute of the dataframe, and not a function or method of the DataFrame, it does not have parentheses after the period. If you made the mistake of putting parentheses after the shape attribute, you’ll get an error.

# shape is an attribute, not a method 
# this will cause an error 
print(df.shape())
| Traceback (most recent call last):
|   File "<ipython-input-1-e05f133c2628>", line 2, in <module>
|     print(df.shape())
| TypeError: 'tuple' object is not callable

Typically, when first looking at a data set, you want to know how many rows and columns there are (we just did that). To get the gist of which information it contains, you look at the columns. The column names, like shape, are specified using the column attribute of the dataframe object.

# get column names
print(df.columns) 
  Index(['country', 'continent', 'year', 'lifeExp', 'pop',
         'gdpPercap'],
        dtype='object')

The Pandas DataFrame object is similar to the DataFrame-like objects found in other languages (such as Julia and R) Each column (Series) has to be the same type, whereas each row can contain mixed types. In the current example, you can expect the country column to be all strings and the year to be integers. However, it’s best to make sure that is the case by using the dtypes attribute or the info method.

The table compares the types in Pandas to the types in native Python.

# get the dtype of each column
print(df.dtypes)
| country        object
| continent      object
| year            int64
| lifeExp       float64
| pop             int64
| gdpPercap     float64
| dtype: object
# get more information about our data
print(df.info())
| <class 'pandas.core.frame.DataFrame'>
| RangeIndex:  1704 entries, 0 to 1703
| Data columns (total 6 columns):
| country      1704 non-null object
| continent    1704 non-null object
| year         1704 non-null int64
| lifeExp      1704 non-null float64
| pop          1704 non-null int64
| gdpPercap    1704 non-null float64
| dtypes: float64(2), int64(2), object(2)
| memory usage: 80.0+ KB
| None 

Looking at columns, rows, and cells

Now that you’re able to load a simple data file, you want to be able to inspect its contents. You could print the contents of the dataframe, but with today’s data, there are often too many cells to make sense of all the printed information. Instead, the best way to look at your data is to inspect it in parts by looking at various subsets of the data. You already saw that you can use the head method of a dataframe to look at the first five rows of your data. This is useful to see if your data loaded properly and to get a sense of each of the columns, its name, and its contents. Sometimes, however, you may want to see only particular rows, columns, or values from your data.

Before continuing, make sure you are familiar with Python containers.

Subsetting columns

If you want to examine multiple columns, you can specify them by names, positions, or ranges.

Subsetting columns by row. If you want only a specific column from your data, you can access the data using square brackets.

# just get the country column and save it to its own variable
country_df = df['country']

# show the first 5 observations 
print(country_df.head())
| 0  Afghanistan
| 1  Afghanistan
| 2  Afghanistan
| 3  Afghanistan
| 4  Afghanistan
| Name: country, dtype: object
# show the last 5 observations
print(country_df.tail())
| 1699  Zimbabwe
| 1700  Zimbabwe
| 1701  Zimbabwe
| 1702  Zimbabwe
| 1703  Zimbabwe
| Name: country, dtype: object

To specify multiple columns by the column name, you need to pass in a Python list between the square brackets. This may look a bit strange because there will be two sets of square brackets.

# Looking at country, continent, and year
subset = df[['country', 'continent', 'year']]

print(subset.head()) country continent year
| 0  Afghanistan    Asia  1952
| 1  Afghanistan    Asia  1957
| 2  Afghanistan    Asia  1962
| 3  Afghanistan    Asia  1967
| 4  Afghanistan    Asia  1972
print(subset.tail()) country continent year
| 1699  Zimbabwe    Africa    1987
| 1700  Zimbabwe    Africa    1992
| 1701  Zimbabwe    Africa    1997
| 1702  Zimbabwe    Africa    2002
| 1703  Zimbabwe    Africa    2007

Again, you can opt to print the entire subset dataframe. We don’t use this option in this book, because it would take up an unnecessary amount of space.

Subsetting columns by index position break in Pandas v0.20. At times, you may want to get a particular column by its position, rather than its name. For example, you want to get the first (“country”) column and third column (“year”), or just the last column (“gdpPercap”).

As of Pandas v0.20, you can no longer pass in a list of integers in the square brackets to subset columns. For example, df[[1]], df[[0, -1], and df[list(range(5)] no longer work. There are other ways of subsetting column, but they build on the technique used to subset rows.

Subsetting rows

Rows can be subset in multiple ways, by row name or row index. The table gives a quick overview of the various methods.

Subset rows by index label: loc. Let’s take a look at part of the Gapminder data.

print(df.head())
|      country      continent  year  lifeExp  pop       gdpPercap
| 0    Afghanistan  Asia       1952  28.801    8425333  779.445314
| 1    Afghanistan  Asia       1957  30.332    9240934  820.853030
| 2    Afghanistan  Asia       1962  31.997   10267083  853.100710
| 3    Afghanistan  Asia       1967  34.020   11537966  836.197138
| 4    Afghanistan  Asia       1972  36.088   13079460  739.981106

On the left side of the printed dataframe, you see what appear to be row numbers. This column-less row of values is the index label of the dataframe. Think of the index label as being like a column name, but for rows instead of columns. By default, Pandas will fill in the index labels with the row numbers (note that it starts counting from 0). A common example where the row index labels are not the same as the row number is when you work with time series data. In that case, the index label will be a timestamp of sorts. For now, though, you will keep the default row number values.

You can use the loc attribute on the dataframe to subset rows based on the index label.

# get the first row
# Python counts from 0 
print(df.loc[0])
| country    Afghanistan
| continent  Asia
| year       1952
| lifeExp    28.801
| pop        8425333
| gdpPercap  779.445
| Name: 0, dtype: object
# get the 100th row
# Python counts from 0 
print(df.loc[99])
| country    Bangladesh
| continent  Asia
| year       1967
| lifeExp    43.453
| pop        62821884
| gdpPercap  721.186
| Name: 99, dtype: object
# get the last row
# this will cause an error 
print(df.loc[-1])
| Traceback (most recent call last):
|    File "/home/dchen/anaconda3/envs/book36/lib/python3.6/site-
|  packages/pandas/core/indexing.py", line 1434, in _has_valid_type
|      error()
| KeyError: 'the label [-1] is not in the [index]'

During handling of the above exception, another exception occurred:

| Traceback (most recent call last):
|    File "<ipython-input-1-5c89f7ac3971>", line 2, in <module>
|      print(df.loc[-1])
| KeyError: 'the label [-1] is not in the [index]'

Note that passing -1 as the loc will cause an error, because it is actually looking for the row index label (row number) -1, which does not exist in the example. Instead, you can use a bit of Python to calculate the number of rows and pass that value into loc.

# get the last row (correctly)
# use the first value given from shape to get the number of rows 
number_of_rows = df.shape[0] 

# subtract 1 from the value since you want the last index value
last_row_index = number_of_rows - 1

# now do the subset using the index of the last row
print(df.loc[last_row_index])
1 2 3 Page 1
Page 1 of 3