Time series analysis with KNIME and Spark

Train and evaluate a simple time series model using a random forest of regression trees and the NYC Yellow taxi data set

Time series analysis with KNIME and Spark
Thinkstock

Demand prediction

I think we all agree that knowing what lies ahead in the future makes life much easier. This is true for life events as well as for prices of washing machines and refrigerators or the demand of electrical energy in the whole city. Knowing how many bottles of olive oil customers will want tomorrow or next week allows for better restocking plans in the retail store. Knowing the likely increase in the price of gas or diesel allows a trucking company to better plan its finances. Examples where such knowledge can be of help are countless.

Demand prediction is a big branch of data science. Its goal is to make estimations about future demand using historical data and possibly other external information. Demand prediction can refer to any kind of numbers: visitors to a restaurant, generated kW/h, school new registrations, beer bottles required on the store shelves, appliance prices, and so on.

Predicting taxi demand in NYC

As an example of demand prediction, we will tackle the problem of predicting taxi demand in New York City. In megacities such as New York, more than 13,500 Yellow taxis roam the streets every day (per the 2018 Taxi and Limousine Commission Factbook). This makes understanding and anticipating taxi demand a crucial task for taxi companies or even city planners, to increase the efficiency of the taxi fleet and minimize waiting times between trips.

For this case study, we used the NYC taxi data set, which can be downloaded at the NYC Taxi and Limousine Commission (TLC) website. This data set spans 10 years of taxi trips in New York City with a wide range of information about each trip, such as pickup and drop-off date/times, locations, fares, tips, distances, and passenger counts. Since we are using this case study just for demonstration, we used only the Yellow taxi subset for the year 2017. For a more general application, it would be useful to include data from a few additional years in the data set, at least to be able to estimate the yearly seasonality.

Let’s set the goal of this tutorial to predict the number of taxi trips required in NYC for the next hour.

Time series analysis: The process

The demand prediction problem is a classic time series analysis problem. We have a time series of numerical values (prices, number of visitors, kW/h, etc.) and we want to predict the next value given the past N values. In our case, we have a time series of numbers of taxi trips per hour (Figure 1a), and we want to predict the number of taxi requests in the next hour given the number of taxi trips in the last N hours.

For this case study, we implemented a time series analysis process through the following steps  (Figure 1):

  • Data transformation: aggregations, time alignment, missing value imputation, and other required transformations usually depending upon the data domain and the business case
  • Time series visualization
  • Removal of non-stationarity/seasonality, if any
  • Data partitioning to build a training set (past) and test set (future)
  • Construction of vector of N past values
  • Training of a machine learning model (or models) allowing for numerical outputs
  • Calculation of prediction error
  • Model deployment, if prediction error is acceptable
time series spark figure 1 KNIME

Figure 1. Some classic steps for time series analysis.

Note that precise prediction of a single numerical value might be a complex task. In some cases, a precise numerical prediction is not even needed and the same problem can be satisfactorily and easily solved after transforming it into a classification problem. In order to transform a numerical prediction problem into a classification problem, you just need to create classes out of the target variable.

For example, predicting the price of a washing machine in two weeks might be difficult, but predicting whether this price will increase, decrease, or remain the same in two weeks is a much easier problem. In this case, we have transformed the numerical problem of price prediction into a classification problem with three classes (price increase, price decrease, price unchanged).

Data cleaning and other transformations

The first step is to move from the original data rows sparse in time (in this case taxi trips, but it could be contracts with customers or Fast Fourier Transform amplitudes just the same) to a time series of values uniformly sampled in time. This usually requires two things:

  • An aggregation operation on a predefined time scale: seconds, minutes, hours, days, weeks, or months depending on the data and the business problem. The granularity (time scale) used for the aggregation is important to visualize different seasonality effects or to catch different dynamics in the signal.
  • A realignment operation to make sure that time sampling is uniform in the considered time window. Often, time series are presented in a single sequence of the captured times. If any time sample is missing, we do not notice. A realignment procedure inserts missing values at the skipped sampling times.

Another classical preprocessing step consists of imputing missing values. Here a number of time series dedicated techniques are available, like using the previous value, the average value between previous and next value, or the linear interpolation between previous and next value.

The goal here is to predict the taxi demand (equals the number of taxi trips required) for the next hour. Thus, since we need an hourly time scale for the time series, the total number of taxi trips in New York City was calculated for each hour of every single day in the data set. This required grouping the data by hour and date (year, month, day of the month, hour) and then counting the number of rows (i.e., the number of taxi trips) in each group.

Time series visualization

Before proceeding with the data preparation, model training, and model evaluation, it is always useful to get an idea of the problem we are dealing with via visual data exploration. In this case we’ll visualize the data on multiple time scales. Each visualization can offer different insight on the time evolution of the data.

In the previous step, we already aggregated the number of taxi trips by the hour. This produces the time series x(t) (Figure 2a). After that, in order to observe the time series evolution on a different time scale, we also visualized it after aggregating by day (Figure 2b) and by month (Figure 2c).

From the plot of the hourly time series, you can clearly see a 24-hour pattern: high numbers of taxi trips during the day and lower numbers during the night.

If we switch to the daily scale, the weekly seasonality pattern becomes evident, with more trips during business days and fewer trips over the weekends. The non-stationarity of this time series can be easily spotted on this time scale, through the varying average value.

Finally, the plot of the monthly time series does not have enough data points to show any kind of seasonality pattern. It’s likely that extending the data set to include more years would produce more points in the plot and possibly a winter/summer seasonality pattern could be observed.

time series spark figure 2a KNIME

Figure 2a. Plot of the number of taxi trips in New York City by the hour, zoomed in on the first two weeks of June 2017, from the NYC Taxi data set. The 24-hour seasonality here is quite easy to see.

time series spark figure 2b KNIME

Figure 2b. Plot of the number of taxi trips in New York City by the day, zoomed in on the time window between May 2017 and September 2017, from the NYC Taxi data set. The weekly seasonality here is quite easy to spot. The three deep valleys correspond to Memorial Day, Fourth of July, and Labor Day.

time series spark figure 2c KNIME

Figure 2c. Plot of the number of taxi trips in New York City by the month for the entire year 2017, from the NYC Taxi data set. You can see the difference between winter (more taxi trips) and summer (fewer taxi trips).

Figure 2a. Plot of the number of taxi trips in New York City by the hour, zoomed in on the first two weeks of June 2017, from the NYC Taxi data set. The 24-hour seasonality here is quite easy to see.

Figure 2b. Plot of the number of taxi trips in New York City by the day, zoomed in on the time window between May 2017 and September 2017, from the NYC Taxi data set. The weekly seasonality here is quite easy to spot. The three deep valleys correspond to Memorial Day, Fourth of July, and Labor Day.

Figure 2c. Plot of the number of taxi trips in New York City by the month for the entire year 2017, from the NYC Taxi data set. You can see the difference between winter (more taxi trips) and summer (fewer taxi trips).

Non-stationarity, seasonality, and autocorrelation function

A frequent requirement for many time series analysis techniques is that the data must be stationary.

A stationary process has the property that the mean, variance, and autocorrelation structure do not change over time. Stationarity can be defined in precise mathematical terms, but for our purpose, we mean a flat looking time series, without trend, with constant average and variance over time and a constant autocorrelation structure over time. For practical purposes, stationarity is usually determined from a run sequence plot or the linear autocorrelation function (ACF).

If the time series is non-stationary, we can often transform it to stationary by replacing it with its first order differences. That is, given the series x(t), we create the new series y(t) = x(t) - x(t-1). You can difference the data more than once, but the first order difference is usually sufficient.

Seasonality violates stationarity, and seasonality is also often established from the linear autocorrelation coefficients of the time series. These are calculated as the Pearson correlation coefficients between the value of time series x(t) at time t and its past values at times t-1,…, t-n. In general, values between -0.5 and 0.5 would be considered to be low correlation, while coefficients outside of this range (positive or negative) would indicate a high correlation.

In practice, we use the ACF plot to determine the index of the dominant seasonality or non-stationarity. The ACF plot reports on the y-axis the autocorrelation coefficients calculated for x(t) and its past x(t-i) values vs. the lags i on the x-axis. The first local maximum in the ACF plot defines the lag of the seasonality pattern (lag=S) or the need for a correction of non-stationarity (lag=1). In order not to consider irrelevant local maxima, a cut-off threshold is usually introduced, often from a predefined confidence interval (95%). Again, changing the time scale (i.e., the granularity of the aggregation) or extending the time window allows us to discover different seasonality patterns.

If we found the seasonality lag to be S, then we could apply a number of different techniques to remove seasonality. We could remove the first S samples from all subsequent S-sample windows; we could calculate the average S-sample pattern on a portion of the data set and then remove that from all following S-sample windows; we could train a machine learning model to reproduce the seasonality pattern to be removed; or more simply, we could subtract the previous value x(t-S) from the current value x(t) and then deal with the residuals y(t) = x(t) - x(t-S). We chose this last technique for this tutorial, just to keep it simple.

Figure 3 shows the ACF plot for the time series of hourly number of taxi trips. On the y-axis are the autocorrelation coefficients calculated for x(t) and its previous values at lagged hour 1, … 50. On the x-axis are the lagged hours. This chart shows peaks at lag=1 and lag=24, i.e., a daily seasonality, as was to be expected in the taxi business. The highest positive correlation coefficients are between x(t) and x(t-1) (0.91), x(t) and x(t-24) (0.83), and then x(t) and x(t-48) (0.68).

If we use the daily aggregation of the time series and calculate the autocorrelation coefficients on a lagged interval n > 7, we would also observe a peak at day 7, i.e., a weekly seasonality. On a larger scale, we might observe a winter-summer seasonality, with people taking taxis more often in winter than in summer. However, since we are considering the data over only one year, we will not inspect this kind of seasonality.

time series spark figure 3 KNIME

Figure 3. Autocorrelation plot (Pearson coefficients) over 50 hours. The strongest correlation of x(t) is with x(t-1), x(t-24), and x(t-48), indicating a 24-hr (daily) seasonality.

Data partitioning to build the training set and test set

At this point, the data set has to be partitioned into the training set (the past) and test set (the future). Notice that the split between the two sets has to be a split in time. Do not use a random partitioning but a sequential split in time! This is to avoid data leakage from the training set (the past) into the test set (the future).

We reserved the data from January 2017 until November 2017 for the training set and the data of December 2017 for the test set.

Lagging: Vector of past N values

The goal of this use case is to predict the taxi trip demand in New York City for the next hour. In order to run this prediction, we need the demands of taxi trips in the previous N hours. For each value x(t) of the time series, we want to build the vector x(t-N), …, x(t-2), x(t-1), x(t). We will use the past values x(t-N), …, x(t-2), x(t-1) as input to the model and the current value x(t) as the target column to train the model. For this example, we experimented with two values: N=24 and N=50.

1 2 Page 1
Page 1 of 2