How to model time-series anomaly detection for IoT

Machines fail. By creating a time-series prediction model from historical sensor data, you can know when that failure is coming

How to model time series anomaly detection for IoT

Anomaly detection covers a large number of data analytics use cases. However, here anomaly detection refers specifically to the detection of unexpected events, be it cardiac episodes, mechanical failures, hacker attacks, or fraudulent transactions.

The unexpected character of the event means that no such examples are available in the data set. Classification solutions generally require a set of examples for all involved classes. So, how do we proceed in a case where no examples are available? It requires a little change in perspective.

In this case, we can only train a machine learning model on nonfailure data; that is, on data that describes the system operating in normal conditions. The evaluation of whether the input data is an anomaly or just a regular operation can only be performed in deployment after the prediction has been made. The idea is that a model trained on normal data can only predict the next normal sample datum. However, if the system is not working in a normal condition anymore, the input data will not describe a correctly working system, and the model prediction will stray from reality. The error between the reality sample and the predicted sample can then tell us something about the underlying system’s condition.

In IoT (Internet of things) data, signal time series are produced by sensors strategically located on or around a mechanical device or component. A time series is the sequence of values of a variable over time. In this case, the variable describes a mechanical property of the device, and it is measured via one or more sensors. Usually, the mechanical device is working correctly. As a consequence, we have tons of samples for the device working in normal conditions and close to zero examples of device failure. Especially if the device plays a critical role in a mechanical chain, it is usually retired before any failure happens and compromises the whole machinery.

Thus, we can only train a machine learning model on a number of time series describing a system working as expected. The model will be able to predict the next sample in the time series, when the system works properly, because this is how it was trained. We then calculate the distance between the predicted sample and the real sample, and from there, we draw the conclusion as to whether everything is working as expected or if there is any reason for concern.

We will build, train, and deploy the model using the open-source, GUI-driven Knime Analytics Platform. Knime allows you to build models by assembling processing pipelines (called workflows) graphically from processing elements (called nodes).

knime anomaly detection fig01 KNIME

Figure 1. A time series is the signal produced by a working system. Usually the system is working in normal conditions and we have a normal signal. Sometimes, something new happens and we have an anomaly in our signal. Anomalies are rare, if they occur at all. Therefore, a machine learning model can only be trained on normal samples. A distance calculated between real and predicted signal is used to trigger an anomaly alarm.

The data set: 28 time series from 28 sensors

We used a 28-sensor matrix, focusing on eight parts of a mechanical rotor (see the table below), during a time frame spanning from January 1, 2007, to April 20, 2009. In total, we have 28 time series from 28 sensors attached to eight different parts of the mechanical rotor.

The signals reach us after the application of the fast Fourier transform (FFT), spread across 28 files, in the form of:

[date, time, FFT frequency, FFT amplitude]

Spectral amplitudes have been aggregated across frequency bands, time, and sensor channel. This results in 313 time series, describing the system evolution in different locations and frequency bands.

The whole data set shows only one breakdown episode on July 21, 2008. The breakdown is visible only from some sensors and primarily in some frequency bands. After the breakdown, the rotor was replaced, and much cleaner signals were recorded afterwards.

knime anomaly detection fig03 KNIME

Figure 3. Evolution over time of time series A1-SV3[0, 100] and A1-SV3[500,600]. The rotor breakdown episode on July 21, 2008, is easily visible in the higher frequency band [500, 600] Hz rather than in the lower frequency band [0, 100]. There are 313 such time series in the data set referring to different frequency bands of the original 28 time series.

Training the machine learning models

Machine learning models are trained on the portion of the time series running from January to August 2007 when the rotor was still functioning correctly.

One autoregressive (AR) model is trained on 10 past samples for each time series, resulting in 313 AR models; that is, one model for each one of the time series.

To build one model for each time series, we need to loop on the time series; that is, on the columns of the data set. Thus, the workflow to train the AR models is centered on a loop cycle. Within the loop, on each column, for each value, we build a past of ten previous samples, impute missing values, train a linear-regression model on the past values to predict the current value, and finally calculate the error statistics (mean and standard deviation) between predicted and real value. The models and the error statistics are then saved and will be used in deployment.

knime anomaly detection fig04 KNIME

Figure 4. Training workflow 02_Time_Series_AR_Training. This workflow trains an autoregressive model on 10 past samples to predict the current sample for each of the input time series. Notice that the training set consists only of time series for a system correctly working. The model will then predict only future samples of time series for a correctly working system.

The model deployment workflow

The goal of the deployment workflow here is a bit more creative than a classic model application for predictions. Again, we’ll be creating this workflow using Knime’s graphical workflow designer.

Alarm level 1

For each time series, we predict the next value based on the past 10 values, then measure the distance between the predicted value and the current real value, and finally compare this distance with the error statistics generated during training. If the error distance is above (or below) the error mean +(-) 2* standard deviation, an alarm spike is created as large as the distance value; otherwise the alarm signal is set to 0. This alarm signal is Alarm Level 1.

The whole prediction, error distance calculation, comparison with mean and standard deviation of training error, and final Alarm Level 1 calculation is performed inside the column loop within the deployment workflow. All told, 313 Alarm Level 1 time series are calculated, i.e., one for each time series. This happens in the Alarm Level 1 node in the workflow diagram (see Figure 5).

Alarm level 2

The Alarm Level 1 time series is a series of more or less high spikes. A single spike on its own does not mean much. It could be due to electricity fluctuation, a quick change in temperature, or some such temporary cause. On the other hand, a series of spikes could mean a more serious and permanent change in the underlying system. Thus, an Alarm Level 2 series is created as the moving average of the previous 21 samples of the Alarm Level 1 series, on all 313 columns. These comprise the Alarm Level 2 time series and are calculated in the Alarm Level 2 meta node (Figure 5). (A meta node in Knime is a node that contains a subworkflow.)

Because we need 21 daily values for this moving average, at the start of the deployment workflow we select the day to investigate and then we take at least 21 past values for all of the time series. Because of the many missing values, to ensure we have enough samples for the moving average, we exaggerated and used a two-month window of past values.

At this point, all Alarm Level 2 values are summed up across columns for the same date. If this aggregated value exceeds a given threshold (0.01), the alarm is taken seriously, and a checkup procedure is triggered in the meta node Trigger Checkup if level 2 alarm = 1 (Figure 5).

knime anomaly detection fig05 KNIME

Figure 5. Deployment workflow 03a_Time_Series_AR_Deployment. Here we read the models trained and saved in the training workflow; we apply them to the data in a new time window (at least two months long); we calculate the distance between predicted samples and original samples; and we generate two level alarms. If Alarm Level 2 is active, a checkup procedure is triggered.

Alarm trigger

The trigger agent in that meta node is a CASE Switch Data (Start) node (see Figure 6). The second port is enabled only when Alarm Level 2 is active and starts an external workflow via the node Call Workflow (Table Based). This node in its configuration window is set to start the external Knime workflow Send_Email_to start_checkup.

knime anomaly detection fig06 KNIME

Figure 6. Content of the meta node Trigger checkup if level 2 Alarm =1. The trigger node, Call Workflow (Table Based), calls and executes an external workflow named Send_Email_to_start_checkup, and that is exactly what it does: It sends an email to trigger the checkup procedure.

The workflow Send_Email_to start_checkup” has just one central node, the Send Email node. The Send Email node—as the name states—sends an email using a specified account on an SMTP host and its credentials.

The other node in the workflow is a Container Input (Variable) node. While this node is functionally not important, it is required to pass the data from the calling to the called workflow. Indeed, the Call Workflow (Table Based) node passes all the flow variables available at its input port to the Container Input (Variable) node, if any, of the called workflow. In summary, the Container Input (Variable) workflow is a bridge to transport the flow variables available at the caller node into the called workflow.

knime anomaly detection fig07 KNIME

Figure 7. The workflow Send_Email_to_start_checkup sends an email to trigger the checkup procedure. Notice the Container Input (Variable) node to pass flow variables from the caller workflow to the called workflow.

Testing the results

Just barely modifying the deployment workflow, we get the chance to test this strategy on a number of data points, and to observe the evolution over time of the Alarm Level 2 time series.

In the modified version, we read all data after the training set portion, i.e., from September 2007 through July 2008. The Alarm Level 2 time series is visualized for each frequency band for each sensor in a stacked area chart. As you can see in Figure 8 below, Alarm Level 2 values rise as soon as the beginning of March 2008 across all frequency bands and all sensors. However, the change in the system becomes evident at the beginning of May 2008, especially in some frequency bands of some sensors (see [200-300] A7-SA1 time series).

Considering the rotor broke off on July 22, 2008, this data would have provided many weeks of advanced warning.

More details are available in the Knime whitepaper, “Anomaly Detection in Predictive Maintenance.”

knime anomaly detection fig08 KNIME

Figure 8. The deployment workflow has been slightly modified to run a test on the remaining time window until the breakup episode (Workflow 03b_Time_Series_AR_Testing). The result is a stacked area chart piling up all Alarm Level 2 signals from January 2007 through July 2008. You can see the alarm signal rising in March 2008 and clearly evident in May 2008.

Rosaria Silipo is principal data scientist at Knime. She is the author of more than 50 technical publications, including her most recent book “Practicing Data Science: A Collection of Case Studies”. She holds a doctorate degree in bio-engineering and has spent 25 years working on data science projects for companies in a broad range of fields, including IoT, customer intelligence, the financial industry, and cybersecurity. Follow Rosaria on TwitterLinkedIn, and the Knime blog.

Copyright © 2019 IDG Communications, Inc.