With all the hype around the Internet of things, you have a right to be skeptical that the reality could ever match the promise. I would feel that way myself -- if it weren’t for some recent firsthand experience.
I recently had the opportunity to work on a project that involved applying IoT technologies to medical devices and pharmaceuticals in a way that could have a profound impact on health care. Seeing the possibilities afforded by “predictive health care” opened my eyes to the value of IoT more than any other project I’ve been associated with.
Of course, the primary value in an IoT system is in the ability to perform analytics on the acquired data and extract useful insights, though make no mistake -- building a pipeline for performing scalable analytics with the volume and velocity of data associated with IoT systems is no walk in the park. To help you avoid some of the difficulties we encountered, allow me to share a few observations on how to develop an ideal IoT analytics stack.
Acquiring and storing your data
Myriad protocols enable the receipt of events from IoT devices, especially at the lower levels of the stack. For our purposes, it doesn’t matter whether your device connects to the network using Bluetooth, cellular, Wi-Fi, or a hardware connection, only that it can send a message to a broker of some sort using a defined protocol.
One of the most popular protocols and widely supported protocols for IoT applications isMQTT (Message Queue Telemetry Transport). Plenty of alternatives exist as well, including Constrained Application Protocol, XMPP, and others.
Given its ubiquity and wide support, along with the availability of numerous open source client and broker applications, I tend to recommend starting with MQTT, unless you have compelling reasons to choose otherwise. Mosquitto is one of the best-known and widely used open source MQTT brokers, and it's a solid choice for your applications. The fact that it's open source is especially valuable if you're building a proof of concept on a small budget and want to avoid the expense of proprietary systems.
Regardless of which protocol you choose, you will eventually have messages in hand, representing events or observations from your connected devices. Once a message is received by a broker such as Mosquitto, you can hand that message to the analytics system. A best practice is to store the original source data before performing any transformations or munging. This becomes very valuable when debugging issues in the transform step itself -- or if you need to replay a sequence of messages for end-to-end testing or historical analysis.
For storing IoT data, you have several options. In some projects I've used Hadoop and Hive, but lately I’ve been working with NoSQL document databases like Couchbase with great success. Couchbase offers a nice combination of high-throughput, low-latency characteristics. It's also a schema-less document database that supports high data volume along with the flexibility to add new event types easily. Writing data directly to HDFS is a viable option, too, particularly if you intend to use Hadoop and batch-oriented analysis as part of your analytics workflow.
For writing source data to a persistent store, you can either attach custom code directly to the message broker at the IoT protocol level (for example, the Mosquitto broker if using MQTT) or push messages to an intermediate messaging broker such as Apache Kafka -- and use different Kafka consumers for moving messages to different parts of your system. One proven pattern is to push messages to Kafka and two consumer groups on the topic, where one has consumers that write the raw data to your persistence store, while the other moves the data into a real-time stream processing engine like Apache Storm.
If you aren’t using Kafka and are using Storm, you can also simply wire a bolt into your topology that does nothing but write messages out to the persistent store. If you are using MQTT and Mosquitto, a convenient way to tie things together is to have your message delivered directly to an Apache Storm topology via the MQTT spout.
Preprocessing and transformations
Data from devices in their raw form are not necessarily suited for analytics. Data may be missing, requiring an enrichment step, or representations of values may need transformation (often true for date and timestamp fields).
This means you'll frequently need a preprocessing step to manage enrichments and transformations. Again, there are multiple ways to structure this, but another best practice I’ve observed is the need to store the transformed data alongside the raw source data.
Now, you might think: “Why do that when I can always just transform it again if I need to replay something?” As it turns out, transformations and enrichments can be expensive operations and may add significant latency to the overall workflow. It's best to avoid the need to rerun the transformations if you rerun a sequence of events.
Transformations can be handled several ways. If you are focused on batch mode analysis and are writing data to HDFS as your primary workflow, then Pig -- possibly using custom user-defined functions -- works well for this purpose. Be aware, however, that while Pig does the job, it’s not exactly designed to have low-latency characteristics. Running multiple Pig jobs in sequence will add a lot of latency to the workflow. A better option, even if you aren’t looking for “real-time analysis” per se, might be using Storm for only the preprocessing phase of the workflow.
Analytics for business insights
Once your data has been transformed into a suitable state and stored for future use, you can start dealing with analytics.
Apache Storm is explicitly designed for handling continuous streams of data in a scalable fashion, which is exactly what IoT systems tend to deliver. Storm excels at managing high-volume streams and performing operations over them, like event correlation, rolling metric calculations, and aggregate statistics. Of course, Storm also leaves the door open for you to implement any algorithm that may be required.
Our experience to date has been that Storm is an extremely good fit for working with streaming IoT data. Let’s look at how it can work as a key element of your analytics pipeline.
In Storm, by default “topologies” run forever, performing any calculation that you can code over a boundless stream of messages. Topologies can consist of any number of processing steps, aka bolts, which distribute over nodes in a cluster; Storm manages the message distribution for you. Bolts can maintain state as needed to perform “sliding window” calculations and other kinds of rolling metrics. A given bolt can also be stateless if it needs to look at only one event at a time (for example, a threshold trigger).
The calculated metrics in your Storm topology can then be used to suit your business requirements as you see fit. Some values may trigger a real-time notification using email or XMPP or update a real-time dashboard. Other values may be discarded immediately, while some may need to be stored in a persistent store. Depending on your application, you may actually find it makes sense to keep more than you throw away, even for “intermediate” values.
Why? Simply put, you have to “reap” data from any stateful bolts in a Storm topology eventually, unless you have infinite RAM and/or swap space available. You may eventually need to perform a “meta analysis” on those calculated, intermediate values. If you store them, you can achieve this without the need to replay the entire time window from the original source events.
How should you store the results of Storm calculations? To start with, understand that you can do anything in your bolts, including writing to a database. Defining a Storm topology that writes calculated metrics to a persistent store is as simple as adding code to the various bolts that connect to your database and pushing the resulting values to the store. Actually, to follow the separation-of-concerns principle, it would be better to add a new bolt to the topology downstream of the bolt that performs the calculations and give it sole responsibility for managing persistence.
Storm topologies are extremely flexible, giving you the ability to have any bolt send its output to any number of subsequent bolts. If you want to store the source data coming into the topology, this is as easy as wiring a persistence bolt directly to the spout (or spouts) in question. Since spouts can send data to multiple bolts, you can both store the source events and forward them to any number of subsequent processing steps.
For storing these results, you can use any database, but as noted above, we've found that Couchbase works well in these applications. The key point to choosing a database: You want to complement Storm -- which has no native query/search facility and can store a limited amount of data in RAM -- with a system that provides strong query and retrieval capabilities. Whatever database you choose, once your calculated metrics are stored, it should be straightforward to use the native query facilities in the database for generating reports. From here, you want the ability to utilize Tableau, BIRT, Pentaho, JasperReports, or similar tools to create any required reports or visualizations.
Storing data in this way also opens up the possibility of performing additional analytics at a later time using the tool of your choice. If one of your bolts pushes data into HDFS, you open up the possibility of employing entire swath of Hadoop-based tools for subsequent processing and analysis.
Building analytics solutions that can handle the scale of IoT systems isn't easy, but the right technology stack makes the challenge less daunting. Choose wisely and you'll be on your way to developing an analytics system that delivers valuable business insights from data generated by a swarm of IoT devices.