What is Apache NiFi? Initially, it was a government project, born in the intelligence community. Apache NiFi was made available through the NSA Technology Transfer Program in 2014. Before that, the founders of Onyara were the key contributors to the US government software project that evolved into Apache NiFi.
What Apache NiFi does is provide "powerful and scalable directed graphs of data routing, transformation, and system mediation logic." In other words, it is connecting systems that produce data (such as sensors, geo-location devices, social feeds, clickstreams, etc.), with systems that process data (Apache Spark fits nicely here), and systems that store data (HDFS, NoSQL databases for example). All of this, securely, with data traceability, governance, etc.
What the NSA is using this technology for one can only bet, although it's probably safe to assume that they have plenty of use cases that need secure and reliable collection of data from sensors and social feeds.
But this acquisition sheds light on some interesting aspects and hurdles of big data outside the realm of transactional systems.
When collecting data from the Internet of things, or from social media, reliability is often viewed as secondary issue: after all, there are so many sensors or tweets, what's important is to get a representative sample, right?
In some cases it's a valid approach. Weather predictions, sentiment analysis, rely on general patterns and missing data points don't matter. But for some mission critical applications, such as industrial plant monitoring, or fault detection, the missing data point may very well be the one that should have triggered an alert, or initiated a remedial process. And even without going to such extremes, failure to properly indicate a specific parking spot as occupied, or a trash can as full, can disrupt user experience and create trust issues with the system.
It's not because it's big data that reliability of the collection process is less important. Therefore, guarantee of delivery, failover procedures in case of a broken communication link, must be ensured.
Security is often overlooked in big data applications. In part, it's because big data platforms don't do security too well. But it's also because, like reliability, there is often a perception that security risks of sensors are minimum.
Some sensors are also actuators. In cases, hacking a sensor is all that's needed to create a critical condition in an industrial process, that can lead to either a safety shutdown or even a catastrophic failure.
Other sensors can receive requests from more information -- for example by changing their polling frequency. What better way to crash a network than suddenly ask all sensors to poll a hundred times more often?
And finally, we have all seen these movies where the bad guys hijack security cameras to show a pre-recorded image while they break into the safe ... what if they could as easily break into the sensors that detect intrusions?
It would only seem natural that the NSA had figured out these issues. Apache NiFi's project page certainly discussed them. So it seems that with this acquisition, Hortonworks is getting their hands onto some interesting technology (and expertise) to expand the scope of big data projects.
This article is published as part of the IDG Contributor Network. Want to Join?