Big data log analysis thrives on machine learning

Huge quantities of log data generated by all sorts of devices opens immense potential for insight, but machine learning is needed to make sense of it

Machine-generated log data is the dark matter of the big data cosmos. It is generated at every layer, node, and component within distributed information technology ecosystems, including smartphones and Internet-of-things endpoints. It is collected, processed, analyzed, and used everywhere, but mostly behind the scenes.

Log data is fundamental to many of the least glamorous enterprise applications, such as troubleshooting, debugging, monitoring, security, antifraud, compliance, and e-discovery. However, it can also be a powerful tool for analyzing clickstream, geospatial, social media, and other logged behavioral data relevant to many customer-centric use cases.

[ Machine learning floats all boats on big data's ocean. | Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this hot topic. | Cut to the key news for technology development and IT management with our once-a-day summary of the top tech happenings. Subscribe to the InfoWorld Daily newsletter. ]

Mortals can barely keep up with machine-logged data. Most of it is not designed or intended for direct human analysis. Unless filtered with brutal efficiency, the extreme volumes, velocities, and varieties of log data can quickly overwhelm human cognition. The authors of this recent Accenture article explain it succinctly:

[A]s the volume and variety of log files rises, it becomes increasingly difficult for log management solutions to parse log files, trace potential issues, and actually find errors -- particularly when cross-log correlations come into play. Even in the best-case scenarios, it requires an experienced operator to follow event chains, filter noise, and eventually diagnose the root cause to a complex problem.

Clearly, automation is key to finding insights within log data, especially as it all scales into big data territory. Automation can ensure that data collection, analytical processing, and rule- and event-driven responses to what the data reveals are executed as rapidly as the data flows. Key enablers for scalable log-analysis automation include machine-data integration middleware, business rules management systems, semantic analysis, stream computing platforms, and machine-learning algorithms.

Among these, machine learning is the key for automating and scaling distillation of insights from log data. But machine learning is not a one-size-fits-all approach to log-data analysis. Different machine-learning techniques are suited to different types of log data and to different analytical challenges. When the correlations and other patterns sought through machine learning can be specified a priori, supervised learning is the way to proceed. However, supervised learning requires a human expert to prepare a reference "training data" set from the log in order to refine a machine-learning algorithm's ability to discern the most relevant patterns.

But when the log-data patterns cannot be precisely defined in advance, unsupervised and reinforcement learning may be more appropriate. Those are the machine-learning-powered, log-data-analysis scenarios most amenable to full automation, because they can pick out and prioritize the most relevant patterns to the task at hand without need of human-supplied training-data sets. (For links to further details on these machine-learning approaches, see my recent post.)

Multilog correlation is a core log-data analysis use case for unsupervised and reinforcement learning. As heterogeneous log-data sets are combined and grow more heterogeneous, complex, and inscrutable, the most interesting data variables and relationships are not at all clear in advance of the analysis. Consequently, the hidden patterns may remain invisible if we merely try to view them using simple queries, pre-existing reports and dashboards, and other standard analytic views. In these cases, machine learning can pull out the most noteworthy patterns for further exploration by using various quantitative approaches such as clustering, Markov models, self-organizing maps, and so forth.

Another key use of unsupervised and reinforcement learning is to identify significant patterns that either never occurred before or, if they had, never been flagged by human analysts as anything other than "noise." The article's authors discuss a hypothetical security-log analysis application of machine learning that can "immediately spot an atypical access pattern for a user, even if that specific access pattern had never been seen before, and prevent particularly high-risk losses of private information."

Many of the most disruptive insights from massive log data will be of this nature: complex, buried, and unprecedented. Learning from the log data itself, rather than from any a priori knowledge, will be how many data scientists spend much of their time. They will increasingly tune their machine-learning algorithms to listen for "signals" in the log that even the most advanced human subject-matter experts had previously overlooked.

This story, "Big data log analysis thrives on machine learning," was originally published at InfoWorld.com. Read more of Extreme Analytics and follow the latest developments in big data at InfoWorld.com. For the latest developments in business technology news, follow InfoWorld.com on Twitter.

Related:

Copyright © 2014 IDG Communications, Inc.