Compliance has turned us into pack rats. Even outside the heavily regulated health and financial industries, many of us now reflexively play it safe and save anything that seems important. But surprisingly, this "structured" information -- documents, e-mail, transaction records -- accounts for only about half of all enterprise data.
The other half is so-called dark matter. Generated by servers, routers, desktops, switches, and other systems, dark matter generally takes the form of log files that record errors, system access attempts, and countless other events. Dark matter in IT, like that mysterious stuff floating in deep space, is both widely distributed and hidden despite its enormous mass.
[ Dark matter makes up almost half of the enterprise data explosion, the most pressing problem in IT. ]
Typically, IT pays attention to dark matter only after something goes wrong. When there's a security breach, you go straight to the log files to see when and how the breach began and which systems may have been compromised. When a server goes down, log files usually reveal the cause of the failure. Otherwise, dark matter stays in the dark.
But what if you monitored those log files en masse as a matter of course? Could you drill into dark matter and detect security breaches in progress or sound the alarm based on a pattern of errors before a server falls over?
The answer to that question points to some of the most interesting enterprise technology around -- including SEM (security event management), cloud-based distributed computing, and advanced search technology expressly designed for dark matter.
To take a timely example, ArcSight -- one of the leading SEM vendors -- just announced FraudView, which mines security log data for statistically significant patterns of nefarious activity. According to Reed Henry, senior vice president of marketing for ArcSight, FraudView is already being used to detect wire fraud in wholesale banks and "pump and dump" stock schemes in retail brokerages.
On the raw technology side, there's Apache Hadoop, a Java programming framework designed for data-intensive parallel processing. Hadoop turns out to be perfect for pulling together log files distributed across an organization for analysis. Amazon now provides turnkey Hadoop services, so customers can shovel huge quantities of log data onto Amazon servers and crunch on it mercilessly, teasing out patterns that may yield profound insights on, say, application or datacenter architecture.
At a more refined level, Splunk provides what it calls "Google for IT." Rather than attempt to consolidate log data, Splunk crawls and indexes it wherever it may reside on the network, using semantics to parse that data and make sense of it -- to the point of creating charts and graphs that look like business intelligence reports. Erik Swan, Splunk's CTO and co-founder (and a winner of InfoWorld's 2009 CTO 25 award) believes that the usefulness of this data extends beyond IT. Business management, he says, can use these graphical search results to put a very accurate price tag on downtime, for example, or show the concrete results of investing in optimization. The recently released Splunk 4 ups the ante, enabling users to create custom dashboards and process as many as 500,000,000 events per second.
There was a time in IT when advanced thinking always pointed to some new, more logical organizational model, in which order would be imposed on chaos and new efficiencies would magically appear. Dark matter and the emerging solutions to manage it tell a different story. You can spend eternity centralizing, creating new standards, and regimenting reality, or you can accept a certain degree of random distribution and take your best shot at interpreting it, at a speed much closer to real time. Exploring the inner space of dark matter is a fascinating new frontier.