Compliance has turned us into pack rats. Even outside the heavily regulated health and financial industries, many of us now reflexively play it safe and save anything that seems important. But surprisingly, this "structured" information -- documents, e-mail, transaction records -- accounts for only about half of all enterprise data.
The other half is so-called dark matter. Generated by servers, routers, desktops, switches, and other systems, dark matter generally takes the form of log files that record errors, system access attempts, and countless other events. Dark matter in IT, like that mysterious stuff floating in deep space, is both widely distributed and hidden despite its enormous mass.
[ Dark matter makes up almost half of the enterprise data explosion, the most pressing problem in IT. ]
Typically, IT pays attention to dark matter only after something goes wrong. When there's a security breach, you go straight to the log files to see when and how the breach began and which systems may have been compromised. When a server goes down, log files usually reveal the cause of the failure. Otherwise, dark matter stays in the dark.
But what if you monitored those log files en masse as a matter of course? Could you drill into dark matter and detect security breaches in progress or sound the alarm based on a pattern of errors before a server falls over?
The answer to that question points to some of the most interesting enterprise technology around -- including SEM (security event management), cloud-based distributed computing, and advanced search technology expressly designed for dark matter.
To take a timely example, ArcSight -- one of the leading SEM vendors -- just announced FraudView, which mines security log data for statistically significant patterns of nefarious activity. According to Reed Henry, senior vice president of marketing for ArcSight, FraudView is already being used to detect wire fraud in wholesale banks and "pump and dump" stock schemes in retail brokerages.
On the raw technology side, there's Apache Hadoop, a Java programming framework designed for data-intensive parallel processing. Hadoop turns out to be perfect for pulling together log files distributed across an organization for analysis. Amazon now provides turnkey Hadoop services, so customers can shovel huge quantities of log data onto Amazon servers and crunch on it mercilessly, teasing out patterns that may yield profound insights on, say, application or datacenter architecture.