Rod Smith has one of the most enviable titles around: vice president of emerging Internet technologies. He earned it. My first encounter with him goes back to the early days of SaaS (software as a service) when he was IBM's point man on the topic. But he is probably best known for his key role in the development of IBM's WebSphere line of middleware, as well as for his early advocacy of XML, Web services, and J2EE.
Last week, the day after IBM's 100th anniversary celebration, I caught up with Smith at the Strata conference on "big data" -- that is, the huge globs of unstructured data generated by Web clickstreams, system and security logs, distributed sensors, truckloads of text, and just about anything else you can name.
[ See InfoWorld's primer on Big Data and learn how a variety of companies are finding hidden business value in unstructured data. | Apache Hadoop is an InfoWorld Technology of the Year award winner. ]
Teasing value from data once considered too amorphous to exploit is Smith's current obsession -- not surprising, since this is one of the most exciting areas of emerging technology. Smith leads strategy and planning for IBM's Big Data practice, including IBM InfoSphere BigInsights, a collection of analytics and visualization technologies centering on Hadoop. I began our conversation by asking Smith about the origins of his involvement with Big Data.
Eric Knorr: When did you first encounter Big Data? My guess is that it was before it was called that.
Rod Smith: It was. When we went to customers and talked about just processing data, they kept saying, "Databases, we know what we know about them, but there's data out there that we think has value -- but we don't know. We think it has insights for us. But we don't want to pick it up and put it in a database with all the management costs that go with that, and then find it doesn't mean anything. So we need something we can use to discover insights quickly -- or not."
It's kind of like a cycle of exploration, but traditional handling of data doesn't do that. You go through the process of bringing it in and cleansing it and normalizing it. But they said, "That's not what we want. We don't know if data from Twitter is going to be valuable until we see something there that makes us go, 'ah ha, now we know what we can do with it!'"
One of the first customers that asked for a proof of concept was the BBC. They had an effort called Digital Democracy, and they were looking at how they could help journalists be much more efficient writing in-depth articles. It takes a long time to really sift through information. So I said, "That's interesting." We didn't know what they wanted us to do yet. So they said, "We're not quite ready to get our information from our side of it, but could you go out and read in all the Parliament information and then tell us what Parliament members were interested in what bills, what bills were getting buzzed, who was working on them, how long they'd been working on them?" And they gave us a list of interesting questions. And so that's where we started, and that's Big Data. Not necessarily in the terabyte sense, but in the sense of cost-intensive people trying to work with it.
Knorr: And it's unstructured.
Smith: And it's unstructured; or semi-structured, as people call it. But we like the term "big data" because data folks have been forced to define different types of data, as opposed to the business person who just says, "I don't care if it's structured or unstructured or whatever, I just want to get this information from it. And you confuse me by telling me how it's done. I don't know the how. I don't care. I just want to get these insights from it." And that was really how we got started running these things and using Many Eyes, in the BBC case, to do the visualizations.