Today's large-scale data analysis may be a high-tech undertaking, but smart data scientists can improve their craft by observing how simple low-tech picture puzzles are solved, said an IBM scientist at the GigaOm conference.
Watching how people put together picture puzzles can reveal "a lot of profound effects that we could bring to big data" analysis, said Jeff Jonas, IBM's chief scientist for entity analytics, speaking Wednesday at one of the more whimsical presentations at the data structure conference in New York.
[ Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]
Data analysis is becoming a more important component to many businesses. IDC estimates enterprises will spend more than $120 billion by 2015 on analysis systems. IBM estimates that it will reap $16 billion in business analytics revenue by 2015.
But getting useful results from such systems requires careful planning.
In a series of informal experiments, Jonas observed how small groups of friends and family work together to assemble picture puzzles, those involving thousands of separate pieces that could be assembled to form a picture.
"My girlfriend sees her son and three cousins, I see four parallel processor pipelines," he said. To make the challenge a bit harder, he removed some of the puzzle pieces, and, obtaining a second copy of some puzzles, added duplicate pieces.
Puzzles are about assembling small bits of discrete data into larger pictures. In many ways, this is the goal of data analysis as well, namely finding ways of assembling data such that it reveals a bigger pattern.
A lot of organizations make the mistake of practicing "pixel analytics," Jonas said, in which they try to gather too much information from a single data point. The problem is that if too much analysis is done too soon, "you don't have enough context" to make sense of the data, he said.
Context, Jonas explained, means looking at what is around the bit of data, in addition to the data itself. By doing too much stripping and filtering of seemingly useless data, one can lose valuable context. When you see the word "bat," you look at the surrounding data to see what kind of bat it is, be it a baseball bat, a bat of the eyelids, or a nocturnal creature, he said.
"Low-quality data can be your friend. You'll be glad you didn't over-clean it," Jonas said. Google, for instance, reaps the benefits of this approach. Sloppy typers will often get a "did you mean this?" suggestion after entering into the search engine a misspelled word. Google provides results to what it surmises is the correct word. Google guesses the correct word using a backlog of incorrectly typed queries.
With puzzles, users first concentrate on assembling one piece with another. Over time, they create small clumps of data, which they can then figure out how to connect to finish the puzzle. The edges and the corners are assembled fairly quickly. What in effect happens is that, as progress on the puzzle proceeds, "you are making faster quality decisions than before," Jonas said. "The computational costs to figure out where a piece goes declines."