In the future, when an administrator does a server build, he or she may be adding a MapReduce library to the usual stack of server, database, and middleware software.
This copy of MapReduce could be used to analyze log data directly on the server, minimizing the need for a separate cluster to analyze log data, and shortening the length of time until results are generated, noted University of California San Diego researcher Dionysios Logothetis at the Usenix Annual Technical Conference, being held this week in Portland, Oregon.
With this approach, "the analysis [moves] from the dedicated clusters to the log servers themselves, to avoid expensive data migration," said Logothetis, who, along with other researchers at UCSD and Salesforce.com, described this technique in a Usenix paper, "In-situ MapReduce for Log Processing."
First introduced by Google, MapReduce is increasingly being used to analyze large amounts of data that can be located across multiple servers, or nodes. It is most commonly used as part of the Hadoop data processing platform.
While most uses of MapReduce take place on dedicated clusters, the researchers argued that a version of the analytical software framework could also be an integral part of Web servers themselves, where they could be used to do early analysis of log data.
Today's commercial Internet sites log detailed information on user visits, information that can be used for targeting ads, security monitoring and debugging.
A single server working for a busy e-commerce site can generate 1 to 10 MB worth of log data every second. In the course of the day, this can result in tens of terabytes' worth of data. On average, 1,000 servers each producing 1MB per second can generate 86TB of data per day. As an example, Facebook generates about 100TB a day, Logothetis said.
Typically, large organizations like Facebook will collect data from all servers, load it into a Hadoop cluster, and analyze the results using MapReduce. MapReduce is a framework and associated library for splitting a job into multiple parts so that it can be run simultaneously across many servers, or nodes.
This "store-first, query-later" approach to log analysis has a number of disadvantages, Logothetis argued. Shipping all the data from the servers consumes a great deal of bandwidth. "This puts a lot of stress on the network," he said.
Logothetis noted Facebook discards about 80 percent of its log data before it is analyzed. With this new technology, that data would not need to be transferred.
The traditional approach also introduces a lag between when the data is produced and when it can be analyzed, which can be a critical shortcoming if the data is time-sensitive in nature.
As an alternative, the researchers proposed placing MapReduce functionality on each server itself, which could analyze the data and ship a much smaller result set back to the centralized data collection point. They dubbed this approach "in-situ MapReduce" (iMR).
"iMR is designed to complement, not replace traditional cluster-based architectures. It is meant for jobs that filter or transform log data either for immediate use or before loading it into a distributed storage system ... for follow-on analysis," the researchers' paper notes.