The lure of using big data for your business is a strong one, and there is no brighter lure these days than Apache Hadoop, the scalable data storage platform that lies at the heart of many big data solutions.
But as attractive as Hadoop is, there is still a steep learning curve involved in understanding what role Hadoop can play for an organization, and how best to deploy it.
[ Harness the power of Hadoop with InfoWorld's 7 top tools for taming big data. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]
[ FREE DOWNLOAD: The hidden costs of the data explosion | The small world of big data ]
By understanding what Hadoop can, and can't do, you can get a clearer picture of how it can best be implemented in your own data center or cloud. From there, best practices can be laid out for a Hadoop deployment.
What Hadoop can't do
We're not going to spend a lot of time on what Hadoop is, since that's well covered in documentation and media sources. It's important to know the two major components of Hadoop: The Hadoop distributed file system for storage, and the MapReduce framework that lets you perform batch analysis on whatever data you have stored within Hadoop. That data, notably, does not have to be structured, which makes Hadoop ideal for analyzing and working with data from sources like social media, documents, and graphs -- anything that can't easily fit within rows and columns.
That's not to say you can't use Hadoop for structured data. In fact, there are many solutions that take advantage of the relatively low storage expense per TB of Hadoop to simply store structured data there instead of a RDBMS (relational database system ). But if your storage needs are not all that great, then shifting data back and forth between Hadoop and an RDBMS would be overkill.
One area you would not want to use Hadoop for is transactional data. Transactional data, by its very nature, is highly complex, as a transaction on an e-commerce site can generate many steps that all have to be implemented quickly. That scenario is not at all ideal for Hadoop.
Nor would it be optimal for structured data sets that require very minimal latency, like when a website is served up by a MySQL database in a typical LAMP stack. That's a speed requirement that Hadoop would poorly serve.
What Hadoop can do
Because of its batch processing, Hadoop should be deployed in situations like index building, pattern recognitions, creating recommendation engines, and sentiment analysis -- all situations where data is generated at a high volume, stored in Hadoop, and queried at length later using MapReduce functions.
But this does not mean that Hadoop should replace existing elements within your data center. On the contrary, Hadoop should be integrated within your existing IT infrastructure in order to capitalize on the myriad pieces of data that flow into your organization.







