Already, "big data" has become one of those buzzphrases you say with an apologetic smirk. It sounds like marketecture, broad enough to apply to almost anything.
So let's clear up what big data is and isn't. Perhaps you've heard the canonical "three V's" definition: data high in volume, velocity, and variability. In other words, big data comes in multiterabyte quantities, accrues or changes fast, often resists normalized structure -- and tends to demand technologies beyond the tried-and-true RDBMS or data warehouse.
[ InfoWorld's Andrew Lampitt looks beyond the hype and examines big data at work in his blog Think Big Data. | Confused about Hadoop? Our Hadoop Deep Dive puts the technology in perspective. | Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview. ]
That cluster of new technologies around big data -- including Hadoop, a wild array of new NoSQL databases, massively parallel processing (MPP) analytic databases, and more -- together represent the biggest leap forward in data management and analytics since the 1980s. That's really what big data is about. And these emerging technologies are already delivering business value: in deep insights about customer behavior, in faster app dev cycles, in the ability to use commodity hardware, and in reduced software licensing costs, because almost all these new technologies are open source.
Assuming your data volumes are exploding as fast as everyone else's, you're part of the big data trend whether you like it or not. So why not employ the tools purpose-built for the big data era? It's a better strategy than blindly buying more Oracle licenses or building another gold-plated data warehouse. Where you start, though, depends on the problems you want to solve.
Problem No. 1: I don't want to pay Oracle more money
This is not a big data problem per se, but software surrounding the big data trend may help solve it.
Many companies simply use Oracle (or DB2 or SQL Server) as their default data store for almost everything. After all, the RDBMS is probably the most successful technology in the history of software, and if you want a battle-tested, unassailable RDBMS with all the bells and whistles, you choose Oracle (or other ironclad commercially licensed software) and pay a lot for it. That's where data goes, period.
But now the RDBMS has all sorts of viable competition. As it turns out, there are many, many instances where database needs do not include relational capability, two-phase commits, complex transactions, and so on. In such cases, NoSQL solutions -- most of which are open source -- may perform and scale better at vastly reduced cost and with much lower maintenance overhead. For an overview of NoSQL database types, see "Which freaking database should I use?" by InfoWorld's Andrew Oliver.
Now, nobody would power down their Oracle servers and port all their existing customer and product data to, say, MongoDB. For one thing, the security isn't there yet -- and by their nature NoSQL databases tend to compromise ACID compliance. Also, when complex transactions are involved, even NoSQL vendors will tell you that an RDBMS remains your best solution. Finally, if you just want to save money, you're not going to waste a fortune rearchitecting an Oracle database and its applications for NoSQL (an open source alternative like PostgreSQL might be a better choice).
But for new projects, especially those involving Web applications that demand instant scalability -- or analytics systems intended to crunch gobs of semistructured data -- exciting alternatives beckon. Not only are they mostly open source, they run on low-cost server hardware.