Big data analytics is one of the major trends every company is told it must jump on for competitive advantage, even survival. As a result, there's a lot of mythology around big data. Those myths can lead you astray, wasting resources or putting you on dead-end paths. They can also cause you to miss opportunities where budget approaches help.
Here are the nine biggest myths about big data and Hadoop that you should not believe.
Myth No. 1: You can get data scientists
Recently, a presales engineer at one of my company's partners mentioned how much trouble his firm had finding data scientists. I asked about the qualifications his company was seeking. Well, they need to have a doctorate in math, a background in computer science, and what amounts to an MBA, not to mention actual work experience in all of those fields. I asked, "How old is this person, 90?"
Here's what actually exists:
- Good mathematicians who write crap Python and often need the business stuff spoon-fed to them
- Good computer science people who understand some math
- Good computer science people who understand business after working enough problems
- Business types who understand math
- Subject matter experts
- Leaders who know how to get these people to work together
Because that company could not find this data-scientist unicorn, it had to create a working group with a cross-section of expertise. This is in fact what you have to do.
Myth No. 2: Everything is new
Technologists like to throw away the past, preferring tools that are new for what they claim is a totally new reality or problem set. That's rarely the case.
For example, the Kafka message broker is portrayed as a big-data-needs-a-new-tool product. But compared to other message brokers, it has a pretty poor feature set and is immature. What's actually new (meaning different): Kafka is architected for the Hadoop platform and with massive distribution in mind. That could be useful, if you can accept its flaws.
That said, sometimes you need more sophisticated routing and guarantees. Use ActiveMQ or a more robust option for those situations.
Myth No. 3: Machine learning is what you need
I estimate that about 85 percent of what people call machine learning is simple statistics. Most of your problems are probably simple math and analysis. Start there.
Myth No. 4: You are special
As the great philosopher Tyler Durden once said, "You are not special. You are not a delicate and unique snowflake." Guess what? About half of the industry is busy writing the same ETL scripts for many of the same data sources and custom-creating the same analysis. Hell, in any sizable company, many departments probably are duplicating this work as well.
Needless to say, it’s a good time to be a big data consultant.
Myth No. 5: Hive is fast
Hive is not fast. It cannot be made to impress you. Yes, the new version is better, but it will still underwhelm you from a performance perspective. It scales well, but you may need multiple tools in your chest to hit Hadoop with SQL.
Myth No. 6: You can use clusters with fewer than 12 nodes
Hadoop 2+ barely fits on 12 nodes -- anything less and you will wait forever for it to even start. Plus, anything you run will complete in cricket time, if at all. (Well, you can run "hello world" on 12 nodes.) Hadoop 2 runs more processes, which means you need more nodes and more memory.
Spark will do better minus the load time from HDFS so long as the data set fits in memory.
Myth No. 7: Virtualization is a solution for your data nodes
Your vendor told you no. Your IT team balked. No, you cannot put data nodes on your SAN. But If you put your management nodes in VMs, you could bottleneck if writing the logs and any journals hit latency, or you get low IOPS or high latency to the data nodes.
That said, Amazon Web Services and others navigate these issues and still manage reasonable performance and scalability. You can too, but you need to distinguish this from your internal file servers and your external corporate presence site, as well as manage hardware and virtualized resources effectively.
Remember: Throughput and latency are orthogonal. HDFS cares about both in different places.
Myth No. 8: Every problem is a big data problem
If you are matching a couple fields against a couple of conditions across a couple of terabytes, it isn't really a big data problem. Don't treat every analytics need as a big data effort.
Myth No. 9: You don't have big data
Although big data is about, well, working on huge sets of data, big data approaches can be quite useful on small data sets, too. So don't ignore budget approaches when working with small data. You could have mere gigabytes of data and still benefit from Hadoop or other big data technologies, depending on the problem.
You could also have big data that you don't know about. There are a lot of data sets that companies are accustomed to discarding, but could be useful. Any company with 50 or more employees probably have a big data issue somewhere -- even a smaller company will if it manages enough assets (financial or otherwise).