With every cool new technology, people get overly infatuated and start using it for the wrong things. For example: Looking through a bazillion records for a few million marked with a set of criteria is a rather stupid use of MapReduce or your favorite DAG implementation (see: Spark).
For that and similar tasks, don’t forget the original big data technology: search. With great open source tools like Solr, Lucidworks, and Elasticsearch, you have a powerful way to optimize your I/O and personalize your user experience. It's much better than holding fancy new tools from the wrong end.
A bad job for Spark
Not long ago a client asked me how to use Spark to search through a bunch of data they’d streamed into a NoSQL database. The trouble was that their pattern was a simple string search and a drill-down. It was beyond the capabilities of the database to do efficiently: They would have to pull all the data out of storage and parse through it in memory. Even with a DAG it was a little slow (not to mention expensive) on AWS.
Spark is great when you can put a defined data set in memory. Spark is not so great at sucking up the world, in part because in memory analytics are only as good as your ability to transfer everything to memory and pay for that memory. We still need to think about storage and how to organize it in a way that gets us what we need quickly and cleanly.
For that particular client, the answer was to index the data as it came in and pull back a subset for more advanced machine learning -- but leave search to a search index.
Search versus machine learning
No clean line exists between search, machine learning, and certain related techniques. Clearly, information that's textual or linguistic tends to strongly indicate a search problem. Information that is numeric, binary, or simply not textual or linguistic in nature indicate a machine learning (or other) problem. There is overlap. There are even instances, such as anomaly detection, where either technique may be valid to use.
A key question is whether you can pick the right data when you retrieve it from storage as part of the criteria -- versus having to munge through the data. For textual or defined numeric data this may be simple. Again, the kind of rules one uses for anomaly detection might lend themselves to search as well.
This approach of course has its limits. If you don’t know what you’re looking for and can’t define the rules very easily, then clearly search isn’t the right tool.
Search plus big data
In many cases, using search with Spark or your favorite machine library may be the ticket. I’ve talked about methods for adding search to Hadoop, but there are also methods for adding Spark, Hadoop, or machine learning to search.
After the dust settled on Spark, anyone working with it realized that it wasn’t magic beans and there are real issues with working in memory. For data you can index, being able to quickly pull back your working set for analysis is far better than a big fat I/O pull into memory to find what you’re looking for.
Search and context
But search isn’t only how you solve your “find my working set,” memory, or I/O issues. One of the weaknesses of most big data projects is the lack of context. I’ve talked about this in terms of security, but what about your user experience? While you’re streaming every little bit of data you can find about the user, how are you working with that to personalize the user experience?
Using the things you know about users (aka signals), you can improve the information you put in front of them. This might mean streaming analytics on the front end of your user interaction and a faceted search on the back end when you show them results or a personalized webpage.
The search solution
As a data architect, engineer, developer, or scientist, you need more than one or two options in your toolbelt. I get very annoyed at the approach of “let’s store a big blob and hope for the best while we pay to sort through it every single time we use it.” Some vendors actually seem to espouse that.
Using indexes and search technology, you can compose a better workset. You can also avoid implementing machine learning or analytics and simply “pick” the data via criteria out of storage -- and via signals even personalize data for users based on your data streams. Search is good. Use it.