Dec 14, 2017 3:00 AM

In the rush to big data, we forgot about search

In the cloud era, we need to look at search to be the glue that lets us find the data and analyze it together, no matter where it lives

Graeme Smith (CC BY-SA 2.0)

I read David Linthicum’s post ”Data integration is the one thing the cloud makes worse” with great interest. A huge reason that I decided my next job would be for a search company was because of this very problem. (That’s why I now work for LucidWorks, which produces Solr- and Spark-based search tools.) While working with clients, I realized that with big data and the cloud a tough problem, finding things was becoming worse. I had seen the upcoming meltdown as the use of Hadoop formed yet another data silo and as a result produced few actual insights.

Part of the problem is that the technology industry is trend-driven rather than problem-solving. A few years ago, it was all about client/server under the guise of distributed computing à la Enterprise JavaBeans, followed by web services and then big data. Now it is all about machine learning. Many of these steps were important, and machine learning is an important tool for solving problems.

We lost indexing and search as big data emerged

But sadly, the most important problem-solving trend got lost in the shuffle: indexing and search.

The modern web began with search. The web would be a lot smaller if Yahoo and the search portals of the late 1990s had triumphed. The dot-com bomb happened and yet Google was born from its ashes. Search also birthed big data and arguably the modern machine learning trend. Google, Facebook, and other companies needed more ways to handle their indexing jobs and their large amounts of data distributed to internet scale. Meanwhile, they needed better ways to find and organize data after they ran upon the limits of crowdsourcing and human intelligence.

Amazon.com blew away the retail market in part because it dared to invest in search technology. The main reason I go to Amazon and not other vendors is because I’ll almost definitely find what I’m looking for. In fact, Amazon may suggest what I want before I get around to searching for it. (Though, I have to say that Amazon.com’s recommendations are now falling behind the curve.) Yet many retailers still use the built-in search in their commerce suite and then wonder why customer conversion and engagement is off. (Hint: Customers can’t find anything to buy.)

Meanwhile, many companies continue to keep old-style enterprise search products. Some of these products aren’t even maintained, belonging to dead or acquired companies. Most people still operate with bookmarks. So if you move some of your data to SaaS solutions, move some of your data to PaaS solutions, move some of your data to IaaS solutions and across multiple vendors’ cloud platforms while maintaining some of your data behind the firewall—yeah, no one is going to find anything!

How to redefine “integration”

To address what Linthicum raised in his post, we need to do is redefine “integration” for the distributed and cloud computing era.

Data integration used to mean just that: grabbing all the data and dumping it into a big, fat, single area. First this was with databases, then data warehouses, and then Hadoop. Ironically, we moved further away from indexed technology when doing this.

Now, integration must mean that we can index and find the data where it lives, deduplicate it, and derive a result. To find a single source of truth, we need to capture timestamps and source IDs.

To integrate, we need a single search solution that can reach our on-premises data and our cloud data. The worst thing we can do is deploy a search tool that only searches one source of data, serves only one use case, or can’t be used behind our firewall.

In the cloud era, we need to look at search to be the glue that lets us find the data and analyze it together, no matter where it lives. We can’t just dump everything into one place; we need tools to let us get to exactly the right data where it lives.