Refining enterprise search

Enterprise search is reaping relevant results thanks to new platforms and technologies

Anyone who has been transfixed by a gymnast or a figure skater knows that the magic happens when they perform flawlessly and yet make it seem easy. That’s how a search should work: Enter a query, and the right results appear in simple, elegant fashion -- even if it took countless hours of preparation to make the magic possible.

Yet most enterprise users still stumble as they try to extract data from multiple repositories, each with its own search engine. Enterprises seem awash in a rising tide of structured and unstructured data. And even though users are often forced to tag documents manually across various content management systems in hopes that those documents will be easier to retrieve, searches still yield a surfeit of irrelevant, time-wasting results.

ESPs (enterprise search platforms) are on a mission to change all that. These new, comprehensive bundles of search and integration technologies unlock information tucked away in data stores across the enterprise. The goal of ESPs is deceptively simple: to take fairly simple queries and return the most relevant results possible, all in one place. But under the hood, ESPs aggregate a host of emerging technologies such as autocategorization, entity extraction, and NLP (natural language processing). With an ESP as a foundation, businesses can build customized search applications while automating the process of preparing documents for archiving and indexing.

“The building blocks are converging so that you don’t have to cobble together all the pieces yourself,” observes Susan Feldman, vice president of content technology research at IDC. These advanced search platforms establish sophisticated gateways to silos of information -- even those with their own search engines. ESPs also provide a common set of data and search logic that can be tuned on an application-by-application basis to improve the relevance of search results.

IBM last month came out swinging with its DB2 Information Integrator, code-named Masala, which contains an advanced search engine designed to complement the company’s other heavy hitters in the content management arena, DB2 Content Manager and WebFountain. With Masala, IBM joins the ranks of Autonomy, Convera, EasyAsk, Endeca, Fast Search & Transfer (FAST), iPhrase, and Verity, each of which offers search-application platforms with a different mix of features.

Breaking down the walls

ESPs are transforming the way the enterprise conducts a federated search, the process by which a single query is passed to multiple search engines and the user is presented with aggregated results. A federated search can augment searches of similar data stores but loses traction when it runs up against external databases that require specific syntax.

Basic federated search, which has been in existence for years, “doesn’t protect the user from another kind of infoglut -- getting irrelevant results from multiple search engines instead of just one,” observes Hadley Reynolds, vice president and director of research at Delphi Group. “Without some additional sense-making, it’s a blunt instrument.”

Click for larger view.

Compounding matters, enterprises typically have multiple search engines embedded in various applications -- for instance, one in a content management system, one in the Microsoft Office environment, and another in an e-mail program. The ESP transcends these search-engine silos and corresponding data repositories and imposes syntax translation and other linguistic manipulations, such as spell-check and phrase detection, on the query prior to crawling the data stores.

At the indexing layer, the ESP aids the user by returning lists of improved query choices based on the context of the original, sometimes vague, query. Take FAST’s ESP, which powers the public-facing If you type the word “nuclear” in an effort to retrieve published science-journal entries related to that topic, the keyword will reap more than 700,000 returns. A refined keyword search selected from the list of suggestions on the right-hand side of the page -- “nuclear facility” -- whittles that to approximately 1,000. Click once more, on “uranium enrichment,” and you’re down to about 10.

Click for larger view.

Endeca offers a technology that combines search with what it calls Guided Navigation. Here, a keyword search generates a search directory on the fly, which users can employ to drill down to progressively refined results.

Customized tune-ups

According to Delphi Group’s Reynolds, creating an effective search interface for the enterprise user involves “knowledge-driven search applications” tailored to the business domain of the staffer.

“In order to achieve real accuracy, the search software has to be tuned to understand the context in which I’m working,” Reynolds says. “It’s a business-process-centered development strategy, so that you’re looking at a platform from the perspective of its ability to be tailored to specific users.”

Reynolds adds that Autonomy and FAST already prepackage offerings in the compliance, call center, market intelligence, and financial arenas. Verity offers multiple application templates as well. With this kind of tailored search interface, when financial brokers type “bonds” into a query, they never have to set eyes on a document related to glue.

FAST’s Marketrac layers an application on top of the ESP, which amounts to a search-powered interface that can access e-mail content, news feeds, competitor’s Web site content, and database content in a CRM. Moreover, the platform’s categorization facilities enable knowledge workers to explore content through patterns of meaning or subject matter.

Meanwhile, Google is taking a different approach with its enterprise offering, the Google Search Appliance (see our Test Center Review). It puts behind the firewall much of the successful technology that powers its public product, taking plug and play to new heights. In other words, the appliance is basically a search engine, not a comprehensive platform.

Dave Girouard, general manager for enterprise search at Google, cautions that ESPs “are putting a bigger burden on the user. As long as the results show up in the first page, [users] don’t care what’s behind it. … We have the right relevancy algorithms. So, in terms of [too much] content, we’re saying, ‘Bring it on.’ ”

The Google appliance may save the day for enterprises with broken search technology: Just open up the repositories and rev up the Google engine. But Delphi Group’s Reynolds thinks that “IT should stop investing in generic search tools and start concentrating on their professional domains. At the same time, the business side should be more involved, to ensure that IT commits the resources to develop business-oriented applications of search.”

Andrew McKay, vice president of direct sales at FAST, agrees but adds that vendors “aren’t necessarily fighting over a percentage of the pie. It’s about making the pie dramatically larger,” as information stores expand exponentially.

It’s all in the pipeline

For years, businesses have been fighting to get searches of unstructured data -- information that resides outside enterprise applications and databases -- to achieve the kind of accuracy and precision expected with structured data. According to Delphi Group’s Reynolds, with ESPs, the search-indexing process for unstructured information is evolving into a pipeline of different search algorithms and advanced technologies. These allow for dynamic categorizations or targeted text analytics to take place within the processes that parse documents when they come into the search platform, and within the processes that evaluate queries and return relevant information.

A relatively new addition to the pipeline is entity extraction, in which a search engine dynamically extracts terms from indexed content on the fly through grammatical analysis. The process includes identifying proper nouns and creating a list of people, places, and things from a document and then inserting a new level of metadata into that document.

Another is the use of NLP, which helps turn poor queries into good ones. The state of the art in search platforms involves a wide range of algorithms, rules, data enhancements, user- and context-profiling -- all of which work together to help zero in on what users need to answer their questions.

Click for larger view.

As for metadata, the old way of manually defining properties of a document is waning in favor of an intelligent search platform’s capability of autotagging based on users’ “custom logic,” according to FAST’s McKay.

ESPs can discover patterns in the content and enhance the value of that content within the search platform infrastructure by automatically creating metadata elements. Thanks to the exponential spread of XML across search environments, this metadata can then be used for a wide range of application processing, query enhancements, and presentation options.

Enhanced classification and taxonomy come into play by enabling users to browse information by subject area rather than relying solely on the blank search field and their capability of constructing an effective query. Dynamic classification capabilities can modify the presentation of subject areas based on the query’s context.

These new technologies “allow you to cross the structured and unstructured worlds,” says Pete Bell, co-founder of Endeca.

To make unstructured data more meaningful, Verity is taking several approaches. Its newly introduced Extractor automatically preprocesses documents, looking for concepts, patterns, entities, and tags files, accordingly. At the next level, its Collaborative Classifier enlists a broad range of subject-matter experts within the organization to manage topics. It’s highly intuitive and encourages user participation, which, in turn, significantly boosts categorization accuracy, according to company officials.

End to end with security in mind

Although the line between consumer search and enterprise search continually blurs, a key difference lies in enterprise security architecture.

“Security is a huge issue because you don’t want to show results that include documents to which the user has no right,” IDC’s Feldman says, asserting, however, that security at the platform layer is fairly straightforward. “If you’ve got document-level security and repository-level security, search engines can use them to index documents for access rights. They can also tie into an LDAP directory to look at the collection-level access rights.”

John McPherson, a distinguished engineer at IBM, explains that the search engine within DB2 Information Integrator is adept at integrating assigned permissions and maintaining the security of the data from the underlying repository.

“There are associated security tokens at the document level, and an interface allows the application to do a search on behalf of the user with specified security credentials, which guarantee we’re only returning content that the user is allowed to see,” McPherson says. “It’s integrated way down in the index so we’re also getting peak performance."

Delphi Group’s Reynolds echoes a prevailing sentiment: “The search environment has no business imposing a specific security scheme on the enterprise. You want it to be flexible and agnostic.”

Simplicity wrapped in complexity

For that matter, users are typically agnostic in that few workers sit around questioning how results are returned.

To be of value, search vendors must provide “a single user experience that hides the fact that there are different engines, different indexes, and different capabilities happening in the background,” notes Laura Ramos, vice president of research at Forrester Research.

But ESPs demand that they get acquainted with more intelligent search methods. According to IDC’s Feldman, the blank query field and three-word search is gradually going away, as ESPs forge new interfaces. Search platforms “must be tied into the collaborative tools of the organization,” she says.

Mike Heck contributed to this article.

Copyright © 2004 IDG Communications, Inc.

InfoWorld Technology of the Year Awards 2023. Now open for entries!