Exclusive: IBM enters enterprise search fray
WebSphere Information Integrator finds almost everything -- but slowly
With the proliferation of electronic documents and the archival pressures that various industry regulations have been exerting on companies, enterprise search has become an important IT requirement during the past few years. Many search solutions -- including search appliances and more-robust, federated search engines, such as those from IBM and Verity -- have come to market recently to meet the demand. Specialty products such as Vivísimo Velcocity fill additional niches.
I spent a day at IBM's San Jose, Calif., office putting through its paces an enterprise implementation of IBM's federated, enterprise search engine: the recently renamed WebSphere Information Integrator, OmniFind Edition, v. 8.2. The product first rolled out late in 2004 under the DB2 database brand name.
OmniFind is a true enterprise-scale search engine that IBM itself uses to find items in its databases, e-mail archives, and its 10,000 Web sites. The product comes in one of two configurations: on a single server as well as on a four-way system comprising a crawler, a parser-indexer, and two redundant run times that provide client interface services.
Clients most commonly interact with OmniFind through a browser, but they can also do so through a Java API. The latter enables a department to query search results for specific items, with the results returned to the application as Java objects. The Java API is useful for handling custom software, such as a knowledge-base search facility embedded in a help-desk application.
OmniFind uses a crawler to spider through a company's online assets. Results are parsed into individual words and links, which are then reassembled into an index. This index fields the search queries.
The crawler is a highly configurable piece of software. Its underlying technology uses two mechanisms. The first combs through databases and extracts searchable data; the second searches through unstructured data, including e-mail archives, content management systems, and a variety of document files.
There is also the pure intranet crawler, configured to adjust its spidering dynamically. The crawler tracks how often documents or data changes and computes how frequently certain venues need to be reindexed. Exclusion lists and tools such as robots.txt files, which specify what can and cannot be accessed on given sites, can keep the crawler out of specific files and Web sites. You also have the option of identifying sites or resources that are difficult or slow to access, which can keep the crawl from overwhelming low-speed connections.
Data from the crawl is fed into the indexing engine, which relies on a specially configured, embedded instance of DB2. After parsing, categorizing, and weighting the data from the crawl, the engine generates a large index that becomes the file from which queries are answered.
The weighting algorithms are as important in enterprise searches as they are on Web engines -- perhaps even more so because enterprise users will often know that a specific document does exist somewhere but won't know exactly where. As a result, OmniFind uses algorithms that are distinct from those used by Web search engines. The latter depend heavily on the number of links pointing to a specific page to judge relevancy. Intranets, however, are rarely linked extensively to other intranet sites; they are often silos unto themselves, and so link counts are much less useful.