Free Newsletters
InfoWorld Daily

InfoWorld
Log-in | Register

Exclusive: IBM enters enterprise search fray

WebSphere Information Integrator finds almost everything -- but slowly

By Andrew Binstock
July 04, 2005
 

With the proliferation of electronic documents and the archival pressures that various industry regulations have been exerting on companies, enterprise search has become an important IT requirement during the past few years. Many search solutions -- including search appliances and more-robust, federated search engines, such as those from IBM and Verity -- have come to market recently to meet the demand. Specialty products such as Vivísimo Velcocity fill additional niches.

Free IT resource

Hear how top CIOs turn change into a competitive advantage.

Sponsored by HP

Free IT resource

Attend the SOA Executive Forum: Breaking SOA Bottlenecks SOAExecForum.com/may2007

Sponsored by InfoWorld



IBM WebSphere Information Integrator, OmniFind Edition, v. 8.2.1

IBM, ibm.com

Good  7.8
criteria score weight
Ease-of-use 8 25%
Integration 7 20%
Management 8 15%
Performance 7 15%
Scalability 9 15%
Value 8 10%

Cost:
$75,000 per CPU

Platforms:
AIX, Red Hat Linux, Suse Linux, Windows Server

Bottom Line:
IBM WebSphere Information Integrator OmniFind Edition v. 8.2.1 is a true enterprise search engine that federates data from intranets, e-mail repositories, and databases. Scalability and configurability are impressive — but certain common file types are not supported, and security features are incomplete.

About our Reviews and Scoring Methodology

I spent a day at IBM's San Jose, Calif., office putting through its paces an enterprise implementation of IBM's federated, enterprise search engine: the recently renamed WebSphere Information Integrator, OmniFind Edition, v. 8.2. The product first rolled out late in 2004 under the DB2 database brand name.

OmniFind is a true enterprise-scale search engine that IBM itself uses to find items in its databases, e-mail archives, and its 10,000 Web sites. The product comes in one of two configurations: on a single server as well as on a four-way system comprising a crawler, a parser-indexer, and two redundant run times that provide client interface services.

Clients most commonly interact with OmniFind through a browser, but they can also do so through a Java API. The latter enables a department to query search results for specific items, with the results returned to the application as Java objects. The Java API is useful for handling custom software, such as a knowledge-base search facility embedded in a help-desk application.

OmniFind uses a crawler to spider through a company's online assets. Results are parsed into individual words and links, which are then reassembled into an index. This index fields the search queries.

The crawler is a highly configurable piece of software. Its underlying technology uses two mechanisms. The first combs through databases and extracts searchable data; the second searches through unstructured data, including e-mail archives, content management systems, and a variety of document files.

There is also the pure intranet crawler, configured to adjust its spidering dynamically. The crawler tracks how often documents or data changes and computes how frequently certain venues need to be reindexed. Exclusion lists and tools such as robots.txt files, which specify what can and cannot be accessed on given sites, can keep the crawler out of specific files and Web sites. You also have the option of identifying sites or resources that are difficult or slow to access, which can keep the crawl from overwhelming low-speed connections.

Data from the crawl is fed into the indexing engine, which relies on a specially configured, embedded instance of DB2. After parsing, categorizing, and weighting the data from the crawl, the engine generates a large index that becomes the file from which queries are answered.

The weighting algorithms are as important in enterprise searches as they are on Web engines -- perhaps even more so because enterprise users will often know that a specific document does exist somewhere but won't know exactly where. As a result, OmniFind uses algorithms that are distinct from those used by Web search engines. The latter depend heavily on the number of links pointing to a specific page to judge relevancy. Intranets, however, are rarely linked extensively to other intranet sites; they are often silos unto themselves, and so link counts are much less useful.

Instead, OmniFind weights its searches with data such as how often a keyword appears in the page, whether it appears in the title or subheads, and how often it appears in anchor text. OmniFind also uses a dynamic mechanism that tracks how often previous searches on a specific keyword have resulted in clicks to a particular page. So, as more searches are performed, the quality of the ranking improves significantly.

Users have limited access to the ranking mechanism: They can specify links that must show up first for a given keyword, but they can't do much more to tweak rankings. This could prove a limitation for companies that have considerable material for a given keyword and want to make specific documents more salient. Indexing can also be administered so that reindexing can be scheduled when systems will be least affected.

The results display shows a broad capability of selecting and choosing search items. A keyword search is the base level. A user, however, can ask for specific records or data items using an SQL-like query language. If the results derive from a database, they are shown in complete field detail.

OmniFind's security currently is coarse-grained. The display mechanism checks authorization levels before displaying data to make sure an employee is entitled to see a given result. Unfortunately, OmniFind lacks document-level security. Moreover, no mechanism exists to support an LDAP directory to automate access to an employee's credentials, although this feature is forthcoming.

OmniFind is an impressive tool in terms of the sheer volume of data that it can federate. It is clearly designed for enterprise use and scales to handle huge amounts of data.

I was surprised, however, by some limitations. For example, the crawler doesn't open .zip or .tar files. Help files, which frequently contain a wealth of searchable information, are also skipped. 

Performance was hard to assess. IBM claims a minimum of 30 dps (documents per second) for crawling and the same rate for indexing, with bursts of 100 dps. My experience was that these numbers were aggressive: Indexing is gated by disk I/O and, in the demo I saw, it wasn't near 30 dps. The test system was not set up to simulate a true crawl -- as all the documents were local -- so crawl performance was more difficult to ascertain.

OmniFind is, for the most part, an easy-to-run, configurable, scalable, and intelligent enterprise search engine. However, the lack of document-level security, the absence of LDAP support, and the ignored file types suggest OmniFind's first release needs some tweaks.





 


 
Andrew Binstock is the principal analyst at Pacific Data Works. He previously was in charge of global technology forecasts at PricewaterhouseCoopers. Earlier, he was the editor in chief of UNIX Review.
 

TOP NEWS:


»  Four quick tips for choosing an IM security product
71 percent of businesses will invest in real-time messaging this year. If you're one of them, be sure to protect your enterprise

»  Forrester analysts ID hot IT jobs
Research group finds 16 IT roles with a promising future

»  Nvidia claims 10 hours of HD video on Tegra chip
The Tegra 600 and 650 can be used with hard disk drives and are designed partly for mobile Internet devices

»  Database vendors add Google's MapReduce
Greenplum and Aster Data Systems will support Google's programming technique, developed for parallel processing of large data sets across commodity hardware

»  Network management: Tips for managing costs
New technologies, changing requirements, and ongoing equipment maintenance and upgrades cost money, but there are ways to manage expenses

»  EMC targets SMBs, branch offices with new low-end storage
Celerra NX4 highlights include thin provisioning, snapshot technology for data recovery and backups, and Web-based console for management of storage volumes




FIVE WAYS TO REDUCE IT COSTS IN 2009
The demands on IT have never been greater, particularly in light of lower revenue and uncertain demand for the goods and services. There are many ways that IT can help organizations adjust to this new economic environment. Learn about five key technology trends that can immediately impact your organization's bottom line, and how to build a strategy to implement these technologies within your current budget. Sponsored by: Riverbed

»  Click here to view this Webcast
  Enterprise Data Security Solutions Guide
Data security used to be about outside threats. These days the biggest challenge for data-driven organizations is the management of secure information from the inside out. Data is available on laptops, your network and even USB devices, but not always secure. Read this Solutions Guide to learn the best ways to keep it safe. Sponsored by ISC2

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 

FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2009, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist
TecChannel :: TecCommunity