Free Newsletters
Technology & Business Daily

InfoWorld
Log-in | Register
STRATEGIC DEVELOPER  

IBM's new search framework and the blogosphere

With a little help, UIMA could be a boon to unstructured data retrieval

By Jon Udell  
August 17, 2005
 

You can find Irving Wladawsky-Berger’s fingerprints on most of IBM’s key initiatives: on-demand, open source, Linux, autonomic, and grid computing. So when he launched his blog in May, I became a charter subscriber.

Free IT resource

Virtualization Insights from Top Experts - Learn how virtualization gets real!

Sponsored by Dell

Free IT resource

TechNet: More ways to know it, share it, and keep it running.

Sponsored by Microsoft

There are not many of us yet -- not nearly as many as his thoughtful and informative posts should attract. Case in point: IBM’s plan to open source its UIMA (Unstructured Information Management Architecture) SDK. Wladawsky-Berger’s blog posting linked to every relevant item, including the software-download page, the press release, the blogosphere’s analysis of the announcement, and -- most crucially for me -- an issue of the IBM Systems Journal devoted entirely to UIMA’s architecture and applications.

The details are gnarly, but the general plan will look familiar to anyone who’s thinking in terms of service-oriented architecture. The UIMA software provides a framework for coordinating many different text analyzers. Each runs as a service that consumes and produces data in common formats. Applications are composed by declaratively combining sets of analyzers.

Using a potpourri of technologies, these analyzers pore through unstructured text looking for named entities (people, places, companies, or products, for example) and relationships among them. Then the analyzers tag these entities to enable structured search. Queries are XML fragments that can nest entities, such as “person” and “organization”, inside relationships, such as “president_of”.

If such tagging were already present in the document, or linked to it by way of an external tagging service, you could skip the rocket-science analysis phase and proceed directly to the query endgame.

Yeah, sure, and if pigs had wings they could fly. UIMA quite reasonably assumes that people cannot and will not compose texts using semantic markup to denote entities and relations. It also assumes that the semantic clues we can find on the public Web -- thanks to linking and, more recently, social tagging -- won’t be as available in the enterprise, given its vastly smaller scale and complex security regime.

The more machine analysis we can do, the better. But we should also keep looking for ways to extract the semantic metadata that people carry around in their heads. As blogging begins to play a greater role in enterprise knowledge management, two strategies will present themselves.

First, there’s social tagging. It’s true that the Web dwarfs the enterprise, but people who use social tagging services form small communities around specific tags. Maybe such communities can flourish at enterprise scale.

The second strategy is microformats. The idea here is that your blogging tool should make it easy to post items that contain nuggets of structure. Examples on the public Web include calendar events and book reviews. In the enterprise, the nuggets would be things like meetings and status reports. People won’t know that these nuggets are embedded as XML fragments within their blog postings. They’ll just appreciate having an easy way to create styled elements, and an easy way to find them later.

Will Irving Wladawsky-Berger connect the dots between UIMA and the blogosphere? If he does, I’ll be one of the first to read about it.





 


 
Jon Udell is lead analyst and blogger in chief at the InfoWorld Test Center.

  More of Jon Udell's column
  Jon Udell's Weblog

Newsletter Check out all of our free newsletters!
Enter e-mail address:




 

TOP NEWS:


»  Four quick tips for choosing an IM security product
71 percent of businesses will invest in real-time messaging this year. If you're one of them, be sure to protect your enterprise

»  Forrester analysts ID hot IT jobs
Research group finds 16 IT roles with a promising future

»  Nvidia claims 10 hours of HD video on Tegra chip
The Tegra 600 and 650 can be used with hard disk drives and are designed partly for mobile Internet devices

»  Database vendors add Google's MapReduce
Greenplum and Aster Data Systems will support Google's programming technique, developed for parallel processing of large data sets across commodity hardware

»  Network management: Tips for managing costs
New technologies, changing requirements, and ongoing equipment maintenance and upgrades cost money, but there are ways to manage expenses

»  EMC targets SMBs, branch offices with new low-end storage
Celerra NX4 highlights include thin provisioning, snapshot technology for data recovery and backups, and Web-based console for management of storage volumes




Best Practices for Successful SOA Governance
It's widely accepted that SOA will fail to achieve the benefits it promises without a successful SOA governance strategy. What makes up a successful SOA governance strategy though? Find out some proven best practices around SOA governance that you can apply within your organization to get you on the path to success. Sponsored by Oracle

»  Click here to view this Webcast
  Planning For A Disaster
This new, comprehensive Solutions Guide is your one stop source for Disaster Recovery. In it you'll learn how to reduce the likelihood of a disaster and to create a rock solid business continuity plan should you face a disaster situation. Sponsored by Equallogic

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 

FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist