August 17, 2005

IBM's new search framework and the blogosphere

With a little help, UIMA could be a boon to unstructured data retrieval

You can find Irving Wladawsky-Berger’s fingerprints on most of IBM’s key initiatives: on-demand, open source, Linux, autonomic, and grid computing. So when he launched his blog in May, I became a charter subscriber.

There are not many of us yet -- not nearly as many as his thoughtful and informative posts should attract. Case in point: IBM’s plan to open source its UIMA (Unstructured Information Management Architecture) SDK. Wladawsky-Berger’s blog posting linked to every relevant item, including the software-download page, the press release, the blogosphere’s analysis of the announcement, and -- most crucially for me -- an issue of the IBM Systems Journal devoted entirely to UIMA’s architecture and applications.

The details are gnarly, but the general plan will look familiar to anyone who’s thinking in terms of service-oriented architecture. The UIMA software provides a framework for coordinating many different text analyzers. Each runs as a service that consumes and produces data in common formats. Applications are composed by declaratively combining sets of analyzers.

Using a potpourri of technologies, these analyzers pore through unstructured text looking for named entities (people, places, companies, or products, for example) and relationships among them. Then the analyzers tag these entities to enable structured search. Queries are XML fragments that can nest entities, such as “person” and “organization”, inside relationships, such as “president_of”.

If such tagging were already present in the document, or linked to it by way of an external tagging service, you could skip the rocket-science analysis phase and proceed directly to the query endgame.

Yeah, sure, and if pigs had wings they could fly. UIMA quite reasonably assumes that people cannot and will not compose texts using semantic markup to denote entities and relations. It also assumes that the semantic clues we can find on the public Web -- thanks to linking and, more recently, social tagging -- won’t be as available in the enterprise, given its vastly smaller scale and complex security regime.

The more machine analysis we can do, the better. But we should also keep looking for ways to extract the semantic metadata that people carry around in their heads. As blogging begins to play a greater role in enterprise knowledge management, two strategies will present themselves.

First, there’s social tagging. It’s true that the Web dwarfs the enterprise, but people who use social tagging services form small communities around specific tags. Maybe such communities can flourish at enterprise scale.

The second strategy is microformats. The idea here is that your blogging tool should make it easy to post items that contain nuggets of structure. Examples on the public Web include calendar events and book reviews. In the enterprise, the nuggets would be things like meetings and status reports. People won’t know that these nuggets are embedded as XML fragments within their blog postings. They’ll just appreciate having an easy way to create styled elements, and an easy way to find them later.

Will Irving Wladawsky-Berger connect the dots between UIMA and the blogosphere? If he does, I’ll be one of the first to read about it.

Close

On Twitter now

Platforms

Powered by Twitter

On Twitter now

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive Platforms Resource Alerts

Subscribe to the Today's Headlines: First Look Newsletter

Find out what will be news for the day, with our first-thing-in-the-morning briefing.

©1994-2009 Infoworld, Inc.