January 30, 2004

Content-aware searching

Adding a little structure to HTML content elicits a knowledge management payoff

When I'm deeply engrossed in R&D, as I have been lately, I can become obsessed. So take what I’m about to say with a grain of salt, but I really think I’m on to something — namely, content-aware search.

Way back in the last century, during XML’s formative years, an oft-heard argument for XML was that it would enable smarter Web searching. Well, it didn’t. One reason we don’t have Web searching that exploits structured content is that we never got ubiquitous and easy-to-use writing tools to create well-formed XML content. So there aren’t many pools that can be plumbed with XML-aware search technology.

Another reason is that Google is good enough. At InfoWorld’s 2002 CTO Forum, Google co-founder Sergey Brin threw cold water on the idea of instrumenting content for intelligent search. "I’d rather make progress by having computers understand what humans write," he said, "than by forcing humans to write in ways that computers can understand."

Brin’s pragmatic stance sharply opposes the idealistic view of the Web’s inventor, Tim Berners-Lee, who continues to evangelize his vision of a Semantic Web full of carefully encoded content that we can precisely search and fluidly recombine. My own humble contribution to this debate is a prototype search engine, now running on my Weblog, that tries to steer a middle course between the Scylla of simple fulltext search and the Charybdis of unwieldy tagging schemes and brittle ontologies. The twin enablers of this prototype are XHTML content and XPath search. Because I maintain my Weblog’s content as XHTML — HTML’s well-formed cousin — I can query it using XPath patterns. That means I can answer questions that I can’t answer with ordinary full-text search. Some examples of things I can find this way: paragraphs with links to book-related Web sites, tables with more than five rows of data, and articles with references to audio or video clips.

These queries don’t depend on any special HTML coding. They require only that the HTML be well-formed XHTML. Of course, the vast majority of published HTML isn’t well-formed. Does that make this approach a non-starter for most repositories of Web content?

Not necessarily. The next phase of my experiment involves converting the 200 or so Weblogs I scan everyday, using my RSS feedreader, from HTML to XHTML. Early indications are that this will work reasonably well.

Remember, the pools of HTML content that your people routinely create, and the infinitely vaster pools to which they have access, are full of intrinsic metadata — including the links, tables, images, and other elements that occur naturally within HTML content. Mining that metadata may be more practical than you think.

By implementing content-aware search against existing repositories, you can show people the tangible benefits of more expressive content. Months ago I began writing Weblog entries that identify the sources of quotations, and the programming languages in which included code snippets are written. This was a promise to the future. I knew that I’d later be able to find these things very precisely, and now I can. But most people live in the present. For any extra effort, however modest, they quite rightly expect an immediate payback. Content-aware search is a great way to reward such effort.

Close

On Twitter now

Application development

Powered by Twitter

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive InfoWorld Resource Alerts

Subscribe to the Developer World Newsletter

Receive a weekly roundup about the art and science of software development.

©1994-2009 Infoworld, Inc.