Free Newsletters
Technology & Business Daily

InfoWorld
Log-in | Register
STRATEGIC DEVELOPER  

Content-aware searching

Adding a little structure to HTML content elicits a knowledge management payoff

By Jon Udell  
January 30, 2004
 

When I'm deeply engrossed in R&D, as I have been lately, I can become obsessed. So take what I’m about to say with a grain of salt, but I really think I’m on to something — namely, content-aware search.

Free IT resource

Open Source Business Conference (OSBC) May 22-23, 2007

Sponsored by OSBC

Free IT resource

Virtualization Insights from Top Experts - Learn how virtualization gets real!

Sponsored by Dell

Way back in the last century, during XML’s formative years, an oft-heard argument for XML was that it would enable smarter Web searching. Well, it didn’t. One reason we don’t have Web searching that exploits structured content is that we never got ubiquitous and easy-to-use writing tools to create well-formed XML content. So there aren’t many pools that can be plumbed with XML-aware search technology.

Another reason is that Google is good enough. At InfoWorld’s 2002 CTO Forum, Google co-founder Sergey Brin threw cold water on the idea of instrumenting content for intelligent search. "I’d rather make progress by having computers understand what humans write," he said, "than by forcing humans to write in ways that computers can understand."

Brin’s pragmatic stance sharply opposes the idealistic view of the Web’s inventor, Tim Berners-Lee, who continues to evangelize his vision of a Semantic Web full of carefully encoded content that we can precisely search and fluidly recombine. My own humble contribution to this debate is a prototype search engine, now running on my Weblog, that tries to steer a middle course between the Scylla of simple fulltext search and the Charybdis of unwieldy tagging schemes and brittle ontologies. The twin enablers of this prototype are XHTML content and XPath search. Because I maintain my Weblog’s content as XHTML — HTML’s well-formed cousin — I can query it using XPath patterns. That means I can answer questions that I can’t answer with ordinary full-text search. Some examples of things I can find this way: paragraphs with links to book-related Web sites, tables with more than five rows of data, and articles with references to audio or video clips.

These queries don’t depend on any special HTML coding. They require only that the HTML be well-formed XHTML. Of course, the vast majority of published HTML isn’t well-formed. Does that make this approach a non-starter for most repositories of Web content?

Not necessarily. The next phase of my experiment involves converting the 200 or so Weblogs I scan everyday, using my RSS feedreader, from HTML to XHTML. Early indications are that this will work reasonably well.

Remember, the pools of HTML content that your people routinely create, and the infinitely vaster pools to which they have access, are full of intrinsic metadata — including the links, tables, images, and other elements that occur naturally within HTML content. Mining that metadata may be more practical than you think.

By implementing content-aware search against existing repositories, you can show people the tangible benefits of more expressive content. Months ago I began writing Weblog entries that identify the sources of quotations, and the programming languages in which included code snippets are written. This was a promise to the future. I knew that I’d later be able to find these things very precisely, and now I can. But most people live in the present. For any extra effort, however modest, they quite rightly expect an immediate payback. Content-aware search is a great way to reward such effort.





 


 
Jon Udell is lead analyst and blogger in chief at the InfoWorld Test Center.

  More of Jon Udell's column
  Jon Udell's Weblog

Newsletter Check out all of our free newsletters!
Enter e-mail address:




 

TOP NEWS:


»  Four quick tips for choosing an IM security product
71 percent of businesses will invest in real-time messaging this year. If you're one of them, be sure to protect your enterprise

»  Forrester analysts ID hot IT jobs
Research group finds 16 IT roles with a promising future

»  Nvidia claims 10 hours of HD video on Tegra chip
The Tegra 600 and 650 can be used with hard disk drives and are designed partly for mobile Internet devices

»  Database vendors add Google's MapReduce
Greenplum and Aster Data Systems will support Google's programming technique, developed for parallel processing of large data sets across commodity hardware

»  Network management: Tips for managing costs
New technologies, changing requirements, and ongoing equipment maintenance and upgrades cost money, but there are ways to manage expenses

»  EMC targets SMBs, branch offices with new low-end storage
Celerra NX4 highlights include thin provisioning, snapshot technology for data recovery and backups, and Web-based console for management of storage volumes




Remote Access: Maintain Security and Decrease the Burden on IT
Join this interactive webcast to discover how IT Managers can control access rights, end-user security settings and end-point authorization. Sponsor: Citrix(R) GoToMyPC(R) Corporate

»  Click here to view this Webcast
  Planning For A Disaster
This new, comprehensive Solutions Guide is your one stop source for Disaster Recovery. In it you'll learn how to reduce the likelihood of a disaster and to create a rock solid business continuity plan should you face a disaster situation. Sponsored by Equallogic

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 

FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist