Free Newsletters
InfoWorld Daily

InfoWorld
Log-in | Register
STRATEGIC DEVELOPER  

Implementing real-world structured searches

Mixing tags with free-text search can bring the promise of XML that much closer to reality

By Jon Udell  
February 25, 2005
 

In the early days of XML, smart search was often cited as a key benefit. Instead of just trawling for single-celled keywords in an ocean of undifferentiated text, the story went, we'd navigate islands of structure looking for more evolved creatures. Product descriptions, calendar events, and media objects are all examples of the kinds of things we were meant to be finding by now.

Free IT resource

Open Source Business Conference (OSBC) May 22-23, 2007

Sponsored by OSBC

Free IT resource

Virtualization Insights from Top Experts - Learn how virtualization gets real!

Sponsored by Dell

That vision hasn't materialized yet, but I'm not ready to give up on the idea. A year ago I wrote about my efforts to chart "a middle course between the Scylla of simple full-text search and the Charybdis of unwieldy tagging schemes and brittle ontologies." The Scylla of this myth was Google's Sergey Brin, and the Charybis was the W3C's Tim Berners-Lee. Between Brin's "we don't need no stinking structure" and Berners-Lee's "wrap everything in RDF (Resource Description Framework) and OWL (Web Ontology Language)," there is a vast, fertile middle ground awaiting discovery.

For example, the current craze for tagging things -- Flickr photos, del.icio.us, and Furl URLs -- shows that people are more likely than you'd guess to add structure to content. Under what conditions will they make the effort? First, tagging must be easy -- a two-second no-brainer. Second, it must deliver both instant gratification and longer-term value to the person doing the tagging. Third and most important, it must occur in a shared context so that network effects can kick in.

Of course, some tags are implicitly woven into the fabric of our content. Consider, for example, the recent Demo conference in Scottsdale, Ariz. As information about the event flowed into the blogosphere, a likely tag to hang on conference-related items would have been the distinctive name Demo@15. And sure enough, that tag was used on both Flickr and del.icio.us, although by only one person. (Hint to conference planners: If you want the blogosphere to synchronize its coverage of your event, pick a tag and promote it.)

But there are also implicit tags -- namely links -- that identify items about the conference, and a new service I built this week is helping me find them. After Jason Hunter showed me Mark Logic's XQuery-based XML database, Content Interaction Server, in a screencast, I set up an instance of it and began pumping in the RSS feeds of all the blogs I read. Then I wrote a query that combines free-text search for items containing the strings "Demo" or "Demo@15" with structured search for items that contain links to demo.com. It yielded a nice list of Demo-related items that I couldn't have built any other way.

The service works by converting the HTML content of my feeds into well-formed XHTML, storing it in the Mark Logic database, and then using the XQuery engine to perform hybrid free-text and structured searches. Although the vocabulary of XHTML is not very rich, certain elements -- notably links -- carry a latent semantic payload.

It's also possible to enrich the semantic payload of blog content, and on my own blog I've been doing that for a while. Using my XPath query service, you can easily find quotes by Ward Cunningham, Python code fragments, and a number of other things I'm marking with simple CSS tags. Can these ad hoc syntaxes be collaboratively extended? If we can get structured search working for the whole blogosphere, we'll find out.





 


 
Jon Udell is lead analyst and blogger in chief at the InfoWorld Test Center.

  More of Jon Udell's column
  Jon Udell's Weblog

Newsletter Check out all of our free newsletters!
Enter e-mail address:




 

TOP NEWS:


»  Four quick tips for choosing an IM security product
71 percent of businesses will invest in real-time messaging this year. If you're one of them, be sure to protect your enterprise

»  Forrester analysts ID hot IT jobs
Research group finds 16 IT roles with a promising future

»  Nvidia claims 10 hours of HD video on Tegra chip
The Tegra 600 and 650 can be used with hard disk drives and are designed partly for mobile Internet devices

»  Database vendors add Google's MapReduce
Greenplum and Aster Data Systems will support Google's programming technique, developed for parallel processing of large data sets across commodity hardware

»  Network management: Tips for managing costs
New technologies, changing requirements, and ongoing equipment maintenance and upgrades cost money, but there are ways to manage expenses

»  EMC targets SMBs, branch offices with new low-end storage
Celerra NX4 highlights include thin provisioning, snapshot technology for data recovery and backups, and Web-based console for management of storage volumes




Migrating to Vista
Join Windows Vista Expert, Richard Whitehead as he presents the benefits and challenges of migrating to Windows Vista. Sponsored by Novell

»  Click here to view this Webcast
  Planning For A Disaster
This new, comprehensive Solutions Guide is your one stop source for Disaster Recovery. In it you'll learn how to reduce the likelihood of a disaster and to create a rock solid business continuity plan should you face a disaster situation. Sponsored by Equallogic

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 

FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist