February 25, 2005

Implementing real-world structured searches

Mixing tags with free-text search can bring the promise of XML that much closer to reality

In the early days of XML, smart search was often cited as a key benefit. Instead of just trawling for single-celled keywords in an ocean of undifferentiated text, the story went, we'd navigate islands of structure looking for more evolved creatures. Product descriptions, calendar events, and media objects are all examples of the kinds of things we were meant to be finding by now.

That vision hasn't materialized yet, but I'm not ready to give up on the idea. A year ago I wrote about my efforts to chart "a middle course between the Scylla of simple full-text search and the Charybdis of unwieldy tagging schemes and brittle ontologies." The Scylla of this myth was Google's Sergey Brin, and the Charybis was the W3C's Tim Berners-Lee. Between Brin's "we don't need no stinking structure" and Berners-Lee's "wrap everything in RDF (Resource Description Framework) and OWL (Web Ontology Language)," there is a vast, fertile middle ground awaiting discovery.

For example, the current craze for tagging things -- Flickr photos, del.icio.us, and Furl URLs -- shows that people are more likely than you'd guess to add structure to content. Under what conditions will they make the effort? First, tagging must be easy -- a two-second no-brainer. Second, it must deliver both instant gratification and longer-term value to the person doing the tagging. Third and most important, it must occur in a shared context so that network effects can kick in.

Of course, some tags are implicitly woven into the fabric of our content. Consider, for example, the recent Demo conference in Scottsdale, Ariz. As information about the event flowed into the blogosphere, a likely tag to hang on conference-related items would have been the distinctive name Demo@15. And sure enough, that tag was used on both Flickr and del.icio.us, although by only one person. (Hint to conference planners: If you want the blogosphere to synchronize its coverage of your event, pick a tag and promote it.)

But there are also implicit tags -- namely links -- that identify items about the conference, and a new service I built this week is helping me find them. After Jason Hunter showed me Mark Logic's XQuery-based XML database, Content Interaction Server, in a screencast, I set up an instance of it and began pumping in the RSS feeds of all the blogs I read. Then I wrote a query that combines free-text search for items containing the strings "Demo" or "Demo@15" with structured search for items that contain links to demo.com. It yielded a nice list of Demo-related items that I couldn't have built any other way.

The service works by converting the HTML content of my feeds into well-formed XHTML, storing it in the Mark Logic database, and then using the XQuery engine to perform hybrid free-text and structured searches. Although the vocabulary of XHTML is not very rich, certain elements -- notably links -- carry a latent semantic payload.

It's also possible to enrich the semantic payload of blog content, and on my own blog I've been doing that for a while. Using my XPath query service, you can easily find quotes by Ward Cunningham, Python code fragments, and a number of other things I'm marking with simple CSS tags. Can these ad hoc syntaxes be collaboratively extended? If we can get structured search working for the whole blogosphere, we'll find out.

Close

On Twitter now

Architecture

Powered by Twitter

On Twitter now

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive Architecture Resource Alerts

Subscribe to the Today's Headlines: First Look Newsletter

Find out what will be news for the day, with our first-thing-in-the-morning briefing.

©1994-2009 Infoworld, Inc.