May 28, 2004

Optical illusions

How we look at data determines what we can do with it

In 1832, the Swiss crystallographer Louis Albert Necker discovered his famously ambiguous cube, which seems to jump back and forth between two orientations. Given the same raw data — a particular arrangement of a dozen line segments — our brains find different ways to interpret it. Although optical illusions like this one are loads of fun, they have no obvious practical use. However, a recent project reminded me that the ability to switch frames of reference is a powerful lever that can do real work with data.

For example, not long ago, I needed to join two different data sources: One was a collection of Web pages, the other was a Web service, and the join key was an ISBN (International Standard Book Number). The challenge was to select ISBNs from the Web pages, capture each ISBN’s enclosing structure, then join elements of that structure with elements retrieved from the Web service.

I knew that by converting the HTML pages to XHTML — using my favorite tool for this job, HTML Tidy— I’d be able to attack the Web pages in a structured way with XML-oriented tools such as XPath and XSLT (XSL Transformation). Finding the target ISBNs presented something of a problem, though. The obvious solution would be to use a regular expression to match their distinctive 10-digit pattern. A next generation of XML tools — XPath 2.0, XSLT 2.0, and XQuery — can do that, but none of these is yet widely available. Meanwhile, XPath 1.0 and XSLT 1.0, which are ubiquitous, don’t support regular expressions.

The solution I hit on involved a series of Necker-cubelike perceptual flips. I started by pretending that the XHTML pages were just streams of characters, and used good old regular-expression search-and-replace to attach the marker _ISBN_ to each ISBN. So, for example, 1565925378 became _ISBN_1565925378. Then, switching to an interpretation of the pages as structured text, I used XPath queries to find elements containing the _ISBN_ marker, and to extract surrounding context. Then, switching back to the character-stream view, I removed the markers and restored the original ISBNs. Finally I switched to the XML view again to process the extracts.

If data sources always described themselves sufficiently well, we wouldn’t have to play these games. Because they often don’t, we must intervene. In this example, the job would have gone more smoothly if both frames of reference — that is, regular expressions and XML — had been integrated into a single processor such as XPath 2.0 or XQuery. But that’s just a convenience. It’s quite feasible to compose solutions by combining specialized processors.

The real integration challenge resides inside our heads. There is no single frame of reference for data. At different times, for different reasons, we’ll choose to perceive the same stuff as a stream of characters or a nested structure or a collection of related sets. XQuery’s capability of accommodating all these perspectives is of special interest, and seems likely to make it a potent force in the long run. I suspect it’ll be a quite a while, though, before we truly get comfortable with XQuery. We’re wired to be able to see the same data in different ways, but that latent talent must be cultivated, and it improves with practice. So let’s keep staring at the cube until we learn how to see it.

Close

On Twitter now

Application development

Powered by Twitter

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive InfoWorld Resource Alerts

Subscribe to the Developer World Newsletter

Receive a weekly roundup about the art and science of software development.

©1994-2009 Infoworld, Inc.