One of my early uses of Web services, back in 1999, predated SOAP and WSDL. It was a script to calculate what I called Web
mindshare. It combined Yahoo’s capability of enumerating sites in a category with AltaVista’s capability of counting inbound
links to each of those sites. It was a primitive version of what Google, then in beta, went on to prove dramatically: Links
measure authority. What interested me even more, though, was how easily that little script was able to compose a novel service
— ranking everything in a category — from two existing but unrelated services.
I was reminded of the mindshare calculator this week when I noticed that the new book Spidering Hacks by Kevin Hemenway and
Tara Calishain includes an updated version that works with Google. Naturally, I had to try it out. But first I needed to find
my API key because only registered users can call Google’s Web services. Then I had to install Perl’s SOAP::Lite module, which
wasn’t on the machine I was using. Then I had to find a copy of Google’s WSDL file. All this to do exactly what the 1999 script
had done without any of this paraphernalia.
This isn’t just an old-fart story about how simple things used to be simpler. In many cases, simple things still are (or can
be) simple. My recent plunge back into the primordial soup that became Web services reminded me why simplicity is a good thing:
Neither Yahoo nor AltaVista offers a SOAP/WSDL interface. So when I decided to rerun the old script to compare previous results
with current ones, it was reasonable to expect a disaster. Conventional wisdom says HTML screen-scraping is a poor excuse
for a formal API and won’t survive test of time. And yet in this case, the Yahoo format was unchanged, and there was only
one trivial tweak for AltaVista. (It used to report “about 43,000 pages,” now it reports “found 43,000 results.”)
With current AltaVista data in hand, I decided to look at comparable results for Google and AllTheWeb. Because Google now
discourages screen-scraping, I used the SOAP/WSDL method. But for AllTheWeb, which like AltaVista offers no formal API, I
used the original technique, tweaking the URL of the search engine and the pattern of the result count. You’d have to make
two analogous tweaks in order to specify a SOAP end point and an XML result. But if I’d had to register for an API key and
locate WSDL documentation for each of the three services whose results I compared, I probably wouldn’t have bothered.
Of course, sites such as Amazon and Google have reasons to create formal APIs and control access to them. But on an enterprise
intranet the threat is disuse, not overuse. You’re publishing information that you want people to find, exploit, and recombine.
When it’s appropriate to use SOAP and WSDL — for example, when queries require fancy authorization or complex inputs — then
do so. But when a simpler strategy will suffice, don’t be ashamed to use it. Between the primordial tag soup of HTML and the
formal realm of Web services exists a large and fertile middle ground: XHTML.
Information that you publish in XHTML can be directly consumed by browsers, and it is much friendlier to spiders than ill-formed
HTML. It’s true that creating XHTML pages requires more discipline than hacking out HTML, and it may incur some retraining
costs. But if you hope people will mine your intranet, make the job as easy as it can be.