Free Newsletters
Technology & Business Daily

InfoWorld
Log-in | Register
STRATEGIC DEVELOPER  

Mining the intranet

Using Web services, with or without SOAP, to cull valuable data from other Web apps

By Jon Udell  
December 12, 2003
 

One of my early uses of Web services, back in 1999, predated SOAP and WSDL. It was a script to calculate what I called Web mindshare. It combined Yahoo’s capability of enumerating sites in a category with AltaVista’s capability of counting inbound links to each of those sites. It was a primitive version of what Google, then in beta, went on to prove dramatically: Links measure authority. What interested me even more, though, was how easily that little script was able to compose a novel service — ranking everything in a category — from two existing but unrelated services.

Free IT resource

Open Source Business Conference (OSBC) May 22-23, 2007

Sponsored by OSBC

Free IT resource

Virtualization Insights from Top Experts - Learn how virtualization gets real!

Sponsored by Dell

I was reminded of the mindshare calculator this week when I noticed that the new book Spidering Hacks by Kevin Hemenway and Tara Calishain includes an updated version that works with Google. Naturally, I had to try it out. But first I needed to find my API key because only registered users can call Google’s Web services. Then I had to install Perl’s SOAP::Lite module, which wasn’t on the machine I was using. Then I had to find a copy of Google’s WSDL file. All this to do exactly what the 1999 script had done without any of this paraphernalia.

This isn’t just an old-fart story about how simple things used to be simpler. In many cases, simple things still are (or can be) simple. My recent plunge back into the primordial soup that became Web services reminded me why simplicity is a good thing: Neither Yahoo nor AltaVista offers a SOAP/WSDL interface. So when I decided to rerun the old script to compare previous results with current ones, it was reasonable to expect a disaster. Conventional wisdom says HTML screen-scraping is a poor excuse for a formal API and won’t survive test of time. And yet in this case, the Yahoo format was unchanged, and there was only one trivial tweak for AltaVista. (It used to report “about 43,000 pages,” now it reports “found 43,000 results.”)

With current AltaVista data in hand, I decided to look at comparable results for Google and AllTheWeb. Because Google now discourages screen-scraping, I used the SOAP/WSDL method. But for AllTheWeb, which like AltaVista offers no formal API, I used the original technique, tweaking the URL of the search engine and the pattern of the result count. You’d have to make two analogous tweaks in order to specify a SOAP end point and an XML result. But if I’d had to register for an API key and locate WSDL documentation for each of the three services whose results I compared, I probably wouldn’t have bothered.

Of course, sites such as Amazon and Google have reasons to create formal APIs and control access to them. But on an enterprise intranet the threat is disuse, not overuse. You’re publishing information that you want people to find, exploit, and recombine. When it’s appropriate to use SOAP and WSDL — for example, when queries require fancy authorization or complex inputs — then do so. But when a simpler strategy will suffice, don’t be ashamed to use it. Between the primordial tag soup of HTML and the formal realm of Web services exists a large and fertile middle ground: XHTML.

Information that you publish in XHTML can be directly consumed by browsers, and it is much friendlier to spiders than ill-formed HTML. It’s true that creating XHTML pages requires more discipline than hacking out HTML, and it may incur some retraining costs. But if you hope people will mine your intranet, make the job as easy as it can be.





 


 
Jon Udell is lead analyst and blogger in chief at the InfoWorld Test Center.

  More of Jon Udell's column
  Jon Udell's Weblog

Newsletter Check out all of our free newsletters!
Enter e-mail address:




 

TOP NEWS:


»  Software piracy hurts the open-source community too
Many nations are beginning to see stolen proprietary software as a lost opportunity for open source software, whose development can encourage innovation and job growth

»  Intel readies slew of embedded chips based on Atom core
Intel is trying to increase performance and drop power consumption in more than 15 system-on-chips that use the Atom core

»  Microsoft surprise reorganization aimed at online woes
Microsoft's online troubles hint at larger vulnerability; the company is facing challenges in areas that have been a lock for many years

»  Attack code released for DNS bug
Security experts warn that this attack code may give cybercriminals a way to launch virtually undetectable phishing attacks

»  Parts of San Francisco network still locked out
Administrators are still locked out of the city's VoIP system and LANs within the Sheriff's Department and the Recreation & Park Department

»  Intel says Moblin update coming soon
Open-source effort set for mobile Linux should have an alpha-level release in a few weeks




Solutions to the Toughest IT Challenges in Remote Offices
Though small in size, remote offices face many of the same IT challenges as larger central offices. This Webcast zeroes in on the top line challenges to deliver information that can provide immediate benefits to your business. Sponsor: AMD and Dell

»  Click here to view this Webcast
  Zombie PCs Are Attacking Your LAN
A recent study showed that malware-infected zombie PCs are now a bigger threat to ISPs and Web infrastructure than DoS attacks. As this brand new IT Strategy Guide explains, an increased use of peer-to-peer techniques by the attackers has made it harder to fight back. Download now, compliments of Verio:

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 

FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist