Free Newsletters
Technology & Business Daily

InfoWorld
Log-in | Register

Exploring XML in Office 11

XML capabilities in store for Word and Excel pack a learning curve

By Jon Udell  
February 21, 2003
 

This year's upcoming debut of Microsoft Office 11 will mark the start of a long process of education and adaptation.

Free IT resource

TechNet: More ways to know it, share it, and keep it running.

Sponsored by Microsoft

Free IT resource

Attend the SOA Executive Forum: Breaking SOA Bottlenecks SOAExecForum.com/may2007

Sponsored by InfoWorld



Microsoft Word, Excel, and XML


Executive Summary:
Most business information lives in documents, not in databases. With the new XML features in Office 11, IT can start to bring database-like discipline to the creation and querying of those documents.

Test Center Perspective:
For developers, schematization of business documents, such as resumes and expense reports, will be a long and gradual process. But Excel's new ability to read in and analyze XML data -- from XML-aware databases, Web services, and other sources -- will be immediately useful.

Our previous look at the Office 11 beta (see "XML for the rest of us") painted the big picture. We described how and why the pillars of Office — Word and Excel — can make use of XML. But the devil's in the details. So here we'll explore how existing Office documents can benefit from the new features, how developers will prepare XML-aware Office templates, and how users will apply them to create and analyze XML data.

Microsoft's Jean Paoli, the architect of Office 11's XML support, was co-editor of the XML 1.0 specification with Tim Bray. The first thing Paoli showed Bray was that any existing .doc  file can be saved as XML — specifically, as WordML, which expresses both the style and the content of the document in pure XML. "When I showed that to Tim," Paoli remarked, "he was jumping for joy." In a separate interview, Bray — an Internet search pioneer and founder of data-visualization provider Antarctica Systems — said the same thing. Although it's true that Google can index Word, PDF, and other formats, .doc files are inherently opaque. WordML is a bridge from the .doc format to the world of XML and its associated technologies of transformation, indexing, and search. In Word 11, you need only Save As XML to enter that world.

Word 11's Save As XML feature presents a check-box labeled "Save as data only." What data means, here, is tagged elements belonging to an XML Schema. For a preexisting .doc file — a status report, a book chapter — there are no such elements. If you check "Save as data only," Word warns that you'll lose your document formatting. In this case, you'll lose more than that. The output will be an empty file because the document has no data in the XML sense. Let's conjure up some.

The example that Paoli offered began with a standard .dot file — that is, an existing Word template, just like those you already use. To make that template a launchpad for a family of documents that store valid XML data, the first step is to acquire, or create, an XSD (XML Schema Definition) file. And that step is a doozy. As we discussed in "Modeling Biz Docs in XML," few IT professionals have experience modeling data with XML Schema's predecessor, DTD (Document Type Definition), which has been around for more than 15 years. Even fewer have XML Schema experience. After Office 11 ships, we face a classic chicken-and-egg scenario. Developers can't really learn the art of modeling data in business documents without user feedback. But users can't provide that feedback until they start actually working with XML-enriched documents. Office 11's XML support isn't a final solution. Rather, it allows for a long, difficult, and absolutely vital bootstrapping process.

Caveats aside, after the developer has an XSD file — for example, one that defines required structure and data types for a résumé — it's straightforward to map it to a Word template. In the current beta, you use Tools/Templates and Add-Ins/XML Schema to associate your schema with the template. In the XML Structure task pane you then choose the schema's root tag and wrap that element around the document. That exposes its contained elements for more granular mapping. Validation of structure and data types happens interactively. Data-type validation — for example, ensuring only numbers in the date fields — is also an option, although it wasn't working in the early beta we tested.

All of Word's formatting power is available here. But does that formatting carry over to the saved XML? It depends.

If the XSD file defines a field merely as a string, with no internal XML structure, the formatting will be lost when you save only XML data, not WordML. You can certainly elaborate more structure within that field, but that's the kind of trade-off developers and users will wrestle with for years to come. It's costly for developers to define structure, and costly for users to interact with it. The solution will often be to punt on the more elaborate structure, and focus on the benefit of being able to search for words, say, in the Experience sections of a pile of résumés.

The process of schematizing an Excel template — say, for an expense report — is similar. Starting with a pre-existing spreadsheet template, you create or acquire a schema, and map the schema to the template, element by element. You can then hand the XML-enhanced template to a user. Expense reports spawned from the template are now, necessarily, schema-valid.

Until Microsoft announced InfoPath (formerly XDocs), examples such as Word résumés and Excel expense reports illustrated a new vision for Office as an information-gathering toolset. Word would create documents full of text and graphics; Excel would create documents full of numbers and charts; both would allow IT to exert control over the data. When it arrives as the newest member of the Office family, InfoPath will complicate that picture. It's clear that InfoPath, in many cases, will be the strongest tool for gathering semistructured data. It is tuned neither for the complex documents that are Word's forte, nor the data grids that are Excel's, but rather for gathering information that might be viewed in Word, or analyzed in Excel, or injected into a business process via e-mail or SOAP calls.

Will InfoPath, rather than Word or Excel, become the preferred way to gather such data? We've seen InfoPath, but haven't tested it yet, so we'll reserve judgment until then. At first glance, though, InfoPath seems to overlap more with Word than with Excel. Although Word can deal with massively complex documents, its power is frankly often wasted on simpler texts for which InfoPath's built-in XHTML editor might suffice.

But nothing else in the Office suite will have anything like Excel's analytic prowess. Excel 11's newfound ability to absorb arbitrary schema-governed XML data, coupled with the explosion of XML data coming from everywhere — Web services, XML-aware databases, the rest of the Office suite, and other emerging XML applications — makes it more valuable.

If you start with a raw XML file — just data, no schema — Excel will read the data and make a best-effort map to the grid. In the resulting worksheet, that data is immediately available for editing, sorting, charting, pivot-table analysis, and more. Of course when the data comes from a Web service, as it increasingly will, it is likely to be schematized. In that case, your options multiply. Once you associate a schema with the XML data, you can select elements shown in the XML Structure task pane.

Under the covers, Excel creates the XPath queries that address those elements within the nested structure of the document. By dragging a set of selected elements to the worksheet, you create an XML data range that can absorb data from one or more XML files conforming to the schema. Maybe you can write an XSLT transformation to sort that aggregated data on a column. But why bother? Life's too short, and most users don't (and shouldn't) know beans about XSLT.

Controlling the quality of our XML data, by creating it and properly maintaining it, will be a huge step toward smoother business processes and better business intelligence. But the data means nothing until we interpret it. Excel has always been the engine of interpretation. Everything you already know about how to do that still applies. When you fuel it with higher-quality data, Excel becomes an even more powerful analytical engine.





 


 
Jon Udell is lead analyst and blogger in chief at the InfoWorld Test Center.

  More of Jon Udell's column
  Jon Udell's Weblog

Newsletter Check out all of our free newsletters!
Enter e-mail address:




 

TOP NEWS:


»  Troubleshooting tool for Java offered
Sun's Java VisualVM open-source technology views apps while they run on a JVM and is billed as all-in-one solution

»  Python backing eyed for NetBeans
Scripting language capabilities of the open-source IDE continue to expand

»  Microsoft sets Windows XP SP3 automatic download for Thursday
The latest service pack for Windows XP will be pushed to Automatic Update at 7a.m. EDT on July 10

»  Real Software, Veryant bolster dev tools
RealBasic, Cobol apps platforms get improvements

»  Microsoft sets hosted-services pricing, irks partners
By offering 38 percent discount to customers who buy entire hosted business productivity suite, Microsoft undercuts partners selling similar services

»  Adobe readying new mashup tool for business users
Mashup interface code-named 'Genesis' will open up desktop 'workspace' combining business application data, documents, analytics, and instant messaging




Beyond AntiVirus: Symantec Endpoint Protection
Today's threats to the endpoint are much more dangerous as they rapidly evolve to evade traditional security measures. To combat these threats, companies should supplement existing security with proactive behavioral based technologies. Join this webcast to learn about Symantec's next generation AntiVirus solution that provides that level of protection. Sponsor: Symantec

»  Click here to view this Webcast
  The Silver Lining: Cloud Computing
This IT Strategy Guide digs deep into cloud computing helping put you ahead of the curve on this hot topic. It explores the differences between cloud computing, grid computing and utility computing and then helps you see where and how each applies to your business. Sponsored by Box.net

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 

FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist