GCA
GCA What is XML
Home Page

  XML FILES
  XML NEWS FLASHES
  W3C STANDARDS
  XML GLOSSARY
  VOCABULARIES
  XML BOOKS
  XML CONFERENCES
  XML/EDI GROUP
  XML.COM
  ROBIN COVER'S
XML WEBSITE

Attend a GCA Conference
Become a GCA Member

Buy a GCA Publication




Unvalidated XML; Is There a Place for It?

Thursday, April 29, 1999

XML Europe '99, Granada Spain

Tim Bray, the panel chair, began the plenary focusing on validation by presenting a bit of perspective. Validation, according to Bray, is based on DTDs. When XML was developed, everyone on the committee agreed that there are situations when DTDs are not required. At the first conference where XML was introduced (SGML '96), Tim Bray and Michael-Sperberg-McQueen conducted a poster session to explain why DTDs are not required. They pointed out that DTDs, according to the standard, were not only the <!ELEMENT and <!ATTLIST declarations but all the accompanying comments and prose that explain the intended use of the DTD. Do we validate it against that? Of course not. So the issue is, do we validate at all really?

Bray continued his introduction by stating that much of the XML product set today is based on James Clark's parsers -- neither of which validate. To add to the irony, W3C has a new working group developing a schema specification to create more comprehensive and more sophisticated validation. We don't need DTDs, but we need something more powerful?

Bray, although he has worked in SGML/XML for many years said that he has never yet written a DTD that people use for real work. So he asks the panel, "WHEN SHOULD WE VALIDATE?" Today there is a tremendous amount of debate in our community about the relevancy of DTDs and validation. The panel is intended to address these issues.

David Turner of Microsoft Corporation began the "debate" by discussing the case for unvalidated XML. David first clarified the difference between "unvalidated" and "invalid." Validation is the formal process to test data against a DTD. That is very different from data being invalid, not conforming to a DTD. So, why not validate? Well, according to Turner, it turns out that validation adds a tremendous amount of processing overhead. Today many use parsers without validation because of this overhead. The best example of when data does not need to be validated is when data has already been validated. If data has not been edited, it doesn't make sense to validate it each time. Or if documents are pulled from a reliable source (such as a relational database), validation seems to be unnecessary overhead. High volumes of "trustworthy" transactions would be too much expense for your system with little or no benefit.

Turner also pointed out that even if data is validated, there is no guarantee that the data is correct. One reason is that we have no reliable datatypes in DTDs. Finally Turner points out that just because data is validated does not mean it is "trustworthy". Data may be valid, and passes all the rules, but it could be the wrong data. Also controlling the source of the DTD and the determination of which DTD is "right" impacts the trustworthiness of data. Turner believes that DTDs are useful in the development of data and while applications are being developed. But once the system is "trustworthy," eliminating the overhead of validation each time reduces cost.

Eve Maler (Arbortext) is an admitted DTD geek. She agrees that there is a place for unvalidated XML as David Turner pointed out. She extended the case by discussing the document class of one and how having a DTD makes little sense in that case. Another way Eve uses unvalidated XML is to prototype for the eventual creation of a DTD. Finally, she says that for simple updates of documents or interchange when the receiver of a document simply wants to read it validation is not required.

Eve participates in the XML Schema working group. According to Eve, this is a group that says "we hate DTDs, we don't need DTDs, give us more of them!" She also discussed the environment where groups must collaborate. Without DTDs and validation, it is almost impossible to create consistent documentation. Eve claims that in such an environment 60 percent of authoring time is spent trying to provide consistency and that DTDs can help alleviate this situation.

Peter Murray Rust was the third panelist. Rust came to SGML 4 years ago. At that time he thought he had to do DTDs. Yet trying to do DTDs is filled with horror stories. Rust's presentation was made in XML using a tag set he made up without a DTD. He created an XSL style sheet and ran it through XT to create slides that we can all read. So Rust asks "where is the use of the DTD?"

Now Rust's CML (Chemical Markup Language) has a DTD and it has been updated 4 times. But why is this? It is because DTDs are all about control and that does not work in the real world. DTDs not only impose rules, but DTDs frighten people. According to Rust, DTDs just don't really work to validate. Real validation requires datatyping, which Rust demonstrated by showing us a valid document with incorrect content. He then showed us a document, that although not valid, had valid content. "Which would you choose if your life depended upon it? I rest my case!"

François Chahuneau (AIS), the final panelist, believes that in order to find value in data we must work from models. However, as he examined that, the reasons people really used SGML had little to do with validation:

  1. Use a standard instead of proprietary markup
  2. Separation of content and structure from presentation
  3. Facilitate automated processing
  4. Support for model-constraint of documents

For many years, only SGML could give us these benefits. Now we have XML. XML addresses the first three benefits in exactly the same way that SGML did. In fact, according to Chahuneau, XML may succeed better than SGML does. But to support model-constrained documents XML is questionable.

Chahuneau continued to examine the problem with the traditional SGML notion of validity. First, the DTD is both a grammar and a schema. There is a confusion between the notion of parsability and model compliance. Next, DTDs are lacking expressive power, (data typing, lexical rules). Finally, validity is defined from the "parser point of view". It is only defined for sequential processing of completed documents. Today, all SGML editors break the true notion of validity when they enable us to save a partial document that by its nature is not valid.

With XML, DTDs are no longer required to ensure proper understanding of document structure. It separates the grammar from a schema. Separates the notion of parsability from model compliance. Related schema proposals provide expressive power beyond that of DTDs. XML offers a range of possibilities:

  • No DTD at all
  • Vocabulary compliance (catalog of element types and attribute names)
  • Full DTDs
  • Schemas (beyond DTDs)

When do you need constraints? Chahuneau says that in some areas of business, specifications could be formally expressed using SGML. SGML was a convenient way to express pre-existing sets of rules and constraints. The need for validation is more related to the application than to the technology (XML/SGML).

Chahuneau concluded by saying that there is not a clear "Yes" or "No" answer. A situation where a DTD might be used is in authoring documents that must comply to a specification. However, we would not need validation when program generated XML fragments are interchanged or viewed in a browser. There are a growing number of cases for semi-structured documents. We want to check certain objects, but do not care about the "validation" of the overall document. Chahuneau points out that XML brings us the freedom to use validation only when we need it and to actually provide greater validation when we find it useful.

Return to TOC

Today's News DigestWhat is XML?What is SGML?ICEGCA's Mail.dat
Technical CommitteesTechnical ResourcesTargeted InitiativesGCA's GRACol
What is GCA?GCA Press ReleasesGCA MembersContact GCA


GCA - Phone: +1 703-519-8160   Click Here For Legal And Technical Information
Click Here For Legal And Technical Information email: info@gca.org