Unvalidated
XML; Is There a Place for It?
Thursday,
April 29, 1999
XML
Europe '99, Granada Spain
Tim
Bray, the panel chair, began the plenary focusing
on validation by presenting a bit of perspective.
Validation, according to Bray, is based on DTDs.
When XML was developed, everyone on the committee
agreed that there are situations when DTDs are
not required. At the first conference where XML
was introduced (SGML '96), Tim Bray and Michael-Sperberg-McQueen
conducted a poster session to explain why DTDs
are not required. They pointed out that DTDs,
according to the standard, were not only the <!ELEMENT
and <!ATTLIST declarations but all the accompanying
comments and prose that explain the intended use
of the DTD. Do we validate it against that? Of
course not. So the issue is, do we validate at
all really?
Bray
continued his introduction by stating that much
of the XML product set today is based on James
Clark's parsers -- neither of which validate.
To add to the irony, W3C has a new working group
developing a schema specification to create more
comprehensive and more sophisticated validation.
We don't need DTDs, but we need something more
powerful?
Bray,
although he has worked in SGML/XML for many years
said that he has never yet written a DTD that
people use for real work. So he asks the panel,
"WHEN SHOULD WE VALIDATE?" Today there
is a tremendous amount of debate in our community
about the relevancy of DTDs and validation. The
panel is intended to address these issues.
David
Turner of Microsoft Corporation began the "debate"
by discussing the case for unvalidated XML. David
first clarified the difference between "unvalidated"
and "invalid." Validation is the formal
process to test data against a DTD. That is very
different from data being invalid, not conforming
to a DTD. So, why not validate? Well, according
to Turner, it turns out that validation adds a
tremendous amount of processing overhead. Today
many use parsers without validation because of
this overhead. The best example of when data does
not need to be validated is when data has already
been validated. If data has not been edited, it
doesn't make sense to validate it each time. Or
if documents are pulled from a reliable source
(such as a relational database), validation seems
to be unnecessary overhead. High volumes of "trustworthy"
transactions would be too much expense for your
system with little or no benefit.
Turner
also pointed out that even if data is validated,
there is no guarantee that the data is correct.
One reason is that we have no reliable datatypes
in DTDs. Finally Turner points out that just because
data is validated does not mean it is "trustworthy".
Data may be valid, and passes all the rules, but
it could be the wrong data. Also controlling the
source of the DTD and the determination of which
DTD is "right" impacts the trustworthiness
of data. Turner believes that DTDs are useful
in the development of data and while applications
are being developed. But once the system is "trustworthy,"
eliminating the overhead of validation each time
reduces cost.
Eve
Maler (Arbortext) is an admitted DTD geek. She
agrees that there is a place for unvalidated XML
as David Turner pointed out. She extended the
case by discussing the document class of one and
how having a DTD makes little sense in that case.
Another way Eve uses unvalidated XML is to prototype
for the eventual creation of a DTD. Finally, she
says that for simple updates of documents or interchange
when the receiver of a document simply wants to
read it validation is not required.
Eve
participates in the XML Schema working group.
According to Eve, this is a group that says "we
hate DTDs, we don't need DTDs, give us more of
them!" She also discussed the environment
where groups must collaborate. Without DTDs and
validation, it is almost impossible to create
consistent documentation. Eve claims that in such
an environment 60 percent of authoring time is
spent trying to provide consistency and that DTDs
can help alleviate this situation.
Peter
Murray Rust was the third panelist. Rust came
to SGML 4 years ago. At that time he thought he
had to do DTDs. Yet trying to do DTDs is filled
with horror stories. Rust's presentation was made
in XML using a tag set he made up without a DTD.
He created an XSL style sheet and ran it through
XT to create slides that we can all read. So Rust
asks "where is the use of the DTD?"
Now
Rust's CML (Chemical Markup Language) has a DTD
and it has been updated 4 times. But why is this?
It is because DTDs are all about control and that
does not work in the real world. DTDs not only
impose rules, but DTDs frighten people. According
to Rust, DTDs just don't really work to validate.
Real validation requires datatyping, which Rust
demonstrated by showing us a valid document with
incorrect content. He then showed us a document,
that although not valid, had valid content. "Which
would you choose if your life depended upon it?
I rest my case!"
François
Chahuneau (AIS), the final panelist, believes
that in order to find value in data we must work
from models. However, as he examined that, the
reasons people really used SGML had little to
do with validation:
- Use
a standard instead of proprietary markup
- Separation
of content and structure from presentation
- Facilitate
automated processing
- Support
for model-constraint of documents
For
many years, only SGML could give us these benefits.
Now we have XML. XML addresses the first three
benefits in exactly the same way that SGML did.
In fact, according to Chahuneau, XML may succeed
better than SGML does. But to support model-constrained
documents XML is questionable.
Chahuneau
continued to examine the problem with the traditional
SGML notion of validity. First, the DTD is both
a grammar and a schema. There is a confusion between
the notion of parsability and model compliance.
Next, DTDs are lacking expressive power, (data
typing, lexical rules). Finally, validity is defined
from the "parser point of view". It
is only defined for sequential processing of completed
documents. Today, all SGML editors break the true
notion of validity when they enable us to save
a partial document that by its nature is not valid.
With
XML, DTDs are no longer required to ensure proper
understanding of document structure. It separates
the grammar from a schema. Separates the notion
of parsability from model compliance. Related
schema proposals provide expressive power beyond
that of DTDs. XML offers a range of possibilities:
- No
DTD at all
- Vocabulary
compliance (catalog of element types and attribute
names)
- Full
DTDs
- Schemas
(beyond DTDs)
When
do you need constraints? Chahuneau says that in
some areas of business, specifications could be
formally expressed using SGML. SGML was a convenient
way to express pre-existing sets of rules and
constraints. The need for validation is more related
to the application than to the technology (XML/SGML).
Chahuneau
concluded by saying that there is not a clear
"Yes" or "No" answer. A situation
where a DTD might be used is in authoring documents
that must comply to a specification. However,
we would not need validation when program generated
XML fragments are interchanged or viewed in a
browser. There are a growing number of cases for
semi-structured documents. We want to check certain
objects, but do not care about the "validation"
of the overall document. Chahuneau points out
that XML brings us the freedom to use validation
only when we need it and to actually provide greater
validation when we find it useful.

Return
to TOC