How to maintain a family of DTDs and keep them related using switchboards

Diederik Gerth van Wijk, DTD Manager, Wolters Kluwer Nederland, dgerth@kluwer.nl

Presented on the Extreme Markup Languages 2000 conference, Montréal, 2000.08.18.09.45

This HTML page is based on the slide show used while presenting the paper. It uses CSS to clarify some examples, and is only tested with IE5.5, where it works fine. The full text of the paper is in the conference proceedings, which should be available through GCA.

If you think this talk is about:

Overview

5 minute intro

A family of DTDs

Switches

Switches are parameter entities with possible values of "INCLUDE" or "IGNORE"; they switch marked sections on and off:
<!ENTITY % DO.subtitle.IN.parablock "IGNORE">
<![ %DO.subtitle.IN.parablock; [
  <!ENTITY % HF.subtitle.IN.parablock "subtitle?, ">
]]>
<!ENTITY % HF.subtitle.IN.parablock "">
<!ENTITY % H.parablock "title, %HF.subtitle.IN.parablock">
<!ENTITY % B.parablock "para+">
<!ENTITY % T.parablock "">
<!ENTITY % C.parablock "(%H.parablock; %B.parablock;%T.parablock;)">
<!ELEMENT parablock %C.parablock;>

Switchboards

DTD x is defined in a switchfile x.SWI, that only contains switch definitions:
<!ENTITY % DO.subtitle.IN.parablock "INCLUDE">
Switches can trigger other switches in the central switchboard:
<![ %DO.subtitle.IN.parablock; [
  <!ENTITY % DO.subtitle "INCLUDE">
]]>
The only way a user can change his DTD is by modifying the switch file.

The advantages

Why we think we need so many DTDs

Beacause of who we were when starting with SGML

Some more data (rough estimates)

  • Loose leaf titles:
    • Legal: 130 (50% SGML)
    • Tax: 150 (30% SGML)
    • 150,000 pages SGML
  • 300 journals (5% SGML), 100 CD-ROM's
  • 300 book series (5% SGML)
  • 800 books a year
  • 16,000 Laws (300 MB)
  • 50,000 Court sentences (750 MB)
  • 600 SGML titles (1750 MB)

Why we need some loose DTDs

Tight DTDs are better

Real world situation

If you encounter a new situation, that can't be fixed, the DTD must be modified (loosened) to allow it. F.i., one parablock without a title:
<!ELEMENT pboc (title?, para+,  source)>
and then one with more than one source:
<!ELEMENT pboc (title?, para+, source+)>
and one with the source as title:
<!ELEMENT pboc ((title?, para+, source+) |(source+, para+))>

The conflict

Some classic techniques

DocBook / TEI approach

Example of DocBook module

<!ENTITY % msgexplan.module "INCLUDE">
<![ %msgexplan.module; [
  <!ENTITY % local.msgexplan.attrib "">
  <!ENTITY % msgexplan.role.attrib "%role.attrib;">
  <!ELEMENT MsgExplan - - (Title?, (%component.mix;)+)>
  <!ATTLIST MsgExplan
    %common.attrib;
    %msgexplan.role.attrib;
    %local.msgexplan.attrib; >
]]>

Why is DocBook not good enough?

The solution: switchboards

How a simple switch file looks like

<!ENTITY % DO.ugv 'INCLUDE'  -- the root element -->
<!ENTITY % DO.UGV "INCLUDE" 
-- the general switch to include UGV like structures -->
  

Some switchboard stuff

<!-- General settings for  XML/SGML -->
<!ENTITY % DO.XML "IGNORE">
<![ %DO.XML; [
  <!-- Tag omission: illegal in XML -->
  <!ENTITY % NoTagOmit "">
  <!ENTITY % EndTagOmit "">
  <!ENTITY % BothTagOmit "">
  <!ENTITY % StartTagOmit "">
  <!ENTITY % SDATA "">
  <!ENTITY % NAME "NMTOKEN">
  <!ENTITY % NAMES "NMTOKENS">
  <!ENTITY % NMTOKEN "NMTOKEN">
  <!ENTITY % NMTOKENS "NMTOKENS">
  <!ENTITY % NUMBER "NMTOKEN">
  <!ENTITY % NUMBERS "NMTOKENS">
  <!ENTITY % NUTOKEN "NMTOKEN">
  <!ENTITY % NUTOKENS "NMTOKENS">
  <!ENTITY % DO.Exclusions "IGNORE">
]]>
<!-- End of XML stuff; now you get the SGML stuff -->
<!ENTITY % NoTagOmit "- -">
<!ENTITY % EndTagOmit "- o">
<!ENTITY % BothTagOmit "o o">
<!ENTITY % StartTagOmit "o -">
<!ENTITY % SDATA "SDATA">
<!-- Declared attribute values that are more specific in SGML than XML
  allows; %NMTOKEN(S) is just added for convenience -->
<!ENTITY % NAME "NAME">
<!ENTITY % NAMES "NAMES">
<!ENTITY % NMTOKEN "NMTOKEN">
<!ENTITY % NMTOKENS "NMTOKENS">
<!ENTITY % NUMBER "NUMBER">
<!ENTITY % NUMBERS "NUMBERS">
<!ENTITY % NUTOKEN "NUTOKEN">
<!ENTITY % NUTOKENS "NUTOKENS">
<!ENTITY % DO.Exclusions "INCLUDE">
<!ENTITY % DO.UGV "IGNORE">
<![ %DO.UGV; [
  <!ENTITY % DO.A.RunHeadAlg "INCLUDE">
  <!ENTITY % DO.A.sX.RunHeadAlg "IGNORE" -- but not in S1-16 -->
  <!ENTITY % DO.A.contentkind "INCLUDE">
  <!ENTITY % DO.A.s1.contentkind "IGNORE" -- but not for element S1 -->
  <!ENTITY % DO.A.s2.contentkind "IGNORE" -- nor for element S2 -->
  <!ENTITY % DO.cpart "INCLUDE">
  <![ %DO.cpart; [
    <!ENTITY % DO.J.annotgpX "INCLUDE"
     -- allow skipping in annotation group levels -->
  ]]><!-- %DO.cpart; -->
  <!ENTITY % DO.J.sX "INCLUDE" -- allow skip section levels S1-16 -->
  <!ENTITY % DO.s16 "INCLUDE" -- have 16 levels of sections -->
  <!ENTITY % DO.sX.IN.lq "INCLUDE" -- and in long quotations -->
  ...
  <!ENTITY % DO.wctt.IN.sX "INCLUDE">
]]><!-- %DO.UGV -->

A simple element declaration

<!ENTITY % DO.hnar "IGNORE">
<![ %DO.hnar; [
  <!ENTITY % O.hnar "%EndTagOmit;" -- tag omission -->
  <!ENTITY % M.hnar "%MF.atext;" -- mixed content -->
  <!ENTITY % C.hnar "(%M.hnar;)*" -- content model -->
  <!ENTITY % E.hnar "" -- no exclusions -->
  <!ELEMENT hnar %O.hnar; %C.hnar; %E.hnar;>
  <!ENTITY % DO.A.hnar.nr "%DO.A.nr;">
  <![ %DO.A.hnar.nr; [
    <!ENTITY % A.hnar.nr "%A.nr;">
    <!ATTLIST hnar %A.hnar.nr;>
  ]]>
]]>

Some complex declaration stuff

<!ENTITY % DO.pboc "IGNORE">
<![ %DO.pboc; [
  <!ENTITY % O.pboc "%EndTagOmit;">
  <!ENTITY % DO.OPT.ti.IN.pboc "IGNORE">
  <![ %DO.OPT.ti.IN.pboc; [
    <!ENTITY % OC.ti.IN.pboc "?">
  ]]>
  <!ENTITY % OC.ti.IN.pboc "">
  <!ENTITY % DO.ti.IN.pboc "IGNORE">
  <![ %DO.ti.IN.pboc; [
    <!ENTITY % HF.ti.IN.pboc "ti%OC.ti.IN.pboc;, ">
  ]]>
  <!ENTITY % HF.ti.IN.pboc "">
  <!ENTITY % H.pboc "%HF.Meta; %HF.ti.IN.pboc;">
  <!ENTITY % DO.REQ.para.IN.pboc "IGNORE">
  <![ %DO.REQ.para.IN.pboc; [
    <!ENTITY % OC.para.IN.pboc "+">
  ]]>
  <!ENTITY % OC.para.IN.pboc "*">
  <!ENTITY % REQ.para.IN.pboc "+">
  <!ENTITY % DO.MULT.source.IN.pboc "IGNORE">
  <![ %DO.MULT.source.IN.pboc; [
    <!ENTITY % REQ.source.IN.pboc "+">
    <!ENTITY % FAC.source.IN.pboc "*">
  ]]>
  <!ENTITY % REQ.source.IN.pboc "">
  <!ENTITY % FAC.source.IN.pboc "?">
  <!ENTITY % DO.REQ.source.IN.pboc "IGNORE">
  <![ %DO.REQ.source.IN.pboc; [
    <!ENTITY % OC.source.IN.pboc "%REQ.source.IN.pboc;">
  ]]>
  <!ENTITY % OC.source.IN.pboc "%FAC.source.IN.pboc;">
  <!ENTITY % DO.source.IN.H.pboc "IGNORE">
  <!ENTITY % DO.source.IN.T.pboc "IGNORE">

  <![ %DO.source.IN.H.pboc; [
    <![ %DO.source.IN.T.pboc; [
      <!ENTITY % B.pboc "((source%REQ.source.IN.pboc;, para%OC.para.IN.pboc;)
                         |(para%REQ.para.IN.pboc;, source%OC.source.IN.pboc;))">
    ]]>
    <!ENTITY % B.pboc "source%OC.source.IN.pboc;, para%OC.para.IN.pboc;">
  ]]>
  <![ %DO.source.IN.T.pboc; [
    <!ENTITY % B.pboc "para%OC.para.IN.pboc;, source%OC.source.IN.pboc;">
  ]]>
  <!ENTITY % B.pboc "para%OC.para.IN.pboc;">
  <!ENTITY % T.pboc "">
  <!ENTITY % C.pboc "(%H.pboc; %B.pboc; %T.pboc;)">
  <!ENTITY % E.pboc "">
  <!ELEMENT pboc %O.pboc; %C.pboc; %E.pboc;>
]]><!-- %DO.pboc; -->

Things we can rule

Normalizing it

We wrote a Perl script, inspired by NormDTD.exe (Richard Light), based on DTD2HTML.PL (Earl Hood), that reads the source files and creates normalized DTD.

Why normalizing plain vanilla SGML?

  • Not every parser aproves this kind of parameter entities, thinks it is obfuscating.
  • If element x is not declared in the DTD but allowed in a content, some parsers give error messages.
  • Can't predict behaviour if they don't.
  • Speed of parsing and ease of transportation if only one file with no comments or marked sections.

An old reason to normalize

First we had empty tokens before removing undeclared elements:
(a, %b.IN.x;, c)
where %b.IN.x; could be "b" or "" empty; we now say
(a, %HF.b.IN.x; c)
where %HF.b.IN.x can be "b," or ""
So now we can parse our source DTD with NSGMLS, even before normalizing.

Why schemas don't matter

How to schematize a switch:

If the DTD stuff looks like:
<![ %DO.OPT.title.IN.pboc; [
  <!ENTITY % OC.title.IN.pboc "?">
]]>
<!ENTITY % OC.title.IN.pboc "">
<!ENTITY % H.pboc "title%OC.title.IN.pboc;, ">
then the schema stuff in the internal subset will look like:
<![ %DO.OPT.title.IN.pboc; [
  <!ENTITY MINOC.title.IN.pboc "0">
]]>
<!ENTITY MINOC.title.IN.pboc "1">
and in the document:
<complextype name="pboc">
  <element ref="title" minoccurs="&MINOC.title.IN.pboc;"/>
  ...
</complextype>

How to RELAX a switch

If the DTD stuff still looks like:
<![ %DO.OPT.title.IN.pboc; [
  <!ENTITY % OC.title.IN.pboc "?">
]]>
<!ENTITY % OC.title.IN.pboc "">
<!ENTITY % H.pboc "title%OC.title.IN.pboc;, ">
then the RELAX stuff in the internal subset will look like:
<![ %DO.OPT.title.IN.pboc; [
  <!ENTITY % OC.title.IN.pboc "?">
]]>
<!ENTITY % OC.title.IN.pboc "">
and in the document:
<elementRule role="pboc">
  <sequence>
    <ref label="title" occurs="&OC.title.IN.pboc;"/>
    ...
  </sequence>
</complextype>

What is it all about?

Conclusions