Meaning and Interpretation of Markup
not as simple as you think
C. M. Sperberg-McQueen
Claus Huitfeldt
Allen Renear
Extreme Markup Languages
Montréal
15 August 2000
→Overview
- introduction
- a straw man proposal
- problems with the straw man proposal
- framework for a better solution
- outlook for the future
←→Function of markup
- Markup is not random; it has meaning.
- What does it mean to have meaning?
- How does markup have meaning?
←→Relevance / applications
Why worry about this question?
- for better markup language documentation
- for better QA (verification)
- for better automated processes (translation, normalization, query)
- to provide way to survey current practice (relevance for
software developers)
- ... and because it's interesting.
←→Related work
- Simons 1997 (translating between
marked up texts and database systems)
- Sperberg-McQueen and Burnard 1995 (informal
introduction to the TEI)
- Langendoen and Simons 1995 (also TEI)
- Huitfeldt and others in Bergen
- Renear and others at Brown University
- Welty and Ide 1997 (querying using knowledge representation
system)
- Wadler 1999 (XSLT)
- Ramalho et al. 1999 (semantic verification)
←→Henry Laurens to Lord William Campbell, 1775
It was be For When we applied
to Your Excellency for leave to adjourn it was because
we foresaw that we were ↑should continue↓
wasting our own time ...
Papers of Henry Laurens
←→The markup
<p><del>It was be</del> <del>For</del>
When we applied to Your Excellency
for leave to adjourn it was because
we foresaw that we <del>were</del>
<add>should continue</add>
wasting our own time ... </p>
←→Markup-related Inferences
- Some words are deleted in the MS.
- Some words are added above the line in the MS.
- The whole thing is a paragraph.
←→How does markup mean?
- Because markup means something, ...
- we know certain things.
- I.e. because we see certain markup,
- we are allowed (licensed) to make certain inferences.
I.e. markup licenses inferences.
The meaning of markup is the set of inferences it licenses.
←→Specifying the meaning of markup
So ...
- what inferences are licensed by each element type?
- by each attribute?
- for each location? (i.e. how do you associate the meaning with
a particular instance?)
←→A straw-man proposal
Consider the example:
<p><del>It was be</del> <del>For</del>
When we applied ... </p>
- <del> denotes a property (deletedness)
- from the start tag ...
- ... to the end-tag
I.e. elements provide information about properties of their contents.
←→A Prolog illustration
Elements and characters represented as nodes:
node(location,nodetype) where nodetype
is element(gi) or pcdata(char).
node([1,5,2],element(p)).
node([1,5,2,1],element(del)).
node([1,5,2,1,1],pcdata("I")).
node([1,5,2,1,2],pcdata("t")).
node([1,5,2,1,3],pcdata(" ")).
node([1,5,2,1,4],pcdata("w")).
node([1,5,2,1,5],pcdata("a")).
node([1,5,2,1,6],pcdata("s")).
node([1,5,2,1,7],pcdata(" ")).
node([1,5,2,1,8],pcdata("b")).
node([1,5,2,1,9],pcdata("e")).
node([1,5,2,2],pcdata(" ")).
node([1,5,2,3],element(del)).
node([1,5,2,3,1],pcdata("F")).
node([1,5,2,3,2],pcdata("o")).
node([1,5,2,3,3],pcdata("r")).
←→[Referring to nodes]
We refer to nodes with numeric path expressions.
- Not ideal
- because it is meaningful, not opaque.
- An opaque name would be better ...
- but ID is not a required attribute.
Think of it as a kind of pointing ...
... note also that XML nodes are nodes in trees.
←→Attributes
Attributes are triples: element, attribute name, value:
attr([1,5,2],id,implied).
attr([1,5,2],n,implied).
attr([1,5,2],lang,implied).
attr([1,5,2],rend,implied).
attr([1,5,2],teiform,"p").
←→Inferences from elements
- If each element denotes some property ...
- ... use the GI to name that property.
E.g. for this paragraph:
p([1,5,2]).
del([1,5,2,1]).
del([1,5,2,3]).
add([1,5,2,97]).
←→Inferences from elements
- Nice notation for known properties;
- less nice for known locations. So we use
property_applies(p,[1,5,2]).
property_applies(del,[1,5,2,1]).
property_applies(del,[1,5,2,3]).
property_applies(add,[1,5,2,97]).
←→What does property del mean?
Define properties using skeleton sentences:
_____ is a paragraph, or _____
has been deleted (or marked as deleted) in the source," ...
Fill blanks with reference to element.
←→Inferences from attributes
<p lang="eng"><del>It was be</del> <del>For</del> When we applied
to Your Excellency for leave to adjourn it was because
we foresaw that we <del>were</del> <add>should continue</add>
wasting our own time ... </p>
The
<p> is in English.
In Prolog:
property_applies(english,[1]).
or
english([1]).
←→Two-argument predicates
A simpler way in Prolog:
language([1],english).
property_applies(language,[1],english).
Our way:
property_applies(language(english),[1]).
←→Propagating inferences downwards (inherited properties)
Properties are inherited:
("It was be" was deleted) → ("It" was deleted) →
(The letter "I" of that word was deleted).
Side question: how far does a property propagate?
- The letter I was deleted.
- *The letter I is in English.
←→Automating inheritance
Inheritance is a general pattern, not element-type specific.
In Prolog.
infer(Property,Loc) :-
node(Loc,element(Property)).
infer(Property,Loc) :-
node(Anc,element(Property)),
descendant(Loc,Anc).
←→Inheritance for attributes
infer(Prop,Loc) :-
attr(Loc,Att,Val),
not(Val = implied),
Prop =.. [Att,Val].
infer(Prop,Loc) :-
attr(Anc,Att,Val),
not(Val = implied),
Prop =.. [Att,Val],
descendant(Loc,Anc).
←→Summary
Summarizing this first straw-man proposal, we can say:
- Each element type E signals some
property prop(E).
- For each instance of E,
P = prop(E) implies
P(X) for X = E and
X = descendant(E).
- Each attribute A signals a two-argument
property, prop(A).
- If E has
A = "V" and
P = prop(A),
then property P(V) is true of E and of
descendants.
←→Summary
- Inferences licensed for some location
L can be generated: for each element
E which is an ancestor of L in the
document tree, identify each property P which is
attributed to E, and assert that L has
that property: P(L) for one-argument properties, or
P(V,L) for two-argument properties.
←→Illustration
In Prolog, if the example paragraph is node 1.5.2:
?- infer(Property,[1,5,2]).
Property = p ->;
Property = doc ->;
Property = docbody ->;
Property = teiform([112]) ->;
Property = id([72,76,49,48,51,48,53]) ->;
Property = lang([101,110,103]) ->;
no
?-
←→Properties of children
?- infer(Property,[1,5,2|Tail]).
Property = p
Tail = [] ->;
Property = del
Tail = [1] ->;
Property = del
Tail = [3] ->;
Property = del
Tail = [95] ->;
Property = add
Tail = [97] ->;
Property = person
Tail = [184] ->;
Property = del
Tail = [318] ->;
Property = add
Tail = [320]
->
←→Finding a property
What locations have property del>?
?- infer(del,Loc).
Loc = [1,5,2,1] ->;
Loc = [1,5,2,3] ->;
Loc = [1,5,2,95] ->;
Loc = [1,5,2,318] ->;
Loc = [1,5,2,348] ->;
Loc = [1,5,2,717] ->;
Loc = [1,5,2,719,57] ->;
Loc = [1,5,2,866] ->;
Loc = [1,5,2,917] ->
...
←→Problems with the straw-man proposal
Nice in places, but:
- Part and whole are not the same.
- Some inferences may be incompatible.
- Properties of n arguments (n > 1)
- Predicates in real systems frequently
take arguments other than "the contents of this element" and "the value
of this attribute on this element".
←→Distributed and non-distributed features
What is true of the whole is not always true of the parts.
I have a dream is a sentence;
the
word have is not.
Distinguish
distributed properties (del) from
non-distributed properties (sentence-hood).
←→Non-distributed properties
Non-distributed properties are true of the element as a
whole, but not true of all
of the content. From
<P>Reader, I married him.</P>
we can infer the existence of one
paragraph, but not that the word
Reader is itself a paragraph. It is, however,
within a
paragraph.
←→Distributed properties
Consider this (from
Tristram Shandy).
<hi rend="gothic">And this Indenture
further witnesseth</hi> that the said
<hi rend="italic">Walter Shandy</hi>,
merchant, in consideration of the said
intended marriage ...
←→A synonymous example:
Or equivalently*:
<P><HI REND="gothic">And</HI>
<HI REND="gothic">this</HI>
<HI REND="gothic">Indenture</HI>
<HI REND="gothic">further</HI>
<HI REND="gothic">witnesseth</HI> that the said
<HI REND="italic">Walter Shandy</HI>,
merchant, in consideration of the said
intended marriage ... </P>
These examples license the same* inferences.
←→Distributed properties
In general: If x marks a distributed property, then
- adjacent <x> elements may be joined
- one <x> element may be
split
<x>abc</x><x>def</x> ≡
<x>abcdef</x>
- occurrences of <x>
are not usefully countable,
←→Overrides and incompatibilities
Consider:
<doc lang="en">
<p>Wittgenstein wrote:
<q lang="de"><ital>Die Welt ist alles,
was der Fall ist.</ital></q>
It is hard to escape, at first reading,
the suspicion that Wittgenstein is guilty
here of a gross platitude; it is only
after reading the rest of the
<title lang="la">Tractatus</title> that on returning
to its famous first sentence one appreciates
the depths of its intension.</p>
</doc>
For this example, the straw man leads to contradictions.
←→Inferences
We infer:
- The contents of <doc> are in English.
- The contents of the <q> are in German.
- And since attribute values are inherited,
the contents of the <q> are in English.
?- infer(lang("en"),[1]).
yes
?- infer(lang("de"),[1,1,22]).
yes
?- infer(lang("en"),[1,1,22]).
yes
?-
←→N-ary predicates
Consider the TEI <title> element:
- The contents of this element are a title.
- The contents of this element are the title
of the item described by the enclosing <bibl>
element.
←→n-ary predicates
property_applies(title_of,[1,2,5,3,4],[1,2,5,3]).
property_applies(title_of,[[1,2,5,3,4],[1,2,5,3]]).
property_applies(title_of([1,2,5,3]),[1,2,5,3,4]).
property_applies(has_title([1,2,5,3,4]),[1,2,5,3]).
N.B. these vary in convenience, but all carry the same
basic information.
←→Arguments of predicates
In the straw man proposal, all arguments are the same:
- contents of this element
- value of this attribute
In common markup languages, other arguments may be needed; e.g.
- the nearest ancestor of type <bibl>
Such terms start from some reference point; we call them
deictic expressions.
←→Deictic Expressions
Markup languages vary in the forms of deixis they require.
- For simple languages,
contents(this) or
value(attribute-name,this) may suffice.
- For others (e.g. TEI), we will need
first-ancestor(ancestor-gi,this) etc.
←→Requirements for deixis
What do we need in a language for deixis? At least:
- ancestor-dependency (TEI
<hi>,
<foreign>,
<head>).
- ordinal position (first para, last para, ...)
- milestones (TEI <pb>)
- linking: out-of-line or 'standoff' markup: predication
about what the link points to
- upward propagation (attribute-free languages)
More ... ?
←→Languages for deictic expressions
- TEI extended-pointer notation
- XPath
- Caterpillar languages (Brüggemann-Klein/Wood 2000)
Conjecture: only a subset required.
Open question: how big a subset?
←→Deixis as complexity measure
Note:
- Simple languages have few, simple deictic expressions.
- More complex languages have more expressions, more complex.
- Use size of deictic-expression language / complexity of
d.e.s as complexity measure?
←→A framework (generic parts)
To describe the meaning of the markup in a document, we will need:
- representation of the document (Prolog
representation / DOM / ...)
- generic routines for applying skeleton sentences to document
and generating statements about it (ad hoc? XSLT?)
←→Framework (DTD-specific parts)
For each markup language:
- set of
sentence skeletons describing the meaning to be attached
to each construct
- set of
deictic expressions to fill
blanks in sentence skeletons
- categorization of predicates according to inference rules
(distributed / non-distributed, ...)
- (optionally) rules allowing further inferences from the
properties directly predicated by the markup (e.g. "if something is an
author, and not identified as a corporate author, then it is a person",
or "if something is a person, then it is human")
←→Open questions / further work
- Prolog instantiation of framework
- language for deictic expressions (pointers, overlap tricks)
- survey of existing markup languages
- sentence skeletons for TEI, HTML, ISO 12083, DocBook, ...
- inference-relevant properties of properties (+/- distributed,
mutual exclusion, ..., unbounded set?)
- exploitation for QA
- exploitation for query
- exploitation for normalization
- semi-automatic translation?
→Overview
- introduction
- a straw man proposal
- problems with the straw man proposal
- framework for a better solution
- outlook for the future