XML Europe 2001 logo21-25 May 2001
Internationales Congress Centrum (ICC)
Berlin, Germany

Topic Maps in the News

Daniel Rivers-Moore <daniel.rivers-moore@rivcom.com>
 PDF version    Latest version   

ABSTRACT

NewsML is a powerful new XML-based standard for the management of news items in all media throughout the news lifecycle. It is a flexible standard that has built-in mechanisms for its own controlled evolution as new requirements emerge in the fast-changing environments of the News industry and modern communication technologies. Topic Maps are a general-purpose mechanism for the definition, navigation and manipulation of information objects based on the meaning of the information they contain. One of the strengths of Topic Maps is that they provide a logical overlay to a mass of information resources, and make those resources accessible and navigable on the basis of the topics they cover, and whatever relationships between those topics may be of interest. Topic Maps recognise that the meanings and relationships between things are context-dependent, and, through its 'scoping' mechanism, allows that context-dependence to be formalised and used as an aid to producing meaningful views onto, subsets of, and navigable paths through, information resources. The underlying concepts of Topic Maps are built into NewsML at many levels. Firstly, they are used to make NewsML flexible and extensible, through the use of 'controlled vocabularies', which take the form of sets of named topics. Secondly, a NewsML document can carry with it explicit information about the topics that occur within the actual content of the news it conveys. NewsML documents thus provide a perfect complement to topic maps, in that they carry with them an awareness of the topics to which they relate, and also of who (or what system) makes the assertion that these topics are relevant, in what way, and with what degree of confidence. NewsML and Topic Maps between them provide a rich hybrid between human-processable and machine-processable aspects of information. They can be used as an infrastructure to support human-computer synergy in deriving knowledge and understanding from large quantities of semi-structured content, with the recognition that knowledge is always partial and understanding always perfectible. This presentation will provide an overview of the components and structure of NewsML documents and Topic Maps, and will show how, because of this powerful synergy, Topic Maps and NewsML can combine to make a newsfeed into a knowledge engine, and a news archive into a navigable knowledge repository.

Table of Contents

1. Alternative syntaxes with shared semantics

Topic Navigation Maps became an ISO Standard in January 2000. This standard provides an SGML DTD that uses a number of features of HyTime. ISO topic map documents are thus SGML documents which, in order to be fully understood, require some understanding of the HyTime specification. Little more than a year later, in February 2001, XML Topic Maps (XTM) Version 1.0 was formally approved by TopicMaps.Org. This specification provides an XML DTD and allows topic maps to be expressed in XML documents that can be parsed using XML processors.

One important feature of the XTM specification is that it includes not only a definition of syntax (in the form of an XML DTD), but also a formal Conceptual Model which underpins the syntax by providing a description of the abstract objects that the syntax is designed to represent, and a set of processing requirements which conforming implementaitons must meet when processing topic maps. The specification states that a topic map document may be expressed in some syntax other than XTM, provided that it conforms to the specification, which is to say that it carries the semantics descibed in the XTM conceptual model, and can be processed by processors that comply with the XTM processing requirements. Once the work has been done to show how the ISO topic map syntax maps to the XTM Conceptual Model, and how processors need to act on the elements of ISO topic map syntax in order to conform to the XTM processing requirements, it will be established that ISO topic maps and XTM are alternative syntaxes that can be used to express the same semactic intent, and that conversions can be carried out between ISO topic map documents and XTM documents without loss of information or meaning.

2. Topic Maps and knowledge representation

We have mentioned that in its conceptual model, XTM provides a formal description of the semantics expressed by topic map documents. At the heart of this conceptual model is the notion of a subject, which is defined as 'anything that can be conceived of or spoken about by a human being'. A subject may be either a resource, which is an object that is exists within the computer system, or a non-addressable subject, which is anything else. Each topic in a topic map stands for precisely one subject, and it is the existence of the topic that makes it possible for characteristics to be assigned to the subject within the topic map. In other words, the topic, by acting as a surrogate for the subject within the system, makes the subject accessible to the system, and allows things to be said about it, inferences to be made concerning it, and so on. It is this powerful capability of topic maps that makes them a strong candidate for being the basis for knowledge representation and the development of the the so-called Semantic Web.

A further refinement, and a very important one, is that every assignment of a characteristic to a topic occurs within a particular 'scope', which is itself defined as a set of topics. In this way, topic maps explicitly recognise the context-dependence of the truth of any assertion or, to put it another way, the fact that knowledge is always relative, and that our knowledge of the world is always reflective of our point of view. In this sense, topic maps allow the mutual coexistence of alternative viewpoints on the world, which is an essential feature of any knowledge system.

It is because of these important features that topic maps in general, and XTM in particular, have been attracting such attention in the XML and knowledge management communities. XTM shares many features with another specification, namely Resource Description Framework (RDF), which has been proposed by the World Wide Web Consortium (W3C) as a potential basis for the Semantic Web. Like XTM, RDF also allows assertions to be made about, and properties to be assigned to, information objects. But RDF sees every object as a resource, and does not have the key concept of the non-addressable subject which exists inherently outside the computer system and yet has to be represented in some way within the system. The topic/subject distinction, and the recognition that only some subjects are resources while others are inherently non-addressable, is perhaps the key difference between XTM and RDF and what makes XTM a true knowledge technology as opposed to an information technology (my thanks to Paul Prueitt for pointing out this key distinction at the Knowledge Technologies 2001 conference in Austin, Texas). Nor does RDF have a mechanism equivalent to the scoping mechanism provided by the topic maps paradigm.

Collaborative discussions are currently under way between the RDF and XTM communities, which will no doubt clarify these issues and perhaps lead to a creative synthesis between the two bodies of work. Such a result will certainly contribute to bringing the advent of the Semantic Web closer.

3. An XML standard for multimedia news

In October 2000, the International Press Telecommunications Council (IPTC) approved NewsML version 1.0, an XML-based standard for the management and delivery of multimedia news items. The IPTC has been in the business of developing standards for the news industry since the 1960s. One of its widely used specifications, News Industry Text Format (NITF) is an XTMDocument Type Definition (DTD) for the textual content of a news story.

NewsML was designed to meet a number of quite demanding requirements, including:

These requirements have led to NewsML having a number of interesting features which make it suitable not only for the immediate needs of the news industry, but potentially make any news archive, or live news feed, into a knowledge resources. The features that make NewsML special include:

4. Controlled extensibility

The structure of a NewsML news item

As the illustration shows, a NewsML news item can contain nested components, down to arbitrary levels of nesting. At each level components may be equivalent (meaning that the carry the same information but in different forms, such as different languages or formats) or complementary (meaning that they carry additional but related information). The complementary components are pointed out in the diagram, and you will notice that each plays a named role. In order for NewsML to be able to support arbitrarily complex news items in arbitrary mixtures of media, including media that have yet to be invented, it has to be possible to invent new roles as new ways of delivering news are developed. On the other hand, the invention of new roles in an arbitrary and haphazard manner would not allow NewsML systems to be interoperable. If the receiving system was unaware of the meanings of the roles supported by the sending system, the news could not be processed in an appropriate way. There is therefore a requirement for controlled extensibility.

This controlled extensibility has been achieved ion NewsML through the requirement that metadata property values are drawn from controlled vocabularies. Each metadata property is associated with a vocabulary, from which the allowed values for that property are drawn. The items in the vocabulary may be identified by one or more naming schemes, and may be have rich descriptions, or further properties of their own. Syntactically, the vocabularies are expressed as TopicSet eleents, and each item in the vocabulary takes the form of a Topic element. Here is an extract from the default vocabulary for Roles, published by the IPTC as an accompaniment to the NewsML version 1.0 specification:

<TopicSet> 
<Topic Duid="s05-2role1"> 
<TopicType FormalName="Role" Scheme="IptcType"/> 
<FormalName Scheme="IptcRole">Main</FormalName> 
<Description xml:lang="en">Principal component.</Description> 
</Topic> ... <Topic Duid="s05-2rol5"> 
<TopicType FormalName="Role" Scheme="IptcType"/> 
<FormalName Scheme="IptcRole">Thumbnail</FormalName> 
<Description xml:lang="en">A news-component substitute, 
smaller than the original, used for convenience. 
</Description> 
</Topic> ...
</TopicSet>
		

This mechanism has considerable expressive power, and considerable rigour. Notice that each Topic in the TopicSet also has a TopicType subelement. In this case, the Topics are Roles, and this is stated by reference to the object whose formal name is Role in the IptcType naming scheme, within the controlled vocabulary that has previously been declared as being applicable to TopicType elements. Here is the relevant extract from the IPTC's default vocabulary for TopicTypes. You will see that one of these has the formal name of Role, and an English-language help text description which provides more information about what it is.

<TopicSet> ... 
<Topic Duid="s05-2TopicTypes.NewsML.Role"> 
<TopicType FormalName="TopicType"/> 
<FormalName Scheme="IptcTopicType">Role</FormalName> 
<Description xml:lang="en">The distinguishing characteristic of a NewsComponent, 
or its relationship to the others with which it is associated within the same 
containing NewsComponent. 
</Description> 
</Topic> ...
</TopicSet>
		

5. Using topics to describe news content

We have seen above that NewsML uses the notion of a Topic in a TopicSet to provide a rigorous but extensible mechanism for assigning metadata to news components. Another extremely important feature of NewsML is the ability to identify the topics that actually occur within the content of a news component. There is a category of metadata, called DescriptiveMetadata, which is intended to provide information about the content of a text story, photograph, video clip, or any other item of news media.

Here is an example of some DescriptiveMetadata that might apply to a particular news story:

<DescriptiveMetadata> 
<SubjectCode> 
<Subject FormalName="04003005" Scheme="IptcSubject"/> 
</SubjectCode> 
<TopicOccurrence Topic="topics/companies.xml#nsdq3067"/>
<DescriptiveMetadata>
		

The first part of this metadata is a SubjectCode element containing a Subject element whose formal name in the IptcSubjectCode naming scheme is 04003005. The following extract from the IPTC subject codes vocabulary tells us that the subject in question is software, and we conclude that this is a story about software.

<TopicSet> ... 
<Topic Duid="s05-2sr04003005"> 
<TopicType Scheme="IptcTopicType" FormalName="SubjectDetail"/> 
<FormalName Scheme="IptcSubjectCodes">04003005</FormalName> 
<Description xml:lang="en" Variant="Name">Software</Description> 
</Topic> ...
</TopicSet>
		

The second part of the DescriptiveMetadata tells us something much more specific about the news item, namely that a specific topic occurs within it. The topic is identified by the Topic attribute of the TopicOccurrence subelement of DescriptiveMetadata. This is a pointer to a specifi Topic element within a TopicSet. The following is the relevant extract of that TopicSet:

<TopicSet> ... 
<Topic Duid="s05-2nsdq3067"> 
<TopicType FormalName="Company" Scheme="IptcType"/> 
<FormalName Scheme="NASDAQ">MSFT</FormalName> 
<Description>Microsoft Corporation</Description> 
</Topic> ...
</TopicSet>
		

We see that this is a topic of type Company, whose formal name in the NASDAQ naming scheme is MSFT, and which is described as Microsoft Corporation. This therefore tells us that there is a reference in our news story to Microsoft Corporation, and it tells us this in an extremely rigorous manner, by reference to a controlled vocabulary (in this case, the NASDAQ stock market listing) of companies.

6. Equivalence between Topics

Of course, there may be many formal listings of companies, or of any other kind of entity we might be interested in, and the same object may have different names in different formal naming schemes. The mechanisms provided by NewsML allow us to assert the equivalence of items in different controlled vocabularies. In the following example, I have created my own TopicSet of countries. In my naming scheme, I have given a particular country the name United Kingdom. But I want to go further, and formally assert that this country is the same one as has the name UK in the ISO 2-letter country-codes naming scheme. I can do this by using a TopicSetRef to import the ISO country codes TopicSet (which is provided by IPTC as part of the default set of TopicSets that were published with the NewsML specification), and including within my country Topic a second FormalName subelement that uses the ISO naming scheme, with UK as its content.

<TopicSet Duid="s05-2mycountries"> 
<TopicSetRef TopicSet="www.iptc.org/NewsML/topicsets/iso-ountry.xml"/> ... 
<Topic Duid="s05-2country17"> 
<TopicType FormalName="Country" Scheme="myscheme"/> 
<FormalName Scheme="myscheme">United Kingdom</FormalName> 
<FormalName Scheme="ISOalpha2">UK</FormalName> 
</Topic> ...
</TopicSet>
		

This has the result that my the country which I call United Kingdom is the very same one that ISU calls UK, because NewsML has the rule that any two Topics that have the same FormalName in the same Scheme, are considered to be equivalent. This rule is more or less identical in intent to the constraint known as the Topic Naming Constraint, defined in XTM, which states that any two topics that have the same baseName in the same scope must be merged.

7. Relationships between Topics

As will as providing a mechanism for controlled extensions to metadata, and rigorous identification of topics that occur in news stories, NewsML provides a mechanism for providing additional information about the topics it identifies, and even asserting relationships among different topics. This is achieved by allowing Topics to have Property subelements, which may in turn have further, Property subelements, where the value of any Property may either be a string, or some other Topic. In the following example, we see a TopicSet of towns, in which there is a Topic for the town which I call London. This Topic has a Property subelement called Location, whose value is identified as the country Topic which I defined in my previous example.

<TopicSet Duid="s05-2mytowns"> ... 
<Topic Duid="s05-2town3"> 
<TopicType FormalName="Town" Scheme="myscheme"/> 
<FormalName Scheme="myscheme">London</FormalName> 
<Property FormalName="Location" ValueRef="countries.xml#country17"/> 
</Topic> ...
</TopicSet>
		

8. NewsML as a Topic Map syntax

During the development of NewsML, much of the thinking that has gone into the development of Topic Maps was brought to bear on the meeting the demanding requirements that had been identified by the IPTC and which the standard needed to meet. The choice of the names for the Topic, TopicType and TopicSet elements was chosen by a vote among the IPTC members, who did not like the originally proposed names, Thing, TypeOfThing and Vocabulary. The choice was highly appropriate, however, because the semantic force of these element parallels that of the topic, instanceOf and topicMap elements in XTM.

We have already seen how the mechanism for asserting equivalence of Topics in NewsML mirrors the Topic Naming Constraint in XTM, with NewsML's FormalName element matching XTM's baseName element. The Scheme attribute in NewsML provides a particular kind of scoping mechanism, which behaves much as scope does in XTM. There are many other parallels, and this is no coincidence, as the requirements of NewsML match very closely some of the requirements that Topic Maps were designed to meet.

We have seen that the XTM specification allows that the XTM syntax is not the only one that may be used to express Topic Maps, and that provided some other syntax maps to the XTM conceptual model and the XTM processing requirements are met, documents in that syntax can indeed be considered to be Topic Map documents. It is my belief that it will be possible to identify the correspondences between NewsML and XTM with sufficient rigour and precision to make it possible for NewsML documents to be treated as conforming Topic Map documents in their own right. If this can be achieved, the consequences will be far-reaching, as it will mean that all news archives and news feeds that use NewsML (and before too long, that will be a very significant proportion of the world's news content, since NewsML has been adopted as standard by the world's major news industry consortium) can be processed directly by Topic Map applications. When this occurs, the content of these news feeds and news archives will be able to be made available to a new generation of knowledge-processing software.

Glossary

DTD

Document Type Definition

IPTC

International Press Telecommunications Council

NITF

News Industry Text Format

RDF

Resource Description Framework

W3C

World Wide Web Consortium

XTM

XML Topic Maps

Biography

Daniel Rivers-Moore
Director of New Technologies
RivCom
Swindon
United Kingdom
Email: daniel.rivers-moore@rivcom.com Web: www.rivcom.com

Daniel Rivers-Moore - Daniel Rivers-Moore is Director of New Technologies at RivCom, a consultancy and services company specializing in helping businesses adopt XML technologies to meet their information management and distribution needs. He has a decade of experience helping organizations gain business benefit from the adoption of standards-based publishing solutions and been actively involved in the development of the XML family of standards for many years, having been a member of the original XML Special Interest Group, joint project leader of the STEP/SGML harmonisation initiative under ISO, software development lead in the recently completed European XML/EDI Pilot Project. Daniel is a Founder Member of TopicMaps.org and a member of the XML Topic Maps Authoring Group. He is the editor of the NewsML standard and chair of the provisional steering group of KnoW (Knowledge on the Web), a collaborative initiative aimed at fostering the development of Web technologies for the sharing and interchange of knowledge.