Selection and utilization of metadata from news articles
Asta Bäck
Find


Abstract
Metadata is needed for various purposes in the publishing process. This paper gives a list of them and analyses some metadata dictionaries with regard to the aspects they cover, and reports a case implementation where metadata is used to support electronic publishing services to endusers and editors. Some of the metadata is created by people and some is collected semiautomatically or automatically. Finally, the paper discusses the experiences gained from using this metadata and processing the actual XML-articles for rendering.

Contents
  1. Introduction
  2. Metadata classification and metadata dictionaries
    1. Basic data and a unique identification
    2. Content
    3. Copyright
    4. Technical information
    5. Relations to other resources
    6. Related components
    7. Resource management
  3. Application of technical news article metadata
    1. Definition of the metadata dictionary
    2. Production and utilization of metadata for content description
  4. Discussion
  5. Bibliography

Introduction
How to define and create metadata is one of the much-discussed topics in the publishing world. It has gained in importance with the rise in electronic publishing. While the focus used to be on provision of sufficient information to find an existing publication, recent developments in publishing have made metadata important also for a content that has not been published in the traditional sense of the word. Also, what used to be regarded as a single publication is today often viewed as consisting of many components, which may need metadata of their own. The huge web portals and the different personalized information services are two important examples of the new publishing opportunities which also require additional metadata.
A classification of the different types of metadata is given in this presentation, some metadata dictionaries and projects are analysed by using this classification, and finally our own implementation for creating and using metadata relating to technical news articles is discussed with a review of out own experiences and ideas for improvements.
In this paper, the terms 'publisher' and 'publication' are used in broad sense. The word 'publisher' refers to any organization or company that makes content available to other parties. The words 'document', 'resource' and 'publication' are used interchangeably.
Previous Previous Table of Contents
Metadata classification and metadata dictionaries
Basic data and a unique identification
The basic data includes the basic facts of a publication, such as the creator and when and where it is published. This type of information is traditionally regarded as the core metadata. The typical role of the basic data is to provide enough information of a resource so that anyone interested will be able to locate it. This information largely aims at a unique identification, and systems have been set up to support a unique identification. The ISBN numbering for books is an example of this. The ISBN number alone would be sufficient to uniquely identify any published book, but for the sake of readability also other information is usually provided. In the web community, the URL has provided a unique identification. With regard to documents, its major drawback is that there is no guarantee that the document's location or content remains unchanged, therefore other methods have been developed for unique identification of electronic publications. The DOI (Digital Object Identifier) initiative is an example of this [1].
Content
With the metadata that describes the content, we usually try to provide answers to questions like 'Which documents tell us about natural catastrophes in North America during the last decade', or 'Which documents deal with XML in publishing applications'.
Keywords and classifications are the most traditional ways to describe the content of a document or a publication. The Dewey Decimal System is one of the universal classification schemes. There are also innumerable index term lists, for general or special purposes.
Another content description approach is the PICS specification, which is a product of the web era. It was originally designed to help parents and teachers control what children access on the Internet by creating a way to include ratings for web resources. [2]
The goal of the current PRISM project is to define an XML metadata vocabulary that would make it possible to describe the content of articles that are sold to other publishers and portals. The primary areas that the metadata dictionary should cover are magazine publishing, news and book publishing. [3]
The International Press Telecommunications Council has defined a subject list and a property list which allow to describe the content of a news item. [4]
Also the new ISO standard for Topic Maps addresses the issue of content description. Topic Maps can be created to describe the relations between the topics with or without reference to documents which deal with these topics. Topic Maps can be used as a tool to create and manage the metadata of metadata. [5]
Copyright
The copyright-related metadata has gained in importance quite significantly since the content is distributed in the digital format and there are more publishers and publishing channels. The goal of the European indecs project is to develop a data model for describing the copyright and the deals with the content [6]. Also the DOI project aims to create infrastructure for copyright management.
The Xerox Digital Property Rights Language (DPRL) is another copyright-related initiative. It is announced to provide a mechanism in which different terms and conditions related to access, fee, and time can be specified and enforced for the different operations on digital documents, such as view, print, and copy. [7]
Most of the existing metadata dictionaries, such as Dublin Core [8] or XMLNews-Meta [9]support the inclusion of only very basic copyright information.
Technical information
Technical information of the resource has a supporting role in the metadata. The same content may exist in several different formats, which serve the same or almost the same function (e.g. various text formats), whereas some formats may serve totally different purposes even though they contain much of the same information (e.g. a transcript and a video). The metadata should support different instances of the same content.
Relations to other resources
The issue of relations to other resources is of interest to different user groups of during the whole life span of any document. When a document is created, its relation to previous documents on the same topic is often stated in the text, and should, of course, be included in the metadata. Many users, particularly in the scientific world, create their own views of the relations between the publications. The XLink proposal provides a tool to manage the relations between the documents [10].
Related components
When electronic publishing processes are used, one document or publication is composed of several components. Content producers and publishers usually need metadata both at the component level and at the different composite levels, whereas the endusers usually view the publications as a whole.
Resource management
The version and the status of the document are the most important pieces of management information. In most cases, this type of metadata is needed in the creation of content and production processes. In quick-paced publication processes, like the news publication process, version management should also be supported across the companies.
Previous Previous Table of Contents
Application of technical news article metadata
Definition of the metadata dictionary
VTT Information Technology publishes a monthly newsletter, GT-Bulletin, consisting of printing and publishing related articles. New articles are written for the printed issue, and after the printed version is completed the articles are stored in the XML format for electronic retrieval. At present, only our own personnel have access to the articles in the XML format, but we also plan to allow our subscribers to access them. Now the subscribers can only read electronically the PDF version of the bulletin.
The basic requirement for the metadata was that it must provide a way to find the articles that deal with certain topics or with certain companies, or their products and services. This should be achieved with a minimum of extra work.
When we designed the application, we had to consider certain restrictions. Most importantly that we had a relational database at our disposal and the authors' additional work had to be kept to an absolute minimum. Our application begins with ready-made articles, so that we have no need for article version control.
The articles of the GT-Bulletin are, by tradition, classified into 14 categories. We took these categories as one way to describe the articles. With only these few categories and with the tendency of listing an article in several categories, the lists of articles are fairly long in every category and it takes time to find the relevant articles. Obviously additional information is needed to make the search easier.
To decide how we should describe the articles, we looked into the existing metadata vocabularies and, in particular, into the XMLNews-Meta and the Dublin Core. Not surprisingly, neither one of these dictionaries includes all the elements we wanted for our application.
The way XMLNews-Meta describes the content of the article was closer to our needs. Proper names carry in our articles important content information and we wanted to collect them into our metadata document to enable precise searches. This is also the approach in the XMLNews-Meta. For our application, we modified the XMLNews-Meta both by extending it and by discarding some elements.
Production and utilization of metadata for content description
We chose the following elements to describe the contents of our news articles:
We have developed an application in which XML tagging and metadata generation are performed partly automatically and partly with computer assistance. To collect the content describing elements (elements 9 to 16 in the previous list), TextMorfo software by a Finish company called Kielikone Oy is used for searching proper names and finding their basic forms. Our application shows the words to the user for classification. The users have to go through the proposed terms and to accept, reject or change their classification. The classifications are stored in a database, and so in process of time the system becomes more proficient in making the right suggestions for classification and it can run more automatically.
At this point, we do not try to classify the articles automatically, but expect the user to make a manual classification as a part of the final check for the metadata. The user also selects the type of the article from a predefined list.
The rest of the metadata elements are marked by the authors as a routine task when they write the articles. The required metadata elements can therefore be picked automatically from the article text into the metadata document.
When the metadata generating process is completed, we produce two XML documents: the metadata document and the article with detailed XML tagging. By detailed XML tagging we mean that all proper nouns included in the metadata document are tagged in the article as well.
After the metadata document and the article are created, the metadata is stored in a relational database and the articles in a file system. An HTML browser interface has been built to make queries into the metadata database. The users are offered following options for content-based searches:
and for basic-data searches
These criteria may be combined.
A summary of the articles matching the search criteria is returned to the user, who can retrieve the full articles in the XML format along with an XSL stylesheet provided that there is an IE 5.0 -browser, or as HTML. The conversion to HTML is made on the fly by using servlets.
Previous Previous Table of Contents
Discussion
XML and separate metadata provide an open basis for content applications. In this way it is easy to collect and process articles to create a variety of combinations and publications. It is also easy to distribute the metadata or even the articles to other interested parties.
We can to collect a lot of descriptive metadata from the articles, but this metadata should be processed further to make sure that it is in a usable format. We already accumulate information of user-made classifications of proper names (Example: Nokia is a company), but that is not enough. We should also know that Nokia Corporation refers to the same company, and that there is a town called Nokia in Finland, to ensure that we make the right classification of the word. Some of this metadata of metadata is general and branch-independent, while some is specific to a branch or topic.
Our application and our metadata DTD treat all the found proper nouns similarly, so that there is no system-supported function to help decide the most significant proper nouns in the article. The proper nouns which are mentioned most frequently and/or are mentioned in headings, in headnotes or in captions are probably the most relevant ones. This kind of information could easily be accumulated automatically, and stored in the metadata document.
Our metadata documents are fairly large since also excerpts from the article, such as the headnote and the subtitles, are included in the metadata. Some information is stored twice. With future tools and databases for XML documents we can expect to minimize such redundancy. There is some redundancy also in the sense that the content describing proper nouns are tagged in the document and included in the metadata. However, the tagging in the document can be used not only to convey the meaning of the content of the document but it can also be used to control the rendition of the document.
The classifications of proper names and stories allow to know a lot of the content of the article and to make precise searches. But very detailed classifications are not always practical for the persons who make queries into the database, because they should know the classification principles as well as those who make the classifications. This should be considered very carefully, when classification schemes and user interfaces for queries are made. In our present search application, the users may search for specified proper nouns also without specifying their category.
We might also ask, whether a full-text search would be enough in our application. The answer is yes and no. With many search tasks, a full-text search would probably give as good, or give in some cases better results than our metadata-based searches do: a full text search finds all the words in a document whereas a metadata-based search only finds those words which are included in the metadata. With our approach we can enhance the XML tagging of the articles while the metadata is collected and the article content and type classifications can conveniently be added to the metadata to increase its usability. The condensed information in the metadata documents may also be utilized and distributed without a direct access to the documents.
The articles are now classified by people. There is probably a clear correlation between the classified proper nouns and the article classification - at least in our articles which only discuss topics relating to the printing and publishing industries. It might also become possible to automate this part of the metadata generation.
A lot of work is done by various parties and organisations to define metadata dictionaries, and methods and frameworks for the manipulation of metadata across the companies. It is important for individual content producers and publishers to follow these developments, and participate in them, if possible. General metadata dictionaries can, however, seldom meet all the requirements of a company, so companies should find ways to combine their needs and public metadata requirements, and try to automate generating metadata, where possible.
Previous Previous Table of Contents
Bibliography
[1]The Digital Object Identifier System. http://www.doi.org/
[2]Platform for Internet Content Selection. http://www.w3.org/PICS/
[3]PRISM - Publishing Requirements for Industry Standard Metadata. http://www.idealliance.org/prism.htm
[4]The News Industry Text Format. http://www.nitf.org
[5]Topic Maps Frequently Asked Questions. http://www.infoloom.com/tmfaq.htm
[6]indecs1999. Overview. http://www.indecs.org/overview/overview.htm
[7]Digital Property Rights LanguageTM. http://www.contentguard.com/overview/tech_dprl.htm
[8]The Dublin Core: A Simple Content Description Model for Electronic Resources. http://purl.org/DC/
[9]XMLNews-Meta Documentation. http://www.xmlnews.org/docs/xmlnews-meta.html
[10]http://www.w3.org/TR/xlink/
Previous Previous Table of Contents