|
Selection and utilization of metadata from news articles
|
 |
Metadata is needed for various purposes in the publishing process. This
paper gives a list of them and analyses some metadata dictionaries with regard
to the aspects they cover, and reports a case implementation where metadata
is used to support electronic publishing services to endusers and editors.
Some of the metadata is created by people and some is collected semiautomatically
or automatically. Finally, the paper discusses the experiences gained from
using this metadata and processing the actual XML-articles for rendering.
Introduction
How to define and create metadata is one of the much-discussed topics
in the publishing world. It has gained in importance with the rise in electronic
publishing. While the focus used to be on provision of sufficient information
to find an existing publication, recent developments in publishing have made
metadata important also for a content that has not been published in the traditional
sense of the word. Also, what used to be regarded as a single publication
is today often viewed as consisting of many components, which may need metadata
of their own. The huge web portals and the different personalized information
services are two important examples of the new publishing opportunities which
also require additional metadata.
A classification of the different types of metadata is given in this
presentation, some metadata dictionaries and projects are analysed by using
this classification, and finally our own implementation for creating and using
metadata relating to technical news articles is discussed with a review of
out own experiences and ideas for improvements.
In this paper, the terms 'publisher' and 'publication' are used in broad
sense. The word 'publisher' refers to any organization or company that makes
content available to other parties. The words 'document', 'resource' and 'publication'
are used interchangeably.
Metadata classification and metadata dictionaries
Basic data and a unique identification
The basic data includes the basic facts of a publication, such as the
creator and when and where it is published. This type of information is traditionally
regarded as the core metadata. The typical role of the basic data is to provide
enough information of a resource so that anyone interested will be able to
locate it. This information largely aims at a unique identification, and systems
have been set up to support a unique identification. The ISBN numbering for
books is an example of this. The ISBN number alone would be sufficient to
uniquely identify any published book, but for the sake of readability also
other information is usually provided. In the web community, the URL has provided
a unique identification. With regard to documents, its major drawback is that
there is no guarantee that the document's location or content remains unchanged,
therefore other methods have been developed for unique identification of electronic
publications. The DOI (Digital Object Identifier) initiative is an example
of this
[1].
Content
With the metadata that describes the content, we usually try to provide
answers to questions like 'Which documents tell us about natural catastrophes
in North America during the last decade', or 'Which documents deal with XML
in publishing applications'.
Keywords and classifications are the most traditional ways to describe
the content of a document or a publication. The Dewey Decimal System is one
of the universal classification schemes. There are also innumerable index
term lists, for general or special purposes.
Another content description approach is the PICS specification, which
is a product of the web era. It was originally designed to help parents and
teachers control what children access on the Internet by creating a way to
include ratings for web resources.
[2]
The goal of the current PRISM project is to define an XML metadata vocabulary
that would make it possible to describe the content of articles that are sold
to other publishers and portals. The primary areas that the metadata dictionary
should cover are magazine publishing, news and book publishing.
[3]
The International Press Telecommunications Council has defined a subject
list and a property list which allow to describe the content of a news item.
[4]
Also the new ISO standard for Topic Maps addresses the issue of content
description. Topic Maps can be created to describe the relations between the
topics with or without reference to documents which deal with these topics.
Topic Maps can be used as a tool to create and manage the metadata of metadata.
[5]
Copyright
The copyright-related metadata has gained in importance quite significantly
since the content is distributed in the digital format and there are more
publishers and publishing channels. The goal of the European
indecs
project is to develop a data model for describing the copyright and the deals
with the content
[6]. Also the DOI project aims to
create infrastructure for copyright management.
The Xerox Digital Property Rights Language (DPRL) is another copyright-related
initiative. It is announced to provide a mechanism in which different terms
and conditions related to access, fee, and time can be specified and enforced
for the different operations on digital documents, such as view, print, and
copy.
[7]
Most of the existing metadata dictionaries, such as Dublin Core
[8] or XMLNews-Meta
[9]support the
inclusion of only very basic copyright information.
Technical information
Technical information of the resource has a supporting role in the metadata.
The same content may exist in several different formats, which serve the same
or almost the same function (e.g. various text formats), whereas some formats
may serve totally different purposes even though they contain much of the
same information (e.g. a transcript and a video). The metadata should support
different instances of the same content.
Relations to other resources
The issue of relations to other resources is of interest to different
user groups of during the whole life span of any document. When a document
is created, its relation to previous documents on the same topic is often
stated in the text, and should, of course, be included in the metadata. Many
users, particularly in the scientific world, create their own views of the
relations between the publications. The XLink proposal provides a tool to
manage the relations between the documents
[10].
Related components
When electronic publishing processes are used, one document or publication
is composed of several components. Content producers and publishers usually
need metadata both at the component level and at the different composite levels,
whereas the endusers usually view the publications as a whole.
Resource management
The version and the status of the document are the most important pieces
of management information. In most cases, this type of metadata is needed
in the creation of content and production processes. In quick-paced publication
processes, like the news publication process, version management should also
be supported across the companies.
Application of technical news article metadata
Definition of the metadata dictionary
VTT Information Technology publishes a monthly newsletter, GT-Bulletin,
consisting of printing and publishing related articles. New articles are written
for the printed issue, and after the printed version is completed the articles
are stored in the XML format for electronic retrieval. At present, only our
own personnel have access to the articles in the XML format, but we also plan
to allow our subscribers to access them. Now the subscribers can only read
electronically the PDF version of the bulletin.
The basic requirement for the metadata was that it must provide a way
to find the articles that deal with certain topics or with certain companies,
or their products and services. This should be achieved with a minimum of
extra work.
When we designed the application, we had to consider certain restrictions.
Most importantly that we had a relational database at our disposal and the
authors' additional work had to be kept to an absolute minimum. Our application
begins with ready-made articles, so that we have no need for article version
control.
The articles of the GT-Bulletin are, by tradition, classified into 14
categories. We took these categories as one way to describe the articles.
With only these few categories and with the tendency of listing an article
in several categories, the lists of articles are fairly long in every category
and it takes time to find the relevant articles. Obviously additional information
is needed to make the search easier.
To decide how we should describe the articles, we looked into the existing
metadata vocabularies and, in particular, into the XMLNews-Meta and the Dublin
Core. Not surprisingly, neither one of these dictionaries includes all the
elements we wanted for our application.
The way XMLNews-Meta describes the content of the article was closer
to our needs. Proper names carry in our articles important content information
and we wanted to collect them into our metadata document to enable precise
searches. This is also the approach in the XMLNews-Meta. For our application,
we modified the XMLNews-Meta both by extending it and by discarding some elements.
Production and utilization of metadata for content description
We chose the following elements to describe the contents of our news
articles:
- 1. classification (one or more values from a predefined list)
- 2. type of the article (one value from a predefined list)
- 3. description (free text = headnote or the first paragraph)
- 4. datelineDate
- 5. datelineLocation
- 6. datelineEvent
- 7. source
- 8. subheadings
- 9. company name
- 10. event name
- 11. location name
- 12. URL
- 13. person's name
- 14. product or system
- 15. project
- 16. acronym
We have developed an application in which XML tagging and metadata generation
are performed partly automatically and partly with computer assistance. To
collect the content describing elements (elements 9 to 16 in the previous
list), TextMorfo software by a Finish company called Kielikone Oy is used
for searching proper names and finding their basic forms. Our application
shows the words to the user for classification. The users have to go through
the proposed terms and to accept, reject or change their classification. The
classifications are stored in a database, and so in process of time the system
becomes more proficient in making the right suggestions for classification
and it can run more automatically.
At this point, we do not try to classify the articles automatically,
but expect the user to make a manual classification as a part of the final
check for the metadata. The user also selects the type of the article from
a predefined list.
The rest of the metadata elements are marked by the authors as a routine
task when they write the articles. The required metadata elements can therefore
be picked automatically from the article text into the metadata document.
When the metadata generating process is completed, we produce two XML
documents: the metadata document and the article with detailed XML tagging.
By detailed XML tagging we mean that all proper nouns included in the metadata
document are tagged in the article as well.
After the metadata document and the article are created, the metadata
is stored in a relational database and the articles in a file system. An HTML
browser interface has been built to make queries into the metadata database.
The users are offered following options for content-based searches:
- proper nouns with or without an exact classification (e.g. personal
name, company name, event name)
- classification
and for basic-data searches
- creator
- issue (one or several issues)
These criteria may be combined.
A summary of the articles matching the search criteria is returned to
the user, who can retrieve the full articles in the XML format along with
an XSL stylesheet provided that there is an IE 5.0 -browser, or as HTML. The
conversion to HTML is made on the fly by using servlets.
Discussion
XML and separate metadata provide an open basis for content applications.
In this way it is easy to collect and process articles to create a variety
of combinations and publications. It is also easy to distribute the metadata
or even the articles to other interested parties.
We can to collect a lot of descriptive metadata from the articles, but
this metadata should be processed further to make sure that it is in a usable
format. We already accumulate information of user-made classifications of
proper names (Example: Nokia is a company), but that is not enough. We should
also know that Nokia Corporation refers to the same company, and that there
is a town called Nokia in Finland, to ensure that we make the right classification
of the word. Some of this metadata of metadata is general and branch-independent,
while some is specific to a branch or topic.
Our application and our metadata DTD treat all the found proper nouns
similarly, so that there is no system-supported function to help decide the
most significant proper nouns in the article. The proper nouns which are mentioned
most frequently and/or are mentioned in headings, in headnotes or in captions
are probably the most relevant ones. This kind of information could easily
be accumulated automatically, and stored in the metadata document.
Our metadata documents are fairly large since also excerpts from the
article, such as the headnote and the subtitles, are included in the metadata.
Some information is stored twice. With future tools and databases for XML
documents we can expect to minimize such redundancy. There is some redundancy
also in the sense that the content describing proper nouns are tagged in the
document and included in the metadata. However, the tagging in the document
can be used not only to convey the meaning of the content of the document
but it can also be used to control the rendition of the document.
The classifications of proper names and stories allow to know a lot
of the content of the article and to make precise searches. But very detailed
classifications are not always practical for the persons who make queries
into the database, because they should know the classification principles
as well as those who make the classifications. This should be considered very
carefully, when classification schemes and user interfaces for queries are
made. In our present search application, the users may search for specified
proper nouns also without specifying their category.
We might also ask, whether a full-text search would be enough in our
application. The answer is yes and no. With many search tasks, a full-text
search would probably give as good, or give in some cases better results than
our metadata-based searches do: a full text search finds all the words in
a document whereas a metadata-based search only finds those words which are
included in the metadata. With our approach we can enhance the XML tagging
of the articles while the metadata is collected and the article content and
type classifications can conveniently be added to the metadata to increase
its usability. The condensed information in the metadata documents may also
be utilized and distributed without a direct access to the documents.
The articles are now classified by people. There is probably a clear
correlation between the classified proper nouns and the article classification
- at least in our articles which only discuss topics relating to the printing
and publishing industries. It might also become possible to automate this
part of the metadata generation.
A lot of work is done by various parties and organisations to define
metadata dictionaries, and methods and frameworks for the manipulation of
metadata across the companies. It is important for individual content producers
and publishers to follow these developments, and participate in them, if possible.
General metadata dictionaries can, however, seldom meet all the requirements
of a company, so companies should find ways to combine their needs and public
metadata requirements, and try to automate generating metadata, where possible.
Bibliography