|
The role of standard metadata in a portal publishing system
|
 |
This paper will discuss the impact of the Publishing Requirements for
Industry Standard Medadata (PRISM) on editorial content exchange and syndication.
Standard metadata in portal publishing
What is “metadata”? When and how should you begin to capture
it? How much metadata do you need for a particular piece of content? How do
you attach that metadata to the content it describes? How much of the metadata
needs to travel with the content, and for how long? Exactly what metadata
should you be capturing? These are some of the questions PRISM is working
on, and these are precisely the questions Cahners is trying to answer as it
makes the leap from being a print-based publishing company to a “new
economy” electronic information company.
For about a year now, Cahners has been using the
XML to facilitate the process of publishing
its content electronically. Currently, this process begins with a Quark Express
document (this is the layout/pagination tool Cahners uses for its print publications).
Using a Quark extension developed for Cahners, a Web editor extracts the text
portion of the document as well-formed XML, based on the “Xpress Tag”
markup representing the names styles from the Quark. The Web editor then uses
an
ASP script that employs the XML
DOM to convert the Well-formed
XML to valid XML, conforming to a DTD Cahners wrote specifically for its magazine
content.
Most of the Metadata Cahners currently captures is stored as attributes
of the root XML element in the valid document. The conversion process itself
can capture a certain amount of this metadata from the Quark documents (as
much as there is). The Web editor then adds the rest of the required metadata
by hand using an XML editor. Cahners then uses that metadata to route articles
onto the Web, to sort articles by topic or article type, and to filter articles
for re-use rights when we syndicate or otherwise re-purpose an article or
a complete issue.
While we are probably ahead of the industry in our use of XML and metadata
to re-publish and syndicate our content, we are not yet where we’d like
to be, and our XML metadata is still coming at more of a cost in terms of
production than it should. For example, when an article is first written or
copy-edited, it would be a simple matter for the author or copy editor to
add some basic metadata to the article—the author’s name, for
example, along with the date it was created, the publisher, perhaps the subject
of the article. Later, as an article moves through its editing cycle, it would
be useful to capture some of the other information that accrues about the
article—for example, who owns the primary and secondary rights to the
article text? Who created the illustrations and other graphics? Does Cahners
own re-use rights for all images attached to the article? For what publication
volume and issue was the article first written? Out of necessity, a Web editor
manages to gather all of this information, but the process is much less efficient
than it could be, because the Web editor doesn’t know the answers to
many of those questions and must track down the answers.
Beyond the issue of basic information is the question of context for
an article. For a piece of content to be truly valuable to Cahners in an electronic
arena, we need to be able to sort and re-collate that content in many different
ways. We need to be able to search our content for specific pieces of information,
and insert electronic “hooks” to capture connections between pieces
of content. For example, Cahners may want to create a Web “portal”
for all of its electronics industry titles. Within this portal, we want to
let our customers view articles (or maybe even just pieces of articles), by
subject matter (e.g., “semiconductors,” or “analog circuits”),
or by content type (e.g., industry news, feature articles, commentary), or
both. We want to send email notification to our readers when an article appears
on a subject that readers have identified as of particular interest to them.
Or when an article appears about a particular company, or a particular product.
Or, on a more granular level, we may want to link a company name within
an article to a profile of that company, or to a list of products that company
currently offers. We may want to link a product name to a profile of the company
that sells that product. From there, we may want to take the next logical
step, and facilitate some sort of transaction between the reader and the company.
This is, after all, the ultimate direction of so-called business-to-business
(“B2B”) e-commerce on the Web.
In order to do all of that, Cahners needs a format-neutral, centralized
method for capturing metadata as content is created or acquired, and for storing
that metadata in a way that makes it easy for people or programs to search
and assemble our stored content. We need a workflow that allows content creators,
editors and others to apply metadata to content efficiently, at the moment
it first becomes available. Or even to generate the metadata automatically.
And finally, to accomplish all I outlined above, we also need a rich, comprehensive
set of industry-standard metadata terms.
To meet the first need, we are building an XML-based content management
system that will give our editors workflow tools for creating content, adding
metadata, and storing both in a centralized, searchable, repository. Another
set of tools will facilitate the creation and publication of portal Web sites,
targeted email, content syndication, and other electronic information products.
The metadata framework that will help power this system will come, we hope,
from the PRISM standard. Because we’re on the cutting edge of the industry
right now, working with PRISM is giving us the opportunity to leverage the
findings of the Working Group within our current efforts, as well as shape
the direction of the industry standard as it develops.