|
XML and PDF in digital printing
irreconcilable differences?
|
 |
In the world of trade book publishing where PDF workflows are just beginning
to be accepted, XML workflows are still largely unknown. This discussion will
highlight one approach to merging the two types of workflow to create a highly
successful digital printing and eBook distribution operation.
Motivation
Early in 1999 Barnes & Noble identified the need to establish a
conversion operation to support its eBook and print-on-demand (POD) plans.
The company was already implementing plans around eBooks with an investment
in NuvoMedia (now Gemstar) and planned efforts with Microsoft to support their
MS Reader launch and with Glassbook to provide distribution of the Glassbook
reader and titles on its site. B&N was also in the midst of planning a
POD deal with IBM to install InfoPrint 4000 and InfoColor 70 equipment in
its Memphis, TN distribution operation.
B&N looked around for partners in establishing the operation, but
couldn’t find companies that were able to achieve the necessary quality
at a reasonable cost. Target quality was 99.998% character accuracy –
better than that provided by traditional typesetting – along with commercial
grade tagging and page make-up at quality levels acceptable to trade publishers.
The operation had to be focused on conversion of hardcopy backlist versus
electronic files due to the large amounts of content in that format. The trade
book industry is only now moving into fully digital workflows and for most
trade titles published today, the publisher still usually doesn’t have
the file as it was finally printed in a manageable format.
This approach is in direct contrast to the one taken by many technology
companies that are pursuing automated tagging and reformatting of electronic
files. With all of the companies focused in this area, it was anticipated
that there would be opportunities to purchase or license reasonably inexpensive
solutions in the short term. Once text and image conversion from hardcopy
was addressed, the move into electronic file conversion could be pursued either
directly or through alliances with software suppliers.
After evaluating a number of alternatives and not being able to find
anyone offering the required cost, quality and mix of services B&N established
an operation spanning New York, Mexico City and Manila.
Our original conversion process
In the current world of eBook and POD publishing and distribution the
ultimate deliverables are PDF and HTML variants. PDF is required to drive
POD equipment and some eBook readers. The HTML variants are found in Rocket
eBooks, SoftBook readers, Peanut Press Readers and, of course, the standard
PC browser. As a result, the initial focus was to implement a process to merge
the production of paged and adaptive formats from both hardcopy and electronic
files. This was accomplished in a straight through, hybrid Quark/HTML workflow
that directly produces all of the variants required using manual processes.
To take advantage of processing economics an international process was
implemented. Books start in NYC, where they are received from a publisher.
The books are pre-edited to identify the types of processing required and
any challenging elements entailed. The book is then sent to Mexico City for
scanning.
In Mexico City the books are scanned using either a 300dpi process or
a 600 dpi process depending on the ultimate formats required. If a publisher
is requesting that the title go directly into POD, the 600 dpi process is
used to give appropriate resolution through the print engines. The PDF in
this case is simply a “package” of 600 dpi scanned page images.
If eBook formats are desired the 300dpi process is used. In the eBook workflow
the scans are then transmitted to the Manila conversion operation as TIFF
images.
It’s in Manila that much of the interesting work takes place.
The files are zoned, driving images to an image cleanup process and text into
OCR. OCR text streams are then cleaned up using a heavy dose of AI supplemented
with manual editing to produce high quality RTFs, with a small amount of styling
applied.
Once a clean RTF is achieved, the file splits into tagging and page
production. In the tagging process the RTFs are converted into HTML and additional
tagging is applied using conventional HTML editing tools. There are actually
several different files created here, depending on the number of formats that
are being requested by the publisher. In addition to HTML we can produce OEB,
Rocket eBook format, Softbook format, and Microsoft Reader format.
In page production the file is imported into Quark and standard page
layout techniques are used to get a print- or eBook-ready PDF Normal file
with typography at a level acceptable to New York trade publishers. This PDF
is then filtered to the various formats required by different POD printers
or PDF eBook distribution operations. This is essentially an imposition step.
The files all then return to New York where they are proofed before
final delivery to the publishing customer.
The evolving process
The eBook world is rapidly moving to single XML standard to replace
the insanity of the multiple conflicting HTML variants. So far it’s
looking as though there will be two overall standards: OEB in most of the
eBook world and PDF for paged representations. There are persistent rumors
that Adobe will be officially recognizing the existence of XML and incorporating
it into PDF – indeed XML is creeping into Frame and a number of their
other applications. Both this shift and the increasing capabilities in Manila,
are prompting a move from a traditional straight-through, manual publishing
workflow that generates a number of output formats, to a two-stage process
that yields the same number of formats in an automated manner.
The automated process is similar to the manual process through the point
of the RTF. The big difference is in the post-RTF process. Instead of using
HTML, tagging is done to a standard DTD and placed into an XML repository.
Since XSL and other XML rendering mechanisms haven’t proven themselves
capable of generating the quality of typography needed on the fly, there is
really no alternative but to store two versions of the file: an XML version
to generate the various non-paged outputs and a PDF version for paged outputs.
Output formats, such as OEB or HTML 4.0 are then generated on demand
using XSL.
Issues
The issues in this process will be familiar to XML practitioners. These
range from using XSL to create high-quality pages and managing a large number
of shifting XSL stylesheets, to issues around where in the process tagging
should be done and, a dearth of XML tools appropriate to the Manila production
environment, where a good grounding in the fundamentals of XML and its uses
cannot be presumed.
High quality pages
The first challenge is the creation of high-quality pages. Publishing
has long been grappling with how to apply automation to high quality composition
to reduce or eliminate designer intervention. This is no different in an XML
workflow. The inherent problem is that XML is a non-page representation of
a text and if on demand production of pages is desired complex design problems
must be resolved on the fly. Some of the more challenging problems include:
- Cross-page implications of widow & orphan control
- Column balancing, either within or across pages
- Kerning, inter-character, and inter-word spacing
- Juxtaposition of non-text elements with text
- Preservation of the design of the original book in the reproduction
Our operation achieves the maximum flexibility possible in this environment
through the use of custom XSL code to transform the tagged XML in our repository
to the final target formats. Overall, however, the jury’s still out
on whether XSL can take files all the way from an XML markup to quality pages
– the solution may very well end up being to take the best composition
one can get from XSL output, hand adjust the output, and then archive the
result.
XSL proliferation
XSL proliferation is another issue. While it’s easy to see that
an XSL style sheet is needed to produce every output format, it’s also
necessary to have a more sophisticated style sheet to deal with styling within
formats, usually publisher-specific, but sometimes even getting down to the
book level. This requires that the style sheet perform two separate functions.
First, it must transform the XML into a format such as HTML (called the “transform”
function) and second, how to make the output in various formats look like
the original publisher’s house style or, worse, match the book itself
(called the “styling” function).
Iterative tagging
In the process described there are several areas where XML tagging can
be applied: in the original zoning, in the cleanup process on the RTF (in
the form of RTF styles which are post-processed into tags), and post-cleanup
in the more rarified XML environment. The choices here represent tradeoffs
between productivity and accuracy – the idea is to be able to touch
the text (or image) once and get as much mileage as possible out of that touch.
For example it’s necessary to zone page images after scanning to drive
the OCR and image cleanup workflows. To the extent that text elements can
be identified in the images, later tagging effort can be saved by having the
zones also correspond to those elements. This works so long as a great deal
of unnatural effort isn’t induced in the process, and so long as it
isn’t easy for the operator to apply tags that will later prove to be
invalid in accordance with our DTD.
Production tools
B&N has created a highly productive manufacturing environment in
its Manila operation that relies on standardization, process control and bulletproof
tools. Operators in these environments need tools that can be rigorously customized
to eliminate excessive work steps and that aren’t terribly complex –
not because of any lack of ability, but because of high production and quality
expectations. Further, the tools need to be able to be integrated into an
overall workflow and driven from a common production control and content database
that span the entire process.
Conclusion
During the past year Barnes & Noble has come a long way from the
realization that a world-class conversion operation was needed that could
integrate page and tagged content production. The operation has been established
and is working smoothly through a variety of issues around the integration
of XML and PDF workflows. Through efforts such as this and our other POD and
eBook efforts, B&N feels that is helping the industry to achieve the goals
that many people share today: that any consumer can find the book they need
in the format they can best use wherever and whenever they need it, and that
any publisher can count on being able to produce new eBook formats as they
arise, without the need to reconvert original content.