XML Europe 2001 logo21-25 May 2001
Internationales Congress Centrum (ICC)
Berlin, Germany

A World to Discover: A Topic Map for Thomas Mann

Ingrid Schmidt <schmidt@via-iscm.de>
Carolin Müller <mueller@via-iscm.de>
 PDF version    Latest version   

ABSTRACT

A new attempt of knowledge organization is semantic nets using the Topic Map notation. This is a challenging task, mainly because the Thomas Mann project has to cope with real world topics as well as fictional ones in varying degrees of detail. This paper describes our practical approach to the design of this semantic net.

Table of Contents

1. Introduction to the project

The German publisher S. Fischer approached us to create an XML-based information pool for one of the most famous authors of the 20th century: Thomas Mann. This information pool will serve as a text base for the 58-volume print edition as well as the electronic version. But the key idea is to provide a valuable information asset for the publisher to benefit from both today's and future technologies. It consists not only of Thomas Mann texts and the commentary texts of the editors, but also of extensive archival material such as pictures, film, sound, and text documents - not necessarily in XML format. The main project requirements were to prepare for intelligent access possibilities for all these information units, as well as for better navigation. Translating these demands into XML DTDs for the documents would have resulted in highly complex structures with close to no chance of ever getting finished, especially because literary texts do not easily conform to standardization. Therefore we decided to develop highly flexible and less restrictive text structures. They contain very little semantic information, and are combined with a metastructure realized as a semantic net using the Topic Map notation.

2. Vision and reality

When we first started to think about such a metastructure for Thomas Mann we had lots of fantastic ideas about the positive impact it would have on the editorial work as well as the information access possibilities of the electronic version. As we started to translate our visions into concrete requirements, however, the complexity of the resulting model became apparent. In thinking about such a complex model, two leading questions were arising:

  1. Which parts of the model are pure facts and where does interpretation start?

  2. Is such a model feasible with the resources available?

Interpretation had to be considered as a crucial point because the edition of Thomas Mann's works was supposed to fulfill scientific demands. Therefore, interpretation had to be kept out of all levels, or had to be separated for clear identification. Feasibility was equally important. On one hand we had to deal with restricted personnel and financial resources combined with the fact that we had no experience at all with the newly up-coming workflow; on the other hand we had to ensure a smooth working environment for building the topic map. We could hardly imagine having the editors work with a plain XML topic map document without any graphical support.

From these reflections we extracted five basic requirements for the metastructure:

Due to the last requirement, we decided on a tool that provided us with a graphical interface for building the semantic net. We took into account the disadvantage that it does not store a topic map natively but will be able to export one. A graphical view of the metastructure was very important for us. It retains us a good view of the modeling, especially as complexity will increase in the future, and gives the editor a visual support of her work.

3. Designing the semantic net

With regard to the semantic net, we distinguished between the concept level and thelevel of individuals. The border between these two levels cannot always be determined unambiguously. Nevertheless, it is the individuals that are connected to their text occurrences in the textbase, which can be regarded as a third level.

Figure 1: Schema of the different levels of the metastructure

The actual modeling of the semantic net takes place on the concept level. The concepts and associations decided on shape the character of the metastructure and, moreover, its later processing possibilities. As we could not represent the maximum complexity possible for our subject all at once, we decided on a more pragmatic, step-by-step approach. For the first step we focused on:

The concepts themselves can be divided up into basic concepts and more specific ones, such as referring to literature. On both the concept and individuals levels we had to deal with a real world and a fictional scope, and in both cases consider the grading of details starting with the Thomas Mann specific world, going down to the literature specific world and finally the remaining.

As we are dealing with the literary edition of Thomas Mann's works, works of literature are the main topic and are therefore the most detailed. Related topics are other works of art (e.g. fine arts, music, film), scientific works, the Holy Bible, persons, places, and events. We treated the Bible separately because it cannot totally be assigned to one of the other categories like literature, scientific works, or works of art. In order to keep the semantic net transferable and extensible, we embedded the above mentioned topics into a broader context. For example, in order to get to the concept of works of literature, one possible path down the tree is, starting with the main node #T: objective - object - artifact - artifact of language - artifact in written form- work in written form - work of literature.

Figure 2: Works of literature (concept level)

Having a closer look at the individuals we figured out that we had to find a balance between the granularity and the extensibility of the associations, and furthermore consider the available resources to populate the net. Therefore we started out allowing between all individuals the most basic is related to association. As a first step this association was only detailed in areas of the greatest importance to our subject. This way we could ensure that with the available resources these associations could be completely filled out for every individuals it applies to. We aim to avoid under all circumstances to have only partial realizations, because this would produce incomplete search results without the user necessarily noticing.

We will show the approach of breaking down the basic association for one example: On the most general level each two persons are connected by the is related to association. Within one family this association changes to the more family specific relation is family relation of. Theoretically all branches of a family tree can be expressed by relating each two family members from the top to the bottom of the tree by the is child of association. In most cases a tremendous amount of inquiry would be necessary to create entire family trees. In reality we usually know the specific relationship between certain persons, e.g. mother of, nephew of. Therefore we decided on the most important relationships to be specified between two persons. In cases were none of the available relationships can be assigned, the basic is family relation of or is related to association takes over. In a further step all cases where these associations were assigned will be analyzed. Depending on the result they will possibly be broken down into more specific associations. These considerations are finally expressed in a hierarchy of associations between two persons:

Figure 3: Hierarchy of family relations

A similar approach applies to all other associations.

4. Populating the net

The level of individuals will be populated in two manners: manually and automatically. We make use of four different resources, whereas the first three are automatically processed:

We already have existing back-of-book indexes of an edition of Thomas Mann essays and of books written on Thomas Mann. These indexes have been prepared very carefully and systematically and therefore now can be translated fairly easy into a well formed XML from which individuals and associations between individuals can be derived. This translation is done in a two-step-process, starting with a conversion routine refined by manual editing afterwards.

Figure 4: Example of index entries

The documents of the text base will mostly be provided with metadata, e.g. type of work of literature. The information retained in the metadata will also be imported into the semantic net. Furthermore, we will have defined work packages, dealing with e.g. the life spans of persons, or the creation dates of works. This data will be typed in valid XML. For each of these packages we have chosen reliable information resources to extract the necessary facts.

The commentary texts of the editors serve as a basis for the manual extraction of individuals and their associations. Especially here, the above mentioned graphical support of the tool is essential.

5. Conclusion

Today there is still a gap between the theoretical possibilities of topic maps and their realization. What we need is a broader range of tools supporting the different tasks involved in creating a topic map and linking it to a text base. Furthermore we have to deal with a technology whose advantages only recently became apparent to a broader audience. In order to get this technology up and running within a certain project, a company has to redefine old tasks, define new ones, and moreover establish a brand new workflow. Even though this is often a very difficult task, especially as we cannot fall back on much experience, it is also today's most challenging and exciting one, involving the chance to discover a myriad of new worlds.

Bibliography

[1] Ingrid Schmidt, Carolin Müller, Planning a new type of literary edition: the Thomas Mann Project. In: XML Europe 2000 Conference Proceedings, S. 83 - 97; www.gca.org/papers/xmleurope2000/papers/s09-02.html
[2] Ingrid Schmidt, Carolin Müller, Zaubernetz. In: iX 11/2000, S. 100 - 107; www.heise.de/ix/artikel/2000/11/100/

Biography

Ingrid Schmidt
Senior Information Architect
VIA
Heidelberg
Germany
Email: schmidt@via-iscm.de Web: www.via-iscm.de

Ingrid Schmidt - Ingrid manages VIA informationsarchitekturen. By providing consultancy as a main focus, VIA accompanies firms that want their content resources to outlive rapidly outdated media and thus be prepared for the constantly shifting requirements of the media landscape. Within this context, VIA emphasizes the design and realization of forward-looking XML-based information architectures and semantic nets.

Since 1993 Ingrid has worked as an independent consultant, information architect, and trainer in the field of SGML/XML-based (and related standards) publishing for both industry and research. Moreover, she regularly teaches classes on SGML/XML at the German Linguistic and the Computational Linguistic department of Heidelberg University. In 1999, together with Wiebke Möhr, she edited a book on advanced SGML/XML projects in Germany. From 1993 to 1998 she worked on different projects with the Institute for Integrated Publication and Information Systems of GMD in Darmstadt. Research focuses were knowledge-based information access for hypermedia reference works and the evaluation of machine-aided, automated semantic encoding possibilities based on an object-oriented database system. From 1991 to 1993 she worked at Texcel as an application developer and consultant. From 1988 to 1991 she worked, also in the field of application development and consulting, with Manfred Krüger, later MID/Information Logistics Group in Heidelberg.

Carolin Müller
Information Architect
VIA
Heidelberg
Germany
Email: mueller@via-iscm.de Web: www.via-iscm.de

Carolin Müller - Carolin has been with VIA informationsarchitekturen since 1998. She is an information architect and independent consultant for XML-based publication environments and semantic nets. Before this she was involved in linguistic research with the main focus on lexicography. Apart from her own studies, she participated in the design of an XML-based lexicological-lexicographical information system developed by the Institut für Deutsche Sprache (IDS) in Mannheim. She has also worked on a dictionary project covering the different language stages of Early New High German.