XML Europe 2001 logo21-25 May 2001
Internationales Congress Centrum (ICC)
Berlin, Germany

Harvesting Knowledge from the Organization's Information Assets

Eric Freese <eric@isogen.com>
 PDF version    Latest version   

ABSTRACT

Most of an organization's corporate knowledge is contained in documents or in the minds of its human resources. To make effective use of this corporate knowledge, organizations must be able to access, harvest, organize and redistribute it. In this session a proof of concept topic map based system with the ability to build and manage topic map documents will be demonstrated. This system has the ability to identify and interpret the information found currently within XML documents, but could be expanded to other document formats. The system can be used to build new topic maps and add into existing ones by reading from any XML document. This is done by simply developing import rules, which interpret the structure of the source document, and building the appropriate topics, associations and occurrences. The information is aggregated to construct and maintain a knowledge base as the document collection grows. The information can also be further enhanced by adding links to other specialized knowledge bases. The system also includes an inferencing engine that allows the user to define rules by which new knowledge can be inferred automatically from the information already known to the system. The rules within the inference engine are themselves topic map structures.

Table of Contents

1. Introduction

Studies have discovered that almost 90% of an organization's corporate knowledge is contained within its documents. That means all those memos, letters and reports that are sitting on file servers and network tape backups are not being used to their full potential. Imagine if it were possible to extract the knowledge contained within documents (not just the knowledge about the documents) and manage it in a corporate knowledge base that is continually being updated with new information, both manually and automatically.

Over the past year or two, a great deal of attention has been paid to topic maps, both in the ISO standard (ISO/IEC 13250:2000) variety and its web-enabled cousin, XML Topic Maps (XTM). Some of the focus has been on the knowledge management capabilities that these paradigms give to information owners. Some of the focus has been on the ability to interchange knowledge in a vendor-neutral format based on SGML or XML. Some of the focus has been on the enhanced accuracy of searching for information provided by the associative capabilities of topic maps.

Several organizations have developed topic map browsing and development tools that allow users to view the knowledge stored within a topic map. Some of the systems also allow users to manually build topic maps. However, this process is usually time-intensive and labor-intensive and often requires some sort of subject matter expertise in order to build the intricate associations between topics.

Imagine if it were possible to examine a document or set of documents, identify where topic material was located, and load the information into a topic map automatically. With a few simple actions, large complex topic maps could be constructed with relatively low effort. Inference rules could also be used to further enhance the extracted knowledge. Knowledge bases could be separated or merged as needed for application specific purposes. This paper presents a methodology by which documents can be examined and rules defined which extract the knowledge from the source documents and build a topic map complete with references (modeled as topic occurrences) back to the original source material.

A prototype topic map based system with the ability to build and manage topic map documents will be demonstrated. This system, written in Python, has the ability to identify and interpret the information found currently within XML documents, but could be expanded to other document formats using the grove paradigm described in the HyTime standard. The rules for harvesting are stored and managed within the system, allowing them to be reused on documents of the same type. The harvested information is aggregated to construct and maintain a knowledge base as the document collection grows. Adding links to other specialized knowledge bases can also further enhance the information. This system allows a user to query for specific information or to browse beginning with a piece of information. A user also has the ability to use the system to interpret the knowledge without manually browsing through the nodes. The system also includes an inferencing engine that allows the user to define rules by which new knowledge can be inferred automatically from the information already known to the system. The rules within the inference engine are themselves topic map structures.

2. Embedded Knowledge

The knowledge within an organization comes in many different forms. There is, of course, the knowledge retained by the human resources within the organization. This knowledge is gained from training and experience. In most cases, this knowledge is highly transitory. Once a person is no longer associated with the organization, the person's knowledge is also no longer available. Some organizations attempt to alleviate this problem by codifying a person's expertise in a form that can be used by expert systems or decision support system. This is in no way the normal practice across the corporate world. Instead this practice is concentrated mostly to environments where specialized expertise is extremely rare, such as research or engineering environments.

A more common and mostly ignored pool of knowledge is contained within the documentation generated by almost every organization. This documentation can be in the form of simple memoranda to technical reports to marketing literature. Within these documents is a wealth of corporate knowledge. Of course technical reports are seen as being very important pieces of information. However, simpler documents such as memos are often not as highly regarded. What is often missed is that these simple documents contain valuable information such as decision drivers, tribal knowledge within a smaller group, policies, procedures, opinions and many other types.

The challenge facing an organization is how to harvest this knowledge and manage it in a way that allows it to be applied into the future. Of course it may be possible to require people to enter information to a specialized system that codifies the knowledge. However, the impact on productivity caused by this method is generally seen as negative.

Some organizations place large document management systems as the foundation for information management within the organization. All documents generated by the organization are checked into the repository along with some sort of metadata. This metadata may take the form of keywords, abstracts, and other information dependent on users to enter. One of the main drawbacks is that quite often there are no set standards for the entry of this information or the language to be used. The net result is that inconsistent entry of metadata makes it difficult, if not impossible, to find all the information contained within the repository that may satisfy a particular query. Also, the amount of metadata allowed per item stored in the repository is often limited to the degree that it would be impossible to fully describe the knowledge contained within each managed document.

Another applicable drawback is that knowledge is often located in some unit smaller than a complete document. For example, a technical report may contain a section containing background information important to the overall report. The metadata stored within the repository will most likely not make mention of this background information meaning that any search for the information short of a full-text search will result in the information not being found. This is increasingly critical if the background information comes from a source external to the organization.

The following sections will discuss various methods for harvesting and managing the knowledge locked within the documentation. Unfortunately unlocking the knowledge from the human mind still requires assistance from the human in possession of the actual mind. Knowledge management tools need to be able to allow users to enter and build personal knowledge bases in an intuitive, non-intrusive manner.

Before moving on it would be useful to discuss the difference between data mining and a process the author has called "knowledge harvesting." Data mining is a process of discovering new meaningful correlations, patterns, and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. In short, data mining turns data into information. Data mining attempts to discover knowledge from large amounts of data through various means of automatic processing and analysis. Knowledge harvesting takes pieces of information that have already been identified in some way and collects them into a knowledge base for use by an application which can process the knowledge. Data mining attempts to discover knowledge in large data pools while knowledge harvesting attempts to use discovered knowledge to create new knowledge or solve a problem.

3. Contextual Knowledge Harvesting

One possible method for harvesting knowledge from documents is based on the context in which pieces of knowledge occurs. In documents marked up using SGML or XML, the actual tags surrounding the information may provide this context. A mailing address within a document provides a simple example. Assume there is markup identifying, among other things, an individual, an organization, a city, and a state. The markup allows the specified strings of characters to be identified and have a label applied to them. It also makes it possible to develop rules that can then infer additional knowledge, including the fact that the city is located in the state, the organization is located in the city and the individual is somehow affiliated with the organization.

While the example is extremely simplistic, it does illustrate that markup can provide some clue to the pieces of knowledge contained within a data set or document. Two types of markup can be useful in contextual knowledge harvesting. The first is structural markup. Structural markup identifies structural items such as chapters or sections or paragraphs. Semantic markup actually provides a meaningful label to the information that it marks up. Using the address example, city and state may be considered semantic markup. Structural markup might not be as helpful as semantic markup in identifying knowledge within a data set. It does, however, provide an anchor that can be used to identify the location of a piece of knowledge. Knowledge that is harvested from the documents can also include citations that point to the source of the knowledge. Non-semantic markup can also be applied in positional settings such as tables where location can be used to identify information contained therein. For example the third column may contain a part name that can be harvested from a parts database.

It would be naive to assume that all documents are in a form such as XML that can be easily processed. In fact, it is safe to assume that large shares of an organization's document collection are stored in proprietary formats. This does not provide a significant problem as long as the tools used to create the documents are still able to process them. However, history has shown that a document creation process will almost certainly have documents that outlast it. Large collections of documents created using early word processing systems may currently exist electronically, but they are interchangeable only on paper.

The HyTime standard introduces the grove paradigm to solve this problem. Groves provide a mechanism for documents to be processed using a standard methodology. Groves can be applied to XML documents as well as non-XML documents. This is accomplished by creating property sets that identify classes of information contained within an application or document set. The property sets can be as specific or as general as necessary. For example, once a general property set has been developed for an application, it is theoretically possible to be able to build a grove from any document created with the application. Suddenly data locked in a proprietary format can be harvested and used well into the future. Once data is in a grove, it may then be possible to render it as XML or any other useful format.

Depending on the quality and consistency within the original source document, contextual harvesting may or may not be possible. For example, a document containing highly structured information - such as a table containing a parts list - may contain harvestable information, while a document containing only paragraphs may not.

For contextual harvesting to be successful, some analysis is required. The context of an element should be utilized to correctly identify the truly useful pieces of information. Knowledge harvesting exists by the same rule as most other computer processes – “Garbage In, Garbage Out.” By not taking the context of an element into consideration in the harvesting process, it becomes possible for extraneous data to be harvested that, while validly marked up, may not be entirely appropriate for inclusion in the overall knowledge base.

4. Content-based Knowledge Harvesting

Not all markup schemes provide semantic markup that can be useful for knowledge harvesting. An excellent example is HTML. While some HTML experts may cringe at the thought, it is fairly safe to say that many web page developers utilize the elements to specify the format to be applied to a piece of information. This can be easily illustrated by the web pages that use tables not to display tabular information, but to ensure the desired placement of the information on the page.

HTML documents are probably not dependable for contextual harvesting. The fixed tag set provides very little opportunity to identify information by the markup. However, a great many web pages exist which contain valuable knowledge. The challenge is to identify some method by which the knowledge can be harvested from these pages or from documents with only structural markup.

This method would require harvesting based on the content within the document. This method would not include natural language processing, but rather pattern identification and parsing. For example, a label might be included within the text identifying the data, or the data may have a specific format, such as a phone number or a social security number. This sort of pattern recognition falls more into the domain of data mining.

One additional feature of HTML documents that may provide useful contextual information is the application of accessibility features and considerations. The same markup and methodologies that make information more accessible to the print impaired may in fact also be used to aid in the harvesting of knowledge from the same documents by identifying specific pieces of information. One example is the use of alternate text for images. This alternate text can also be used to aid knowledge harvesting about the images themselves.

As mentioned in the contextual harvesting section, a great deal of care must also be taken in harvesting knowledge using the content-based method. Errant hits might occur when defined patterns are not sufficiently specific. The context of the data may prove to be a useful tool in identifying valid hits. For example, the hits located in titles may not be as desirable as those found within paragraphs.

5. Natural Language-based Knowledge Harvesting

Anyone who has entered a plain-English question into the Microsoft Office Assistant or a search engine knows that computer programs have great difficulty understanding natural language. Instead of locating an answer to the question, the programs often return long lists of irrelevant responses. The problem is that unlike people, computers are unable to pinpoint what is being asked. Why is this?

An important reason is that natural language is highly ambiguous. Consider the text "I am". The "I" could be a subject pronoun, a Roman numeral, or the symbol for electric current. The "am" could be a verb, an abbreviation for before noon, or the symbol for americium. Now consider a 5-word sentence, where each word has 4 meanings. This sentence could conceivably have 4 x 4 x 4 x 4 x 4 = 1024 meanings! Assuming the computer can generate all these meanings, it is then difficult for it to figure out which one was intended.

Natural language processing is in its infancy. Computational linguists are only starting to learn how to capture word meanings in a computer and how to build programs that track the meanings of sentences. Current computer hardware is not up to the task, even given the best available software. Still, it seems likely that if we want them to, computers will eventually be able to understand and communicate in natural language as easily as we can.

Until that point, is it feasible to attempt to harvest knowledge from flowing text using natural language processing? Early indications would tend to suggest that it is. Several projects have identified methods for detecting the part of speech each word plays within a sentence. Others have defined heuristics for determining which definition for a particular word is appropriate within the context of the sentence.

Based on this work, it seems likely that knowledge harvesting is possible. However, as stated in the previous sections, care must be taken (probably even more so) when harvesting from natural language information. While it may seem worthwhile to harvest all the nouns or all the personal names from a text, doing so blindly may result in a knowledge base that is difficult to process or organize in any useful fashion.

6. XML Topic Maps (XTM)

A topic map is meant to convey knowledge about a set of resources. This is done by superimposing an information layer over the resources. The knowledge can then be managed separately from the resources it describes. A topic map captures and represents the subjects of the resources, and the relationships between subjects, using XML syntax and a DTD defined in the XTM specification.

Three key building blocks are used in the construction of topic maps:

A topic is the representation of a resource within the computer for some real-world subject. Examples of such subjects might be the play Hamlet, the playwright William Shakespeare, or the "authorship" relationship.

Topics can have names. They can also have occurrences, which are information resources that are considered to be relevant in some way to their subject. Finally, topics can participate in relationships, called associations, in which they participate as members.

Topics have three kinds of characteristics: names, occurrences, and roles played as members of associations. The assignment of such characteristics is considered to be valid within a certain scope, or context.

Topic maps can be merged. Merging can take place at the discretion of the user or application (at runtime), or may be indicated by the topic map's author at the time of its creation.

7. Knowledge Harvesting using SemanText

The SemanText system is a demonstration topic map based application that builds semantic networks from topic maps. It was first announced at the XML Europe 2000. It is written in Python. Semantic network nodes are created from topics and topic types. Links are created from associations between the topics. Additional rule-based information can be added to allow the semantic network processor to infer new knowledge beyond the class-instance relationship that is defined in the standard.

SemanText supports topic map creation, modification and browsing. The current version of SemanText supports only the ISO 13250:2000 Topic Map definition. Work is under way to support the XTM model, now that the specification has been completed.

SemanText uses a customized HTML browser interface that presents the topic map information in a manner that is familiar and intuitive to most users. When running in a Microsoft Windows environment, occurrence links are automatically displayed in the appropriate application. In other operating systems, occurrences are fed to a browser to handle in the most appropriate way.

The user begins browsing through the topic map by selecting a topic or topic type from the menu. All the information associated with the topic, within a given scope, is displayed including any related topic and topic types, associations, and links to all occurrences. Related topics can be selected from this frame or by using the menu selections.

SemanText can be used to create and modify topic maps. Existing topic maps can be read and modified. New topic maps created in SemanText contain a set of published subjects needed to support all the functionality supported within the tool. Users can build topic maps by entering the information manually using a series of dialogs. Topics are created by completing input forms. Specialized types of topics can also be created using similar input forms. Specialized dialogs also allow associations and occurrences to be created.

Topic maps can be merged in two ways. A full merge combines two topic maps into one, connecting and resolving common topics with user intervention. SemanText also allows a different type of merge called a reference merge, where the topic maps remain separate, but links are made to common topics. This allows the base topic map being used to remain separate while still being able to reference one or more other topic maps.

SemanText supports contextual knowledge harvesting as described earlier. Users can build topic maps by parsing XML and SGML files and harvesting information from them into topics and associations. This automatic method uses a tree representation of the source file where the user can specify an element and how the element and its contents should be added to the topic map. Users can select a single instance of an element, or select all instances of an element within a given context. This second method requires additional analysis to prevent errant harvesting from occurring.

<country id="s31-1cid-cia-Albania">
 <name>Albania</name>
 <total_area>28750</total_area>
 <population>3249136</population>
 <car_code>AL</car_code>
 <population_growth>1.34</population_growth>
 <infant_mortality>49.2</infant_mortality>
 <gdp_total>4100</gdp_total>
 <gdp_agri>55</gdp_agri>
 <inflation>16</inflation>
 <indep_date>2811 1912</indep_date>
 <government>emerging democracy</government>
 <encompassed continent="europe">100</encompassed>
 <ethnicgroup name="Greeks">3</ethnicgroup>
 <ethnicgroup name="Albanian">95</ethnicgroup>
 <religion name="Muslim">70</religion>
 <religion name="Roman Catholic">10</religion>
 <religion name="Albanian Orthodox">20</religion>
 <border country="cid-cia-Greece">282</border>
 <border country="cid-cia-Macedonia">151</border>
 <border country="cid-cia-Serbia-and-Montenegro">287</border>
 <city is_country_cap="yes" id="s31-1cty-cid-cia-Albania-Tirane">
 <name>Tirane</name>
 <longitude>10.7</longitude>
 <latitude>46.2</latitude>
 <population year="87">192000</population>
 </city>
 <city id="s31-1stadt-Shkoder-AL-AL">
 <name>Shkoder</name>
 <longitude>19.2</longitude>
 <latitude>42.2</latitude>
 <population year="87">62000</population>
 <located at="lake" ref="lake-Skutarisee"/>
 </city> <city id="s31-1stadt-Durres-AL-AL">
 <name>Durres</name>
 <longitude>19.3</longitude>
 <latitude>41.2</latitude>
 <population year="87">60000</population>
 <located at="sea" ref="sea-Mittelmeer"/>
 </city>
 <city id="s31-1stadt-Vlore-AL-AL">
 <name>Vlore</name>
 <longitude>19.3</longitude>
 <latitude>40.3</latitude>
 <population year="87">56000</population>
 <located at="sea" ref="sea-Mittelmeer"/>
 </city>
 <city id="s31-1stadt-Elbasan-AL-AL">
 <name>Elbasan</name>
 <longitude>20.1</longitude>
 <latitude>41.1</latitude>
 <population year="87">53000</population>
 </city>
 <city id="s31-1stadt-Korce-AL-AL">
 <name>Korce</name>
 <longitude>20.5</longitude>
 <latitude>40.4</latitude>
 <population year="87">52000</population>
 </city>
 </country>
		

In the example above, it would be possible to build a topic map from the CIA World Fact Book by simply defining a set of rules for harvesting the information. For instance, a topic could be created for each country, using the id attribute within the country element as the identifier, and grabbing the content of the name sub-element as the base name. The same rule could be applied to city elements. Other topics could be harvested including religions, states, or provinces. Occurrences could be defined to contain the data about each country such as the population or the infant mortality rate. Information could be harvested from element content or from attribute values.

Associations could be created by connecting the harvested topics. For example, cities could be associated with countries, countries could be associated with bordering countries, etc. The associations could be reified to allow occurrences to be added to them. Doing so would make it possible to say that within a specific country a certain religion makes up so much of the population.

The significance of this capability is that it is now possible to harvest information stored in other XML formats, such as RDF or DAML+OIL, into a topic map. It would then also be possible to create topic maps from several different communities in several different formats into single merged topic maps.

As new items are harvested into the topic map, the inferencing engine can build new knowledge based on the rules defined and the information already existing within the topic map. This process allows new information to be incorporated into an existing corpus seamlessly.

8. Conclusion

Many of today's knowledge management systems manage little more than documents and metadata about those documents. To really manage the knowledge owned by an organization it must be possible to harvest the knowledge below the document level. This paper has presented different methods that can be employed to do just that.

Topic maps can be used to interchange the knowledge harvested from a document collection. They also provide a mechanism for mapping the knowledge discovered back to the source documents in order to add a new level of context to the documents.

XML provides a way to identify knowledge within documents so that more extensive text analysis and processing tools are not needed to harvest the knowledge within a document. The importance of semantic markup is shown by the power given to knowledge harvesting processes that can work off of the semantic markup.

Biography

Eric Freese
Senior Consultant
ISOGEN International/DataChannel
St. Paul
Minnesota
USA
Email: eric@isogen.com Web: www.isogen.com

Eric Freese - Mr. Eric Freese is a Senior Consultant for ISOGEN International, a DataChannel company. Mr. Freese has more than a dozen years of experience in the area of information, document, and knowledge management. His specific expertise is in the development of products and implementation of technologies within the SGML/XML domain. This experience includes research, analysis, specification, design, development, testing, implementation, integration and management of database systems and computer technologies in business, education and government environments. He also has research experience in human interface design, graphical interface development and artificial intelligence. He is the current chairman of TopicMaps.org, the organization responsible for XTM, the topic map standard for the web.

Mr. Freese has spoken at events worldwide, including XML conferences in North America, Europe and Asia/Pacific, Seybold Paris, the International WWW Conference and the Aerospace Applications of Artificial Intelligence Conference, on subjects including structured information, data and document management and artificial intelligence.