TrAC - Internet Column


To cite this article please refer to the printed edition of TrAC: Trends Anal. Chem., 16 (1997) 63

Challenges for chemical information extraction and text retrieval

Bas van Bakel* & Geert Postma**

*Knowledge-based Systems Group, University of Twente, the Netherlands

bakel@cs.utwente.nl

**Dept. of Analytical Chemistry, University of Nijmegen, the Netherlands

geertp@sci.kun.nl

Introduction

Computer and information technology have expanded enormously over the last few years, creating a number of promising possibilities regarding (rapid) exchange of information. New concepts like Gopher, ftp, URL, http, and html are now part of required standard knowledge when dealing with computer-mediated communication (cf. [1], [2]). The optimism is widespread that these new technologies already contribute to a more effective and efficient way of conducting scientific research. We will try to show that view is somewhat too optimistic as we focus on vital issues concerning automatic document processing.

Document Processing

Until now, new media such as the World Wide Web (WWW) and CD-ROM have been used mainly for conventional information distribution, although in a digitalized form (cf. [2]). Electronic distribution of chemical information, for example, is usually provided by existing institutions such as universities, publishing companies, etc. However, the number of documents available through digital media is growing enormously. In science alone, hundreds of thousands of publications within various disciplines are released in computer-readable form every year. Because of this huge supply of information it is extremely difficult - if not impossible - to select from all these documents (of which just a subset actually consists of written articles) the ones that are of interest, and thus to keep track of the developments in specific scientific disciplines (or all disciplines, which would be the task for an electronic library service). This is what has brought about the demand for computer systems that can track down information of interest and make it available and accessible, and has resulted in various initiatives to develop computer programs that are known as Information Retrieval Systems (or Document Retrieval Systems) and Information Extraction Systems.

The main goal of Information Retrieval Systems (known examples are the internet Search Engines) is to locate and open up relevant information within documents following a special request by a user. Information Extraction Systems are concerned with extracting all relevant information from previously selected documents and storing this information in a database, which can then serve as a knowledge resource for scientists (for example a source of analytical methods and techniques for chemists). Both kinds of systems may contribute to a more effective and efficient way of information acquisition. They may even prove to be a substantial factor in future developments on electronic publishing (cf. [3]), as the role, and with it the viability, of (electronic) publishers might be reduced considerably once scientists have good information retrieval and extraction systems at their disposal. However, this is an image of a still distant future.

The majority of current information retrieval systems are based mainly on statistical methods. Other knowledge, like knowlege on the domains (e.g. Analytical Chemistry) that are covered by the documents, or knowledge on natural language processing, is only used to a small extent. The scientific field of Natural Language Processing (NLP) still has to make its first significant contribution to improving document retrieval systems (cf. [4]). Nevertheless, the possibilities of statistics-based programs are limited (cf. [5]). At TREC (Text Retrieval Conference) an annual conference organised by the American National Institute of Standards and Technology, state-of-the-art document retrieval systems have reached an effectivity of 30% recall (defined to be 100% when all relevant documents to a specific user request are found and 0% when none are found), at a precison of 55% (defined to be 100% when only relevant documents are found, and 0% when no document is relevant). Whether these results are satisfying depends on the objectives pursued. The results may suffice for someone interested in a few introductory documents or links to other documents on a specific topic, but for a patent office, for example, they are hardly sufficient. What is more, regardless of the objectives of retrieval, any search engine that produces 70% of irrelevant documents will not be highly appreciated. Present search engines produce a great amount of digital `scrap', which is disadvantageous for both the users of these systems (it leaves a lot of work to be done) and their producers. Nobody wants to pay for digital scrap, or systems that produce it.

Automatic extraction of information deals to some extent with similar problems as information retrieval. Both involve text analysis of some kind, and both may benefit from the use of knowledge of the domain(s) involved, and NLP knowledge. Like in retrieval, information extraction systems based on statistical methods alone perform insufficiently. In extraction, 100% coverage with 100% accuracy cannot be obtained. In some cases it is even puzzling to human experts to fully understand what is meant exactly. The performance of current NLP-based information extraction systems indicate that further research is still necessary. A comparison of the performance of a series of systems in 1991 revealed that for relatively free text (news articles on a specific subject) an average of 26% recall and 52% precision could be obtained [6]. A more recent investigation [7] showed figures of 46 and 52% respectively. These figures were the means of the three best performing systems on one domain, but cannot directly be compared with the older figures, because the information extraction task that resulted in the latter figures was evaluated to be more difficult, partly because the metrics used changed slightly. On the same corpus four human analysts obtained 77% recall and 79% precision. These figures are not satisfying. Further research is needed to build systems that perform better. Like in Document Retrieval, research should involve development of deterministic rather than probabilistic systems. This is is why much effort is put into projects that employ NLP knowledge and domain knowledge. At present, various research projects of an interdisciplinary nature are being conducted, involving not only software engineers but also knowledge engineers, domain experts and NLP scientists. In the long run these projects are expected to result in knowledge-based systems that perform well in real situations (e.g. as Search Engines for the internet), and replace the current ones.

Chemical Information Processing

Within the domain of chemistry, several research projects have been involved in automatic information extraction covering various subdomains. For example, Nishida, Takamatsu and Fujita [8] developed a system for extraction and storage of information contained within patent claim sentences in the domain of semi-conductor production. Ai, Blower and Ledwith [9] developed a system for the extraction of (part of the) procedural synthesis information from the experimental section of a journal for organic chemistry. Crowdhurry & Lynch [10] have worked on a system for extraction, representation and storage of textual descriptions of compounds in a chemical database, and the work of Mars & Van der Vet ([11], [12]) is concerned with information extraction from a set of document descriptions on mechanical properties of ceramic materials.

At the University of Nijmegen, the Netherlands, we have developed a system for information extraction from short analytical method descriptions (title plus abstract) taken from Analytical Abstracts, on four analytical techniques, i.e. High-Performance Liquid Chromatography, Inductively Coupled Plasma, Atomic-Absorption Spectrometry, and Titrimetry. These document descriptions contain relatively free text within the analytical chemistry domain. In addition, part of the sentences are written in, more or less telegram style, and the sentences may contain a number of defined abbreviations. Although abstracts do not seem to be a reliable source of information [13] they were used because of the specific text characteristics being an ideal test domain of the system for various applications. The information that should be extracted are the characteristics of analytical methods (analyte, matrix, working range, applied technique, precision, accuracy and detection limit; see also the content requirements of analytical abstracts as described in [13]) and the described actions that (roughly) comprise the analytical method together with the participants and circumstances of the actions.

[NLP system building blocks: lexical module, 
syntactic module, semantic module, postprocessor, discourse module]
Figure 1: Overview of the Nijmegen Information Extraction System.

The information extraction system consists of two major components: a linguistic module and a chemical knowledge module. A strict modular approach is used for reasons of maintainability and expandability. Extraction of relevant information is conducted as follows. First, a domain-independent semantic analysis of the abstract is produced, by identification and analysis of word classes, syntactic structures and thematic relations within all sentences in the text, respectively. After this, analytical chemical discourse analysis is performed to extract all relevant information, by the chemical knowledge module. In Fig. 1 this process is illustrated. Discourse (or pragmatic) analysis consists of constructing the `story' that is told in the abstracts, using background information of the various analytical techniques that are mentioned in the abstracts. This analysis is based on Sowa's Conceptual Graph Theory (cf. [14]), which seems to be unlimited in representation possibilities of conceptual structures. An example of the system's output is presented in Fig. 2. The linguistic module is based on principles of Chomsky's Government & Binding theory (cf. [15]). This theory is used in other systems for other domains as well (cf. [16], [17], [18], [19], and [20]). Figure 1 illustrates that the system is domain-independent in its set-up, which means that its parts can be used to cover English texts on a different domain, or analytical chemical texts written in a different natural language.

[Example Conceptual Graph of
Figure 2: Result of the interpretation of the title phrase ``Determination of phosphorus in milk by electrothermal-atomization atomic-absorption spectrometry with L'vov platform and Zeeman background correction.'' The extracted parts of information are presented in italics, together with their appropriate type or storage field.

The above is just a very brief description of the information extraction system that was developed. In [21], [22], or [23], the system is described in great detail.

Discussion

In four years of research we have built a knowledge-based, deterministic system that can conduct automatic information extraction from analytical chemical document descriptions. The linguistic module is capable of performing a thematic analysis of the input, provided that sufficient information on words, word classes and (in case of verbs and nouns), thematic roles is available. In addition, a discourse analysis module together with a knowledge base was developed that can construct the stories told in the document descriptions. However, so far the entire system only works on a small body of texts, although the lexical and syntactic analyser cover a great deal more of the same body. Initially, the goal of the Nijmegen research was to use a body of 124 texts, and to divide it in two subsets, developing the system using one subset, and testing it with the other. This meant that the system should be able to process 62 abstracts. During the research this goal appeared to be too ambitious, most of all because of the limited time and manpower available. We were facing a number of problems that are exemplary for common problems in document processing systems, some of which we will discuss below.

Previous to the research discussed here, two pilot studies were conducted (cf. [24], [25]) in which the possibilities of building an information extraction system based on the use of domain and linguistic knowledge were explored. We assumed that if we elaborated on the findings of these pilot studies, it would be possible to build a working implementation for the entire body of texts within four years, i.e. based on an investment of eight man-years only. However, this turned out to be too optimistic. During the research we were confronted with a number of practical problems which we did not anticipate, or had underestimated, which hindered the development of a large-scale operative system. For example, in order to process a large number of document descriptions, the system should have a large lexicon containing all the required syntactic and semantic information, in a processable format. We had to build such a lexicon ourselves, as any existing source was either unusable or too expensive. The same also holds for knowlege bases. Scientific publishers, or institutions like Chemical Abstracts Service could play a role of importance in this respect. These difficulties regarding knowledge resources are a few of the main problems when building reality level document processing systems. Creating these facilities is not only expensive because it is labour-intensive, but also because the required knowledge resources can only be obtained at considerable costs. Obviously, this should not necessarily be a problem for commercial R&D teams.

Other problems we encountered during the research involved the qualtity of document descriptions. Misspelled words, ungrammatical sentences and even missing information on the characteristics of the analytical methods described (despite all the criteria that these document descriptions have to meet) complicated automatic information extraction considerably. Given the limited amount of time available, we were forced to deviate from the original research objectives. We therefore decided to focus on finding out exactly what (kind of) knowledge should be used in the information extraction system, and in what way, in other words to follow a principle-based approach. The system thus built should first and foremost have sound theoretical foundations rather than be based on relatively ad hoc analysing strategies. As a consequence of this change of objective, we had to abandon the original objective of processing 62 document descriptions and focus on a small body of representative descriptions instead. This task proved to be far from simple, as principle-based automatic information extraction proved to be virgin territory. Despite these setbacks, a system was built based on a set of principles that prove to be fundamental for linguistics-based and knowledge-based information extraction. For example, a strategy was developed which does away with the notorious problem of syntactic and semantic ambiguities within texts. The combination of linguistic and domain knowledge has proved to be particularly powerful in this respect.

The system produced can be regarded as a good starting point for building real-world applications of document processing systems. This fact is acknowledged by another research group: the Knowledge-Based Systems Group of the University of Twente uses parts of the linguistic module in Condorcet, a domain-specific Information Retrieval system for the domains of mechanical properties of engineering ceramics as a subfield of engineering, and epilepsy as a subfield of medicine. In this project, the problems sketched here are tackled, as the main objective is to build a prototype reality level IR system that is robust and fast, and performs better than existing state-of-the-art IR systems.

Conclusions

Information technology has still a long way to go before document processing systems are fit for the job we want them to do. Systems based on statistical methods have their limitations, and principle-based commercial applications for real life situations still have to be developed. However, there is no reason to be too pessimistic about this. At a recently held conference on full-text processing in Bath, England ([26]), producers of commercial text search engines unanimously acknowledged that the recent past has learnt that million dollar investments in manpower and computer systems do not guarantee well-performing commercial products. What is more, it is now also a generally accepted view that future systems should employ NLP-based and knowledge-based techniques. It is therefore very likely that investments will be made in the development of such systems, and possibly concurrently in projects as discussed above, although the objectives will most likely be more commercially oriented. Perhaps this will lead to good document processing systems that will perform considerably better than we now dare hope. Only then can we truly say that the new technologies have contributed to more effective and efficient scientific research.

References

  1. Steven M. Bachrach
    Chemistry on the Internet: The Northern Illinois University Chemistry WWW/Gopher Site (10 April 1995)
  2. Brian M. Tissue
    Distributing and Retrieving Chemical Information Using the World-Wide Web (28 July 1995)
  3. Stephen R. Heller
    Publishing on the Internet: a Proposal for the Future (28 July 1995)
  4. Donna Harman, Peter Schäubele and Alan Smeaton
    "Document Processing", in: Ronald A. Cole et al. [eds.], Survey of the State of the Art in Human Language Technology
  5. Karen Sparck Jones
    "Information Access"; lecture at the Language and Technology Awareness Day, organised by the European Committee, Prague (18 November 1994)
  6. Lehnert, W. & Sundheim, B.
    "A performance evaluation of text-analysis technologies," in AI Magazine 12 1991, pp. 81-94.
  7. Fifth Message Understanding Conference (MUC-5)
    Proceedings of a conference held in Baltimore, Maryland, August 25-27, 1993, Morgan Kaufmann Publishers, San Fransisco, 1993
  8. F. Nishida, S. Takamatsu and Y. Fujita
    "Semi-automatic indexing of structured information on text", in: Journal of Chemical Information and Computer Science 24 1990, pp. 163-169.
  9. C.S. Ai, P.E. Blower and R.H. Ledwith
    "Extraction of Chemical Information from Primary Journal Texts", in: Journal of Chemical Information and Computer Science 30 1990, pp. 163-169.
  10. G.G. Crowdhurry & M.F. Lynch
    "Automatic interpretation of the texts of chemical patent abstracts; 2: Processing and results", in: Journal of Chemical Information and Computer Science 32 1992, pp. 468-473.
  11. Nicolaas J.I. Mars & Paul E. van der Vet
    "A semi-automatically generated knowledge base for direct answers to user questions", in: Czap, H. & W. Nedobity [eds.]: TKE '90: Terminology and knowledge engineering, Indeks Verlag, Frankfurt an Main 1990, pp. 352-362.
  12. Paul E. van der Vet & Nicolaas J.I. Mars
    "Structured system of concepts for storing, retrieving, and manipulating chemical information", in: Journal of Chemical Information and Computer Science 33 1993, pp. 564-568.
  13. Geert Postma & G. Kateman
    "The Quality of Analytical Information Contained within Abstracts and Papers on New Analytical Methods", in:Anal. Chim. Acta 265 1992, pp. 133-155.
  14. J.F. Sowa
    Conceptual structures: information processing in mind and machine, Reading, Mass. 1984.
  15. Noam Chomsky
    Lectures on Government & Binding, Dordrecht 1981.
  16. J. Fargues, M-C. Landau, A. Dugourd and L. Catatch
    "Conceptual graphs for semantics and knowledge processing," in: IBM Journal of Research and Development 30 1986, pp. 70-79.
  17. M.L. McHale & S.H. Myaeng
    "Integration of Conceptual Graphs and Government-Binding Theory", in: Knowledge-Based Systems 5 1992, pp. 213-222.
  18. M. Schröder
    "Knowledge-based processing of medical language: a language engineering approach", in: GWAI-92: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence 671, Ohlbach, H.J. Ed., 1992, Springer-Verlag Berlin, pp. 221-234.
  19. P. Zweigenbaum [ed.]
    "Consortium Menelas, Menelas: An Access System for Medical Records using Natural Language, Computer Methods and Programs in Biomedicine", Special issue on AIM 45 1994, pp. 117-120
  20. A.-M. Rassinoux, R.H. Baud & J.-R. Scherrer
    "A multilingual analyser of medical texts", in: Conceptual Structures: Current Practices. Second International Conference on Conceptual Structures, ICCS'94, College Park, Maryland, USA, August 1994, Proceedings, Springer-Verlag, Berlin, 1994, pp. 84-96.
  21. Geert Postma
    The information contained within analytical chemical method descriptions: its quality, structure, autmomatic extraction and storage, Ph.D. Thesis, Nijmegen 1996
  22. Bas van Bakel
    A Linguistic Approach to Automatic Information Extraction, Ph.D. Thesis, Nijmegen 1996
  23. Geert Postma, Bas van Bakel and G. Kateman
    "Automatic extraction of Analytical Chemical Information. System description, inventory of tasks and problems, and preliminary results", in: Journal of Chemical Information and Computer Science 36 1996, pp. 770-785.
  24. Postma, G.J., B. van der Linden, J.R.M. Smits, G. Kateman
    "TICA: A System for the Extraction of Data from Analytical Chemical Text", in: Journal of Chemometrics and Intelligent Laboratory Systems 9 1990, pp. 65-74.
  25. Bas van Bakel
    Semantische Analyse van Engelse Zinnen; Een Computer-model. M.A. Thesis Computational Linguistics, Nijmegen 1988.
  26. Bath-meeting of full text processing
    Bath 1996

© 1996 Elsevier Science bv. Back to the TrAC Home Page