TrAC - Internet Column


To cite this article please refer to the printed edition of TrAC: Trends Anal. Chem. 18 (1999) xxx

CCRC-Net: an Internet-based spectral database for complex carbohydrates using artificial neural networks search engines

Faramarz Valafar and Homayoun Valafar

University of Georgia - CCRC, 220 Riverbend Road, Athens, Georgia 30602-4712 USA

faramarz@ccrc.uga.edu

Complex carbohydrates are important biomolecules that play a role in many biological functions such as providing physical strength (connective tissue in animals and woody tissue in plants) and a source of energy reserves (glycogen in animals and starch in plants). These molecules are also known to be directly and widely involved in biological recognition and regulatory processes in normal growth and development as well as in disease processes. A large number of studies have been triggered in particular to better understand the role of abnormal (structurally altered) complex carbohydrates in disease development. These investigations will provide valuable information for the development of diagnostics, drugs, and other therapeutics for diseases involving complex carbohydrate molecules.

The enormous chemical complexity and diversity of complex carbohydrates makes their structural elucidation a particularly challenging task, a task that scientists would not wish to duplicate unnecessarily. Since an estimated 70% of all the N-linked oligosaccharides that researchers encounter have already been discovered and structurally characterized, the potential for duplicated efforts in structural characterization of carbohydrates is high. Furthermore, a linear tetrasaccharide, for instance, that is comprised of four different hexoses has close to 100,000 different theoretical structures due to the variety of possible ring forms and linkages between the sugars. (The number is much larger if other factors such as branched structures are taken into account.) By comparison, the number of possible different structures of a tetranucleotide or a tetrapeptide is several orders of magnitude lower than that of a tetrasaccharide. The analytical task is made more difficult when the carbohydrate to be analyzed is available in only minute quantities, which is often true of biologically active molecules. The problems are further magnified by the fact that the carbohydrates of glycoproteins have tissue-, organ-, and species-specific structures as well as structural differences between normal and transformed (malignant) cells derived from the same tissue or organ (e.g., liver vs. hepatoma).

Carbohydrate scientists, therefore, need a means to know if the carbohydrate structure represented by an analytical spectrum they have obtained has already been identified by others and a way to access quickly what is known about the carbohydrate. It would also be extremely useful for the non-expert to be able to find out if the spectrum in question is closely related to other known structures. The Complex Carbohydrate Research Center (CCRC) has developed the CCRC-Net to solve these problems. CCRC-Net’s databases now contain nearly 500 proton nuclear magnetic resonance (1H-NMR) spectra or gas chromatography-electron impact mass spectra (GC-EIMS) of various common complex carbohydrates. Funding for the CCRC-Net project has been provided as part of the technological research projects of the NIH Resource Center for Biomedical Complex Carbohydrates.

The CCRC-Net has been designed and implemented by Drs. Faramarz Valafar and Homayoun Valafar. It is a World Wide Web-based structural identification system for complex carbohydrates located on the CCRC's home page at www.ccrc.uga.edu. The goal of the CCRC-Net system is to identify complex carbohydrates from their analytical chemical signatures (spectra), and provide chemical structural information of the complex carbohydrate(s) in question via the World Wide Web. Scientists can submit a spectrum of a carbohydrate that they have isolated from an animal or a plant and receive structural information about the carbohydrate with minimal effort or knowledge of how to interpret carbohydrate spectra. The CCRC-Net's pattern recognition system analyzes the submitted spectrum, and if the carbohydrate in question is found in its database, its chemical structure is provided. A "clean" spectrum of the carbohydrate from CCRC-Net’s own database is also presented next to the submitted spectrum for the purpose of visual verification of pattern-matching results and for possible downloading. Additional information about the carbohydrate from published scientific articles is also available via a link to CarbBank and the Complex Carbohydrate Structure Database (CCSD)(1).

CCRC-Net's search engines use state-of-the-art artificial neural network (ANN) pattern recognition technology to "clean up" and identify the submitted spectra. Due to the natural variability present in these spectra, the clean-up process is an essential component of the identification process. For instance, 1H-NMR spectra, in general suffer from environmental, instrumental, and other types of variations that manifest themselves in a variety of aberrations. Low signal-to-noise ratio [1, 2, 4], baseline drifts [3, 5, 7], frequency shifts due to temperature variations, line broadening and negative peaks due to phasing problems, malformed peaks (or overlapped peaks) due to inaccurate shimming, are among the most prominent and common aberrations. Figure 1 shows two 1H-NMR spectra of a complex carbohydrate. The spectrum labeled (B) in this figure suffers from a variety of the above mentioned aberrations, and contamination by lactate, frequently introduced by touching laboratory glassware with bare hands (2). For the purpose of automated identification of these spectra, elimination of the above mentioned aberrations becomes essential, as they can lead to erroneous identification [1-7].

The CCRC-Net uses a collection of classical (e.g. filtering, windowing) and ANN techniques in its clean-up modules. After the spectra have been preprocessed by these modules, they are submitted to the pattern recognition modules for identification. These modules all use various ANN techniques for their identification task. These techniques range from standard backpropagation techniques to more recent techniques developed in-house. Each database of the CCRC-Net system uses a specific combination of clean-up and pattern recognition techniques.

Figure 1. (A) A high quality 1H-NMR spectrum of a xyloglucan oligosaccharide. (B) A poor quality 1H-NMR spectrum of the same oligosaccharide, with baseline drift, noise, negative signals, and large contaminant and standard signals.

The CCRC-Net currently contains five databases of 1H-NMR and mass spectroscopic data of families of complex carbohydrates. These data have been obtained primarily from the spectral information about plant and animal carbohydrates available from CCRC scientists. The services of the databases and search engines are available through the CCRC’s Web site. CCRC-Net is capable of hosting numerous databases of analytical spectra and their search engines as well as being linked to other information resources. We have linked CarbBank/CCSD and CCRC-Net, for example, so that a researcher using any CCRC-Net database will have easy access (via a mouse-click) to all published data about the oligosaccharide that is available in the CCSD.

The CCRC-Net system current databases and search engines contain analytical spectra on the following molecules: (1) gas chromatography-electron impact mass spectra (GC-EIMS) of partially methylated alditol acetates (PMAAs), which are derivatives used to determine glycosyl-linkage compositions; (2) one-dimensional 1H-NMR spectra of asparagine-linked oligosaccharides (N-linked oligosaccharides); (3) 1H-NMR spectra of xyloglucan subunit oligosaccharides; (4) 1H-NMR spectra of glucuronoxylomannans from strains of Cryptococcus neoformans, a microbe that becomes deadly in immuno-compromised patients; and (5) GC-EIMS of partially methylated anhydroalditols generated by reductive cleavage (an alternative to the PMAA procedure for determining glycosyl linkages).

Using the graphical user interface of the CCRC-Net, a user can either browse CCRC-Net's libraries and download the analytical spectra available or submit a spectrum to one of the five libraries for search and identification purposes. Figure 2 shows a sample download page for a 1H-NMR spectrum of an N-linked oligosaccharide. From this page the user can view a sample spectrum of the carbohydrate, download the spectral data of the carbohydrate for later use, or automatically retrieve all published structural information available in the CCSD database via a link to CarbBank. The structure and composition of the carbohydrate is shown in the bottom window of the screen, while the spectrum is shown in the top window. A link is also provided to download the spectral data into local computer for later use. A link to CarbBank and the CCSD is provided at the top of the page whereby the user can click on the "CarbBank structure identification number" and automatically retrieve all published information about the structure available in the CCSD database, including a display of the molecule’s chemical structure.

Figure 2. An example of a download page from the CCRC-Net library of the 1H-NMR spectra of N-linked oligosaccharides.

A user can also submit a spectrum of a carbohydrate unknown to him/her to the CCRC-Net for identification. The CCRC-Net uses leading edge artificial neural network (ANN) technology to identify the carbohydrate from the submitted spectrum. If the carbohydrate that the spectrum originated from exists in CCRC-Net's library, a "clean" spectrum of the carbohydrate will be displayed from CCRC-Net's own library next to the submitted spectrum for visual verification of the identification results. Figure 3 shows a sample results page.

Figure 3: An example of a CCRC-Net results screen.

The results page also contains information about the structure of the carbohydrate in one of a few available formats. The format depends on the type of carbohydrate in question. For instance, in the case of the library of the 1H-NMR spectra of N-linked oligosaccharides, a click of the mouse will guide the user to CarbBank/CCSD, where all structural and other information available in CCSD about the carbohydrate will be displayed. In the case of the GC-EIMS library of PMAAs, the structural information is displayed directly on the results page. Links between CarbBank/CCSD and all the CCRC-Net databases are planned.

If the carbohydrate in question is not available in the CCRC-Net libraries, the ANN search engine of the system will return a "zero hit" message. In this case, the user will be given the choice of submitting the unknown spectrum to the CCRC-Net administrator for manual identification by experts at the CCRC and addition of the spectrum to the appropriate CCRC-Net library.

The CCRC-Net has attracted many scientists in its short period of experimental existence. CCRC-Net's first library (GC-EIMS library of 100 PMAAs) came on line in September 1997. The last library of CCRC-Net (1H-NMR spectra of 23 N-linked oligosaccharides) came on line in October 1998. Since its initiation in 1997, CCRC-Net has not been advertised, as it is an experimental system. However, scientists around the world have managed to find it and use it in their structural analyses of complex carbohydrates. So far, CCRC-Net has recorded 2456 log-in instances (number of times users have logged into the system), 2920 spectra have been viewed and downloaded from various CCRC-Net libraries, and perhaps most importantly, 3019 carbohydrates have been correctly identified by CCRC-Net by analyzing their spectral data. Our estimation is that on the average such an identification task takes about half a day to be completed manually by an expert at the CCRC. This would total 1509.5 mandays to analyze all the spectra that have been submitted to CCRC-Net, whereas it has taken the CCRC-Net system only 15095 minutes or 10.5 days to do the same job. This means that approximately 1499 mandays of expert manpower devoted only to routine structural analysis was saved.

We plan to continue the development of the CCRC-Net on two fronts. First, we plan to increase the size of the CCRC-Net’s databases. This step is crucial, as the completeness of these databases is vital to their usefulness and reliability. Secondly, as the databases grow in size, there will be a need for stronger and more robust search engines. Advances in basic research in pattern recognition and spectral identification of complex carbohydrates, therefore, is our second goal. The CCRC has committed significant resources to the development of CCRC-Net and to making this system freely available to everyone. We are counting on the cooperation of all scientists in helping us enlarge CCRC-Net's databases, as it is not within the scope of a single organization to purify and collect spectra of all complex carbohydrates.

Acknowledgments:

· Dr. Peter Albersheim, Principal Investigator of the NIH Resource Center for Biomedical Complex Carbohydrates, and Co-Director of the Complex Carbohydrate Research Center.

· Dr. Alan Darvill, Co-Director of the Complex Carbohydrate Research Center.

Drs. William York, John Glushka, Larry Elvebak, Sandeep Kalelkar, Parastoo Azadi, and many other scientists for their contributions to the CCRC-Net project.

Notes

(1) CarbBank is the search program developed at the CCRC for the Complex Carbohydrate Structure Database (CCSD). The CCSD contains approximately 49,000 records that give published information about the complex carbohydrate structures in its database. CarbBank, among others, is capable of searching CCSD for published articles about a specific carbohydrate.

(2) It is also important to realize that this spectrum by no means represents a worst case scenario, and it does not demonstrate the complexity of the instrument-independent identification task of these spectra. Spectrum (B) is merely a demonstration of some types of possible aberrations.

References

1. Van Huffel, S. 1993. Enhanced resolution based on minimum variance estimation and exponential data modeling. Signal Processing, 33, 333-355.

2. Van den Boogaarth, A., F. A. Howe, L. M. Rodriges, M. Stubbs, and J. R. Griffiths. 1995. In Vivo 31P MRS: absolute concentrations, signal-to-noise and prior knowledge. NMR in Biomedicine, 8, 87-93.

3. Blumler, P., M. Greferath, B. Blumich, and H. W. Spiess. 1993. NMR Imaging of objects containing similar substructures. Magnetic Resonance, Series A 103, 142-150.

4. Angelidis, P. A. 1996. Spectrum estimation and the Fourier transform in imaging and spectroscopy. Concepts in Magnetic Resonance, 8(5) 339-381.

5. Wabuyele, B. W., and P. Harrington. 1994. Optimal associative memory for background correction of spectra. Analytical Chemistry, 66, 2047-2051.

6. Goodacre, R., E. M. Timmins, A. Jones, D. B. Kell, J. Maddock, M. Heginbothom, J. T. Magee. 1997. On mass spectrometer instrument standardization and interlaboratory calibration transfer using neural networks. Analytica Chemica Acta, 348, 511-532.

7. Wabuyele, B. W., and P. Harrington. 1995. Quantitative comparison of bidirectional optimal associative memories for background prediction of spectra. Chemometrics and Intelligent Laboratory Systems, 29, 51-61.

Back to the TrAC Home Page