The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting algorithms to generate electronic bibliographic citation data from paper biomedical journal articles. The OCR server incorporated in MARS performs well in general, but fares less well with text printed in small or italic fonts. Affiliations are often printed in small italic fonts in the journals processed by MARS. Consequently, although the automatic processes generate much of the citation data correctly, the affiliation field frequently contains incorrect data, which must be manually corrected by verification operators. In contrast, author names are usually printed in large, normal fonts that are correctly converted to text by the OCR server.
The National Library of Medicine’s MEDLINE database contains 11 million indexed citations for biomedical journal articles. This paper documents our effort to use the historical author, affiliation relationships from this large dataset to find potential correct affiliations for MARS articles based on the author and the affiliation in the OCR output. Preliminary tests using a table of about 400,000 author/affiliation pairs extracted from the corrected data from MARS indicated that about 44% of the author/affiliation pairs were repeats and that about 47% of newly converted author names would be found in this set. A text-matching algorithm was developed to determine the likelihood that an affiliation found in the table corresponding to the OCR text of the first author was the current, correct affiliation. This matching algorithm compares an affiliation found in the author/affiliation table (found with the OCR text of the first author) to the OCR output affiliation, and calculates a score indicating the similarity of the affiliation found in the table to the OCR affiliation. Using a ground truth set of 519 OCR author/OCR affiliation/correct affiliation triples, the matching algorithm is able to select a correct affiliation for the author 43% of the time with a false positive rate of 6%, a true negative rate of 44% and a false negative rate of 7%.
MEDLINE citations with United States affiliations typically include the zip code. In addition to using author names as clues to correct affiliations, we are investigating the value of the OCR text of zip codes as clues to correct USA affiliations. Current work includes generation of an author/affiliation/zipcode table from the entire MEDLINE database and development of a daemon module to implement affiliation selection and matching for the MARS system using both author names and zip codes. Preliminary results from the initial version of the daemon module and the partially filled author/affiliation/zipcode table are encouraging.
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals. This 5-engine OCR system, while exhibiting high performance overall, does not reliably convert very small characters, especially those that are in italics. As a result, the 'affiliations' field that typically contains such characters in most journals, is not captured accurately, and requires a disproportionately high manual input. To correct this problem, dictionaries have been created from words occurring in this field (e.g., university, department, street addresses, names of cities, etc.) from 230,000 articles already processed. The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification. This paper outlines the techniques employed and the results of a comparative evaluation.
The optical character recognition system (OCR) selected by the National Library of Medicine (NLM) as part of its system for automating the production of MEDLINER records frequently segments the scanned page images into zones which are inappropriate for NLM's application. Software has been created in-house to correct the zones using character coordinate and character attribute information provided as part of the OCR output data. The software correctly delineates over 97% of the zones of interest tested to date.
The Lister Hill National Center for Biomedical Communications, an R&D division of the National Library of Medicine, has developed a PC-based system for semi-automated entry of journal citation data into MEDLINETM. The system, called MARS for Medical Article Records System, includes many automated features but requires a few manual tasks such as scanning and the entry of certain data that are not located on the scanned page. Now that considerable computing power and speed are routinely available on desktop PCs, we think it may be possible to include speech recognition as an optional user interface to reduce operator burden and to improve speed and quality for document scanning and data entry. We undertook a study to determine if speech recognition was sufficiently accurate, reliable and immune to noise to warrant integration with MARS workstations.
Redundant Array of Inexpensive Disks (RAID) vendors rely on multi-megabyte files and
large numbers of physical disks to achieve the high transfer rates and Input/Output
Operations Per Second (lOPS) quoted in the promotional literature. Practical image
database applications do not always deliver such large files and cannot always afford the
cost of the large numbers of disks required to match the vendors' performance claims.
Because the user is often waiting on-line to view the images, applications deployed on the
World Wide Web (WWW) are especially sensitive to keeping inline images relatively
small. For such applications, the expected performance advantages of RAID storage
may not be achieved.
The Lister Hill National Center for Biomedical Communications houses three image
datasets on a SPARCstorage Array RAID system. Applications deliver these images to
users via the Internet using the WWW and other client/server programs. Although
approximately 3% of the images exceed 1 MB in size, the average file size is less than
200 KB and approximately 60% of the files are less than 100 KB in size. A study was
undertaken to determine the configuration of the RAID system that will provide the
fastest retrieval of these image files and to discover general principles of RAID
performance. Average retrieval times with single processes and with concurrent processes
are measured and compared for several configurations of RAID levels 5and 0+1 . A few
trends have emerged showing a tradeoff between optimally configuring the RAID for a
single process or for concurrent processes.
The Lister Hill National Center for Biomedical Communications, a research and development division of the National Library of Medicine, is evaluating an optical disk jukebox as a digital image store to support prototype systems for image distribution over the Internet. This paper summarizes a study undertaken to determine the performance characteristics of the jukebox to support multiple image databases simultaneously accessed by multiple users. A motivation for this investigation is the need to provide users access to digitized images of medical documents and radiographs.
The Lister Hill National Center for Biomedical Communications is a Research and Development Division of the National Library of Medicine. One of the Center's current research projects involves the conversion of entire journals to bitmapped binary page images. In an effort to reduce operator errors that sometimes occur during document capture, three back error propagation networks were designed to automatically identify journal title based on features in the binary image of the journal's front cover page. For all three network designs, twenty five journal titles were randomly selected from the stored database of image files. Seven cover page images from each title were selected as the training set. For each title, three other cover page images were selected as the test set. Each bitmapped image was initially processed by counting the total number of black pixels in 32-pixel wide rows and columns of the page image. For the first network, these counts were scaled to create 122-element count vectors as the input vectors to a back error propagation network. The network had one output node for each journal classification. Although the network was successful in correctly classifying the 25 journals, the large input vector resulted in a large network and, consequently, a long training period. In an alternative approach, the first thirty-five coefficients of the Fast Fourier Transform of the count vector were used as the input vector to a second network. A third approach was to train a separate network for each journal using the original count vectors as input and with only one output node. The output of the network could be 'yes' (it is this journal) or 'no' (it is not this journal). This final design promises to be most efficient for a system in which journal titles are added or removed as it does not require retraining a large network for each change.
A pilot project of the Center involves automatic document delivery in response to computerized Interlibrary loan requests. Each document request includes an unstructured comment field that patrons occasionally use to indicate whether or not they want the National Library of Medicine to fill that request. These comments vary widely in content, but were found to always contain the test 'NLM.' This paper describes a technique to automatically reduce the amount of operator intervention to resolve ambiguities in the intent of the patron as to whether the request should be filled or not.