Ophthalmologists use the optic disc to cup ratio as one factor to diagnose glaucoma. Optic disc in fundus images is the area where blood vessels and optic nerve fibers enter the retina. A cup to disc ratio (the diameter of the cup divided by the diameter of the optic disc) greater than 0.3 is considered to be suggestive of glaucoma. Therefore, we are developing automatic methods to estimate optic disc and cup areas, and the optic disc to cup ratio. There are four steps to estimate the ratio: region of interest (ROI) area detection (where optic disc is in the center) from the fundus image, optic disc segmentation from the ROI, cup segmentation from the optic disc area, and cup to optic disc ratio estimation. This paper proposes an automated method to segment the optic disc from the ROI using deep learning. A Fully Convolutional Network (FCN) with a U-Net architecture is used for the segmentation. We use fundus images from MESSIDOR dataset in this experiment, a public dataset containing 1,200 fundus images. We divide the dataset into five equal subsets for training and independent testing (each set has four subsets for training and one subset for testing). The proposed method outperforms other existing algorithms. The results show 0.94 Jaccard index, 0.98 sensitivity, 0.99 specificity, and 0.99 accuracy.
This paper describes an automated system to label zones containing Investigator Names (IN) in biomedical articles, a key
item in a MEDLINE<sup>®</sup> citation. The correct identification of these zones is necessary for the subsequent extraction of IN
from these zones. A hierarchical classification model is proposed using two Support Vector Machine (SVM) classifiers.
The first classifier is used to identify an IN zone with highest confidence, and the other classifier identifies the remaining
IN zones. Eight sets of word lists are collected to train and test the classifiers, each set containing collections of words
ranging from 100 to 1,200. Experiments based on a test set of 105 journal articles show a Precision of 0.88, 0.97 Recall,
0.92 F-Measure, and 0.99 Accuracy.
This paper describes two classifiers, Naïve Bayes and Support Vector Machine (SVM), to classify sentences containing
Databank Accession Numbers, a key piece of bibliographic information, from online biomedical articles. The correct
identification of these sentences is necessary for the subsequent extraction of these numbers. The classifiers use words
that occur most frequently in sentences as features for the classification. Twelve sets of word features are collected to train
and test the classifiers. Each set has a different number of word features ranging from 100 to 1,200. The performance of
each classifier is evaluated using four measures: Precision, Recall, F-Measure, and Accuracy. The Naïve Bayes classifier
shows performance above 93.91% at 200 word features for all four measures. The SVM shows 98.80% Precision at 200
word features, 94.90% Recall at 500 and 700, 96.46% F-Measure at 200, and 99.14% Accuracy at 200 and 400. To
improve classification performance, we propose two merging operators, Max and Harmonic Mean, to combine results of
the two classifiers. The final results show a measureable improvement in Recall, F-Measure, and Accuracy rates.
The <i>Medical Article Records System</i> or MARS has been developed at the U.S. National Library of Medicine (NLM) for automated data entry of bibliographical information from medical journals into MEDLINE, the premier bibliographic citation database at NLM. Currently, a rule-based algorithm (called ZoneCzar) is used for labeling important bibliographical fields (title, author, affiliation, and abstract) on medical journal article page images. While rules have been created for medical journals with regular layout types, new rules have to be manually created for any input journals with arbitrary or new layout types. Therefore, it is of interest to label any journal articles independent of their layout styles. In this paper, we first describe a system (called ZoneMatch) for automated generation of crucial geometric and non-geometric features of important bibliographical fields based on string-matching and clustering techniques. The rule based algorithm is then modified to use these features to perform style-independent labeling. We then describe a performance evaluation method for quantitatively evaluating our algorithm and characterizing its error distributions. Experimental results show that the labeling performance of the rule-based algorithm is significantly improved when the generated features are used.
A prototype system has been designed to automate the extraction of bibliographic data (e.g., article title, authors, abstract, affiliation and others) from online biomedical journals to populate the National Library of Medicine’s MEDLINE database. This paper describes a key module in this system: the labeling module that employs statistics and fuzzy rule-based algorithms to identify segmented zones in an article’s HTML pages as specific bibliographic data. Results from experiments conducted with 1,149 medical articles from forty-seven journal issues are presented.
The National Library of Medicine (NLM) is developing an automated system to produce bibliographic records for its MEDLINE<SUP>R</SUP> database. This system, named Medical Article Record System (MARS), employs document image analysis and understanding techniques and optical character recognition (OCR). This paper describes a key module in MARS called the Automated Labeling (AL) module, which labels all zones of interest (title, author, affiliation, and abstract) automatically. The AL algorithm is based on 120 rules that are derived from an analysis of journal page layouts and features extracted from OCR output. Experiments carried out on more than 11,000 articles in over 1,000 biomedical journals show the accuracy of this rule-based algorithm to exceed 96%.