KEYWORDS: Principal component analysis, Algorithm development, Data mining, Databases, Machine learning, Health informatics, Medicine, Social sciences, Information technology, Statistical methods
Data characterizing techniques have been developed to control learning algorithm selection by using statistical
measurements of a dataset. To expand the framework of meta-learning, it is important to consider results of
other learning algorithms. Therefore, we consider about a method to reuse objective rule evaluation indices of
classification rules. Objective rule evaluation indices such as support, precision and recall are calculated by using
a rule set and a validation dataset. This data-driven approach is often used to filter out not useful rules from
obtained rule set by a rule learning algorithm. At the same time, these indices can detect differences between
two validation datasets by using the rule set and the indices, because the definitions of indices independent
on both of a rule and a dataset. In this paper, we present a method to characterize given datasets based
on objective rule evaluation indices by using differences of correlation coefficients between each index. By
comparing the differences, we describe the results of similar/dissimilar groups of the datasets.
It has passed about twenty years since clinical information are stored electronically as a hospital information
system since 1980's. Stored data include from accounting information to laboratory data and even patient
records are now started to be accumulated: in other words, a hospital cannot function without the information
system, where almost all the pieces of medical information are stored as multimedia databases. In this paper,
we applied temporal data mining and exploratory data analysis techniques to hospital management data. The
results show several interesting results, which suggests that the reuse of stored data will give a powerful tool for
hospial management.
This paper proposes an application of data mining to medical risk management, where data mining techniques
were applied to detection, analysis and evaluation of risks potentially existing in clinical environments. We
applied this technique to the following two medical domains: risk aversion of nurse incidents and infection
control. The results show that data mining methods were effective to detection and aversion of risk factors.
KEYWORDS: Liver, Data mining, Bessel functions, Health informatics, Medicine, Time series analysis, Information assurance, Convolution, Multiscale representation, Information technology
This paper proposes a new approach to temporal trajectory analysis for clinical laboratory examinations. When
we select m laboratory examinations, their temporal evolution for one patient can be viewed as a trajectory
in m-dimensional space. Multiscale comparison technique can be applied for segmentation and calculation of
structural similarities of such trajectories. Then, clustering cna be applied to the calculated similarities for
classiffication of these trajectories. The proposed method was evaluated on hepatitis datasets, whose results show
that the clustering captured several interesting patterns for severe chronic hepatitis.
This paper reports the results of temporal analysis of platelet (PLT) data in chronic hepatitis dataset. First
we briefly introduce a cluster analysis system for temporal data that we have developed. Second, we show the
results of cluster analysis of PLT sequences. Third, we show the results of PLT value-based temporal analysis
aiming at finding years for reaching F4, years elapsed between stages, and their relationships with virus types
and fibrotic stages. The results of cluster analysis indicate that the temporal courses of PLT can be grouped
into several patterns each of which presents similarity in average PLT level and increase/decrease trends. The
results of value-based analysis suggests that liver fibrosis may proceed faster in the exacerbating cases.
This papers gives an approach to hospital management data by using statistical data mining. For analysis, distribution analysis, correlation and uniregression analysis and generalized linear model were applied. The results showed several interesting results, which suggests that the reuse of stored data will give a powerful tool to support a long-period management of a university hospital.
KEYWORDS: Visualization, Data mining, Databases, Liver, Chemical vapor deposition, Health informatics, Medicine, Information assurance, Data processing, Lab on a chip
This paper proposes a visualization approach to show the similarity relations between rules based on multidimensional scaling (MDS), which assign a two-dimensional cartesian coordinate to each data point from the information about similiaries between this data and others data. First, semantic and synctatic similarities of rules are obtained after rules are induced from a datasets. Then, MDS is applied to each similarity. MDS visualizes the difference between semantic and synctatic simliarites. This method was evaluated on two medical data sets, whose experimental results show that knowledge useful for domain experts could be found.
KEYWORDS: Linear algebra, Health informatics, Medicine, Data mining, Matrices, Statistical analysis, Feature selection, Surgery, Computer intrusion detection, Information assurance
This paper gives a relations between the degree of granularity and
that of dependence of contingency tables. From the results of determinantal divisors, it seems that the devisors provide information on the degree of dependencies between the matrix of the whole elements and its submatrices and the increase of the degree of granularity may lead to that of dependence. However, this paper shows that a constraint on the sample size of a contingency table is very strong, which leads to the evaluation formula where the increase of degree of granularity gives the decrease of dependency.
KEYWORDS: Liver, Data mining, Convolution, Medicine, Pattern recognition, Neptunium, Health informatics, Data analysis, Information assurance, Computer intrusion detection
This paper presents a novel method for clustering time-series medical data based on the improved multiscale matching. Multiscale matching, developed originally as a pattern recognition technique, has an ability to compare two shapes by partly changing observation scales. We have made some improvements to the conventional multiscale matching in order to enable the cross-scale, granularity-based comparison of long-term time-series sequences. The key idea is
development of a new segment representation that eludes the problem of shrinkage. We induced shape parameters of a segment at high scale directly from the base segments at the lowest scale, instead of using shapes represented by multiscale description. We examined the usefulness of the method on the cylinder-bell-funnel dataset and chronic hepatitis dataset. The results demonstrated that the dissimilarity matrix produced by the proposed method, conbined with conventional clustering techniques, lead to the successful
clustering for both synthetic and real-world data.
KEYWORDS: Probability theory, Data mining, Picosecond phenomena, Statistical analysis, Health informatics, Medicine, Information assurance, Surgery, Computer intrusion detection, Network security
A contingency table summarizes the conditional frequencies of two attributes and shows how these two attributes are dependent on each other with the information on a partition of universe generated by these attributes. Thus, this table can be viewed as a relation between two attributes with respect to information granularity.
This paper focuses on several characteristics of linear and statistical independence in a contingency table from the viewpoint of granular computing, which shows that statistical independence in a contingency table is a special form of linear dependence. The discussions also show that when a contingency table is viewed as a matrix, called a contingency matrix, its rank is equal to 1.0. Thus, the degree of independence, rank plays a very important role in extracting a probabilistic model from a given contingency table.
Furthermore, it is found that in some cases, partial rows or columns will satisfy the condition of statistical independence, which can be viewed as a solving process of Diophatine equations.
A contingency table summarizes the conditional frequencies of two attributes and shows how these two attributes are dependent on each other. Thus, this table is a fundamental tool for pattern discovery with conditional probabilities, such as rule discovery. In this paper, a contingency table is interpreted from the viewpoint of
statistical independence and granular computing. The first important observation is that a contingency table compares two attributes with respect to the number of equivalence classes. For example, a n x n table compares two attributes with the same granularity, while a m x n(m ≥ n) table compares two attributes with different granularities. The second important observation is that matrix algebra is a key point of analysis of this table. Especially, the degree of independence, rank plays a very important role in evaluating the degree of statistical independence. Relations between rank and the degree of dependence are also investigated.
This paper presents a comparative study about the characteristics of clustering methods for inhomogeneous time-series medical datasets. Using various combinations of comparison methods and grouping methods, we performed clustering experiments of the hepatitis data set and evaluated validity of the results. The results suggested that (1) complete-linkage (CL) criterion in agglomerative hierarchical clustering (AHC) outperformed average-linkage (AL) criterion in
terms of the interpretability of a dendrogram and clustering results, (2) combination of dynamic time warping (DTW) and CL-AHC constantly produced interpretable results, (3) combination of DTW and rough clustering (RC) would be used to find the core sequences of the clusters, (4) multiscale matching may suffer from the treatment of 'no-match' pairs, however, the problem may be eluded by using RC as a subsequent grouping method.
KEYWORDS: Databases, Lab on a chip, Spine, Detection and tracking algorithms, Chemical vapor deposition, Health informatics, Medicine, Data mining, Diagnostics, Statistical modeling
This paper presents a new approach to extract hierarchical decision rules, which consists of the following three procedures. First, the characterization set of each given target concept is extracted from databases and the concept hierarchy for given classes is calculated. Second, based on the hierarchy, rules for each hierarchical level are induced from data. Then, for each given class, rules for all the hierarchical levels are integrated into one rule. The proposed method was evaluated on a medical database, the experimental results of which show that induced rules correctly represent experts' decision processes.
KEYWORDS: Data mining, Data acquisition, Quantitative analysis, Head, Medicine, Convolution, Phase shifts, Health informatics, Data processing, Electroencephalography
This paper reports characteristics of dissimilarity measures used in the multiscale matching. Multiscale matching is a method for comparing two planar curves by partially changing observation scales. Throughout all scales, it finds the best set of pairs of partial contours that contains no miss-matched or over-matched contours and that minimizes the accumulated differences between the partial contours. In order to make this method applicable to comparison of the temporal sequences, we have proposed a dissimilarity measure that compares subsequences according to the following aspects: rotation angle, length, phase and gradient. However, it empirically became apparent that it was difficult to understand from the results that which aspects were really contributed to the resultant dissimilarity of the sequences. In order to investigate fundamental characteristics of the dissimilarity measure, we performed quantitative analysis of the induced dissimilarities using simple sine wave and its variants. The results showed that differences on the amplitude, phase and trends were respectively captured by the terms on rotation angle, phase and gradient, although they also showed weakness on the linearity.
Rule induction methods have been introduced since 1980's and
many applications show that they are very useful to acquire simple
patterns from large databases. However, when a database is
very large, the methods generate too many rules, which makes
domain experts interpret all the rules. Moreover, since rules only
shows the relations between attribute-value pairs, it is very
difficult to capture the relations between concepts or among induced
rules. In order to solve this problem, many kinds of visualization
has been introduced. Rough set theory has a technique on
conflict analysis with qualitative distance obtained from attributes,
which gives graphical relations between class or rules. On the
other hand, statistical methods have a graphical model method,
which gives graphical relations between attributes by using
partial coefficients or other indices. In this paper, we introduce
a new approach which combines conflict analysis and graphical
modeling. The results show that the combination of these two methods
gives the other type of visualization of rules, which gives also a
formal mathematical model for rule visualization.
Rough set based rule induction methods have been applied to knowledge discovery in databases, whose empirical results obtained show that they are very powerful and that some important knowledge has been extracted from datasets. However, quantitative evaluation of lower and upper approximation are based not on statistical evidence but on rather naive indices, such as conditional probabilities and functions of conditional probabilities. In this paper, we introduce a new approach to induced lower and upper approximation of original and variable precision rough set model for quantitative evaluation, which can be viewed as a statistical test for rough set methods. For this extension, chi-square distribution, F-test and likelihood ratio test play an important role in statistical evaluation. Chi-square test statistic measures statistical information about an information table and F-test statistic and likelihood ratio statistic are used to measure the difference between two tables.
KEYWORDS: Statistical analysis, Databases, Knowledge discovery, Data mining, Data modeling, Medicine, Distance measurement, Algorithm development, Information science, Binary data
Rule induction methods have been applied to knowledge discovery in databases and data mining, The empirical results obtained show that they are very powerful and that important knowledge has been extracted from datasets. However, comparison and evaluation of rules are based not on statistical evidence but on rather naive indices, such as conditional probabilities and functions of conditional probabilities. In this paper, we introduce two approaches to induced statistical comparison of induced rules. For the statistical evaluation, likelihood ratio test and Fisher's exact test play an important role: likelihood ratio statistic measures statistical information about an information table and it is used to measure the difference between two tables.
One of the key concepts in data mining is to give a suitable partition of datasets in an automatic way. On one hand, classification method is to find the partitions given by combinations of attribute-value pairs which are best fit to the partition given by target concepts. On the other hand, clustering method is to find the partitions which best characterize given datasets by using a similarity measure. Therefore, the choice of distance or similarity measures are one of the most important research topics in data mining. However, such empirical comparisons have never been studied in the literature. In this paper, several types of similarity measures were compared in the following three clinical contexts: the first one is for datasets composed of only categorical attributes. The second one is for those of mixture of categorical and numerical attributes. The final one is for those of only numerical attributes. Experimental results show that simple similarity measures perform as well as new proposed measures.
KEYWORDS: Databases, Data mining, Bacteria, Data analysis, Data processing, Knowledge discovery, Medical research, Data acquisition, Medicine, Data storage
Since early 1980's, the rapid growth of hospital information systems stores the large amount of laboratory examinations as databases. Thus, it is highly expected that knowledge discovery and data mining (KDD) methods will find interesting patterns from databases as reuse of stored data and be important for medical research and practice because human beings cannot deal with such a huge amount of data. However, there are still few empirical approaches which discuss the whole data mining process from the viewpoint of medical data. In this paper, KDD process from a hospital information system is presented by using two medical datasets. This empirical study shows that preprocessing and data projection are the most time-consuming processes, in which very few data mining researches have not discussed yet and that application of rule induction methods is much easier than preprocessing.
Rough set based rule induction methods have been applied to knowledge discovery in databases. The empirical results obtained show that they are very powerful and that some important knowledge has been extracted from datasets. However, quantitative evaluation of induced rules are based not on statistical evidence but on rather naive indices, such as conditional probabilities and functions of conditional probabilities. In this paper, we introduce a new approach to induced rules for quantitative evaluation, which can be viewed as a statistical extension of rough set methods. For this extension, chi-square distribution and F- distribution play an important role in statistical evaluation.
Conventional studies on knowledge discovery in databases (KDD) shows that combination of rule induction methods and attribute-oriented generalization is very useful to extract knowledge from data. However, attribute-oriented generalization in which concept hierarchy is used for transformation of attributes assumes that a given hierarchy is consistent. Thus, if this condition is violated, application of hierarchical knowledge generates inconsistent rules. In this paper, first, we show that this phenomenon is easily found in data mining contexts: when we apply attribute- oriented generalization to attributes in databases, generalized attributes will have fuzziness for classification. Then, we introduce two approaches to solve this problem, one process of which suggests that combination of rule induction and attribute-oriented generalization can be used to validate concept hierarchy. Finally, we briefly discuss the mathematical generalization of this solution in which context- free fuzzy sets is a key idea.
Rule discovery methods have been introduced to find useful and unexpected patterns from databases. However, one of the most important problems on these methods is that extracted rules have only positive knowledge, which do not include negative information that medical experts need to confirm whether a patient will suffer from symptoms caused by drug side-effect. This paper first discusses the characteristics of medical reasoning and defines positive and negative rules based on rough set model. Then, algorithms for induction of positive and negative rules are introduced. Then, the proposed method was evaluated on clinical databases, the experimental results of which shows several interesting patterns were discovered, such as a rule describing a relation between urticaria caused by antibiotics and food.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.