Knowledge discovery presupposes mining the inherent relationships between the numerous data records and categories within a given database. Clustering is one of the popular approaches to such data mining. Most of these clustering techniques require an a priori knowledge of the intrinsic number of clusters or groups in the database, and in addition, their interpretation needs some externally definable basis of affinity underlying each of the clusters (cluster class labels). However, in a real-world knowledge discovery process, such a priori knowledge is not often available, and the user is interested in discovering this as well. In this study, a new approach aimed at discovering the intrinsic number of groups, at the most elemental level, which may then be combined to form metagroups to the extent desired, is proposed. This approach is based upon a concept, referred to here as reciprocal relationship bonds. The method initially identifies all data record pairs in the database with such reciprocal relationship bonds. These represent the cores of the data record groups, to be formed by step-wise bonding of all the records in the entire database. Various levels of relationship can then be defined between any pair of records, depending on the number of bonds required to connect these data records. The strength of bond of each record to the cluster can be ordered, based on how far removed it is from the cluster core. The lower the order, the stronger is its bond to the cluster, and higher is the likelihood that the data record truly belongs to the corresponding cluster. The approach also provides a mechanism for flagging the out-of-norm or unusual data, by clustering them separately from other normal data records, which may indicate incomplete or error-prone records.
This paper proposes a clustering methodology, for sequence data, using hidden Markov model (HMM) representation. The proposed methodology improves upon existing HMM-based clustering methods in two ways: (i) it enables HMMs to dynamically change its model structure, to obtain a better fit model for data during the clustering process, and (ii) it provides objective criterion function, to select the optimal clustering partition. The algorithm is presented in terms of four nested levels of searches: (i) the search for the optimal number of clusters in a partition, (ii) the search for the optimal structure for a given partition, (iii) the search for the optimal HMM structure for each cluster, and (iv) the search for the optimal HMM parameters for each HMM. Preliminary results are given to support the proposed methodology.
Clustering is concerned with the unsupervised discovery of natural grouping of records in a database. Recently, considerable effort has been put into the development of effective clustering algorithms for spatial data mining. These algorithms are mainly developed to deal with continuous valued attributes. They typically use a distance measure, defined in the Euclidean space to determine if two records should be placed in the same or in disjoint clusters. Since such a distance measure cannot be validly defined in non-Euclidean space, these algorithms cannot be used to handle databases, which contain records also characterized by discrete qualitative or categorical values. Owing to the fact that there is usually a mixture of both continuous and discrete valued attributes in many real-world databases, it is important that a clustering algorithm should be developed to handle data mining tasks involving them. In this paper, we present such a clustering algorithm. The algorithm can be divided into two phases: a discretization phase and a genetic algorithm-based clustering phase. The first phase involves discretizing continuous attributes into discrete attributes, based on the use of an information theoretic measure, which minimizes loss of information during such process. These discretized and the discrete attributes are then used in the second phase. In this phase, clustering is carried out, using a genetic algorithm. By representing a specific grouping of records in a chromosome and using a weight of evidence measure as a fitness measure, to determine if such grouping is meaningful, we present here an effective GA for data clustering. To evaluate the effectiveness of the proposed techniques, we tested it using real data. The experimental results showed that it is very promising.
Clustering is among the oldest techniques used in data mining applications. Typical implementations of the hierarchical agglomerative clustering methods (HACM) require an amount of O(N2)-space, when there are N data objects, making such algorithms impractical for problems involving large datasets. The well-known clustering algorithm RNN- CLINK requires only O(N)-space, but O(N3)-time in the worst case, although the average time appears to be O(N2-log N). We provide a probabilistic interpretation of the average time complexity of the algorithm. We also report experimental results, using the randomly generated bit vectors, and using the NETNEWS articles as the input, to support our theoretical analysis.
This paper surveys public domain supervised learning algorithms and performs accuracy (error rate) analysis of their classification performance on unseen instances for twenty-nine of the University of California at Irvine machine learning datasets. The learning algorithms represent three types of classifiers: decision trees, neural networks and rule-based classifiers. The study performs data analysis and examines the effect of irrelevant attributes to explain the performance characteristics of the learning algorithms. The survey concludes with some general recommendations about the selection of public domain machine-learning algorithms relative to the properties of the data examined.
The induction of decision rules from data is important to many disciplines, including artificial intelligence and pattern recognition. To improve the state of the art in this area, we introduced the genetic rule and classifier construction environment (GRaCCE). It was previously shown that GRaCCE consistently evolved decision rule sets from data, which were significantly more compact than those produced by other methods (such as decision tree algorithms). The primary disadvantage of GRaCCe, however, is its relatively poor run-time execution performance. In this paper, a concurrent version of the GRaCCE architecture is introduced, which improves the efficiency of the original algorithm. A prototype of the algorithm is tested on an in- house parallel processor configuration and the results are discussed.
Efficient association rule mining algorithms already exist, however, as the size of databases increases, the number of patterns mined by the algorithms increases to such an extent that their manual evaluation becomes impractical. Automatic evaluation methods are, therefore, required in order to sift through the initial list of rules, which the datamining algorithm outputs. These evaluation methods, or criteria, rank the association rules mined from the dataset. We empirically examined several such statistical criteria: new criteria, as well as previously known ones. The empirical evaluation was conducted using several databases, including a large real-life dataset, acquired from an order-by-phone grocery store, a dataset composed from www proxy logs, and several datasets from the UCI repository. We were interested in discovering whether the ranking performed by the various criteria is similar or easily distinguishable. Our evaluation detected, when significant differences exist, three patterns of behavior in the eight criteria we examined. There is an obvious dilemma in determining how many association rules to choose (in accordance with support and confidence parameters). The tradeoff is between having stringent parameters and, therefore, few rules, or lenient parameters and, thus, a multitude of rules. In many cases, our empirical evaluation revealed that most of the rules found by the comparably strict parameters ranked highly according to the interestingness criteria, when using lax parameters (producing significantly more association rules). Finally, we discuss the association rules that ranked highest, explain why these results are sound, and how they direct future research.
Our goal is to design a knowledge discovery tool that has the ability to accurately generate rules, using concepts and structured data values, extracted from semi-structured documents. To date, two of our major contributions have been the design of a system architecture that facilitates the discovery of rules from HTML documents, and the development of an efficient association rule algorithm that generates rule sets, based on user specified constraints. This paper discusses each of these contributions within the framework of our prototype system IRIS. IRIS allows users to specify a set of constraints associated with a particular domain and then generates association rules based on these constraints. One of the unique features of IRIS, is that it generates rules using the more structured component of the HTML documents, as well as the conceptual knowledge extracted from the unstructured blocks of text.
We suggest a hybrid expert system of memory and neural network-based learning. Neural network (NN) and memory-based reasoning (MBR), have common advantages over other learning strategies. NN and MBR can be directly applied to the classification and regression problem, without additional transform mechanisms. They also have strength in learning the dynamic behavior of the system over a period in time. Unfortunately, they have shortcomings. The knowledge representation of NN is unreadable to human beings, and this 'black box' property restricts the application of NN to areas which need proper explanation, as well as precise predictions. On the other hand, MBR suffers from the feature weighting problem. When MBR measures the distance between cases, some features should be treated more importantly than others. Although previous researchers provide several feature weighting mechanisms to overcome the difficulty, those methods were mainly applicable only to the classification problem. In our hybrid system of NN and MBR, the feature weight set calculated from the trained neural network plays the core role in connecting both learning strategies. Moreover, the explanation on prediction can be given by presenting the most similar cases from the case base. In this paper, we present the basic idea of the hybrid system and preliminary experimental results. Experimental results show that the hybrid system predicts the yield with relatively high accuracy and has a distinct potential in feature selection.
Sensor data can be mined to discover a description of an implicit property in an application domain. Main problems in the sensor data domain are the huge volume of noisy data and, therefore, the difficulty to locate the relevant information. KDD is the field where these problems are being addressed. This paper proposes a feature selection approach to discover relevant features, which will characterize the unknown property. The obtained description improves the expert knowledge and makes up a base for the prediction task. The feature selection task becomes intractable when the features set is large. Recurrent neural networks have shown to be very useful to obtain the global minimum of hard problems. This paper proposes two recurrent neural network models for feature selection: the graph model and the cluster model. Experiments show the advantage of the new methods, by comparing them with heuristic methods like RELIEF-F, DTM, and the Wrapper approach. We have selected databases from the UCI Repository and experimental data from steel machining processes. The new methods provide better solutions than the heuristic methods, and solve the tradeoff between accuracy and efficiency, as well as their generalization capability, which gives a solution of the noise problem.
This paper addresses the problem of building a model for text documents of interest. Specifically, it considers a scenario where a large collection of documents, for example, the result of a search on the Internet, using one of the popular search engines, is given. Each document is indexed by certain keywords or terms. It is assumed that the user has identified a subset of documents that fits the user's needs. The goal is to build a term association model for the documents of interest, so that it can be used either for refining the user search or exported to other search engines/agents for further search of documents of interest. The model built is in the form of a unate Boolean function of the terms or keywords used in the initial search of documents. The proposed document model building algorithm is based on a modified version of the pocket algorithm for perceptron learning and a mapping method for converting neurons into equivalent symbolic representations.
A conceptual model of a database, such as an Entity-Relationship (ER) model, is a specification of objects, attributes, and their relationships. A conceptual model plays important roles in developing successful database applications. Although critical, a conceptual rijodel of a legacy database may not be always available in practice, and discovering and constructing such a model from the data, and from the data only, is a challenging problem. In this paper, we develop a new approach to address object identification and model construction. Our approach has many favorable features, including its robustness in dealing with noise data and scalability to large databases and data sets. We implement this approach in a system called McKey (Model Construction with Key identification) for discovering and building ER models from instances of large legacy databases. We apply McKey to three very large legacy databases, and obtain comprehensive models within hours, which gives many magnitudes of savings of manpower.
Successful model building techniques in the natural sciences can usually be classified into the two categories of inductive and deductive methods. If numerical data is the only source of available knowledge in a certain problem, then it is obvious that inductive techniques seem to be the method of choice. Recently, data mining has received much attention to extract hidden knowledge from very large databases, and is, therefore, in this framework considered to be an inductive approach. Deductive methods, on the contrary, rely on a rigorous application and the explicit knowledge of the underlying domain, using first principles to establish a mathematical model of the problem, typically yielding governing equations. This paper focuses on the establishment of an intermediate knowledge representation level, between inductive and deductive knowledge identification techniques. This intermediate knowledge representation level is based on the notion of group transforms, and can be shown to be a mathematically necessary condition in the existence of any correct model representation. The synergy effect of using group transforms in both the inductive data mining and the deductive similarity theory approach can be demonstrated to be conceptually advantageous and is observed to be numerically superior to conventional techniques. The limitations and underlying assumptions of the approach are identified and discussed using example.
One of the most important directions in improvement of data mining and knowledge discovery, is the integration of multiple classification techniques of an ensemble of classifiers. An integration technique should be able to estimate and select the most appropriate component classifiers from the ensemble. We present two variations of an advanced dynamic integration technique with two distance metrics. The technique is one variation of the stacked generalization method, with an assumption that each of the component classifiers is the best one, inside a certain sub area of the entire domain area. Our technique includes two phases: the learning phase and the application phase. During the learning phase, a performance matrix of each component classifier is derived, using the instances of the training set. Each matrix thus includes a way information concerning the 'competence area' of the corresponding component classifier. These matrixes are used during the application phase to predict the performance of each component classifier in each new instance. The technique is evaluated on three data sets, taken from the UCI machine learning repository, with which well-known classification methods have not proved successful. The comparison results show that our dynamic integration technique outperforms weighted voting and cross-validation majority techniques in some datasets.
Recently, many computerized data mining tools and environments have been proposed for finding interesting patterns in large data collections. These tools employ techniques that originate from research in various areas, such as machine learning, statistical data analysis, and visualization. Each of these techniques makes assumptions concerning the composition of the data collection to be analyzed. If the particular data collection does not meet these assumptions well, the technique usually performs poorly. For example, decision tree tools, such as C4.5, rely on rectangular approximations, which do not perform well if the boundaries between different classes have other shapes, such as a 45 degree line or elliptical shapes. However, if we could find a transformation f that transforms the original attribute space, in which class boundaries are more, better rectangular approximations could be obtained. In this paper, we address the problem of finding such transformations f. We describe the features of the tool, WOLS, whose goal is the discovery of ingredients for such transformation functions f, which we call building blocks. The tool employs genetic programming and symbolic regression for this purpose. We also present and discuss the results of case studies, using the building block analysis tool, in the areas of decision tree learning and regression analysis.
Discovery of non-obvious relationships between time series is an important problem in many domains, such as financial, sensory, and scientific data analysis. We consider data mining in aligned time series, which arise, e.g., in numerous online monitoring applications, and we are interested in finding time series which reflect the same external events. The time series can have different vertical positions, scales and overall trends, however still show related features at the same locations. The features can be short-term, such as small peaks and turns, or long-term, such as wider mountains and valleys. We propose using a wavelet transformation of a time series to produce a natural set of features for the sequence. Wavelet transformation yields features which describe properties of the sequence, both at various locations and at varying time granularities. In the proposed method, these features are processed so that they are insensitive to changes in the vertical position, scaling, and overall trend of the time series. We discuss the use of these features in data mining, in tasks such as clustering. We demonstrate how the features allow a flexible analysis of different aspects of the similarity: we show how one can examine how the similarity between time series changes as a function of time or as a function of time granularity considered. We present experimental results with real financial data sets. Experiments indicate that the proposed method can produce useful results. For instance, important similarities can be found in time series, which would be considered unrelated by visual inspection. Experiments with compression give encouraging results for the application of the method in mining massive time series data sets.
With three centuries of existence, the study of population's behavior implies the manipulation of large amounts of incomplete and imprecise data with high dimensionality. By virtue of its multidisciplinary character, the work in demography involves at least historicists, statisticians and computer scientists/programmers. Moreover, successful demographic analysis requires qualified experts, who have succeeded in analysing data through many views and relate different sources of information, including their personal knowledge of the epoch or regions under study. In this paper, we present an intelligent system to study demographic evolution (ISSDE). This system has a module based on on-line analytical processing (OLAP), which permits conducting multiple analysis, combining many data dimensions. It has a deductive database system, which allows the execution of elaborated queries through the database. It has another module for date treatment (generalization and/or reduction); and, at last, a data mining module to discover nontrivial relations hidden within data. We discover the data treatment procedure with two phases: data generalization and data reduction. In data generalization, utilizing knowledge about concept hierarchies and relevance of data, aggregation of attribute values is performed. In the data reduction phase, rough set theory is applied to compute the minimal attribute set. We highlight the advantages of combining attribute value generalization with rough set theory, to find a subset of attributes that lets the mining process discover more useful patterns, by providing results from the application of the C5.0 algorithm in a demographic relational database.
The paper focuses on discovery of knowledge needed to establish the shared meaning of attributes in a network of distributed autonomous databases. In this paper, we concentrate on the role of equations as definitions of attribute values. We briefly describe various applications of such definitions, including predictions, knowledge verification, intelligent query answering and several others. We present an interface between a distributed autonomous knowledge system, DAKS, and a discovery system 49er. To find knowledge useful in defining attributed missing in one database, the discovery mechanism of 49er can be applied to other databases. DAKS makes requests for definitions and then manages the discovered definitions, verifies their consistency and applies them in its query- answering mechanism. To put a system of equation-based attribute definitions on a firm theoretical foundation, we introduce semantics, which justifies empirical equations in their definitional role. This semantics augments the earlier developed semantics for rules used as attribute definitions.
Customer defined problems don't match data mining heuristics. A basic component of the data mining process is the transformation of the business problem into a form that fits one of these basic heuristics. This means changing the problem size or composition, addition variables, and using more than one data mining heuristic. We will also discuss the basics of data mining tool selection. After all, having the right type of tool can make the project easier to accomplish and more successful.
Diamond Eye is a distributed software architecture, which enables users (scientists) to analyze large image collections by interacting with one or more custom data mining servers via a Java applet interface. Each server is coupled with an object-oriented database and a computational engine, such as a network of high-performance workstations. The database provides persistent storage and supports querying of the 'mined' information. The computational engine provides parallel execution of expensive image processing, object recognition, and query-by-content operations. Key benefits of the Diamond Eye architecture are: (1) the design promotes trial evaluation of advanced data mining and machine learning techniques by potential new users (all that is required is to point a web browser to the appropriate URL), (2) software infrastructure that is common across a range of science mining applications is factored out and reused, and (3) the system facilitates closer collaborations between algorithm developers and domain experts.
Physicians, in their ever-demanding jobs, are looking to decision support systems for aid in clinical diagnosis. However, clinical decision support systems need to be of sufficiently high accuracy that they help, rather than hinder, the physician in his/her diagnosis. Decision support systems with accuracies, of patient state determination, of greater than 80 percent, are generally perceived to be sufficiently accurate to fulfill the role of helping the physician. We have previously shown that data mining techniques have the potential to provide the underpinning technology for clinical decision support systems. In this paper, an extension of the work in reverence 2, we describe how changes in data mining methodologies, for the analysis of 12-lead ECG data, improve the accuracy by which data mining algorithms determine which patients are suffering from heart disease. We show that the accuracy of patient state prediction, for all the algorithms, which we investigated, can be increased by up to 6 percent, using the combination of appropriate test training ratios and 5-fold cross-validation. The use of cross-validation greater than 5-fold, appears to reduce the improvement in algorithm classification accuracy gained by the use of this validation method. The accuracy of 84 percent in patient state predictions, obtained using the algorithm OCI, suggests that this algorithm will be capable of providing the required accuracy for clinical decision support systems.
The main problem related to the retrieval of information from the world wide web is the enormous number of unstructured documents and resources, i.e., the difficulty of locating and tracking appropriate sources. This paper presents a web mining environment (WME), which is capable of finding, extracting and structuring information related to a particular domain from web documents, using general purpose indices. The WME architecture includes a web engine filter (WEF), to sort and reduce the answer set returned by a web engine, a data source pre-processor (DSP), which processes html layout cues in order to collect and qualify page segments, and a heuristic-based information extraction system (HIES), to finally retrieve the required data. Furthermore, we present a web mining environment generator, WMEG, that allows naive users to generate a WME specific to a given domain by providing a set of specifications.
The explosive growth of commercial and scientific databases has outpaced our ability to manually analyze and interpret this data. The newly emerging interdisciplinary field of knowledge discovery in databases (KDD), provides methodologies for seeking valuable and useful information from these databases. In this paper, we describe a methodology for identifying high fault modules in software metrics databases. It employs radial basis function model for the data mining phase of the KDD process based on our newly developed algorithm. We use the well-known bootstrap method for model validation and accuracy estimation of the classification task. As an example, a genuine problem from NASA software database is explored.
With the emergence of data warehousing, Decision support systems have evolved to its best. At the core of these warehousing systems lies a good database management system. Database server, used for data warehousing, is responsible for providing robust data management, scalability, high performance query processing and integration with other servers. Oracle being the initiator in warehousing servers, provides a wide range of features for facilitating data warehousing. This paper is designed to review the features of data warehousing - conceptualizing the concept of data warehousing and, lastly, features of Oracle servers for implementing a data warehouse.