Clustering of the texture images is a demanding part of multimedia database mining. Most of the natural textures are non-homogenous in terms of color and textural properties. In many cases, there is a need for a system that is able to divide the non-homogenous texture images into visually similar clusters. In this paper, we introduce a new method for this purpose. In our clustering technique, the texture images are ordered into a queue based on their visual similarity. Based on this queue, similar texture images can be selected. In similarity evaluation, we use feature distributions that are based on the color and texture properties of the sample images. Color correlogram is a distribution that has proved to be effective in characterization of color and texture properties of the non-homogenous texture images. Correlogram is based on the co-occurrence matrix, which is a statistical tool in texture analysis. In this work, we use gray level and hue correlograms in the characterization of the colored texture. The similarity between the distributions is measured using several different distance measures. The queue of texture images is formed based on the distances between the samples. In this paper, we use a test set which contains non-homogenous texture images of ornamental stones.
The observation that isomorphic relations have isomorphic high frequency patterns implies some unexpected properties about the association rules. First of all, the patterns are properties of the isomorphic class, not an individual relation. Second, those countings on itemsets, association rules and etc. are invariants under isomorphism, and hence the probability theory based such countings is again a theory of the whole class, not an individual relation. On the other hand, examples show that "interesting-ness" (of association rules) are properties of an individual relation, not the whole isomorphic class. As a corollary, contrary to many authors beliefs, we conclude that interestingness cannot be characterized by such a probability theory.
Clustering of the images stored in a large database is one of the basic tasks in image database mining. In this paper we present a clustering method for an industrial imaging application. This application is a defect detection system that is used in paper industry. The system produces gray level images from the defects that occur at the paper surface and it stores them into an image database. These defects are caused by different reasons, and it is important to associate the defect causes with different types of defect images. In the clustering procedure presented in this paper, the image database is indexed using certain distinguishing features extracted from the database images. The clustering is made using an algorithm, which is based on the k-nearest neighbor classifier. Using this algorithm, arbitrarily shaped clusters can be formed in the feature space. The algorithm is applied to the database images in hierarchical way, and therefore it is possible to use several different feature spaces in the clustering procedure. The images in the obtained clusters are associated with the real defect causes in the industrial process. The experimental results show that the clusters agree well with the traditional classification of the defects.
Finding all closed frequent itemsets is a key step of association rule mining since the non-redundant association rule can be inferred from all the closed frequent itemsets. In this paper we present a new method for finding closed frequent itemsets based on attribute value lattice. In the new method, we argue that vertical data representation and attribute value lattice can find all closed frequent itemsets efficiently, thus greatly improve the efficiency of association rule mining algorithm. We discuss how these techniques and methods are applied to find closed frequent itemsets. In our method, the data are represented vertically; each frequent attribute value is associated with its granule, which is represented as a hybrid bitmap. Based on the partial order defined between the attribute values among the databases, an attribute value lattice is constructed, which is much smaller compared with the original databases. Instead of searching all the items in the databases, which is adopted by almost all the association rule algorithms to find frequent itemsets, our method only searches the attribute-value lattice. A bottom-up breadth-first approach is employed to search the attribute value lattice to find the closed frequent itemsets.
This paper presents a new approach to extract hierarchical decision rules, which consists of the following three procedures. First, the characterization set of each given target concept is extracted from databases and the concept hierarchy for given classes is calculated. Second, based on the hierarchy, rules for each hierarchical level are induced from data. Then, for each given class, rules for all the hierarchical levels are integrated into one rule. The proposed method was evaluated on a medical database, the experimental results of which show that induced rules correctly represent experts' decision processes.
Associations (not necessarily in rule forms) as patterns in data are critically analyzed. We build theory based only on what data says, and no other implicit assumptions. Data mining is regarded as a deductive science: First, we observe that isomorphic relations have isomorphic associations. Somewhat a surprise, such a simple observation turns out to have far reaching consequences. It implies that associations are properties of an isomorphic class, not an individual relation. A similar conclusion can be made for probability theory based on item counting, hence it is not adequate to characterize the "interesting-ness," since the latter one is a property of an individual relation. As a by-product of this analysis, we find that all generalized associations can be found by simply solving a set of integral linear inequalities - this is a very striking result. Finally, we observe that from the structure of the relation lattice, we may conclude that random sampling may loose substantial information about patterns.
An approach is being explored that involves embedding a fuzzy logic based resource manager in an electronic game environment. Game agents can function under their own autonomous logic or human control. This approach automates the data mining problem. The game automatically creates a cleansed database reflecting the domain expert's knowledge, it calls a data mining function, a genetic algorithm, for data mining of the data base as required and allows easy evaluation of the information extracted. The co-evolutionary fitness functions, chromosomes and stopping criteria for ending the game are discussed. Genetic algorithm and genetic program based data mining procedures are discussed that automatically discover new fuzzy rules and strategies. The strategy tree concept and its relationship to co-evolutionary data mining are examined as well as the associated phase space representation of fuzzy concepts. The overlap of fuzzy concepts in phase space reduces the effective strategies available to adversaries. Co-evolutionary data mining alters the geometric properties of the overlap region known as the admissible region of phase space significantly enhancing the performance of the resource manager. Procedures for validation of the information data mined are discussed and significant experimental results provided.
Rule induction methods have been introduced since 1980's and
many applications show that they are very useful to acquire simple
patterns from large databases. However, when a database is
very large, the methods generate too many rules, which makes
domain experts interpret all the rules. Moreover, since rules only
shows the relations between attribute-value pairs, it is very
difficult to capture the relations between concepts or among induced
rules. In order to solve this problem, many kinds of visualization
has been introduced. Rough set theory has a technique on
conflict analysis with qualitative distance obtained from attributes,
which gives graphical relations between class or rules. On the
other hand, statistical methods have a graphical model method,
which gives graphical relations between attributes by using
partial coefficients or other indices. In this paper, we introduce
a new approach which combines conflict analysis and graphical
modeling. The results show that the combination of these two methods
gives the other type of visualization of rules, which gives also a
formal mathematical model for rule visualization.
Finding reduct is the core theme in rough set theory, and could be considered one form of data mining. However to find the "perfect" reduct has been proved to be an NP-hard problem. In this paper, we compute a reduct based on granular data model; each granule is represented by a bit string. The computing is very fast.
Rough set theory is emerging as a new tool for dealing with fuzzy and uncertain data. In this paper, a theory is developed to express, measure and process uncertain information and uncertain knowledge based on our result about the uncertainty measure of decision tables and decision rule systems. Based on Skowron’s propositional default rule generation algorithm, we develop an initiative learning model with rough set based initiative rule generation algorithm. Simulation results illustrate its efficiency.
Rough set is a valid mathematical theory developed in recent years. It has the ability to deal with imprecise, uncertain, and vague information. It has been applied in such fields as machine learning, data mining, intelligent data analyzing and control algorithm acquiring successfully. In this paper, we will make a comparative study of the algebra view and information view of rough set theory. Some inequivalent relationships between these two views of rough set theory in inconsistent decision table systems are discovered. It corrects an error of many researchers, that is, the algebra view and information view of rough set theory are equivalent. It is helpful for developing heuristic knowledge reduction algorithms for inconsistent decision table systems.
As the Web technology is progressing, XML (Extensible Markup Language) has become a new data exchange format for future Web mining and applications. Various XML middleware have been developed for transferring Web data stored in relational databases to XML documents enabling a uniform data searching technique. However, these developments are not efficient due to their using extra memory and database resources that lead to poor scalability in Web mining development. In this paper, we explore the research on building an efficient XML middleware for large-scale Web mining applications. Our approach is that if the XML structure can be properly embedded in the creation of relational content during XML middleware table construction, the data in the relational database could be retrieved with minimum amount of memory and database resources. The results of our approach will be analyzed and comparisons with related research will be made.
Intelligent data mining techniques have useful e-Business applications. Because an e-Commerce application is related to multiple domains such as statistical analysis, market competition, price comparison, profit improvement and personal preferences, this paper presents a hybrid knowledge-based e-Commerce system fusing intelligent techniques, statistical data mining, and personal information to enhance QoS (Quality of Service) of e-Commerce. A Web-based e-Commerce application software system, eDVD Web Shopping Center, is successfully implemented uisng Java servlets and an Oracle81 database server. Simulation results have shown that the hybrid intelligent e-Commerce system is able to make smart decisions for different customers.
We present compact image data structures and associated packet delivery techniques for effective Web caching architectures.
Presently, images on a web page are inefficiently stored, using a single image per file. Our approach is to use clustering to merge similar images into a single file in order to exploit the redundancy between images. Our studies indicate that a 30-50% image data size reduction can be achieved by eliminating the redundancies of color indexes. Attached to this file is new metadata to permit an easy extraction of images. This approach will permit a more efficient use of the cache, since a shorter list of cache references will be required. Packet and transmission delays can be reduced by 50% eliminating redundant TCP/IP headers and connection time. Thus, this innovative paradigm for the elimination of redundancy may provide valuable benefits for optimizing packet delivery in IP networks by reducing latency and minimizing the bandwidth requirements.
One main technical means of anti-Spam is to build filters in email transfer route. However, the design of many junk mail filters hasn't made use of the whole security information in an email, which exists mostly in mail header rather than in the text and accessory. In this paper, data mining based on rough sets is introduced to design a new anti-Spam filter. Firstly, by recording and analyzing the header of every collected email sample, we get all necessary original raw data. Next, by selecting and computing features from the original header data, we obtain our decision table including several condition attributes and one decision attribute. Then, a data mining technique based on rough sets, which mainly includes relative reduction and rule generation, is introduced to mine this decision table. And we obtain some useful anti-Spam knowledge from all the email headers. Finally, we have made tests by using our rules to judge different mails. Tests demonstrate that when mining on selected baleful email corpus with specific Spam rate, our anti-Spam filter has high efficiency and high identification rate. By mining email headers, we can find potential security problems of some email systems and cheating methods of Spam senders.
If you have ever used a popular search engine on the Internet to search for a specific topic you are interested in, you know that most of the results you get back are unrelated, or do not have the information for which you are searching. Usually you end up looking through many Web pages before you find information. Different search engines give you different ranked results, so how do you choose which one to use? Buddy solves these problems for you. With Buddy you can search multiple search engines with many different queries. Using topic trees to create in depth search queries, utilizing the power of many renowned search engines, with the ability to dynamically create and delete them on the fly, Buddy gives you the results you want on the information you are looking for. Using its unique ranking algorithm the results from multiple search engines are correlated and fused together, removing multiple document hits. This paper will discuss the motivation for and the capabilities of Buddy.
The world is dynamic and ever changing. Databases that are current at one moment can quickly be out of date a minute later. Many times, database updates and accuracy of the data is secondary. How do we continuously update these databases and associate/fuse new and diverse pieces of data without modifying the database schema? In this paper, we will explore updating a database that contains information on various pieces of equipment worldwide and extending its source and contents by including the association of multi-source documents.
Over the past decade many techniques have been developed which attempt to predict possible events through the use of given models or patterns of activity. These techniques work quite well given the case that one has a model or a valid representation of activity. However, in reality for the majority of the time this is not the case. Models that do exist, in many cases were hand crafted, required many man-hours to develop and they are very brittle in the dynamic world in which we live. Data mining techniques have shown some promise in providing a set of solutions. In this paper we will provide the details for our motivation, theory and techniques which we have developed, as well as the results of a set of experiments.
Traditionally the engineering modeling process is based on first principles, which usually yields large, complex and detailed models of a dynamic system. As an alternative, classical system identification procedures often produce simple (linear) models ignoring additional domain knowledge. In this context, numerical models, e.g. multi-body models or finite-element simulations based on first-principles are used to predict the system behavior. If only a small number of simulation outputs are needed, massive computational power is wasted on computing grid data which is of no further interest. In this paper a different approach using engineering techniques, such as dimensional analysis, coupled with knowledge discovery methods, such as e.g. neural networks or k-nearest neighbor search, is used to predict the dynamic system behavior from only a few characteristic input parameters and the given initial or boundary conditions. The dynamic system is therefore modeled as a nonlinear static mapping whose parameters are estimated from experiment as well as from simulation. This static mapping allows for very fast prediction times compared to computationally intense numerical simulations. Additionally, some of the mapping methods allow the calculation of sensitivities, which in turn allow e.g. the ranking of the inputs according to their contribution to the output parameters. The presented approach to dynamic system analysis is first described in detail, then some of the used methods are described and the usefulness of the approach is demonstrated in the example of a non-linear spring-mass-damper system.
Hypoplastic left heart syndrome (HLHS) affects infants and is uniformly fatal without surgery. Post-surgery mortality rates are highly variable and dependent on postoperative management. The high mortality after the first stage surgery usually occurs within the first few days after procedure. Typically, the deaths are attributed to the unstable balance between the pulmonary and systemic circulations. An experienced team of physicians, nurses, and therapists is required to successfully manage the infant. However, even the most experienced teams report significant mortality due to the extremely complex relationships among physiologic parameters in a given patient. A data acquisition system was developed for the simultaneous collection of 73 physiologic, laboratory, and nurse-assessed variables. Data records were created at intervals of 30 seconds. An expert-validated wellness score was computed for each data record. A training data set consisting of over 5000 data records from multiple patients was collected. Preliminary results demonstratd that the knowledge discovery approach was over 94.57% accurate in predicting the "wellness score" of an infant. The discovered knowledge can improve care of complex patients by development of an intelligent simulator that can be used to support decisions.
In this paper, a novel data mining approach to address damage detection within the large-scale complex structures is proposed. Every structure is defined by the set of finite elements that also represent the number of target variables. Since large-scale complex structures may have extremely large number of elements, predicting the failure in every single element using the original set of natural frequencies as features is exceptionally time-consuming task. Therefore, in order to reduce the time complexity we propose a hierarchical localized approach for partitioning the entire structure into substructures and predicting the failure within these substructures. Unlike our previous sub-structuring approach, which is based on physical substructures in the structure, here we propose to partition the structure into sub-structures employing hierarchical clustering algorithm that also allows localizing the damage in the structure. Finally, when the identified substructure with a failure consists of sufficiently small number of target variables the extent of the damage in the element of the substructure is predicted. A numerical example analyses on an electric transmission tower frame is presented to demonstrate the effectiveness of the proposed method.
The astronomy research community is about to become the
beneficiary of huge multi-terabyte databases from a host of sky
surveys. The rich and diverse information content within this "virtual sky" and the array of results to be derived therefrom will far exceed the current capacity of data search and research tools. The new digital surveys have the potential of facilitating a wide range of scientific discoveries about the Universe! To enable this to happen, the astronomical community is embarking on an ambitious endeavor, the creation of a National Virtual Observatory (NVO). This will in fact develop into a Global Virtual Observatory. To facilitate the new type of science enabled by the NVO, new techniques in data mining and knowledge discovery in large databases must be developed and deployed, and the next generation of astronomers must be trained in these techniques. This activity will benefit greatly from developments in the fields of information technology, computer science, and statistics. Aspects of the NVO initiative, including sample science user scenarios and user requirements will be presented. The value of scientific data mining and some early test case results will be discussed in the context of the speaker's research interests in colliding and merging galaxies.
This paper reports characteristics of dissimilarity measures used in the multiscale matching. Multiscale matching is a method for comparing two planar curves by partially changing observation scales. Throughout all scales, it finds the best set of pairs of partial contours that contains no miss-matched or over-matched contours and that minimizes the accumulated differences between the partial contours. In order to make this method applicable to comparison of the temporal sequences, we have proposed a dissimilarity measure that compares subsequences according to the following aspects: rotation angle, length, phase and gradient. However, it empirically became apparent that it was difficult to understand from the results that which aspects were really contributed to the resultant dissimilarity of the sequences. In order to investigate fundamental characteristics of the dissimilarity measure, we performed quantitative analysis of the induced dissimilarities using simple sine wave and its variants. The results showed that differences on the amplitude, phase and trends were respectively captured by the terms on rotation angle, phase and gradient, although they also showed weakness on the linearity.
The paradigms of OLAP, multidimensional modeling and data mining have first emerged in the areas of market analysis and finance to address various needs of people working in these areas. Does this mean that they are useful and applicable in these areas only? Or, can they also be applicable in the other more traditional areas of science and engineering? What characterize the systems for which these paradigms are suitable? What are the goals of these paradigms? How do they relate to the traditional body of knowledge that has been developed throughout the centuries in the areas of mathematics, statistics, systems science and engineering? Where, how and to what extent can we leverage the conventional wisdom that has been accumulated in the aforementioned disciplines to develop a foundational basis for the above paradigms? The goal of this paper is to address these questions at the foundational level. We argue that the paradigms of OLAP, multidimensional modeling and data mining can also be applied successfully to complex engineering systems, such as membrane-based water/wastewater treatment plants, for example. We develop mathematically-based axiomatic definition of the concepts of 'dimension,' 'dimension level,' 'dimension hierarchy' and 'measure' using set theory and equivalence relations.
The recent advances in mobile communication technologies and their widespread use calls for a host of new value added services for the mobile user. In their current avatar, these deices are not more than mere communication equipments. Now consumer orientated, mobile, internet connected devices which are location aware (that are capable of determining and transmitting their current geographical location) are becoming available everywhere. The availability of internet access and location awareness in portable devices like cell phones, Personal Digital Assistants, etc. opens up a host of new opportunities for services which can en cash on the location of the user. Besides providing navigational information to the user, additional push down information can be sent to the user based on his profile and his preferences. The domain is wide and the number of applications is enormous. This paper presents a design and implementation of a basic location aware service.
The evolution of artificial intelligence systems called by complicating of their operation topics and science perfecting has resulted in a diversification of the methods both the algorithms of knowledge representation and usage in these systems. Often by this reason it is very difficult to design the effective methods of knowledge discovering and operation for such systems.
In the given activity the authors offer a method of unitized representation of the systems knowledge about objects of an external world by rank transformation of their descriptions, made in the different features spaces: deterministic, probabilistic, fuzzy and other. The proof of a sufficiency of the information about the rank configuration of the object states in the features space for decision making is presented. It is shown that the geometrical and combinatorial model of the rank configurations set introduce their by group of some system of incidence, that allows to store the information on them in a convolute kind. The method of the rank configuration description by the DRP - code (distance rank preserving code) is offered. The problems of its completeness, information capacity, noise immunity and privacy are reviewed. It is shown, that the capacity of a transmission channel for such submission of the information is more than unit, as the code words contain the information both about the object states, and about the distance ranks between them. The effective algorithm of the data clustering for the object states identification, founded on the given code usage, is described. The knowledge representation with the help of the rank configurations allows to unitize and to simplify algorithms of the decision making by fulfillment of logic operations above the DRP - code words.
Examples of the proposed clustering techniques operation on the given samples set, the rank configuration of resulted clusters and its DRP-codes are presented.
Expert system development tool ESDT-HKD is a general-purpose language for knowledge engineering. Rule plus frame plus black board is used as knowledge structure of the system. It is a perfect system implemented on personal computer. There is particular knowledge structure, and generating and running environment for users. As a tool of expert system assistant design, ESDT-HKD many original aspects such as citation of explanation framework, opened rule describing language PBRL, dynamic query of black board, maintenance of knowledge base, two-double scheduling of task and so on with which the system has the complete function. This article introduces scheduling and reasoning of ESDT-HKD system tasks during consultation in details.
The interest in analyzing data has grown tremendously in recent years. To analyze data, a multitude of technologies is need, namely technologies from the fields of Data Warehouse, Data Mining, On-line Analytical Processing (OLAP). This paper gives a new architecture of data warehouse in CIMS according to CRGC-CIMS application engineering. The data source of this architecture comes from database of CRGC-CIMS system. The data is put in global data set by extracting, filtrating and integrating, and then the data is translated to data warehouse according information request. We have addressed two advantages of the new model in CRGC-CIMS application. In addition, a Data Warehouse contains lots of materialized views over the data provided by the distributed heterogeneous databases for the purpose of efficiently implementing decision-support, OLAP queries or data mining. It is important to select the right view to materialize that answer a given set of queries. In this paper, we also have designed algorithms for selecting a set of views to be materialized in a data warehouse in order to answer the most queries under the constraint of given space. First, we give a cost model for selecting materialized views. Then we give the algorithms that adopt gradually recursive method from bottom to top. We give description and realization of algorithms. Finally, we discuss the advantage and shortcoming of our approach and future work.
In this paper, we discuss the potential applications of data
mining techniques for the design of Web based information retrieval
support systems (IRSS). In particular, we apply clustering methods
for the granulation of different entities involved in IRSS. Two
types of granulations, single-level and multi-level granulations,
are investigated. Issues of document space granulation, query space
granulation, term space granulation, and retrieval results granulation are studied in detail. It is demonstrated that each different granulation supports a different user task.
This paper addresses some fundamental issues related to
the foundations of data mining. It is argued that there is an urgent
need for formal and mathematical modeling of data mining. A
formal framework provides a solid basis for a systematic study of
many fundamental issues, such as representations and
interpretations of primitive notions of data mining, data mining
algorithms, explanations and applications of data mining results.
A multi-level framework is proposed for modeling data mining
based on results from many related fields. Formal concepts
are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension of the concept,
namely, a formula in a certain language and a subset of the
universe. An object satisfies the formula of a concept if the
object has the properties as specified by the formula, and the
object belongs to the extension of the concept. Rules are used
to describe relationships between concepts. A rule is expressed
in terms of the intensions of the two concepts and is interpreted
in terms of the extensions of the concepts. Several different
types of rules are investigated. The usefulness and meaningfulness
of discovered knowledge are examined using a utility model and
an explanation model.
In this paper, we examine many real world examples of information granules, and construct a granular deductive reasoning system for these domains. Objects are ordered pairs; the first element is an assertion (to call logical formula) and the second is a semantic set corresponding to the assertion. So the granular language and its model involve both logic and set theory. So-called logic means that the reasoning obeys the syntax in logical language; So called set theory means that the operations corresponding to semantic sets of logical formulas obey the methods in set theory. The evaluation of truth values and the computation rule of granular formulas are established.