SkyServer, the Internet portal for the Sloan Digital Sky Survey (SDSS) astronomic catalog, provides a set of tools that allows data access for astronomers and scientific education. One of SkyServer data access interfaces allows users to enter ad-hoc SQL statements to query the catalog. SkyServer also presents some template queries that can be used as basis for more complex queries. This interface has logged over 330 million queries submitted since 2001. It is expected that analysis of this data can be used to investigate usage patterns, identify potential new classes of queries, find similar queries, etc. and to shed some light on how users interact with the Sloan Digital Sky Survey data and how scientists have adopted the new paradigm of e-Science, which could in turn lead to enhancements on the user interfaces and experience in general. In this paper we review some approaches to SQL query mining, apply the traditional techniques used in the literature and present lessons learned, namely, that the general text mining approach for feature extraction and clustering does not seem to be adequate for this type of data, and, most importantly, we find that this type of analysis can result in very different queries being clustered together.
Citizen science projects are those which recruit volunteers to participate as assistants in scientific studies. Since these projects depend on volunteer efforts, understanding the motivation that drives a volunteer to collaborate is important to ensure its success. <p> </p>One way to understand motivation is by interviewing the volunteers. While this approach may elicit detailed information on the volunteers' motivation and actions, it is restricted to a subset of willing participants. For web-based projects we could instead use logs of volunteers' activities, which measures which volunteer did what and when for all volunteers in a project.<p> </p> In this work we present some metrics that can be calculated from the logs, based on a model of interaction. We also comment on the applicability of those metrics, describe an ongoing work that may yield more precise logs and metrics and comment on issues for further research.
Malware detection may be accomplished through the analysis of their infection behavior. To do so, dynamic analysis systems run malware samples and extract their operating system activities and network traffic. This traffic may represent malware accessing external systems, either to steal sensitive data from victims or to fetch other malicious artifacts (configuration files, additional modules, commands). In this work, we propose the use of visualization as a tool to identify compromised systems based on correlating malware communications in the form of graphs and finding isomorphisms between them. We produced graphs from over 6 thousand distinct network traffic files captured during malware execution and analyzed the existing relationships among malware samples and IP addresses.
Citizen science projects are those in which volunteers are asked to collaborate in scientific projects, usually by volunteering idle computer time for distributed data processing efforts or by actively labeling or classifying information - shapes of galaxies, whale sounds, historical records are all examples of citizen science projects in which users access a data collecting system to label or classify images and sounds.
In order to be successful, a citizen science project must captivate users and keep them interested on the
project and on the science behind it, increasing therefore the time the users spend collaborating with the
project. Understanding behavior of citizen scientists and their interaction with the data collection systems
may help increase the involvement of the users, categorize them accordingly to different parameters, facilitate their collaboration with the systems, design better user interfaces, and allow better planning and deployment of similar projects and systems.
Users behavior can be actively monitored or derived from their interaction with the data collection systems. Records of the interactions can be analyzed using visualization techniques to identify patterns and outliers. In this paper we present some results on the visualization of more than 80 million interactions of almost 150 thousand users with the Galaxy Zoo I citizen science project. Visualization of the attributes extracted from their behaviors was done with a clustering neural network (the Self-Organizing Map) and a selection of icon- and pixel-based techniques. These techniques allows the visual identification of groups of similar behavior in several different ways.
Recent technological advances allowed the creation and use of internet-based systems where many users can collaborate gathering and sharing information for specific or general purposes: social networks, e-commerce review systems, collaborative knowledge systems, etc. Since most of the data collected in these systems is user-generated, understanding of the motivations and general behavior of users is a very important issue.
Of particular interest are citizen science projects, where users without scientific training are asked for collaboration labeling and classifying information (either automatically by giving away idle computer time or manually by actually seeing data and providing information about it). Understanding behavior of users of those types of data collection systems may help increase the involvement of the users, categorize users accordingly to different parameters, facilitate their collaboration with the systems, design better user interfaces, and allow better planning and deployment of similar projects and systems.
Behavior of those users could be estimated through analysis of their collaboration track: registers of which user did what and when can be easily and unobtrusively collected in several different ways, the simplest being a log of activities.
In this paper we present some results on the visualization and characterization of almost 150.000 users with more than 80.000.000 collaborations with a citizen science project - Galaxy Zoo I, which asked users to classify galaxies' images. Basic visualization techniques are not applicable due to the number of users, so techniques to characterize users' behavior based on feature extraction and clustering are used.
Analysis of user interaction with computer systems can be used for several purposes, the most common being analysis of the effectiveness of the interfaces used for interaction (in order to adapt or enhance its usefulness) and analysis of intention and behavior of the users when interacting with these systems. For web applications, often the analysis of user interaction is done using the web server logs collected for every document sent to the user in response to his/her request. In order to capture more detailed data on the users' interaction with sites, one could collect actions the user performs in the client side. An effective approach to this is the USABILICS system, which also allows the definition and analysis of tasks in web applications. The fine granularity of logs collected by USABILICS allows a much more detailed log of users' interaction with a web application. These logs can be converted into graphs where vertices are users' actions and edges are paths made by the user to accomplish a task. Graph analysis and visualization tools and techniques allow the analysis of actions taken in relation to an expected action path, or characterization of common (and uncommon) paths on the interaction with the application. This paper describes how to estimate users' behavior and characterize their intentions during interaction with a web application, presents the analysis and visualization tools on those graphs and shows some practical results with an educational site, commenting on the results and implications of the possibilities of using these techniques.
SkyServer is an Internet portal to data from the Sloan Digital Sky Survey, the largest online archive of astronomy
data in the world. provides free access to hundreds of millions of celestial objects for science, education and
outreach purposes. Logs of accesses to SkyServer comprise around 930 million hits, 140 million web services
accesses and 170 million SQL submitted queries, collected over the past 10 years. These logs also contain
indications of compromise attempts on the servers. In this paper, we show some threats that were detected in
ten years of stored logs, and compare them with known threats in those years. Also, we present an analysis of
the evolution of those threats over these years.
Malware spread via Internet is a great security threat, so studying their behavior is important to identify and classify
them. Using SSDT hooking we can obtain malware behavior by running it in a controlled environment and capturing
interactions with the target operating system regarding file, process, registry, network and mutex activities. This
generates a chain of events that can be used to compare them with other known malware. In this paper we present a
simple approach to convert malware behavior into activity graphs and show some visualization techniques that can be
used to analyze malware behavior, individually or grouped.
Malicious code (malware) that spreads through the Internet-such as viruses, worms and trojans-is a major threat
to information security nowadays and a profitable business for criminals. There are several approaches to analyze
malware by monitoring its actions while it is running in a controlled environment, which helps to identify malicious
behaviors. In this article we propose a tool to analyze malware behavior in a non-intrusive and effective way,
extending the analysis possibilities to cover malware samples that bypass current approaches and also fixes some
issues with these approaches.
This work proposes a different approach for the use of turning function space to change shapes in accordance with shape
descriptions and consistent with spectral information. The main steps are: (1) segmentation; (2) contour extraction;
(3) turning function space transform; (4) classification; (5) shape analysis; and (6) blob enhancement on image space. In
the analysis of shape the boundary is modified based on both image and model and constraints are imposed to portions of
the turning function. Shape modeling can be done by defining criteria such as linearity, angles and sizes. Results on
synthetic examples are presented.
Malware has become a major threat in the last years due to the ease of spread through the Internet. Malware detection
has become difficult with the use of compression, polymorphic methods and techniques to detect and disable security
software. Those and other obfuscation techniques pose a problem for detection and classification schemes that analyze
malware behavior. In this paper we propose a distributed architecture to improve malware collection using different
honeypot technologies to increase the variety of malware collected. We also present a daemon tool developed to grab
malware distributed through spam and a pre-classification technique that uses antivirus technology to separate malware
in generic classes.
As the amount and types of remote network services increase, the analysis of their logs has become a
very difficult and time consuming task. There are several ways to filter relevant information and
provide a reduced log set for analysis, such as whitelisting and intrusion detection tools, but all of
them require too much fine- tuning work and human expertise. Nowadays, researchers are evaluating
data mining approaches for intrusion detection in network logs, using techniques such as genetic
algorithms, neural networks, clustering algorithms, etc. Some of those techniques yield good results,
yet requiring a very large number of attributes gathered by network traffic to detect useful
information. In this work we apply and evaluate some data mining techniques (K-Nearest Neighbors,
Artificial Neural Networks and Decision Trees) in a reduced number of attributes on some log data
sets acquired from a real network and a honeypot, in order to classify traffic logs as normal or
suspicious. The results obtained allow us to identify unlabeled logs and to describe which attributes
were used for the decision. This approach provides a very reduced amount of logs to the network
administrator, improving the analysis task and aiding in discovering new kinds of attacks against
In this paper we propose a method for creation of rules for images classification using fuzzy expert systems. The method consists of the analysis of the results of clusters formed by the application of a biased clustering algorithm to the image pixels. Biased clustering algorithms are partially supervised classification algorithms which allows the use of imprecise, incomplete or conflicting expectancies of assignment of data points to classes, and by iterative clustering attempts to solve the conflicts and incompleteness and obtain labeled clusters. The resulting clusters can be used to create new rules or membership functions which can lead to more and/or better rules for classification of the data using a fuzzy expert system. The new rules and membership functions can also be compared with the ones used to create the original expectancies of assignment of data for validation. Examples of application of the proposed method to synthetic and image data are presented. The classification results are evaluated and compared, conclusions on the problems, advantages and overall features of the proposed methods are presented and future work directions are considered.
Clustering algorithms are often used as unsupervised classifiers when minimal information about the classification problem is available. Clustering will usually assign an unique label, corresponding to a class, to each of the data points. For most implementations of clustering algorithms, those labels just correspond to a class index, but don't convey information about which class is that. Identification of the classes corresponding to the formed clusters can be done with heuristics or using information from points in the clusters with known classes. This paper describe a hybrid clustering approach based on a biased fuzzy C-means algorithm. Biases values corresponding to the expectancy of a data point be assigned to a class will be derived from simple image processing operations and included as weighting factors in the clustering algorithm. The final labels for the data will retain the order imposed by the biases, therefore can be used to identify the classes for the clusters. The basic fuzzy C- means algorithm and the modifications for use of biases will be presented. Results for both synthetic and imagery data classification with the method will be presented and compared with the non-biased clustering results. The results obtained with the biased method are qualitatively superior to the non- biased method when conservative biases are used for the classes, and the method can be applied when it is difficult or impractical to use a completely supervised method.
Map images are complex documents generated from several layers of information overlapped and printed on paper, and usually the only available information is the digitized image of the map. The recovery of the original layers of the map for analysis of its components independently would be useful but would require several steps to be completed. One first step could separate the image in conceptual layers by using basic spectral and spatial properties, giving layers corresponding to basic features in the amp image, which would serve as input for more sophisticated algorithms which could give as results more detailed information and so on, until a complete high-level description of the map information is obtainable. Extraction of the conceptual map layers is often a complex task since the pixels that correspond to the categories in a map image are spectrally and spatially mixed with the pixels of other classes. This paper presents the selective attention filter (SAF) which is able to filter out pixels that are not relevant to the information being extracted or enhance pixels of categories of interest. The SAF filter is robust in presence of noise and result of classification with images filtered with it are quantitatively better than results obtained with other commonly used filters.
A very common task in image processing is the segmentation of the image in areas that are uniform in the sense of its features. Various applications can benefit even from partial segmentation, which is performed without need of physical of semantic knowledge. Several segmentation methods exist, but none that is applicable to all tasks. We use color and perceptual texture information to segment color images. Perceptual texture features are the features that can be qualified in simple descriptions by humans. Color information is represented in a perceptual way, using hue, value and saturation. These features' values are represented by histograms that integrates texture information around a small area. Segmentation and classification are obtained by comparing histograms of classes with the histograms of the area around the pixel being classified. We build a small application that uses remote sensing imagery and allows a user to interactively segment a Landsat TM-5 image using color and texture information. The steps and intermediate results of the classification are shown. The results are visually good, and the segmentation using color and texture information is more coherent than the using only color.