We are building a cognitive vision system for mobile robots that works in a manner similar to the human vision
system, using saccadic, vergence and pursuit movements to extract information from visual input. At each fixation,
the system builds a 3D model of a small region, combining information about distance, shape, texture and motion to
create a local dynamic spatial model. These local 3D models are composed to create an overall 3D model of the
robot and its environment. This approach turns the computer vision problem into a search problem whose goal is the
acquisition of sufficient spatial understanding for the robot to succeed at its tasks.
The research hypothesis of this work is that the movements of the robot’s cameras are only those that are necessary
to build a sufficiently accurate world model for the robot’s current goals. For example, if the goal is to navigate
through a room, the model needs to contain any obstacles that would be encountered, giving their approximate
positions and sizes. Other information does not need to be rendered into the virtual world, so this approach trades
model accuracy for speed.
In previous work, we have shown how a 3D model can be built in real time and synchronized with the environment.
This world model permits a robot to predict dynamics in its environment and classify behaviors. In this paper
we evaluate the effect of such a 3D model on the accuracy and speed of various computer vision algorithms,
including tracking, optical flow and stereo disparity. We report results based on the KITTI database and on our own
We describe a cognitive vision system for a mobile robot. This system works in a manner similar to the human vision
system, using saccadic, vergence and pursuit movements to extract information from visual input. At each fixation,
the system builds a 3D model of a small region, combining information about distance, shape, texture and motion.
These 3D models are embedded within an overall 3D model of the robot's environment. This approach turns the
computer vision problem into a search problem, with the goal of constructing a physically realistic model of the entire
At each step, the vision system selects a point in the visual input to focus on. The distance, shape, texture and motion
information are computed in a small region and used to build a mesh in a 3D virtual world. Background knowledge is
used to extend this structure as appropriate, e.g. if a patch of wall is seen, it is hypothesized to be part of a large wall
and the entire wall is created in the virtual world, or if part of an object is recognized, the whole object's mesh is
retrieved from the library of objects and placed into the virtual world. The difference between the input from the real
camera and from the virtual camera is compared using local Gaussians, creating an error mask that indicates the main
differences between them. This is then used to select the next points to focus on.
This approach permits us to use very expensive algorithms on small localities, thus generating very accurate models. It
also is task-oriented, permitting the robot to use its knowledge about its task and goals to decide which parts of the
environment need to be examined.
The software components of this architecture include PhysX for the 3D virtual world, OpenCV and the Point Cloud
Library for visual processing, and the Soar cognitive architecture, which controls the perceptual processing and robot
planning. The hardware is a custom-built pan-tilt stereo color camera.
We describe experiments using both static and moving objects.
We consider the scenario where an autonomous platform that is searching or traversing a building may observe unstable
masonry or may need to travel over unstable rubble. A purely behaviour-based system may handle these challenges but
produce behaviour that works against long-terms goals such as reaching a victim as quickly as possible. We extend our
work on ADAPT, a cognitive robotics architecture that incorporates 3D simulation and image fusion, to allow the robot
to predict the behaviour of physical phenomena, such as falling masonry, and take actions consonant with long-term
goals. We experimentally evaluate a cognitive only and reactive only approach to traversing a building filled with
various numbers of challenges and compare their performance. The reactive only approach succeeds only 38% of the
time, while the cognitive only approach succeeds 100% of the time. While the cognitive only approach produces very
impressive behaviour, our results indicate how much better the combination of cognitive and behaviour-based can be.
We are building a robot cognitive architecture that constructs a real-time virtual copy of itself and its environment,
including people, and uses the model to process perceptual information and to plan its movements. This paper describes
the structure of this architecture.
The software components of this architecture include PhysX for the virtual world, OpenCV and the Point Cloud Library
for visual processing, and the Soar cognitive architecture that controls the perceptual processing and task planning. The
RS (Robot Schemas) language is implemented in Soar, providing the ability to reason about concurrency and time. This
Soar/RS component controls visual processing, deciding which objects and dynamics to render into PhysX, and the
degree of detail required for the task.
As the robot runs, its virtual model diverges from physical reality, and errors grow. The Match-Mediated Difference
component monitors these errors by comparing the visual data with corresponding data from virtual cameras, and
notifies Soar/RS of significant differences, e.g. a new object that appears, or an object that changes direction
Soar/RS can then run PhysX much faster than real-time and search among possible future world paths to plan the robot's
actions. We report experimental results in indoor environments.
An important component of cognitive robotics is the ability to mentally simulate physical processes and to
compare the expected results with the information reported by a robot's sensors. In previous work, we have proposed an
approach that integrates a 3D game-engine simulation into the robot control architecture. A key part of that architecture
is the Match-Mediated Difference (MMD) operation, an approach to fusing sensory data and synthetic predictions at the
image level. The MMD operation insists that simulated and predicted scenes are similar in terms of the appearance of
the objects in the scene. This is an overly restrictive constraint on the simulation since parts of the predicted scene may
not have been previously viewed by the robot.
In this paper we propose an extended MMD operation that relaxes the constraint and allows the real and
synthetic scenes to differ in some features but not in (selected) other features. Image difference operations that allow a
real image and synthetic image generated from an arbitrarily colored graphical model of a scene to be compared. Scenes
with the same content show a zero difference. Scenes with varying foreground objects can be controlled to compare the
color, size and shape of the foreground.
This paper describes our work on integrating distributed, concurrent control in a cognitive architecture, and using it to
classify perceived behaviors. We are implementing the Robot Schemas (RS) language in Soar. RS is a CSP-type
programming language for robotics that controls a hierarchy of concurrently executing schemas. The behavior of every
RS schema is defined using port automata. This provides precision to the semantics and also a constructive means of
reasoning about the behavior and meaning of schemas. Our implementation uses Soar operators to build, instantiate and
connect port automata as needed. Our approach is to use comprehension through generation (similar to NLSoar) to
search for ways to construct port automata that model perceived behaviors. The generality of RS permits us to model
dynamic, concurrent behaviors. A virtual world (Ogre) is used to test the accuracy of these automata. Soar's chunking
mechanism is used to generalize and save these automata. In this way, the robot learns to recognize new behaviors.
A mobile robot moving in an environment in which there are other moving objects and active agents, some of which may represent threats and some of which may represent collaborators, needs to be able to reason about the potential future behaviors of those objects and agents. In previous work, we presented an approach to tracking targets with complex behavior, leveraging a 3D simulation engine to generate predicted imagery and comparing that against real imagery. We introduced an approach to compare real and simulated imagery using an affine image transformation that maps the real scene to the synthetic scene in a robust fashion. In this paper, we present an approach to continually synchronize the real and synthetic video by mapping the affine transformation yielded by the real/synthetic image comparison to a new pose for the synthetic camera. We show a series of results for pairs of real and synthetic scenes containing objects including similar and different scenes.
A mobile robot moving in an environment in which there are other moving objects and active agents, some of which
may represent threats and some of which may represent collaborators, needs to be able to reason about the potential
future behaviors of those objects and agents. In this paper we present an approach to tracking targets with complex
behavior, leveraging a 3D simulation engine to generate predicted imagery and comparing that against real imagery. We
introduce an approach to compare real and simulated imagery and present results using this approach to locate and track
objects with complex behaviors.
In this approach, the salient points in real and imaged images are identified and an affine image transformation that maps
the real scene to the synthetic scene is generated. An image difference operation is developed that ensures that the
matched points in both images produce a zero difference. In this way, synchronization differences are reduced and
content differences enhanced. A number of image pairs are processed and presented to illustrate the approach.
The ADAPT project is a collaboration of researchers in robotics, linguistics and artificial intelligence at three universities to create a cognitive architecture specifically designed to be embodied in a mobile robot. There are major respects in which existing cognitive architectures are inadequate for robot cognition. In particular, they lack support for true concurrency and for active perception. ADAPT addresses these deficiencies by modeling the world as a network of concurrent schemas, and modeling perception as problem solving. Schemas are represented using the RS (Robot Schemas) language, and are activated by spreading activation. RS provides a powerful language for distributed control of concurrent processes. Also, The formal semantics of RS provides the basis for the semantics of ADAPT's use of natural language. We have implemented the RS language in Soar, a mature cognitive architecture originally developed at CMU and used at a number of universities and companies. Soar's subgoaling and learning capabilities enable ADAPT to manage the complexity of its environment and to learn new schemas from experience. We describe the issues faced in developing an embodied cognitive architecture, and our implementation choices.
We have designed and implemented a fast predictive vision system for a mobile robot based on the principles of active vision. This vision system is part of a larger project to design a comprehensive cognitive architecture for mobile robotics. The vision system represents the robot's environment with a dynamic 3D world model based on a 3D gaming platform (Ogre3D). This world model contains a virtual copy of the robot and its environment, and outputs graphics showing what the virtual robot "sees" in the virtual world; this is what the real robot expects to see in the real world. The vision system compares this output in real time with the visual data. Any large discrepancies are flagged and sent to the robot's cognitive system, which constructs a plan for focusing on the discrepancies and resolving them, e.g. by updating the position of an object or by recognizing a new object. An object is recognized only once; thereafter its observed data are monitored for consistency with the predictions, greatly reducing the cost of scene understanding. We describe the implementation of this vision system and how the robot uses it to locate and avoid obstacles.
VMSoar is a cognitive network security agent designed for both network configuration and long-term security management. It performs automatic vulnerability assessments by exploring a configuration’s weaknesses and also performs network intrusion detection. VMSoar is built on the Soar cognitive architecture, and benefits from the general cognitive abilities of Soar, including learning from experience, the ability to solve a wide range of complex problems, and use of natural language to interact with humans. The approach used by VMSoar is very different from that taken by other vulnerability assessment or intrusion detection systems. VMSoar performs vulnerability assessments by using VMWare to create a virtual copy of the target machine then attacking the simulated machine with a wide assortment of exploits. VMSoar uses this same ability to perform intrusion detection. When trying to understand a sequence of network packets, VMSoar uses VMWare to make a virtual copy of the local portion of the network and then attempts to generate the observed packets on the simulated network by performing various exploits. This approach is initially slow, but VMSoar’s learning ability significantly speeds up both vulnerability assessment and intrusion detection. This paper describes the design and implementation of VMSoar, and initial experiments with Windows NT and XP.
Semantic Encoding is a new, patented technology that greatly increases the speed of transmission of distributed databases over networks, especially over ad hoc wireless networks, while providing a novel method of data security. It reduces bandwidth consumption and storage requirements, while speeding up query processing, encryption and computation of digital signatures. We describe the application of Semantic Encoding in a wireless setting and provide an example of its operation in which a compression of 290:1 would be achieved.