We are introducing our device and the steps taken to build it by presenting what already exists, our approach with some specific steps like real-time distortion or visual odometry and discuss the performance and limitations of our system.
In the first place, we will discuss the state of the art in virtual and augmented reality devices and how our approach is different from what is currently available to consumers. By showing our architecture and early results, we will develop on some technical aspects like our inside-out tracking system of the device or our distortions methods. In a final section, we expose some results of performance tests and try to envision the development of our device for industrial use-cases and how our large field of view AR-capable device can create new tools and experiences.
Many materials like our SDK will be released in open source for anyone to contribute to the system or develop their own based on our vision.
Many devices for augmented reality take the optical see-through approach, and it has its cons, like not being able to reproduce “true black” or having a very narrow field of view. We believed there was space for a video see-through device, with its pros and cons, for specific use-cases.
We also believe our approach for an augmented reality device can have an impact on many sectors. Our video see-through method has raised interest in many professions like in medicine, defense, training and education, and remote operations. We are convinced that, with this device, we can provide more traceability, security and efficiency in these processes as well as make an impact in the cultural and educational sectors.
STATE OF THE ART
As of today, there are two kinds of headsets : for virtual reality and for augmented reality. The best VR headsets consist of a dual screen mounted with Fresnel lens and an external tracking system. Some systems like Oculus and Google Daydream tend to bypass the use of an external tracking device by using inside-out tracking (odometry) usually by in visual way (with cameras). The human-machine interface in these systems is generally done with controllers handled by the user and tracked by an external sensor as well.
AR systems on the other hand, take the holographic display approach : points of light projected on special glass. Hololens is still the state of the art device in this area. It is fully portable, performs accurate SLAM* and is able to detect some user’s gestures like a pinch. Two major downsides are observed with this device : it can not create occlusion or ”true black” and the field fo view is considered too narrow for immersive AR experiences (30°), also the interaction between the user and the virtual objects is limited to some gestures.
We call the Hololens display an optical see-through device : the user has a direct view on his physical surroundings. Our approach is different in this way, we call it a video see-through device.
Our device has two main components, the headset and the computing unit. See Fig. 1 for the details.
The computing unit includes an Intel i7 mobile processor and a Jetson TX2 unit. The Jetson TX2 features 256 CUDA cores which allows us to perform our SLAM algorithm in real-time.
The headset is a heavily modified VR headset featuring a 110°field of view, with a leapmotion controller and two cameras mounted on the front panel. We used a 3D printer for the customization of the front panel for our prototypes. The cameras are centered, 66 millimeters apart from each other to match with the average interpupillary distance.
Our system needs to manage two kinds of distortions: the lens distortion from the cameras, and the ones from Fresnel located in the headset. These distortions have to be done in real-time to allow a rendering of 60 frames per second to preserve the confort of our user.
The leapmotion device mounted on the headset comes with built-in software distortions. The only parameter to adjust was the software positioning relative to the cameras so the virtual and physical hands would superimpose correctly on display.
The camera distortion is handled by a polynomial model. Inspired by the Brown-Conrady model2 for radial and tangential distortions, our approach was to calibrate the two cameras and then apply the obtained coefficients in a fragment shader on both rendered textures to match the distortion. The fragment shader replaces the pixels on the texture with the exact same equation from the Brown-Conrady model, for each pixel of coordinate (x, y) with r2 = x2 + y2:
with k1, k2, k3 being the coefficients for a simple radial distortion and p1, p2 the tangential distortion coefficients of a camera following the pinhole model.
The distortion to render the final image projected in the headset uses the distortion-mesh approach. During the final render pass, the texture coordinates for each point are adjusted to read each visible pixel from its corresponding location. This undistortion can be done in the vertex shader by producing a dense mesh that has adjusted texture coordinates for each color or in the pixel shader either by applying a function to the texture coordinates or using a texture map to provide the new texture coordinate for each color to map to the proper location on the screen. This allows to handle distortion of the Fresnel lens as well as an eventual distortion coming from the screen mounted in the headset.
To have the best immersion possible between the real and virtual worlds, we needed to know in real-time the translation and orientation of the device in space in an unknown environnement. Our SLAM system is inspired by the ORB-SLAM2 approach.3 Three threads running on the Jetson TX2 are responsible for tracking features with ORB§4 on GPU¶, mapping with keypoints and loop closing as seen in Fig. 4. We modified ORB-SLAM2 to run some parts of the model on the CUDA cores of the Jetson TX2 to achieve at least 30Hz visual tracking. Some optimizations were also inspired by forks of ORB-SLAM2 codebase, such as MultiCol-SLAM.5
This visual SLAM outputs a sparse reconstruction as a point cloud and a quaternion of the incremented rotation and translation of the headset. We only use the translation back in Unity3D because the rotation is already taken care of by the Inertial Measurement Unit in the headset which runs at 200Hz and with great accuracy (< 1° of precision on the 3 axis), The IMU performs a fusion algorithm that directly outputs a rotation vector in world coordinates. In an upcoming paper, we will present how our system also computes a dense 3D reconstruction thanks to stereo computation, allowing even more immersive AR possibilities.
Our algorithm is implemented as a ROS6 ǁ module and is highly configurable, as it depends heavily on the calibration parameters of the cameras. ROS is a collection of tools, libraries, and conventions that aim to simplify the task of creating complex and robust robot behaviour across a wide variety of robotic platforms. In our case, we use it to build software components that can communicate between the Windows-based CPU and the Jetson TX2 board. It communicates with Unity3D with websockets.
One very interesting feature of our approach by using a modified ORB-SLAM2 algorithm is relocalization. It means we can also save a map for later reuse; allowing assured and precise tracking in known and already visited environments.
Hand tracking and alignment
Tracking the hands of the user in our system is a great benefit for the creation of more natural interfaces in VR and AR. We mounted a Leap Motion sensor on the front panel of the headset to allow this.
One interesting challenge was to align both virtual and physical hands (from the stereo camera)on screen to create immersive AR experiences and credible interfaces on top of the hands of the user. In some cases, there would be no need for the user to see his virtual hands superimposed on his hands. The leap motion program performs his own distortion and thanks to our method, described in the distortion section(4.2), to render a realistic view of the physical surroundings, the virtual and physical hands are visually aligned, even if both are not necessarily at the same distance from the virtual camera in the 3D scene. Minor modifications of the SDK provided by Leap Motion were necessary to position the virtual hands in the correct space in the Unity3D editor.
To enable other developers to create applications on-top of our system, we created a simple scene with the Unity3D game engine. The scene features a single prefab handling the device and the Leap Motion, allowing the engineers to focus on building the experience instead.
The bridge between our device and Unity3D is mainly handled by the OSVR plugin. It creates a SDL** window handling a stereo view rendered in a single pass and benefiting all performance optimizations proposed by the Unity3D engine.
Developers can work in C# within Unity3D to make VR and AR scenarii. They can also develop ROS modules running on the Jetson TX2 unit and communicate with the Unity3D scene with websocket via Ethernet over USB. The bandwidth between the CPU board and the Jetson unit allows to transfer stereo images, point clouds and coordinates.
Interacting and developing with our system requires only basic C# knowledge, and eventually C++ to develop ROS modules in order to use the GPU cores for leveraging some tasks that do not need to be done in real-time (eg: creating a mesh from a point cloud). Developers can use libraries such as OpenCV to compute custom operations and tracking functions in order to create specific use-cases.
The device’s performance has to be measured on different aspects: power consumption, rendered frames per second, and video latency as well as location accuracy. Here we present the early results of our measurements.
On the accuracy of the SLAM system, we tested our modified ORB-SLAM2 algorithm against both KITTI7 and EuRoC8 datasets. The Fig. 3 reflects that our results are similar in accuracy to the original ORB-SLAM2.
Our minimum baseline for our frames per second target was 60. Below 60 frames per second in augmented reality, many users start to feel some discomfort and can feel the latency on screen. We managed to render 60 frames per second with the whole system running, meaning the Unity3D application and our ROS nodes connected to the GPU instance to perform SLAM. Our GPU SLAM runs on average 5 times faster than the initial ORB-SLAM2 code running only on CPU. Each frame is processed in between 25 and 30 milliseconds, compared to an average of 160 milliseconds on the same hardware with the original code. With this speedup, we can use the ORB-SLAM2 method for real-time processing.
We managed to reduce computation by providing already rectified images to the GPU thanks to our fragment shader described in the distortion section(4.2). The GPU doesn’t need to undistort the images as the epipolar lines of the stereo images are already aligned on the CPU side of our architecture. However, framerate tends to drop when the Unity3D scene becomes more complex. As well as in other VR systems, the 3D scene must stay simple, use optimizations such as dynamic occlusion and have minimum computer-heavy tasks like lighting and shadows because it is only manage by a CPU in our current design.
In our observations, we used our system on an Intel i7-7600U CPU linked to the Jetson TX2 module and we could perform 60 frames per second rendering on simple scenes and a visual tracking running between 30 and 40Hz on average. We are confident that with these results, we can have a fully autonomous system on battery.
Discussion and limitations
Our system enables a new range of applications for AR in simulators, industry, remote assistance and many other professions. We are confident on bringing new interfaces for a new kind of computer usage with occlusion and precise hand-tracking. However we are aware of the current limitations. On the hardware side, the screen resolution (currently at 2880x1600) doesn’t provide the same level of detail as the human eye. We achieve 25 pixels per horizontal degree, which is far from 60 pixels per degree which is the corresponding definition of the human eye. Flat screens are not practical either for a 200°field of view in the current form-factor.
On the software side, many optimizations must be performed to handle more complex 3D virtual scenes and progress must be made on visual and inertial odometry. The main limitation in our current architecture is the lack of dedicated GPU for 3D rendering, all the heavy-lifting of 3D rendering is made by an integrated graphics card which limits the performance for more complex scenarii. However, a dedicated graphics card for rendering would consume too much power in the current state and would be impractical to mount on the user’s head.
We proved that ”video see-through” head-mounted displays can be feasible and accurate enough to find their utility in the current industry. However, the current state of the industry must evolve to provide a consumer-grade device. We hope it will find the same enthusiasm as the research done in autonomous cars as they use similar algorithms for SLAM.
We presented an experimental head mounted display capable or rendering real-time 3D scene for virtual and augmented reality scenarii. With real-time distortions of both the stereo and Fresnel camera lenses, we can display a stereo view of the user’s environnement with a confortable field of view of 110°in both virtual and augmented realities.
Due to its unique architecture and performance achievements, we believe it will help in many industries and open the field to a new kind of AR devices.
We thank Guillaume Gelée for his precious insight and help alongside this project, and GFI Informatique for their support.
Brown, D. C., “Close-range camera calibration,” PHOTOGRAMMETRIC ENGINEERING 37(8), 855–866 (1971).Google Scholar
Urban, S. and Hinz, S., “MultiCol-SLAM - a modular real-time multi-camera slam system,” arXiv preprint arXiv:1610.07336 (2016).Google Scholar
Quigley, Morgan., C. K. G. B. P. F. J. F. T. L. J. W. R. and Ng, A. Y., “Ros: an open-source robot operating system,” (2009).Google Scholar