Toward open benchmark tests for automotive lidars, year 1: static range error, accuracy, and precision

Abstract. This paper describes the initial results from the first of 3 years of planned testing aimed at developing methods, metrics, and targets necessary to develop standardized tests for these instruments. Here, we evaluate range error accuracy and precision for eight automotive grade lidars; a survey grade lidar is used as a reference. These lidars are tasked with detecting a static, child-sized, target at ranges between 5 and 200 m. Our target, calibrated to 10% reflectivity and Lambertian, is a unique feature of this test. We find that lidar range precision is in line with the values reported by each manufacturer. However, we find that maximum range and target detection can be negatively affected by presence of an adjacent strong reflector. Finally, we observe that design trade-offs made by each manufacturer lead to important performance differences that can be quantified by tests such as the ones proposed here. This paper also includes some lessons learned, planned improvements, and discussion of future iterations of this activity.


Introduction
The options available to automobile manufacturers and Tier 1 integrators for low-cost light detection and ranging (LIDAR or lidar) sensors used for autonomous vehicle (AV) and advanced driver assistance systems (ADAS) applications are growing rapidly. Fundamentally, these lidars all use lasers to measure distance across a set field of view (FoV). Lidar engineers make design trade-offs to gain competitive advantages in performance and cost in what is a rapidly growing, highly competitive market. Some of these trade-offs include operating wavelength (typically between 850 and 1600 nm), range measurement based on either direct detection/Time-of-Flight (ToF) or coherent techniques, beam steering solutions (mechanically rotating components, MEMS mirrors, microlenses), and laser source type [vertical cavity surface-emitting lasers (VCSELs), edge-emitting diodes]. These design choices have trade-offs of their own, with differences in scan patterns, sampling frequency, achievable ranges, susceptibility to interference from other lidars, etc.
Lidar is one of the most important and versatile components of an AV's perception system. These sensors provide the vehicle with a three-dimensional (3D) map of the location of objects around the vehicle in all lighting conditions. They also allow estimation of the vehicle's position with respect to its surrounding, all updated hundreds of times per second. The performance of the complete AV system can be directly linked to the collective performance of all components and subsystems. 1 AV system integrators must consider the likely performance variation of each component and the impact of that variation at the system and subsystem level.
On datasheets and sales, literature vendors list specifications that assist engineers in designing AV perception systems. Some of these specifications, FoV and angular sampling rate, e.g., *Address all correspondence to Jeremy P. Bos, jpbos@mtu.edu; Paul McManamon, paul@excitingtechnology.com are easy to verify. Others are more ambiguous. For example, range is often listed but is only occasionally accompanied by a target reflectivity. Similarly, range precision is often listed but usually as a normal deviation from the mean; it is unclear if this is a reliable assumption. In comparing one lidar to others, integrators often resort to evaluation of engineering samples. This is a costly endeavor and risks introducing bias due to experiment design, etc. Because these results are kept confidential and internal to those performing the test it is possible for two integrators to draw different conclusions about the same lidar unit. Both the need for a large-scale benchmarking activity and some attempts at testing standardization are clear.
While most lidar testing and benchmarking are confidential, there are some works available in the open literature. Glennie and Lichti 2 assessed a Velodyne HDL-64E and developed a calibration routine for mapping applications. Later, Glennie and Hartzel 3 compared a Livox Mid-40 and an Ouster OS1-64 to their published specifications. In Mittet et al., 4 the range accuracy and precision of an early Quanergy M8 unit was examined. Kutila et al. 5 examined the effects of arctic conditions on lidar sensors. This work is both qualitative and quantitative comparing intensity and reported range from lidar scans in winter conditions to the average across multiple lidar sensors; it does not include a ground truth. Other work by the group including Rosenberg et al. 6,7 have focused on developing sensors models for use in simulation and necessarily involves characterizing sensors using metrics and benchmarks.
Attempts at codifying test methods and comprehensive benchmarking activities are a recent development. Cattini et al. 8 proposed a very precise laboratory method. Their procedure is too cumbersome and complex to be used in a field test event such as ours involving potentially dozens of lidars over a single day. However, their findings with respect to unit-to-unit variation, warm-up time, and stability will be useful as standards are developed. An extensive indoor test was performed in Ref. 9 involving ten rotary style lidars. This testing includes a ground truth and examines accuracy, precision, and intensity variations for a static target containing three materials. However, the specific reflectance behavior of these materials over angle and wavelength are not characterized. In addition, this work presents results in terms of the mean error and standard deviation from the mean. Kim et al., 10 perform a competent evaluation of lidar performance under degraded conditions using a two-way ANOVA test. Their work focuses on number of points and intensity though rather than range accuracy and precision and involve only a single lidar. Test repeatability will inevitably be part of this activity as work progresses.
Occurring in parallel to this effort, the work by Schulte-Tigges et al. 11 evaluates six nonrotary automotive lidar devices against different static and dynamic targets. In addition, metrics and processing steps are outlined for each scenario. Some of which include a target detection algorithm. All the targets are relatively large and, like those in the work by Lambert et al., 9 are not well characterized. Similarly, results are presented in terms of the deviation about the mean. Also, the measurement references are either hand-measured or derived from GPS.
By way of standards, the National Institute for Standards and Technology (NIST) and ASTM International have published standards for metric assessments of laser-based, scanning, ToF, single detector 3D imaging systems, 12,13 but no manufacturer advertises that their internal testing and calibrations are done to these standards, nor is there a requirement for them.
In this work, we describe the results of the first year of a proposed 3-year lidar benchmarking exercise. This effort began in 2019 with the intent of an initial public test at the SPIE Defense and Commercial Sensing conference in April of 2020. This event was cancelled due to the worldwide SARS-CoV-2 pandemic and finally reconvened at SPIE Defense and Commercial Sensing in April of 2022.
The aim of the first year of testing was evaluation of range accuracy and precision on static targets with the goal of refining processing, data acquisition, and test setup. The aim of years two and three is to add additional effects like oncoming lidars on the test range, weather, and dynamic targets. A complete description of proposed future efforts is found in the Appendices to this paper.
Our work differs from the previous and concurrent works in several important ways. First, this effort compares results between both rotary and scanning units. Most crucially, it also includes a small, child-sized, target calibrated to 10% reflectivity over the range between 800 and 1600 nm. This target, provided by Labsphere, is also verified as being a purely Lambertian reflector. Our year one testing also involves two configurations: a control configuration and a second identical test setup with highly reflective adjacent objects we call "confusers." Unlike other previous works, we present results in terms of the median sample range and the sample interquartile range (IQR) to avoid the presumption that sample detections can be fit to a normal distribution. The effort described here also does not include any detection algorithms or weather effects though both are likely to be considered in future iterations of this event.
Testing was conducted in an open field in Kissimmee, Florida, among eleven different lidars including three of the same make and model. We present here a comparison between eight of the units tested using a survey-grade lidar as a reference. All the units evaluated have advertised maximum detection ranges of between 100 and 200 m and operating wavelengths between 800 and 950 nm. In our test configuration, the average maximum detection range was 50 m with the minimum of 25 m and a maximum of 120 m. Range accuracy across units was biased short of the reference distance by −0.6 cm excluding outliers. Range precision across all units is estimated as 3.6 cm compared with a typical advertised value of 3 cm for most units. RMS planar fitting errors with respect to the target differed by around 7 cm. The addition of adjacent "confusers" reduces range precision of all units by 25% to 65% depending on the metric. Confusers also significantly reduce the ability of all lidars to score targets at range. For this reason, the "maximum range" self-reported by lidar manufacturers should be considered measured under the best possible conditions. A finding of this work is that the presence of adjacent, highly reflective, object to a dim object reduces the probability of detecting the dim object. For this reason, this condition should be considered in the development of any standard.
Over the remainder of this paper, we describe our test setup and method starting in the next section. In the section that follows we describe our test metrics and processing approach. Results are presented in Sec. 3. Conclusions and directions for future work are provided in Sec. 4. In an appendix to this paper, we outline plans for years two and three of the effort and solicit both feedback and participation.

Location
Testing was conducted near the Bridging the Innovation Development Gap (BRIDG) center in Kissimmee, Florida (28.291525N, −81.371776 W) on April second and third, 2022. A dry run was conducted on the second, when the conditions were overcast with scattered thunderstorms, with temperatures between 61°F and 81°F. Relative humidity was between 79% and 99%. Testing was conducted on the third, conditions were sunny with temperatures between 73°F and 77°F and relative humidity between 56% and 60%.
The test area was comprised of the unimproved lot behind the BRIDG facility and Skywater Technologies buildings. An overhead plan view of the test area can be found in Fig. 1. The buildings and courtyard are seen in the lower half of the image. Neocities Way is directly behind the "origin" located just off the adjacent sidewalk. The test area itself is somewhat flat along the major axis of the target field. Black areas in the image indicate shadowed areas and suggest a depression present near the center of the test area and a decline in elevation to the right of the test area (top of the image). The device test location is labelled as "origin" at the center left of the image. Targets are labeled by their approximate distance from the origin with the far 200 m target located at the center right. The line between the 200 m target and the origin makes up the main axis of the test range. The image in Fig. 1 was generated using the Reigl VZ-400i by placing the unit at various locations around the test area. Each black circle in the image is the origin of one scan area. Data from multiple scan areas were combined to create the composite.

Targets
Both calibrated and other objects were present in the test range. As part of this benchmarking effort Labsphere developed a 15 cm × 80 cm flat aluminium target with a Lambertian coating that is 10% reflective from 800 nm to 1600 nm. The size corresponds, approximately, to the cross-section area of a small child when viewed from the side (Fig. 2). Only range data for child size targets are presented in this work. Figure 2 also includes a plot of the measured reflectivity of the target between 800 and 950 nm, the range of operating wavelengths for the lidars tested. The exact mean reflectance over this range is 10.18% with a minimum value of 10.08% and a maximum of 10.34%.
All monostatic lidars, no matter their technology, rely on light emitted from a common source location and received at the same location. Whether scanning or flash illuminated the received power per solid angle from a target decreases with range. Naturally, also, the area occupied by the solid angle sampled by the lidar also increases with distance. For this reason, in a test like this one the targets must be arranged so that they do not overlap. This is straightforward for turntable scanning type lidars assuming there is enough open area around the device under test (DUT). However, an increasing number of automotive lidars have a limited azimuthal FoV. For this reason, the test design aims to include the maximum number of targets within a 60°FoV.
Targets were arranged starting with the 200 m target aligned along the intended optical axis down the test range. Starting from the origin the first two targets were placed in 5 m increments  alternating along the right and then left side of a AE30 deg FoV with respect to the text axis. Additional targets were placed in the same manner in 5 m increments out to 50 m on the left side of the range. The next set of targets were placed in 10 m increments starting on the right side of the range and continuing out to 100 m. At 100 m, the spacing was increased to 20 m out to the final target at 200 m. An idealized plan layout of this configuration can be found in Fig. 3.
Each Labsphere target was fixed to a delineator-type traffic cone with stabilizing base using self-adhesive hook and loop faster. Each target was aligned such that the center of the target was approximately aligned with the horizontal optical axis. This alignment was done manually using a spotting scope from test origin. The nature of the cone and base used to hold the targets means that the target itself is at best orthogonal to the ground patch on which the cone is placed, but it was not possible to ensure that the target itself is orthogonal to the ray between the target center and the origin. For this reason, targets will not be oriented uniformly with respect to the origin. However, each DUT observes each target in the same geometry. Also, using our reference system, we were able to measure the orientation of each target. This information is summarized in Table 5 in the appendix. The limited vertical angular FoV of the test targets is a known weakness of this test setup and will be addressed in subsequent efforts.
Testing was performed in two configurations referred to as "Lane 1" and "Lane 2." Lane 1 consisted of only the targets affixed to the stands. Lane 2 consisted of the targets intermixed with "real-world" objects as scene clutter or "confusers." These objects included: orange and white folding metal traffic barricades with retroreflective panels, orange rubber traffic cones with retroreflective tape, and orange plastic delineator tubes with retroreflective tape along the left side of the lane, and a variety of 48-in. steel traffic control signs containing black text on a retroreflective orange background along the right side (signs consisted of type II road construction signs; one "reverse curve" sign (MUTCD code W1-4L), two "two way traffic signs" (W6-3), two "one lane road ahead" (W20-4), and two "be prepared to stop" signs (W3-4) 14 ). These confusers were placed adjacent to each target. This arrangement provides the opportunity to test the effect of laser power automatic gain control and its impact on range detection performance. Figure 4 shows the test set up of Lane 2, using confusers as well as the test targets. The location of the test targets was the same in both configurations.

Reference lidar
A survey-grade Riegl VZ-400i Terrestrial Lidar Scanner (TLS) collected the high-resolution, high-accuracy point clouds used as our reference data. Lidar scanners of this type and accuracy have been used to collect reference data for similar efforts. 3 Table 1 contains the specifications for this instrument. The TLS was mounted on a leveling tribrach with a removable insert, which was secured atop a sturdy survey tripod. Each scan was collected using 0.02 deg horizontal and vertical angular sampling, at a scan rate of 1.2 mHz. A camera affixed to the top of the scanner collected color images that were used to apply and RGB value to each point in the point cloud. The resulting data had an average of 300 points∕m 2 on the target features. Two complete reference datasets were collected, one for each lane configuration. Each dataset comprised of multiple scans collected at different locations along the test lanes, ensuring complete, high density coverage of the area, ground, targets, and buildings (see Fig. 1).
The top of the tribrach was set ∼1 m off the ground and leveled. After a reference scan, the TLS was removed from the tribrach so each test lidar could be affixed to the tribrach using the removable inserts. This setup ensured that each test lidar's coordinate system and pose could easily be aligned to the project's reference coordinate system using the mounting point offsets supplied in the reference documentation.
One scan in each reference dataset was used to define the coordinate system for each test lane configuration. In processing the reference data, a local Cartesian coordinate system was defined such that the origin corresponds to the TLS's X-Y origin at the scan position at the end of the test lane and the Z origin corresponding to the top of the tribrach, calculated by subtracting the optical center from the base plate of the instrument. The elevation from the ground of the optical axis for the DUTs would be slightly lower than the TLS at 92 cm.

Test lidars
Data were collected for each lane configuration using eight different lidar devices. Three of the lidars were of the same make and model. Since our objective is to evaluate variation between  lidar designs and not individual lidar performance the make and model are obscured. Instead, we refer to each device with an assigned letter between "A" and "H." Some general observations regarding the DUTs: all of the test lidars operated near between 800 and 950 nm. The test pool consisted of nearly equal portion of MEMs or other static scanning lidars and traditional rotating scanning devices. Most DUTs indicated a range precision near 3 cm and an operating range between 100 and 200 m. Each DUT was aligned to the optical axis at the origin of the test setup. DUTs were connected to a laptop computer running Ubuntu 18.04 and Robot Operating System (ROS) Melodic. 16 DUTs were configured using the default settings for the respective ROS driver. Using the rosbag tool, 100 consecutive "pointcloud2" messages were collected from each DUT in each lane configuration.

Data Processing
Data from the Riegl TLS was processed using the manufacturer's software (RiScan Pro 2.14.1). Individual scans were automatically registered together on board the scanner during collection and further refined using multistation adjustment, a plane-fitting registration routine in RiScan Pro. Points with very low intensity or high pulse deviation were then filtered out, and RGB information was added to each point.
Each DUT point cloud was exported from ROS to MATLAB and initially registered to the Riegl reference point cloud manually. Final alignment was completed using iterative closest point (ICP) matching. For each DUT, the 100 "pointcloud2" messages were combined into a single point cloud data object and then aligned to the Riegl reference point-cloud via ICP. The result of this pre-processing step is a best-effort aligned point cloud that uses a single transform between the DUT coordinate system and reference system with all 100 point-cloud messages combined into a single data object.
Data was then further processed for each range target. For each candidate sensor, each target was identified and captured by bounding boxes first from a top-down point-of-view, then a leftfacing point-of-view and finally a forward-facing point-of-view. Special care was taken to include neither the cones that held the targets nor confusers when present. This manual process was repeated until every target scored by each sensor was identified. Figure 5 shows the point clouds collected and the points labeled as the 10 m target with only the initial registration. Scoring was performed by finding the minimum distance from each lidar point reported by the DUT to the closest coplanar point from the reference point cloud considering only range and cross-range dimensions; difference in elevation is not considered.

Scoring and Metrics
For this initial examination, only range statistics are examined: range accuracy and precision provided by each DUT with respect to the reference. The RMS plane fit is also provided and refers to the total fit RMS error between each reported DUT target point and the nearest reference point on the target excluding elevation.

Results
Results are presented here with as little interpretation as possible apart from highlighting cases where the field test results further motivate additional testing and the development of standard. Herein, we consider a target to be detected only if 20 or more points over the 100 ROS point clouds score the target. Also, some results are provided out to the maximum scored range the number of points on target beyond 50 m is typically <200 points per target implying <2 points per scan on average. For this reason, results beyond 50 m should be considered informative rather than descriptive.
In terms of overall performance, Lidar B was scored out to 120 m in both lanes. Though the DUT was not scored on the 80 and 100 m targets in Lane 2 with confusers present. Lidar H was scored out to 90 m in Lane 1 but on 45 m in Lane 2. Similarly, DUTs C and F were scored out to 50 m in Lane 1 but only to 45 m in Lane 2. DUTs D and E scored to 40 m in Lane 1 and to 30 and 25 m in Lane 2, respectively. Lidar A scored to 30 m in both lanes and lidar G to 25 m also in both lanes. Only targets between 10 and 25 m are scored by all DUTs.
Tabulated results for the eight DUTs are provided in Tables 2 and 3 representing the results with and without the confusers. Beginning with the top row we observe that three of the eight units reported the targets closer to the origin compared to the reference. Across all units, the absolute average error in position estimate was 2.9 cm and the minimum to maximum variation (span) was 12.4 cm. Adding confusers increases the absolute average error across all DUTs and targets to 4.8 cm; a 65% increase. With the confusers, the span also increased to 15.4 cm (25%). RMS plane fit error increased from 7.4 to 9.6 cm an increase of 30%.
Range precision averaged 3.6 cm across all targets and DUTs and 3.1 cm excluding lidar F, in line with the typical advertised value of 3 cm. However, on a target-by-target and device-bydevice basis, there is quite a bit of discrepancy from a minimum variation of 0.7 cm for lidar C on the 5 m target to 15.2 cm for lidar H on the 90 m target. While it may be assumed this is simply an effect of range, the variation of lidar D observing the 35 m target was 14.9 cm. Adding confusers increases the average range ambiguity (decreases precision) across the test population by 26% to 4.6 cm. The minimum range precision remains with lidar C on the 5 m target while lidar's D precision observing the 35 m target increases to 17 cm.  Tables 2 and 3 also report the number of points across all targets. From this information, the reader can glean some information about the relative sampling rate in the angle space occupied by the targets. Immediately relevant is the effect of confusers. Adding the confusers reduces the number of points on target by 24% across all devices and targets.
(1) and the lower distance threshold, T lower E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 3 8 9 In Eqs. (1) and (2), R 2 and R 3 represent the second and third quartile ranges around the median.
Outliers are those values that exceed these threshold values and are indicated as red "+" markers. Figures are organized sequentially starting with 5 m targets and include both lanes. Examining the results for the 5 m target in Fig. 6, it can be seen that lidars A and H record no points on the 5 m target. These lidars both have an FoV in the range of 60 deg. In our test configuration, the 5 m target was situated at nearly 60 deg from the test axis connection between the reference center and the 200 m target; the target is likely just outside the FoV of both lidars these figures.  While Tables 2 and 3 provide an overview of the performance of the test lidars as a group, Figs. 6-15 tell us much more about the variation in performance of each lidar compared the others. For example, in Fig. 6 we can see that in "Lane 1" lidars B and C have roughly the same performance when observing the 5 m target. When confusers are introduced the median reported range, IQR and number of outliers increase. While the IQR of lidar C increases from 0.7 to 0.9 cm lidar B's increases by 5 cm. A similar trend is observed in the performance of lidar B's scoring the other targets. In contrast, lidars C, D, E, and F show a decrease in the number of outliers in between Lanes 1 and 2. Lidar G, like lidar B, also shows and increases with the introduction of confusers.

Lidar
The variation in some reported range values is also interesting to note: half of the values reported by lidar E differ from the median value by >5 cm whether confusers are present for the 5 m target or not. In fact, the difference between the minimum and maximum value reported for the 5 m target in Lane 1 was 18 cm over the 100 recorded scans. These results and others suggest a relatively fat-tailed probability distribution.
A similar box plot is shown in Fig. 7 here comparing the Lane 1 and Lane 2 results for the 10 m target. All units are represented in this plot. In Lane 1, we observe 50% of the reported points within about 2 cm of the median reported value for all DUTs. However, units D, F, and G report value of >10 cm from the median value 50% of the time, excluding outliers. For lidar A, the observed deviation from the mean is 20 cm excluding outliers at the 50th percentile. Also of note is that 14% of the points reported by DUT E are considered outliers compared to around 8% to 9% for lidars G and H and compared with <3% for the other units. Here, again, we observe similar performance between units B and C and a reduction in the number of outliers reported after introducing confusers. The trends observed in Figs. 6 and 7 continue at the 15 m target in Fig. 8. New here is the is an increase in the mean IQR by only 28% on average across all units compared from Lane 1 to Lane 2. By comparison at 5 and 10 m, the increase was 74% and 146%, respectively; the effect of the confuser is lessened. Conceivably, the confuser here could be far enough from the target as to not have as large an effect. This seems likely as the trend of continues again at 20 m in Fig. 9 and is observed generally in for the other range targets. Indeed, the IQR of DUT F increases from 4.2 to 10.9 cm for the 20 m target with the introduction of confusers. For the same device, a similar increase is observed at the 30 m target; 5.2 to 12 cm. While the data suggests that confusers can severely reduce range precision we cannot rule out issues with the test set up as the strongest confirming results are on the left side (10, 20, 30, and 40 m targets) of the test range. It is conceivable that ambient illumination or the location of the sun relative to the target and DUTs affected results.
Extrapolating general trends are possible only between the 10 and 35 m targets where most of the DUTs score every target. For example, we observe in Lane 1 the average IQR of all DUTs increases from 3.6 to 9.2 cm across 10, 15, and 25 m target data. The IQR of DUT B in particular increases from 1.4 to 18.5 cm. The performance of these units is as one might expect, IQR increases in general with range. A notable exception can be found in lidar G where the IQR does not vary considerably between the two test configurations and is similar with range also. However, data are only available out to 25 m for this unit for both lanes.
Continuing through Figs. 12-15, we see a common trend of decreasing range accuracy and precision. However, this appears to be mostly correlated with a drop in the number of points on target, caused by a decreasing occupied solid angle and also the presence of confusers when relevant. The presence of confusers decreases the number of points on target for all lidars and by more than half in some cases. Confirming results can be found in Table 4 in the appendix.
The boxplots in Figs. 16-23 present the same data as Figs. 6-15 but by range for each lidar. This presentation makes clear the difference in performance for each lidar with and without the presence of confusers. DUT A (Fig. 16) is somewhat typical of the test group. In Lane 1, the DUT tend to slightly underestimate range in this case by about 3 cm on average. In Lane 2, the IQR increases slightly from 5 to 6.3 cm and average estimated range error increases to −0.3 cmstill short of the target.
The effect of confusers is probably most obvious in the results for DUT B. In Fig. 17, performance is seen to be remarkably consistent in Lane 1 out to 70 m. The opposite story is told by the results for Lane 2 in the same figure. Because of this inconsistency, we cannot rule out processing artifacts of even a spurious event on the test range; a strong gust of wind for example. Interesting over the remaining figures is the uniqueness of each lidar compared with the others. For example, if we exclude the 10 and 20 m targets DUT C consistently reports a similar range for these targets with high precision and is only slightly affected by the confusers. For DUT D, the IQR range increases sharply beyond 25 m in both tests. This drop is likely related to sharp drop in points on target (<400).
We suspect that other tests and metrics will allow a unique fingerprinting of lidar performance based on make or certain common design decisions. For example, if we were to assume   that ambient lighting differences or sun-angle are driving the difference in the left/right sides of the test course than lidar E is may be said to be particularly affected by this difference; the same could be said of lidar F. In Fig. 21, the range error increases with distance with deviations from this trend and 10, 20, and 40 m. The nature of the test setup also likely biased results against some of the units. Observe that in Tables 2 and 3 the total number points between lidars E and G are similar through E was able to   detect targets out to 40 m. This is likely due to aspects of the test set-up favoring lidar E over lidar G in this instance. In this instance, vertical scanning of the lidars, larger, or longer targets may have been more equitable. Despite this, and commented upon previously, the performance of lidar G is notable. The IQR for lidar G increased by only 0.2 cm in the presence of confusers less than any other DUT. Lidar G also has most obvious skewness in distribution of detection samples.
One last unique performance characteristic can be observed in Fig. 23. Here, Lidar H was able to detect the 90 m target but reported the target nearly a meter closer compared to the reference and a with an IQR of 26.5 cm.

Conclusion
We have presented here the results of a first attempt at benchmarking eight automotive grade lidars. This effort is the first to use calibrated targets along with a reference and adjacent highly reflective confusers. Our purpose in this work is to motivate the development of test standards in this area and highlight variations in performance between lidars when stated specifications are similar. In this test, all the lidars operated near 900 nm, claimed either 100 m or 200 m maximum ranges, and range precisions of no worse than 3 cm as one standard deviation about the mean. Testing in this first year involved the first use of calibrated, Lambertian targets with 10% reflectivity in two test lanes with and without adjacent highly retro-reflective confuser targets.
In this first, early, initial examination of the test data we focused only on range accuracy and precision. This comparison was made possible via a survey-grade reference lidar. Across all devices tested we observed an average absolute range accuracy of 2.9 cm with respect to the reference across all targets. Average range precision was estimated at 3.6 cm. Introduction of the confusers in the second test decreased the number of points reported on target by 24% and increased range uncertainty by 34%. Additionally, the detection range, or range where 200 points were placed on target and averaged across all DUTs, decreased by half from 100 to 50 m. Only one DUT was able to detect targets beyond 90 m and the typical maximum range detection in our second test was 40 m. In addition, the results presented here indicate that, due to inherent design tradeoffs, the performance of each lidar is unique and can be characterized up to a point.
Generalizing across the tested devices, we can say that while the specifications listed by each vendor are representative, they do not adequately describe performance on their own; further justifying the need for standards. Advertised maximum range would appear to depend upon a very specific, and undocumented, set of circumstances coming together to detect a target. With respect to range precision, all the devices tested demonstrated precision similar to their stated specifications. However, it was common for the distribution of range estimates to be dominated by a concentration near the mean value and heavy tailed. Outliers were typically between 1% and 3% but sometimes as high as 14% for some units and targets. This finding may have implications for object detection and tracking algorithms that assume detections will be normally distributed about the mean. As we laid out in the introduction, the results presented here are from only the first year of proposed 3-year effort. However, there is still more to be done to improve our analysis of the year one data. Notably absent from this manuscript is an estimate of test uncertainty, repeatability, and reproducibility. Similarly, there are some inconsistencies in the test data that bear investigation. For example, there was tendency for detection on some targets on the left of the range to have a higher variability. It has been suggested that this may be due to background illumination or solarangle; effects that have not been accounted for in the test setup.
To that end, proposed additions to the testing for years two and three are included in an appendix to this paper. These additions include more complex targets, dynamic targets, placing corner cubes, or identical lidars on the test range, and weather effects.
Years two and three also include plans to repeat the testing from previous years incorporating lessons learned. As we plan for year two those improvements for the Lanes 1 and 2 tests presented include maintaining the lidars in a power-on state prior to testing and ambient light monitoring/recording. Some changes to the test setup are also likely in order. Carefully orienting confusers adjacent to the targets and increasing the overall target height are likely to be considered. Other work includes improvements to the processing pipeline to accommodate a planned 30 lidars in year two. Finally, as we continue to develop these tests and standards a careful analysis of repeatability and error contributions is warranted. Tables   Table 4 contains aggregated statistics across all lidars, excluding the ground truth, for all targets. Table 5 provides detailed position and alignment information of each test target with respect to the ground truth lidar. The last column provides a double dot product misalignment loss factor.  Major goals of year two testing include examining eye-safety and interference and is planned for Friday, April 28 and Saturday, April 29, before the SPIE DCS conference the following week at the same facility as the year one test. We tentatively expect 30 lidars designed for the automotive sector. Prior to on-site testing, each lidar will undergo a series of eye safety measurements conducted by Exciting Technology (ET) in their Dayton, Ohio, optical labs. The purpose of these eye safety tests is to determine the nominal ocular hazard distance (NOHD) for each lidar for both unaided and aided viewing. The general concept is to integrate the output of a high bandwidth InGaAs detector over a time period of 10 s for various distances from the lidar. The aperture size of the unaided optics will be ∼7 mm, representative of a dark adapted eye, and 50 mm in diameter for the aided optics, representative of a 50 mm diameter binocular. Table 6 below summarizes the required eye safety measurement equipment. Ideally, we would use a single detector for every lidar wavelength, but the range of wavelengths manufacturers used could be quite broad. We have initially selected an extended range InGaAs detector from LabSphere with responsivity from 800 to 2600 nm, peaking at 2200 nm. This particular detector as more spectral range than is required for this application, as the highest wavelength lidar, we anticipate testing is 1550 nm. Figure 24 contains all the information we currently have for the detector. Before the test, we would need the responsivity specifications for each lidar wavelength as well as the temporal bandwidth, as sub-ns rise/fall times (ideally DC− ≥ 2 GHz) are desired to capture peak power from pulsed lidars for more accurate integration.  For the on-site lidar field testing, we will implement two new lane configurations. The first test lane will combine the unconfused and confused lanes of the year one testsleveraging the observation that confusers not close in proximity to targets had no effect on the targets detection. Additional elements such as cement barricades, simulated tire fragments and simulated negative obstacles (Positive obstacles are convex relative to the ground plane. Negative obstacles are concave and are more difficult to detect.) may also be included. The second test lane will examine the impact and susceptibility of the various DUTs to interference events.

Appendix A: Additional
While interference may occur organically during the course of driving or naturally due to certain environments, it may also occur intentionally and possibly maliciously. Another test will be primarily designed to identify naturally occurring interference effects on automotive lidars. Interference may manifest itself as false positives or ghost targets that do not physically exist at the detected location, false negatives or missing targets, or detected targets shifted in position. Each of these cases can cause potentially dangerous results.
In order for interference to occur, the victim DUT must receive an interference event within its range gate time and the DUT must be spatially aligned with the source in some way. For direct interference, the devices need to be spatially aligned so the two DUTs are facing each other with overlapping FoVs. For indirect interference, an interfering alignment can occur with the DUTs imaging the same object at the same time, and the victim lidar must interpret the received interference as a target. Depending on the receiver architecture, it may have some resistance to interference and the resistance can differ between architectures. Specifications regarding interference are not typically released by manufacturers.
Despite best efforts, individual lidars will not be phase locked to each other. Thus, the temporal alignment between lidars can be modeled at random, or quasi-periodic at best. The typical motion of lidars mounted on vehicles is expected to be moderately complex. While cars generally move in translational motion, rotational motion is also expected due to the vehicles' suspension during motion, pot holes, and natural curvature of the driving experience, among others. Furthermore, many commercial lidars scan patterns result in a dynamic but repeatable scanning mechanism to cover a scene. As a result, the FoVs of the lidars will point at each other and overlap during some instances.
Static interference testing will generally follow the procedure outlined in Popko. 17 Each lidar will individually be tested against each of the other DUTs as both the victim lidar and the interfering lidar. The existing test setup in year two will utilize lane one with the addition of a second pedestal for the interfering lidar to be placed. The height should correspond to a typical location on a vehicle. Future efforts will place each lidar on a stage that will permit a naturally occurring rotational motion (yaw, pitch, and roll).
During each test, a point cloud will be collected from the victim lidar while the interference lidar is turned off. After scanning the scene for a fixed amount of time, this point cloud will be considered the 'truth' for the victim lidar. Next, the experiment will be repeated but with both the victim and the interfering lidar on. Point clouds will be compared with any new or shifted targets (identified based on a tolerance value to be determined) and quantified. The number, location (space and angle), repetition, etc., of return points will be characterized in a confusion matrix. Primarily, the deviation from the "interference-free" case will be noted as lidars may have vastly different error and accuracy metrics.
The methods above will certainly allow us to evaluate interference from a limited amount of lidars. We will also examine using large corner cubes to simulate returns from identical, oncoming lidars, but this is the pathologically worst case scenario because the signal is fixed in range and is at exactly the same frequency and in phase. We will also examine using modulated corner cubes, such as NRL has used for two way communications, as this may simulate a more realistic interference scenario. If we can validate the test methodology of using large corner cubes we can easily simulate many more interfering lidars.

Appendix C: Year 3 Expected Approach
In year 3, we plan on replicating the tests of year 2, incorporating lessons learned, and additionally testing DUT's susceptibility to weather-related performance degradation.
We wish to develop test protocols and associated metrics to measure the performance of lidars under varying weather conditions, such as fog (the international definition of fog is visibility <1 km (35 db∕km attenuation at 0.5 km, visibility ∼92.3% transmission at 10 m) and mist is visibility between 1 and 2 km (8 db∕km attenuation at 1.5 km, visibility ∼98.2% transmission at 10 m) 18 and rain (rain intensity is defined by the US Geological Survey as light rain falling at <0.5 mm∕h, moderate rain between 0.5 and 4.0 mm∕h, heavy rain between 4.0 and 8.0 mm∕h, and very heavy rain in excess of 8.0 mm∕h 19 ). We will implement repeatable test conditions that simulate weather events for the duration of the test and maintain uniformity across the testing platform and evaluate lidar performance by measuring the reflectivity of chosen targets under various test conditions at various fog and rain intensities. The lidars will be tested under at least two different fog levels and rain intensities, ideally moderate and heavy. Additionally, we propose to test lidars during both simulated rain events where the surfaces are wet and there is rain actively falling, and again after a rain event when only the surfaces are wet.
The weather tests could take place during the same test event or at a separate facility at a different location, but using an existing facility that supports weather testing with a wide range of control on the testing parameters may be the best option. Various state departments of motor vehicles (DMVs) have testing facilities used to examine vehicle safety under real-world weather and lighting conditions. For example, Virginia Tech's Smart Roads Program 20 has a 2.2 mile highway section with controlled lighting and weather systems that are capable of producing fog and rain of varying intensities and droplet sizes, as shown in Fig. 25. Arranging test times at such facilities is possible but would require additional travel and logistics.
The Naval Research Laboratory's Laboratory for Autonomous Systems Research (LASR) facility 21 in Washington, DC, is a smaller testing facility that could be used. The facility contains a 40 0 × 60 0 × 46 0 tropical high bay that simulates a south-eastern Asian rain forest as seen in Fig. 26. The temperature is held constant at 80°F with 80% humidity. The tropical high bay is capable of producing fog and rain with varying rates up to 15 mm∕h. A catwalk at 15′ level along the perimeter allows access to mounting equipment with a separate observation room that provides dry space for electronics and computers.
If dedicated facilities are unavailable, limited weather conditions can be simulated nearly anywhere using commercially available hardware. To simulate fog, we propose to use a fog generation system such as a pulley drive mist pump 22 that we have utilized in laboratory testing. Sandia National Lab uses one such system, 23 shown in operation in Fig. 27 to replicate low visibility fog by driving water through a series of standard misting nozzles in buckets via hoses connected to a single pump. Two green laser beams are used to measure the transmission through the fog and verify its uniformity. In order to sustain such a fog, the humidity of the enclosed area Fig. 25 The rain testing section of Virginia Tech Transportation Institute's highway section. 20 has to be maintained at >80%. This requirement poses a challenge for outdoor testing, as the tents, we aim to use for such weather testing need to allow for almost complete closure. Transmission measurements at specified spatial intervals should be implemented to monitor the visibility conditions as a function of space and time (day-to-day) to ensure uniform testing conditions during the relevant test periods.
Not having prior experience with rain simulation, we plan to adapt simple rain machine techniques used in the film industry. Commonly available hardware can be used to build a network of PVC pipes with various sprinkler heads to achieve optimum coverage of the testing area as shown in Fig. 28. The flow rate of water through the system controls the amount of rain produced and the droplet size. Control over the flow rate is needed if we wish to evaluate the performance Fig. 26 The LASR facility at Naval Research Laboratory features a tropical high bay capable of simulating rain and fog. 21 Fig. 27 The fog facility at Sandia National Laboratory, using lasers to measure and calibrate transmission. 23 of lidars under different rain conditions. The rain drops distribution needs to be monitored as a function of time to ensure consistent conditions exist throughout the experiment. Droplet size can be measured by direct imaging (shadowgraphy) or a phase-Doppler anemometer, but the equipment are not at hand and need multiple units to monitor the extent of the test site. Setting up such a system and testing to ensure repeated and controlled performance is imperative for successful implementation. Prior experience testing autonomous lidars suggest that distances on the order of 10s of meters may be required before measurable performance degradation can be observed. A long-range test facility to simulate rain with good control and equipment for testing droplet size distributions at set intervals will be a challenge.