Bringing pedestrian detection for autos to the next level

(12-11-2021) The latest research from imec at Ghent University into radar-video sensor fusion and automatic tone mapping can improve pedestrian detection for automobiles by up to 30%.

Studies from the National  Highway  Traffic  Safety  Administration  (NHTSA)  show that human  error  (e.g., speeding,  fatigue,  and  drunk  or  distracted  driving)  causes  94  to  96 percent  of  all  motor  vehicle  accidents. Therefore, equipping cars with advanced driver assistance services (ADAS) that anticipate and mitigate people’s driving errors will likely reduce the number of traffic fatalities.

In pursuit of ever more powerful Advanced Driver-Assistance Systems (ADAS) features, industry and academia are looking into the potential of multiple sensor technologies and sensor suite configurations – including video, ultrasound, radar, and lidar. But since no one system covers all needs, scenarios, and (traffic/weather) conditions, the next generation of powerful ADAS will likely stem from a combination of technologies.

In this article, Jan Aelterman and David Van Hamme from IPI (Imec’s Image Processing and Interpretation research group at Ghent University, Belgium) report on two of the latest breakthroughs that bring pedestrian detection for automotive to the next level: radar-video sensor fusion and automatic tone mapping.

Radar and video: a particularly interesting sensor fusion match

The ability of tomorrow’s cars to detect road users and obstacles rapidly and accurately will be instrumental in reducing the number of traffic fatalities. Yet, no single sensor or perceptive system covers all needs, scenarios, and traffic/weather conditions.

Cameras, for instance, do not work well at night or in dazzling sunlight; and radar can get confused by reflective metal objects. But when combined, their respective strengths and weaknesses perfectly complement one another. Enter radar-video sensor fusion.

Sensor fusion enables the creation of an improved perceptive (3D) model of a vehicle’s surroundings, using a variety of sensory inputs. Based on that information and leveraging deep learning approaches, detected objects are classified into categories (e.g., cars, pedestrians, cyclists, buildings, sidewalks, etc.). In turn, those insights are at the basis of ADAS’ intelligent driving and anti-collision decisions.

Cooperative radar-video sensor fusion: the new kid on the block

Today’s most popular type of sensor fusion is called late fusion. It only fuses sensor data after each sensor has performed object detection and has taken its own ‘decisions’ based on its own, limited collection of data. Late fusion comes with the main drawback that every sensor throws away all the data it deems irrelevant. As such, a lot of sensor fusion potential is lost. In practice, it might even cause a car to run into an object that has remained under a single sensor’s detection threshold.

In contrast, early fusion (or low-level data fusion) combines all low-level data from every sensor in one intelligent system that sees everything. Consequently, however, it requires high amounts of computing power and massive bandwidths – including high-bandwidth links from every sensor to the system’s central processing engine.

imec.PNG

Fig 1: Late fusion: sensor data are fused after each individual sensor has performed object detection and has drawn its own ‘conclusions’. Source: imec.

imec II.PNG

Fig 2: Early fusion builds on all low-level data from every sensor – and combines those in one intelligent system that sees everything. Source: imec.

In response to these shortcomings, the concept of cooperative radar-video sensor fusion has been developed. It features a feedback loop, with different sensors exchanging low-level or middle-level information to influence each other’s detection processing. If a car’s radar system suddenly experiences strong reflection, for instance, the threshold of the on-board cameras will automatically be adjusted to compensate for this. As such, a pedestrian that would otherwise be hard to detect will effectively be spotted – without the system becoming overly sensitive and being subject to false positives.

A 15% accuracy improvement over late fusion in challenging traffic & weather conditions  

Studies conducted in the course of last year already showed that cooperative sensor fusion outperforms the late fusion method commonly used today. On top of that, it is easier to implement than early fusion since it does not come with the same bandwidth issues and practical implementation limitations. Concretely, evaluated on a dataset of complex traffic scenarios in a European city center, cooperative sensor fusion showed to track pedestrians and cyclists 20% more accurately than a camera-only system. What is more, the first moment of detection proved to outperform competitive approaches by a quarter of a second. And over the past months, the system has been finetuned even more – improving its pedestrian detection accuracy even further, particularly in challenging traffic and weather conditions.

When applied to easy scenarios – i.e., in the daytime, without occlusions, and for not too complex scenes – the cooperative sensor fusion approach now comes with a 41% accuracy improvement over camera-only systems and a 3% accuracy improvement over late fusion. But perhaps even more important is the progress that has been made in the case of bad illumination, pedestrians emanating from occluded areas, crowded scenes, etc. After all, these are the instances when pedestrian detection systems really have to prove their worth. In such difficult circumstances, the gains brought by cooperative radar-video sensor fusion are even more impressive, featuring a 15% improvement over late fusion.

imec III.PNG  imec IV.PNGimec V.PNG

Fig 3: Comparing the F2 scores of various pedestrian detection methods, both in easy scenarios (graph on the left) and more challenging circumstances (graph on the right). The F2 score allows to assess the systems’ accuracy objectively, with a high weight being attributed to miss rates (false negatives). In both scenarios, cooperative radar-video sensor fusion outperforms its camera-only and late fusion contenders. Source: imec.

Significant latency improvements

When it comes to minimizing latency, or tracking delay, a lot of progress has been made as well. For example, in difficult weather and traffic conditions, a latency of 411ms is achieved. That is more than a 40% improvement over the latency that comes with camera-only systems (725ms), and the one that comes with late fusion (696ms). 

imec VI.PNG

Fig 4: Evaluated on the intersection of criteria {‘Twilight,’ ‘Nighttime’}, ‘occluded’ and ‘many vulnerable road users (VRUs),’ the latency that comes with cooperative radar-video sensor fusion has been reduced to 411ms. Source: imec

In pursuit of further breakthroughs

These gains show that cooperative sensor fusion holds great potential. In fact, it may very well become a serious competitor to today’s pedestrian detection technology which typically makes use of more complex, more cumbersome, and more expensive lidar solutions.

And more breakthroughs in this domain are being pursued. Expanding cooperative radar-video sensor fusion to other cases, such as vehicle detection, is one important research track. Building clever systems that can deal with sudden defects or malfunctions is another. And the same goes for advancing the underlying neural networks.

For instance, one concrete shortcoming of today’s AI engines is that they are trained to detect as many vulnerable road users as possible. But that might not be the best approach to reduce the number of traffic fatalities. Is it really mandatory to detect that one pedestrian who has just finished crossing the road fifty meters in front of your car? Maybe not. Instead, those computational resources might better be spent elsewhere. Translating that idea into a neural network that prioritizes detecting the vulnerable road users in a car’s projected trajectory is a topic that warrants further research too.

Adding automatic tone mapping for automotive vision

As discussed, camera technology is a cornerstone of today’s ADAS systems. But it, too, has its shortcomings. Cameras based on visible light, for instance, perform poorly at night or in harsh weather conditions (heavy rain, snow, etc.). Moreover, regular cameras feature a limited dynamic range – which typically results in a loss of contrast in scenes with difficult lighting conditions. 

Surely, some of those limitations can be offset by equipping cars with high dynamic range (HDR) cameras. But HDR cameras make signal processing more complex, as they generate high bitrate video streams that could very well overburden ADAS’ underlying AI engines.

Combining the best of both worlds, the concept of automatic tone mapping translates a high bitrate HDR feed into low bitrate, low dynamic range (LDR) images without losing any information crucial to automotive perception.

Losing data does not (necessarily) mean losing information   

Tone mapping has existed for a while, with a generic flavor being used in today’s smartphones. But applying tone mapping to the automotive use case and pedestrian detection, in particular, calls for a whole new set of considerations and trade-offs. It raises the question of which data need to be kept and which data can be thrown away without risking people’s lives.

imec VII.PNG

Fig 5: Evaluating the automatic tone mapping approach from imec – Ghent University (scenes on the right) against the use of existing tone mapping software (scenes on the left). Source: imec.

The result is visualized in the picture above. It consists of four scenes captured by an HDR camera. The images on the left-hand side have been tone mapped using existing software, while those on the right-hand side were compiled using a novel convolutional neural network (CNN) tone mapping approach developed by researchers from imec and Ghent University.

The lower left-hand side image shows the car’s headlights in great detail. But, as a consequence, no other shapes can be observed. In the lower right-hand side image, however, the headlights’ details might have been lost – but a pedestrian can now be discerned.

It is a perfect illustration of what automatic tone mapping can do. Using an automotive data set, the underlying neural network is trained to look for low-level image details that are likely to be relevant for automotive perception and throw away data deemed irrelevant. In other words: some data are lost, but all crucial information on the presence of vulnerable road users is preserved.

And yet another benefit of the novel neural-network-based approach is that various other features can be integrated as well – such as noise mitigation or debayering algorithms and ultimately even algorithms that remove artefacts caused by disturbing weather conditions (such as heavy fog or rain).

Displaying tone-mapped imagery in a natural way

Preserving life-saving information in an image is one thing. But displaying that information in a natural way is important too. It is yet another consideration that needs to be taken into account.

After all, generic tone-mapped images sometimes look very awkward. A pedestrian could be displayed brighter than the sun, halos could appear around cyclists, and colors could be boosted to unnatural, fluorescent levels. For an AI engine, such artefacts matter little. But human drivers, as well, should be able to make decisions based on tone-mapped imagery. For instance, when images are integrated into a car’s digital rearview camera or side mirrors.

A 10% to 30% accuracy improvement

The visual improvements brought by automatic tone mapping are apparent. But the technology also translates into measurable gains.

Research has shown, for instance, that the number of pedestrians that remains undetected drops by 30% – when compared to using an SDR camera. And the tone mapping approaches described in the literature are outperformed by 10% as well. At first sight, that might not seem like a lot, but in practice, every pedestrian that goes by undetected poses a serious risk. 

Jan Aelterman is a professor at the Image Processing and Interpretation (IPI) group, an imec research group at Ghent University (Belgium) – where he has specialized in image modeling and inverse problems. His research deals with image and video restoration, reconstruction, and estimation problems in application fields such as HDR video, MRI, CT, (electron)microscopy, photography and multi-view processing. 

David Van Hamme is a senior researcher at IPI, imec’s Image Processing and Interpretation research group at Ghent University, which he joined in 2007. His research topics include video segmentation, fire and object detection for industrial safety, industrial inspection, and intelligent vehicle perception systems, on which he obtained a PhD in 2016.