Monitoring Safety Properties for Autonomous Driving Systems with Vision-Language Models

less than 1 minute read

Published:

Authors: Felipe Toledo, Sebastian Elbaum, Divya Gopinath, Ramneet Kaur, Ravi Mangal, Corina S. Pasareanu, Anirban Roy, and Susmit Jha

Venue: 2025 IEEE Engineering Reliable Autonomous Systems (ERAS)

Abstract:

With the increased adoption of autonomous vehicles comes the need to ensure they reliably follow safe driving properties. Formally specifying and monitoring such properties is challenging because of the semantic mismatch between the high-level properties (e.g., assertions on spatial relationships between the ego vehicle and other entities in a road scene) and the sensed inputs of the vehicles (e.g., raw pixels). For this reason, existing monitoring methods are applicable in limited simulation settings where the ground-truth spatial relationships are available. To bridge this gap we investigate the use of Vision-Language Models (VLMs) for extracting spatial relationships from real images of driving scenes. Towards this goal, we automated the process to extract triplets of the form <subject, relation, ego> from real image datasets such as nuScenes, Waymo, and KITTI, to create DriSTdataset of road-scene images annotated with corresponding triplets. We use DriSTevaluate the spatial reasoning capabilities of state-of-the-art VLMs in the driving domain. Our experiments show that, while standard VLMs have limited capability on this task, their performance measured using F1 score is significantly improved by fine-tuning from 0.56 to 0.93, showing the utility of DriST. We then incorporate the improved VLM into monitors of safety properties specified in formal temporal logic. The study shows the potential of the approach to detect most violations (27 out of 34) found with ground-truth data, and just four instances of false positives. e make our dataset, evaluation, and trained VLMs available at https://github.com/less-lab-uva/DriST.

Download: [Pre-print] [Artifact]