LLM-TP: Enhancing Trajectory Prediction for Autonomous Driving with LLMs and Chain-of-Thought Prompting

Haicheng Liao*, Hanlin Kong*, Chengyue Wang, Junxian Yang et al.

1. Abstract

Accurate trajectory prediction is crucial for the safety of autonomous driving. In this study, we introduce LLM-TP (Large Language Model for Trajectory Prediction), a novel approach that enhances trajectory prediction through a chain-of-thought reasoning process tailored to specific traffic scenarios. By leveraging the power of large language models (LLMs), LLM-TP generates semantic annotations that significantly improve the understanding of complex traffic environments, thereby boosting prediction accuracy and robustness. We also present two new datasets, Highway-Text and Urban-Text, specifically designed to fine-tune lightweight language models for generating context-specific semantic annotations. This approach not only enhances prediction accuracy but also addresses inference cost concerns through efficient textual data handling. Comprehensive evaluations on real-world datasets—NGSIM, HighD, MoCAD, ApolloScape and nuScenes—demonstrate that LLM-TP outperforms existing models, achieving improvements in prediction accuracy of 12.1%, 23.1%, 11.3%, and 5.1% on the NGSIM, HighD, MoCAD, and ApolloScape datasets, respectively. These results highlight LLM-TP's superiority in effectively handling complex traffic scenarios.

2. Dataset

This study contributes to the field of trajectory prediction by introducing two scene description dataset: Highway-Text and Urban-Text. They collectively encompass over 10 million words describing various traffic scenarios. The Highway-Text dataset contains scene descriptions from 4,327 traffic scenarios derived from the Next Generation Simulation (NGSIM) dataset and 2,279 scenarios from the Highway Drone Dataset (HighD). Meanwhile, the Urban-Text dataset includes multi-agent scene descriptions from 3,255 samples in Macao Connected Autonomous Driving (MoCAD) and 2,176 samples from ApolloScape, covering diverse environments such as campus roads, urban roads, intersections, and roundabouts. Both datasets are divided into training (70%), validation (10%), and testing (20%) sets. To our knowledge, these datasets are the first in the field to leverage the linguistic capabilities of the LLM GPT4-Turbo, which utilizes regularized CoT prompting to generate detailed semantic descriptions. These descriptions encompass interaction analysis, risk assessment, and movement predictions, offering a comprehensive semantic understanding of traffic scenarios.

A detailed presentation of our CoT prompting is given below. The dialogue progression is methodically structured in accordance with conventional cognitive processes encompassing Background and Statistics, Interaction, Risks, and Prediction. Within each thematic category, we systematically infuse the LLM with specific knowledge and illustrative examples.

3. Methodology Overview

Illustration of the Language-Instructed Encoder workflow is given below. This process involves the multimodal fusion of semantic annotations and spatio-temporal data, with annotations generated by a fine-tuned language model (LM). The edge LM is trained on text data sampled from real-world datasets and labeled using CoT prompting with GPT-4 Turbo, enabling it to assimilate knowledge from LLMs and thus learn its powerful contextual learning capabilities.

4. Experiment

Below is the comparison of four LMs: Parameter Count (a) and Performance on Urban-Text (b) and Highway-Text (c). Metric: F1 Score from BERT Score.

Additionally, comparison of semantic annotation output by four different LMs for a specific traffic scene in the nuScenes dataset is given below. In the “Traffic Scene” subfigure, the target agent is marked in red, while the ground truth trajectory in the “Ground Truth”' subfigure is depicted in green. It is evident that the 0.13B parameter GPT-Neo model cannot provide valuable semantic annotations based on the actual scene, whereas the other three LMs can understand complex urban scenarios and offer detailed semantic annotations, such as identifying turning and deceleration trends of the target agent. Based on these comprehensive evaluations, we select Qwen-1.5 as the scene understanding component within our trajectory prediction model.

Qualitative results compare the LLM-TP on the NGSIM dataset against its variant without the Language-Instructed Encoder (LI) and WSip are provided below. Semantic annotations provided by the fine-tuned LM for each scenario is also displayed at the bottom of the visualization.

Qualitative results compare the LLM-TP on the nuScenes dataset (without using the HD map information) against its variant without the Language-Instructed Encoder. Semantic annotations provided by the fine-tuned LM for each scenario are also displayed at the bottom of the visualization.

5. Contact

If you have any questions, feel free to contact Hanlin Kong (hanlinkong@foxmail.com).