TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction

Kojiro Takeyama^1,2 Yimeng Liu¹ and Misha Sra¹ ¹Kojiro Takeyama, Yimeng Liu, and Misha Sra are with the Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106-5080, USA takeyama@ucsb.edu, yimengliu@ucsb.edu, sra@ucsb.edu²Kojiro Takeyama is with Toyota Motor North America, Ann Arbor, MI 48105-9748, USA kojiro.takeyama@toyota.com

Abstract

Accurate prediction of human behavior is crucial for AI systems to effectively support real-world applications, such as autonomous robots anticipating and assisting with human tasks. Real-world scenarios frequently present challenges such as occlusions and incomplete scene observations, which can compromise predictive accuracy. Thus, traditional video-based methods often struggle due to limited temporal and spatial perspectives. Large Language Models (LLMs) offer a promising alternative. Having been trained on a large text corpus describing human behaviors, LLMs likely encode plausible sequences of human actions in a home environment. However, LLMs, trained primarily on text data, lack inherent spatial awareness and real-time environmental perception. They struggle with understanding physical constraints and spatial geometry. Therefore, to be effective in a real-world spatial scenario, we propose a multimodal prediction framework that enhances LLM-based action prediction by integrating physical constraints derived from human trajectories. Our experiments demonstrate that combining LLM predictions with trajectory data significantly improves overall prediction performance. This enhancement is particularly notable in situations where the LLM receives limited scene information, highlighting the complementary nature of linguistic knowledge and physical constraints in understanding and anticipating human behavior.

Project page:

https://sites.google.com/view/trllm?usp=sharing

Github repo:

https://github.com/kojirotakeyama/TR-LLM/blob/main/readme.md

I INTRODUCTION

Predicting human behavior is essential for AI systems to integrate seamlessly into our lives and provide effective support. This capability is particularly crucial for applications like support robots, which need to anticipate human actions and proactively perform tasks in a home environment. While traditional approaches have relied heavily on video data for action prediction [6, 16], they often fall short due to their limited temporal and spatial scope. Human behavior is influenced by complex environmental and personal factors that extend beyond what can be captured in video footage. These include temporal elements like time of day, environmental conditions such as room temperature, the state and position of objects in the scene, and human-specific attributes including personal characteristics, action history, and the relationship with others present in the room. These factors exert significant influence on human behavior as exemplified by the contrast between retrieving an item from a fridge during a time-constrained morning routine versus during a more leisurely evening meal preparation or the contrast between weekday vs weekend routines. To truly understand and predict human behavior, we must take into account this broader contextual landscape, moving beyond the constraints of conventional video-based methods.

Refer to caption — Figure 1: Overview of our approach: We propose a multi-modal action prediction framework that incorporates both an LLM and human trajectories. The core idea is to integrate two different perspectives—physical and semantic factors—through an object-based action prediction framework to reduce uncertainties and enhance action prediction accuracy.

Recent studies have explored the use of Large Language Models (LLMs) for predicting human actions [20, 21]. These models leverage extensive knowledge of human behavior in home environments to forecast actions across diverse scene contexts and generally perform robustly even in zero-shot settings. However, observable scene contexts in the real world are often insufficient for predicting a person’s intentions accurately. For instance, scene context derived from sensor data (e.g., cameras, microphones) frequently suffers from limitations in coverage, sensitivity, or resolution, leading to increased uncertainty in predictions. This presents an inherent limitation in LLM-based action predictions.

To address this shortcoming, we propose a multi-modal prediction framework that enhances LLM-based action prediction by integrating physical constraints. Our approach uses a person’s past trajectory to infer their likely destination, thereby imposing physical constraints on their next target object and narrowing down potential actions (Fig.1). Recognizing that target area prediction is highly dependent on factors such as room layout, location, and speed, we leverage an indoor human locomotion dataset [33] to learn these complex relationships, enabling us to derive a reliable probabilistic distribution of target areas.

Unlike vision-language models (VLMs) [34], which are adept at recognizing and interpreting images and text yet often struggle to predict physical quantities under complex image constraints, our method incorporates a dedicated physically-aware prediction module trained on practical datasets to overcome these challenges.

The contribution of our work is described as follows.

•

Input: Incorporating LLMs to human action anticipation considering a wide variety of semantic scene context
•

Proposing a multi-modal action prediction framework incorporating LLMs with trajectory that imposes physical constraints on the person’s next target object, helping to narrow down potential actions.
•

Building an evaluation dataset for our multi-modal approach, which includes scene map, trajectory, and scene contexts.
•

Demonstrating that our method markedly enhances prediction performance compared with LLM and VLM, especially in scenarios where the LLM has limited access to scene information.

II Related works

II-A Vision-based approach

Previous works have primarily focused on vision-based human action prediction. One line of research has focused on video-based action prediction, utilizing the recent past sequence of video frames to forecast future actions. Given the availability of large-scale action video datasets (e.g., Ego4D [2], Human3.6M [9], Home Action Genome [1]), many researchers have adopted machine-learning-based approaches [8, 7, 5, 4], often employing sequential models such as LSTM [6] or transformers [15, 14]. Some studies have integrated multimodal models to improve performance, combining video with other modalities (e.g., video-acoustic models [3], video-text models [11, 10]). Others have focused on object-based cues within the video to predict future actions [12, 13]. Another research direction involves predicting future motions from past motions, using only the human body’s pose data. For instance, [19] introduced a stochastic model for predicting diverse future actions with probability, while [17, 18] incorporated graph convolutional networks with attention mechanisms to improve prediction accuracy. Additionally, [16] proposed the use of a diffusion model to account for multi-person interactions.

In existing video-based action prediction research, a limitation lies in the spatial and temporal restrictions of scene context. Most studies focus on information within the limited field of view of ego-centric or third-person cameras, which only provides localized data. This narrow perspective hinders understanding of broader, more complex behaviors in real-world environments. Additionally, privacy concerns restrict the observation of individuals’ daily activities over extended periods in uncontrolled settings, limiting the ability to learn a diverse range of daily action patterns from existing datasets. To overcome these constraints, we propose leveraging large language models (LLMs), which encode vast everyday knowledge in natural language. By incorporating LLMs, we remove spatial and temporal limitations, enabling a more flexible and generalized approach to action prediction that captures broader patterns applicable to diverse real-world scenarios.

II-B Language model-based approach

With the rise of large language models (LLMs), many studies have leveraged LLMs to analyze social human behaviors. Notably, [23, 22] made a significant impact on the field by simulating town-scale, multi-agent human daily life, incorporating social interactions between agents throughout the entire pipeline using LLMs. This work has inspired subsequent efforts to extend the scale and complexity of such simulations [24, 25, 26]. However, a notable gap exists between their work and ours, as their focus is primarily on macroscopic-scale social behavior simulations. Additionally, their approach centers on intention-conditioned action generation within these simulations, which contrasts with our goal of predicting human actions in real-world scenarios.

To the best of our knowledge, only a few studies have explored LLM-based human action prediction. [20] utilized an LLM to predict the next human activity by identifying the object in the environment that the person is most likely to interact with. [21] focused on LLM-based human action prediction to enable a robot to plan corresponding supportive actions. In their approach, they provided the LLM with object lists and action history to predict the next human action. However, while they account for scene context, they do not consider physical constraints such as human motion or the spatial structure of the environment.

Furthermore, vision-language models (VLMs)[34] have recently emerged as a prominent technology. These models possess the capability to recognize and understand both images and text, and they are utilized in a wide range of applications. Although they could potentially be applied to our task, pre-trained VLMs generally lack the ability to predict physical quantities under complex image constraints since they are primarily trained to establish correspondences between images and text. In contrast, our method can evaluate complex image constraints and predict the target object by employing a physically-aware prediction module that operates independently of the LLM module.

III Method

III-A Problem definition

In daily life, individuals perform sequential actions while interacting with various objects. For instance, one might take a glass from a cupboard, fill it with a drink from the fridge, and then sit on the couch to drink. In this study, we focus on predicting the transition between these sequential actions, specifically forecasting the next action as a person moves toward the target object. We assume the scene is a room-scale home environment where we can observe various scene states except for the persons’ internal states (Fig.2). The input comprises both text-based and image-based scene contexts. The text-based context includes semantic details—such as the time of day, object lists, a person’s action history, and conversations—while the image-based context provides physical scene information through a 2D image that shows an object layout map and the person’s trajectory. In our task, we first predict the target object using both contexts and then forecast the future action based on this prediction.

III-B Method overview

Fig.1 illustrates an overview of our approach. We propose a multi-modal action prediction framework that incorporates both an LLM and human trajectories. The core idea is to integrate two different perspectives—physical and semantic factors—through an object-based action prediction framework. Our framework consists of two primary steps: target object prediction (III-C) and action prediction (III-D). In the target object prediction step, we first utilize a LLM to predict a person’s target object based on the input scene context, generating a probability distribution over potential objects in the room from a semantic perspective (III-C1). Subsequently, we incorporate the person’s past trajectory to infer their likely destination, applying physical constraints to refine the prediction of the next target object (III-C2).

In the action prediction phase, the individual’s subsequent action is predicted using a LLM, based on the identified target object.

Further details of the approach are provided in the following sections.

III-C Target object prediction

III-C1 LLM-based prediction

Fig.3 shows the pipeline of LLM-based target object prediction. In the initial step, various scene contexts—such as time of day, the person’s state (including location, action history, and conversation), and a list of objects in the room—are provided as a text prompt and input into the LLM. Additionally, an order prompt is used to guide the LLM in generating a comprehensive, free-form list of potential target objects and their associated actions. This intermediate step facilitates the identification of probable target objects by considering their corresponding actions. Consequently, we derive a ranking of objects in the environment based on the target object candidates given in the previous step and the initial scene context prompt. The ranking is divided into four probability levels: A (high probability), B (moderate probability), C (low probability), and D (very low probability). The output from the LLM is subsequently converted into numerical scores and normalized into probabilities that sum to 1. In this study, scores of 15, 10, 5, and 1 were assigned to categories A, B, C, and D, respectively.

III-C2 Trajectory-based prediction

We utilize past trajectory data and scene map information to constrain the target area of a person. Since the target area prediction is heavily influenced by factors such as room layout, the person’s location, and speed, we leverage the LocoVR dataset[33] to learn these intricate relationships. The LocoVR dataset consists of over 7,000 human trajectories captured in 131 indoor environments, with each trajectory segment spanning from a randomly assigned start to a goal location within the room. We employed a simple U-Net model to identify potential target areas based on past trajectories and scene maps. Specifically, the model takes as input a binary obstacle scene map (HxWx1) and a trajectory represented as points on time-series images (HxWxt), and it outputs the target object’s position as an image (HxWx1). During training, we compute the binary cross entropy loss between the predicted and ground truth target object positions represented as the image. The trained model predicts the probabilistic distribution of the target area, and we calculate the overlap between this predicted distribution and the spatial location of objects in the scene to estimate the probability of each object being the target.

Fig.4 illustrates the process of trajectory-based target object prediction. In an early stage of trajectory (left image), the potential target area is broadly distributed. As the person progresses, the predicted target area and potential target objects are progressively narrowed down (center and right images). This refinement is based on a learned policy reflecting typical human behavior: when moving toward a specific goal, individuals generally avoid retracing their steps or taking unnecessary detours.

Since both LLM-based prediction and trajectory-based prediction assign probabilities to each object, we integrate these by multiplying their respective probabilities. This approach allows us to refine and identify target object candidates that are highly probable from both semantic and physical perspectives.

III-D Action prediction

At this stage, we employ the LLM to predict the action corresponding to the target object identified in the previous step. The input to the LLM includes the scene context and the predicted target object, and the LLM outputs the most plausible action the person is likely to perform within the given scene context.

IV Experiment

IV-A Evaluation data

To evaluate our multimodal method, we required input data with a text prompt describing the scene context, trajectory, and scene map. As no existing dataset met these criteria, we constructed a new dataset for evaluation.

Fig.5 illustrates the pipeline we employed to generate the evalution dataset. We utilized the Habitat-Matterport 3D Semantics Dataset (HM3DSem) [27] to obtain scene map and object data in home environments. HM3DSem provides a diverse set of 3D models of home environments, with semantic labels. From HM3DSem, we extracted scene maps represented as binary grids, where a value of 1 indicates areas walkable areas (within 0.3 meters of the floor height), and 0 corresponds to all other regions. These maps have dimensions of 256 x 256 pixels, mapping to a physical space of 10m x 10m. Additionally, we identified the pixel regions corresponding to each object within the scene map.

Furthermore, we manually designed daily action scenarios, detailing factors such as day/time, persona, current location, target object, subsequent actions, conversations, and more. These scenarios were transformed into text prompts for input to the LLM. We generated a trajectory from the current location to the target object by leveraging a model trained on the LocoVR dataset, which incorporates the scene map to guide the path (visit our website[32] for the detail).

Finally, we obtained 67 data pairs from the combination of 10 action scenarios and 9 scene maps. The total number of pairs is lower than the combination count because cases where objects from the action scenarios were absent in the scene maps were excluded.

IV-B Benchmarks

Since action prediction using room-scale scene context is a novel task, few existing studies could directly apply to our work. Therefore, we introduce the following three baseline methods as benchmarks:

LLM: In this method, the LLM predicts the target object and its corresponding human action based on a text-based scene context. The input comprises day/time information, action history, conversation, and an object candidate list. The LLM produces scores for each target object and selects the predicted action associated with the highest-scoring target object. This approach is equivalent to the LLM-based action prediction modules described in [20, 21], though our implementation differs because the original codes were unavailable.

VLM: We employed pre-trained VLM models to benchmark baseline performance on multimodal data. The VLM input consists of a text-based scene context and an image that displays the scene map and the trajectory. In the image, each object and trajectory is uniquely color-coded, and a color list of the objects (e.g., sink: cyan, table: gold, trajectory: red) is included in the text-based scene context to help the model accurately identify object locations and trajectories. The VLM then outputs scores for the target objects and selects the corresponding action for the most plausible object, similar to an LLM.

Trajectory: We evaluated a standalone trajectory-based method, which is a component of our approach. This serves as an ablation study to assess how our method improves performance on the target object prediction compared with standalone trajectory-based method.

For both LLM and VLM, we used two types of pre-trained models—ChatGPT-4o-mini and Llama3.2—to evaluate performance across different model architectures.

IV-C Experimental setup

We provided three types of text-based scene context as input data to assess the model’s robustness under degraded conditions: (1) full scene context, (2) scene context without conversation, and (3) scene context without conversation and past action history. These variations simulate real-world scenarios where certain information may be unavailable. The evaluation examines how our method mitigates performance degradation, demonstrating its robustness to noise in practical settings. Additionally, we evaluate performance under three trajectory progress conditions: d¿1m, d¿2m, and d¿3m. The performance differences across these conditions reveal how the trajectory information contributes in each method.

IV-D Evaluation tasks

We conducted evaluations on both target object prediction and action prediction tasks. For both tasks, we conducted three trials for each evaluation and used the average of these trials as the final result to ensure the reproducibility.

Target object prediction: This task serves as an essential intermediate module for action prediction. The objective is to identify the ground truth target object from a set of 30–40 objects in each scene. For evaluation, we used the top-5 accuracy metric, which considers a prediction correct if the ground truth object appears among the top five predicted objects.

Action prediction: In this task, we evaluate action prediction accuracy by computing the cosine similarity between the predicted actions and their corresponding ground truth values using the SBERT library.

IV-E Results

IV-E1 Quantitative result

TableI presents the accuracy of target object prediction. We describe the following conclusions based on the tableI:

(a) Our method outperforms both LLM and VLM across GPT and Llama models. Notably, as the available text-based scene context decreases, the prediction accuracy of both LLM and VLM deteriorates significantly, whereas our method maintains robust performance. This demonstrates that the physical cues extracted from trajectories effectively compensate for the reduced semantic context.

(b) VLM performance is comparable to—or slightly better than—that of LLM, and both degrade similarly as the available scene context decreases. These results suggest that pre-trained VLMs may struggle with solving physical problems under complex geometric constraints, which could limit their added value when combined with text-based scene contexts. This is because pre-trained VLMs are not designed to predict physical quantities under complex image constraints, as they are primarily trained to establish correspondences between images and text.

(c) Notably, our method’s performance significantly exceeds the average of the standalone LLM and Trajectory approaches, indicating that integrating LLM and trajectory information effectively compensates for individual weaknesses while leveraging their respective strengths.

(d) Regarding trajectory progress distance (d), performance improves as d increases because longer trajectories offer more spatial cues to narrow down potential destination areas as shown in Fig4. This trend is evident in both our method and the standalone trajectory approach. In contrast, VLM does not exhibit a clear correlation between traveled distance and accuracy, suggesting that it struggles to process physical information effectively.

TableII displays the performance on action prediction. A similar trend to that observed in target object prediction is evident, emphasizing that the target object is a key cue for forecasting future actions. These results underscore the effectiveness of our approach compared to other baselines.

TABLE I: Accuracy of target object prediction

$Method$	$Accuracy[\%]$
$Method$	All	wo/conv	wo/conv-hist
LLM-llama	43.2	35.0	26.5
VLM-llama ( $d>1m$ )	40.3	31.6	31.4
VLM-llama ( $d>2m$ )	40.9	32.2	31.2
VLM-llama ( $d>3m$ )	41.4	32.0	31.2
Ours ( $d>1m$ )
(LLM-llama+Trajectory)	47.9	45.9	41.4
Ours ( $d>2m$ )
(LLM-llama+Trajectory)	49.9	48.1	43.5
Ours ( $d>3m$ )
(LLM-llama+Trajectory)	51.4	49.8	46.2
LLM-gpt	44.7	37.2	28.0
VLM-gpt ( $d>1m$ )	50.9	40.9	27.8
VLM-gpt ( $d>2m$ )	51.5	42.4	30.3
VLM-gpt ( $d>3m$ )	51.6	43.0	31.0
Ours ( $d>1m$ )
(LLM-gpt+Trajectory)	54.6	51.2	44.3
Ours ( $d>2m$ )
(LLM-gpt+Trajectory)	56.4	53.5	47.3
Ours ( $d>3m$ )
(LLM-gpt+Trajectory)	58.6	56.3	49.1
Trajectory ( $d>1m$ )	33.3	33.3	33.3
Trajectory ( $d>2m$ )	36.7	36.7	36.7
Trajectory ( $d>3m$ )	39.7	39.7	39.7

TABLE II: Accuracy of action prediction

$Method$	$Cosinesimularity$
$Method$	All	wo/conv	wo/conv-hist
LLM-llama	0.220	0.201	0.176
VLM-llama ( $d>1m$ )	0.208	0.199	0.176
VLM-llama ( $d>2m$ )	0.207	0.198	0.187
VLM-llama ( $d>3m$ )	0.203	0.200	0.185
Ours ( $d>1m$ )
(LLM-llama+Trajectory)	0.218	0.210	0.207
Ours ( $d>2m$ )
(LLM-llama+Trajectory)	0.222	0.214	0.210
Ours ( $d>3m$ )
(LLM-llama+Trajectory)	0.227	0.218	0.213
LLM-gpt	0.335	0.278	0.236
VLM-gpt ( $d>1m$ )	0.368	0.313	0.256
VLM-gpt ( $d>2m$ )	0.370	0.311	0.254
VLM-gpt ( $d>3m$ )	0.370	0.305	0.243
Ours ( $d>1m$ )
(LLM-gpt+Trajectory)	0.387	0.353	0.310
Ours ( $d>2m$ )
(LLM-gpt+Trajectory)	0.397	0.365	0.322
Ours ( $d>3m$ )
(LLM-gpt+Trajectory)	0.409	0.379	0.331

IV-E2 Qualitative result

In this section, we first present visual comparisons between our method and baseline approaches in sample scenes (Fig.6), and then detail how target object prediction evolves as the trajectory progresses (Fig.7).

Fig.6 displays the target object prediction results using LLM-gpt, VLM-gpt, standalone trajectory-based method (Trajectory), and our proposed method (LLM-gpt+Trajectory). The red point indicates the starting position of the trajectory, while the green line and point represent the observed trajectory and current position, respectively. The yellow distribution illustrates the predicted target area based on the observed trajectory, while the objects—color-coded from blue to red—indicate predicted target object probabilities from low to high. The figure indicates that the LLM assigns high probabilities to several objects, including the target; however, some mispredictions persist due to the inherent difficulty of inferring a person’s intentions solely from the semantic scene context. Similarly, the VLM produces some incorrect predictions because of its limited capability in physically-aware prediction. Although the exact mechanism remains unclear for the pre-trained model, we observe that the VLM likely to assign relatively high probabilities to objects near the trajectory, indicating that it does not appear to predict the future target area. On the other hand, the standalone trajectory-based target object prediction assigns high probabilities widely around the ground truth target area, yet it struggles to accurately pinpoint the target object—especially when the trajectory has just begun. By leveraging both semantic and physical cues, our method more effectively narrows down the target object compared to using either LLM or trajectory data alone.

Fig.7 illustrates the detailed behavior of the trajectory-based target object prediction by showing two distinct trajectory patterns over time within the same room layout. Initially (upper row), the target area probability distributions (shown in yellow) are broadly spread in both scenarios. As the trajectories progress (middle and bottom rows), these distributions narrow to regions near the ground truth target object, demonstrating the successful refinement process over time.

V Limitation and Futurework

Introducing Full-Body Motion: While this study focused on trajectories as the primary form of physical information, future work could explore full-body motion, which may provide more refined and highly available cues for action prediction. Incorporating signals such as hand, foot, head, and gaze movements could serve as valuable indicators for predicting actions even during periods without locomotion.

Application to Larger Environments: The experiments in this study were conducted in private home environments within a 10-meter-square area. However, applying the proposed method to larger public spaces, such as schools, offices, stores, and stations, could yield greater temporal advantages in predictions through our framework. This would expand the potential for a wider range of service applications.

Performance enhancement: Further improvement in prediction accuracy is expected by independently enhancing both the LLM and trajectory prediction components. The refinement of each component would offer greater benefits through the integration of the two modalities.

Application to Service Development: A promising application of the proposed method lies in the development of intelligent systems that leverage LLMs to deliver context-aware and adaptive services tailored to predicted human actions. By anticipating user behavior, these systems could provide more timely and relevant responses, significantly enhancing user experience across various domains.

VI Conclusion

We leverage LLM to predict human actions incorporating diverse scene contexts in home environments. To improve the robustness against insufficient scene observations, we propose a multimodal prediction framework that combines LLM-based action prediction with physical constraints derived from human trajectories. The key idea is to integrate two different perspectives—physical and semantic factors—through an object-based action prediction framework, which compensates for the limitations of each perspective and effectively refine the prediction. The experiments show that our method markedly enhances prediction performance compared with LLM and VLM, especially in scenarios where the LLM has limited access to scene information. This improvement underscores the complementary roles of linguistic knowledge and physical constraints in comprehending and anticipating human behavior.

References

[1] N. Rai, H. Chen, J. Ji, R. Desai, K. Kozuka, S. Ishizaka, E. Adeli, and J. C. Niebles, ”Home action genome: Cooperative compositional action understanding,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 11184-11193.
[2] K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V. Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, A. Gebreselasie, C. González, J. Hillis, X. Huang, Y. Huang, W. Jia, W. Khoo, J. Kolář, S. Kottur, A. Kumar, F. Landini, C. Li, Y. Li, Z. Li, K. Mangalam, R. Modhugu, J. Munro, T. Murrell, T. Nishiyasu, W. Price, P. Ruiz, M. Ramazanova, L. Sari, K. Somasundaram, A. Southerland, Y. Sugano, R. Tao, M. Vo, Y. Wang, X. Wu, T. Yagi, Z. Zhao, Y. Zhu, P. Arbeláez, D. Crandall, D. Damen, G. M. Farinella, C. Fuegen, B. Ghanem, V. K. Ithapu, C. V. Jawahar, H. Joo, K. Kitani, H. Li, R. Newcombe, A. Oliva, H. S. Park, J. M. Rehg, Y. Sato, J. Shi, M. Z. Shou, A. Torralba, L. Torresani, M. Yan, and J. Malik, ”Ego4D: Around the world in 3,000 hours of egocentric video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18995-19012.
[3] Z. Zhong, D. Schneider, M. Voit, R. Stiefelhagen, and J. Beyerer, ”Anticipative feature fusion transformer for multi-modal action anticipation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2023, pp. 6068-6077.
[4] E. V. Mascaró, H. Ahn, and D. Lee, ”Intention-conditioned long-term human egocentric action anticipation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2023, pp. 6048-6057.
[5] Y. Wu, L. Zhu, X. Wang, Y. Yang, and F. Wu, ”Learning to anticipate egocentric actions by imagination,” IEEE Trans. Image Process., vol. 30, pp. 1143-1152, 2020.
[6] A. Furnari and G. M. Farinella, ”Rolling-unrolling LSTMs for action anticipation from first-person video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 11, pp. 4021-4036, 2020.
[7] B. Fernando and S. Herath, ”Anticipating human actions by correlating past with the future with Jaccard similarity measures,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 13224-13233.
[8] D. Surís, R. Liu, and C. Vondrick, ”Learning the predictability of the future,” CoRR, vol. abs/2101.01600, 2021. [Online]. Available: https://arxiv.org/abs/2101.01600.
[9] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, ”Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1325-1339, 2013.
[10] H. Mittal, N. Agarwal, S.Y. Lo, and K. Lee, ”Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video-language models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 18580-18590.
[11] Q. Zhao, S. Wang, C. Zhang, C. Fu, M. Q. Do, N. Agarwal, K. Lee, and C. Sun, ”AntGPT: Can large language models help long-term action anticipation from videos?,” arXiv preprint arXiv:2307.16368, 2023.
[12] C. Zhang, C. Fu, S. Wang, N. Agarwal, K. Lee, C. Choi, and C. Sun, ”Object-centric video representation for long-term action anticipation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2024, pp. 6751-6761.
[13] R.G. Pasca, A. Gavryushin, M. Hamza, Y.L. Kuo, K. Mo, L. Van Gool, O. Hilliges, and X. Wang, ”Summarize the past to predict the future: Natural language descriptions of context boost multimodal object interaction anticipation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 18286-18296.
[14] J. Wang, G. Chen, Y. Huang, L. Wang, and T. Lu, ”Memory-and-anticipation transformer for online action understanding,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 13824-13835.
[15] R. Girdhar and K. Grauman, ”Anticipative video transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 13505-13515.
[16] J. Tanke, L. Zhang, A. Zhao, C. Tang, Y. Cai, L. Wang, P.C. Wu, J. Gall, and C. Keskin, ”Social diffusion: Long-term multiple human motion anticipation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 9601-9611.
[17] W. Mao, M. Liu, and M. Salzmann, ”History repeats itself: Human motion prediction via motion attention,” in Proc. Comput. Vis. ECCV 2020: 16th Eur. Conf., Glasgow, U.K., Aug. 2020, pp. 474-489.
[18] W. Mao, M. Liu, M. Salzmann, and H. Li, ”Multi-level motion attention for human motion prediction,” Int. J. Comput. Vis., vol. 129, no. 9, pp. 2513-2535, 2021.
[19] S. Aliakbarian, F. S. Saleh, M. Salzmann, L. Petersson, and S. Gould, ”A stochastic conditioning scheme for diverse human motion prediction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5223-5232.
[20] M. A. Graule and V. Isler, ”GG-LLM: Geometrically grounding large language models for zero-shot human activity forecasting in human-aware task planning,” in 2024 IEEE Int. Conf. Robot. Autom. (ICRA), 2024, pp. 568-574.
[21] Y. Liu, L. Palmieri, S. Koch, I. Georgievski, and M. Aiello, ”Towards human awareness in robot task planning with large language models,” arXiv preprint arXiv:2404.11267, 2024.
[22] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, ”Generative agents: Interactive simulacra of human behavior,” in Proc. 36th Annu. ACM Symp. User Interface Softw. Technol., 2023, pp. 1-22.
[23] J. S. Park, L. Popowski, C. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, ”Social simulacra: Creating populated prototypes for social computing systems,” in Proc. 35th Annu. ACM Symp. User Interface Softw. Technol., 2022, pp. 1-18.
[24] C. Gao, X. Lan, Z. Lu, J. Mao, J. Piao, H. Wang, D. Jin, and Y. Li, ”S3: Social-network simulation system with large language model-empowered agents,” arXiv preprint arXiv:2307.14984, 2023.
[25] S. Li, J. Yang, and K. Zhao, ”Are you in a masquerade? Exploring the behavior and impact of large language model driven social bots in online social networks,” arXiv preprint arXiv:2307.10337, 2023.
[26] K. Zhao, M. Naim, J. Kondic, M. Cortes, J. Ge, S. Luo, G. R. Yang, and A. Ahn, ”Lyfe agents: Generative agents for low-cost real-time social interactions,” arXiv preprint arXiv:2310.02172, 2023.
[27] K. Yadav, R. Ramrakhya, S. K. Ramakrishnan, T. Gervet, J. Turner, A. Gokaslan, N. Maestre, A. X. Chang, D. Batra, M. Savva, and others, ”Habitat-Matterport 3D Semantics Dataset,” arXiv preprint arXiv:2210.05633, 2022. [Online]. Available: https://arxiv.org/abs/2210.05633.
[28] D. P. Kingma and J. Ba, ”Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[29] N. Tax, ”Human activity prediction in smart home environments with LSTM neural networks,” in 2018 14th Int. Conf. Intell. Environ. (IE), 2018, pp. 40-47.
[30] M.J. Tsai, C.L. Wu, S. K. Pradhan, Y. Xie, T.Y. Li, L.C. Fu, and Y.C. Zeng, ”Context-aware activity prediction using human behavior pattern in real smart home environments,” in 2016 IEEE Int. Conf. Autom. Sci. Eng. (CASE), 2016, pp. 168-173.
[31] S. Choi, E. Kim, and S. Oh, ”Human behavior prediction for smart homes using deep learning,” in 2013 IEEE RO-MAN, 2013, pp. 173-179.
[32] K. Takeyama. ”TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction”, TR-LLM project page. [Online]. Available: https://sites.google.com/view/tr-llm. Accessed: Sep. 16, 2024.
[33] K. Takeyama, Y. Liu , and M. Sra, ”LocoVR: Multiuser Indoor Locomotion Dataset in Virtual Reality”, in Proc. Int. Conf. on Learning Representations, 2025
[34] F. Bordes, ,R. Y. Pang, A. Ajay, A. C. Li, Bardes, S. Petryk, .., V. Chandr, ”An introduction to vision-language modeling,” arXiv preprint arXiv:2405.17247., 2024

-A Implementation details

-A1 LLM-based prediction

Here, we present the specific sequence of prompts input into the LLM. Fig.8 and Fig.9 illustrate the prompts used for LLM-based target object prediction, corresponding to the first and second prompts provided to the LLM as shown in Fig.3, respectively. Note that [predicted target objects and actions] defined in the Fig.9 is derived from the results provided by the order prompt shown in Fig.8.

Fig.10 shows the order prompt used to predict an action given a target object. It is designed to generate a concise description of the action expected to occur at the specified target object.

Additionally, Fig.11 presents the order prompt used for scoring the predicted action during the evaluation stage. It outputs 1 if the predicted action has a certain similarity to the groundtruth action, and 0 otherwise.

-A2 Trajectory-based prediction

Training data We employed LocoVR that includes over 7000 trajectories in 131 indoor environments to train the model. We split it into training (85%) and validation sets (15%).

Model and parameters We employed five layer U-Net model to predict the goal area from the past trajectory. Parameters for the model are shown as below.

In the U-Net models, time-series trajectory data is represented in a multi-channel image format. Specifically, the 2D coordinates of a trajectory point are plotted on a blank 256x256 pixel image using a Gaussian distribution, with time-series data encoded across multiple channels. Similarly, the goal position is encoded as an image and concatenated with the multi-channel trajectory image as input to the model.

We employed the Adam optimizer [28] to train the model, with a learning rate of 5e-5 and a batch size of 16. Each model is trained for up to 100 epochs on a single NVIDIA RTX 4080 graphics card with 8 GB of memory.

•
Input: (181 $\times$ H $\times$ W)
- –
  
  Past trajectory of $p_{1}$ for 90 epochs (90 $\times$ H $\times$ W)
- –
  
  Past heading directions of $p_{1}$ for 90 epochs (90 $\times$ H $\times$ W)
- –
  
  Binary scene map (1 $\times$ H $\times$ W)
•
Output: (1 $\times$ H $\times$ W)
- –
  
  $p_{1}$ ’s goal position (1 $\times$ H $\times$ W)
•
Groundtruth: (1 $\times$ H $\times$ W)
- –
  
  $p_{1}$ ’s goal position (1 $\times$ H $\times$ W)
•

Loss: BCELoss between the output and ground-truth
•
U-Net channels:
- –
  
  encoder: 256, 256, 512, 512, 512
- –
  
  decoder: 512, 512, 512, 256, 256
•

Calculation time for training: 30-35 hours on LocoVR

-B Evaluation data

-B1 Scenes

We converted the 3D models from HM3DSem [27] into two-dimensional scene maps, incorporating semantic labels and corresponding object pixelations. Fig.12 illustrates these scene maps along with their associated object labels. Each scene contains approximately 30 to 40 distinct objects, with each object represented by a unique color.

-B2 Trajectory synthesis

To evaluate the performance of our multimodal human action prediction model, we utilized evaluation data that includes human trajectory information. Given that our method predicts target areas based on observed trajectories, it is crucial that the trajectory data be realistic in terms of both spatial configuration and velocity profile. To ensure this, we employed a machine learning-based approach for trajectory generation. Using the LocoVR dataset, which provides comprehensive data on start positions, target positions, scene maps, and trajectories, the model was trained to learn the relationships between these elements. Specifically, the model takes start positions, target positions, and scene maps as inputs and predicts the corresponding trajectories. By incorporating both trajectory shape and velocity variations into the loss function, the model is able to generate realistic position-velocity sequences.