Echo: A Large Language Model with Temporal Episodic Memory

WenTao Liu RuoHua Zhang Aimin Zhou Feng Gao JiaLi Liu

Abstract

Research on large language models (LLMs) has shown remarkable performance in domains such as mathematics, programming, and literary creation. However, most studies have focused on semantic memory-based question answering, neglecting LLMs’ potential to handle episodic memory (EM)-related queries. This oversight has led to suboptimal performance in applications requiring EM, including emotional companionship, personal AI assistants, and AI teachers. To address this gap, we introduce Echo, a LLM enhanced with temporal episodic memory. We propose a Multi-Agent Data Generation Framework that guides the model in generating multi-turn, complex scenario episodic memory dialogue data (EM-Train). Temporal information is innovatively incorporated into the LLM training process, and Echo is trained using the EM-Train. Furthermore, We develop an EM-Test benchmark specifically designed to evaluate LLMs’ episodic memory capabilities. The EM-Test assesses performance across various time spans and difficulty levels, providing a comprehensive evaluation of multi-turn episodic memory dialogues. Our experiments demonstrate that Echo significantly outperforms state-of-the-art LLMs on EM-Test. Additionally, a qualitative analysis reveals Echo’s potential to exhibit human-like episodic memory capabilities. We will open-source all datasets, code, and model weights.

Machine Learning, ICML

Refer to caption — Figure 1: The performance of LLMs across 7 time spans and and two difficulty levels in our EM-Test.

1 Introduction

Research on large language models (LLMs) has made significant advances in many fields (Naveed et al., 2023; Zhao et al., 2023), such as mathematical problem (Liu et al., 2023), programming (Zhang et al., 2023), and tool usage (Qin et al., 2024). However, these tasks primarily rely on semantic memory, with little focus on evaluating and enhancing the LLMs’ episodic memory capabilities.

Episodic memory is a crucial component of human memory (Tulving, 1983, 1972). As shown in Fig. 2, long-term memory (Squire, 2004) is mainly divided into declarative and non-declarative memory. Declarative memory further comprises semantic memory and episodic memory. Semantic memory involves the recollection of widely accepted concrete facts, which relate to knowledge independent of its context of acquisition (Moscovitch et al., 2016). It includes world knowledge, entity memory, language memory, and concept memory, among others. For example, ”China is in Asia” or ”1+2=3”. In contrast, episodic memory refers to time-related event memories centered around the individual, such as ”Last night, I bought tomatoes at Walmart”. In fact, episodic memory is not only a fundamental ability of humans but also a critical capability for LLMs, impacting their performance in any multi-turn Q&A scenarios, such as role-playing (Wang et al., 2023), psychological counseling (Ke et al., 2024), and AI teaching (Dan et al., 2023). Unfortunately, even the most advanced models (e.g., GPT-4) still perform poorly in terms of episodic memory, often suffering from logical inconsistencies and hallucinations.

Some methods (Zhong et al., 2024; Bärmann et al., 2024; Fountas et al., 2024; Packer et al., 2023; Gao & Zhang, 2024; Hu et al., 2023) have been proposed to enhance the long-term memory capabilities of LLMs. These methods primarily use external storage to retain historical records and design operations to help LLMs retrieve this information for responses. However, these approaches can be time-consuming due to the operations on external storage, and context information may be arbitrarily segmented, leading to information loss. Additionally, these methods do not improve the model’s inherent ability to process episodic memory. Episodic memory is thought to be constructive, meaning recall is the (re)construction of a past experience rather than the retrieval of a copy (Sprott, 1933; Schacter, 2012).

In practice, generative models have an inherent capability to construct and consolidate memories (Spens & Burgess, 2024). We argue that LLMs face a significant challenge in developing robust episodic memory capabilities due to the limited availability of high-quality episodic memory data. Such data is essential for training models to effectively handle complex, context-dependent interactions.

First, we propose MADGF, a innovative Multi-Agent Data Generation Framework. MADGF simulates and controls multi-turn scenario dialogues between multiple human roles and an AI assistant. The collected dialogue data, named EM-Train, is used to train our Echo model. In MADGF, three key components are designed: characters, plots, and environments. The design of characters and environments ensures a diverse range of dialogues, while plots guide the LLM to generate dialogue data with enhanced episodic memory capabilities. Additionally, the LLM’s training paradigm is modified by incorporating temporal information into each conversation, enriching the temporal background in the interaction process.

Next, we introduce EM-Test, a novel multi-turn dialogue benchmark designed to evaluate episodic memory capabilities. Each instance in EM-Test may contain multiple evaluation points, requiring the model not only to process long-context text effectively but also to recall, reason, and cognitively handle episodic memory information. Each evaluation point is tagged with both time and difficulty levels, enabling a comprehensive assessment. To reduce manual evaluation efforts, we propose an approach that assesses model performance based on semantic similarity. The feasibility and effectiveness of approach is validated by its strong correlation with human evaluations.

Finally, we conducted both quantitative and qualitative experiments. The quantitative results show that Echo significantly outperforms state-of-the-art LLMs on the EM-Test. Additionally, the qualitative analysis reveals Echo’s potential to exhibit human-like episodic memory capabilities.

2 Related Work

Methods for Enhancing Long-Term Memory Capabilities

Some methods have been proposed to enhance the long-term memory capabilities of large models, such as MemoryBank (Zhong et al., 2024), H-EMV (Bärmann et al., 2024), EM-LLM (Fountas et al., 2024), MemGPT (Packer et al., 2023), MS (Gao & Zhang, 2024), and CHATDB (Hu et al., 2023). These methods use external storage to retain historical information and design various operations to help LLMs utilize information.

MemoryBank (Zhong et al., 2024) introduces a novel memory mechanism specifically designed for LLM. This mechanism processes historical conversation to extract summary information and user portrait. When a user poses a question, the mechanism retrieves relevant information based on similarity and combines it with the summary information and user portrait, to form a Meta Prompt that assists the model in generating responses. EM-LLM (Fountas et al., 2024) adopts a similar method by incorporating key information into preceding prompts. This method effectively handles nearly unlimited context lengths while maintaining high computational efficiency. MemGPT (Packer et al., 2023) enables LLMs to perform tasks beyond the current context limits by simulating extended virtual memory through paging between physical memory and disk storage, akin to how operating systems manage memory to extend LLM context. MS (Gao & Zhang, 2024), H-EMV (Bärmann et al., 2024), and CHATDB (Hu et al., 2023) introduce distinct data structures designed for the storage of historical information: namely, a memory-sharing framework, a tree-based storage structure, and a specialized database, respectively. Each of these architectures facilitates the retrieval of pertinent historical data to support the response generation.

These methods require various operations on external storage that can be time-consuming. Moreover, they primarily focus on retrieving a copy of the data, rather than implementing the constructive nature of episodic memory (Sprott, 1933; Schacter, 2012), failing to enhance the model’s inherent ability to process episodic memory.

Methods for Data Generation utilizing LLM

Manually annotated data is expensive, so many methods (Xu et al., 2023; Luo et al., 2023; Zhao et al., 2024; Wang et al., 2022; Ding et al., 2023; Li et al., 2023) have been proposed to automate data generation utilizing LLMs. Besides obtaining data through user interactions on online platforms using ChatGPT, like WILDCHAT (Zhao et al., 2024), Self-Instruct (Wang et al., 2022) was one of the first to propose generating instructions, inputs, and outputs using LLMs to build instruction fine-tuning data. To increase the diversity of instructions, WizardLM (Xu et al., 2023) introduced an evolutionary instruction approach starting from a small set of seed instructions to generate more complex and diverse instruction. Further, WizardMath (Luo et al., 2023) incorporated a reward model to select better instruction data from multiple outputs, collecting higher-quality generated data. Additionally, some methods (Ding et al., 2023; Li et al., 2023) propose having LLMs play the roles of both AI assistant and user to collect data, which allows for the collection of multi-turn dialogues. UltraChat (Ding et al., 2023) uses this approach to extract instruction data covering various tasks, such as Questions about the World and Creation and Generation. In contrast, CAMEL (Li et al., 2023) focuses on generating instruction data for specific tasks, such as ”Develop a trading bot for the stock market.”

These LLM-based data generation methods primarily focus on extracting high-quality instruction fine-tuning data grounded in semantic memory from LLMs. In contrast, our MADGF mainly aims to simulate real-life scenarios to generate dialogue content rich in episodic memory.

3 Mutil-Agent Data Generation Framework

The purpose of the Multi-Agent Data Generation Framework (MADGF) is to design multiple human characters interacting with an AI assistant. Through simulating daily conversations, a large multi-turn dialogue dataset enriched with episodic memories is collected for the training of the Echo model. To enhance the diversity and effectiveness of the conversation content, we initially devised three key elements: characters, plots, and environments. Extensive character cards, plots, and temporal information were then generated. Subsequently, we formulated a data generation process that utilizes this information to guide the LLM in producing high-quality episodic memory data (EM-Train).

3.1 Characters, plots, and Environments

Characters

As illustrated in Figure 3, the design of character cards encompasses seven attributes: ”Name,” ”Occupation,” ”Age,” ”Gender,” ”Hobbies,” ”Personality,” and ”Social Relationships.” Specifically, we randomly generated attribute values for all attributes except for ”Social Relationships.” Subsequently, we utilized the LLM to generate the ”Social Relationships” attribute values based on the other six attributes.

Plots

The plots generated by LLMs differ significantly from actual real-life scenarios. Therefore, we manually created an event library, from which 20 events are sampled to form a plot. The library contains three types of events: common events, real events, and hallucinatory events. Common events are designed to enable the model to generate data based on semantic memory questions and answers while enriching the context. They include routine occurrences in daily life, such as greetings, inquiries about common knowledge, and discussions about career-related issues. Real events are events that have actually occurred and are related to episodic memory. They are used to prompt the human role to ask the Echo assistant if it remembers a related event. Hallucinatory events are fabricated events that have never occurred. They are used to prompt the human agent to ask the AI assistant about non-existent events and simultaneously remind the AI assistant not to be misled. Notably, since all event prompts are removed during the training of the Echo model, hallucinatory events help reduce the LLM’s tendency to generate false information and enhance the model’s understanding and reasoning abilities regarding episodic memory.

Environments

In the design of environments, we initially considered only temporal information. We first established a series of time-stamped nodes arranged in chronological order (e.g., Monday, September 4, 2006, 21:42:56, Monday, September 4, 2006, 21:55:38). These time-stamped nodes are then automatically added to the conversation history between the human role and the Echo assistant, indicating the time at which each round of dialogue takes place.

3.2 Data generation process

Prompt Design

As illustrated in Figure 4, we designed distinct prompt templates for both the human role and the Echo assistant. The highlighted sections in the figure are replaced with information from Section 3.1. Specifically:

•

Human Role Prompt: This includes the character card and all plot details, enabling the LLM to assume various human roles and engage in dialogues with the AI assistant according to different plots.
•

AI Assistant Prompt: This incorporates both hallucinatory plots and common plots. This setup helps the LLM acting as the AI assistant to reduce episodic memory hallucinations and proactively seek relevant information in a human-like manner.

Based on these prompt templates, we generate initial prompts for the human and the Echo assistant, denoted as $P_{u}$ and $P_{a}$ , respectively.

The Pseudocode of Data Generation Process

Algorithm 1 provides the pseudocode for the data generation process. We initialize and maintain two separate history records, $H_{u}$ and $H_{a}$ , for the human role and the AI assistant using initial prompts $P_{u}$ and $P_{a}$ , respectively. In lines 4-12 of Algorithm 1, we alternately control the two agents representing the human and the assistant to engage in dialogue. Temporal information is incorporated during the conversation in lines 6-8. We check if farewell phrases such as ”goodbye” or ”talk to you later” appear in the response. If any of these phrases are detected, or if the number of conversation rounds exceeds 60, the stopping criterion is considered to be met, and the current data generation process is terminated. Finally, we remove the initial prompt $P_{a}$ from $H_{a}$ to obtain the final dataset, denoted as $Data$ , which constitutes one piece of data in our EM-Train dataset.

Algorithm 1 Pseudocode of Data Generation Process

0: Initial prompts

P_{u}

and

P_{a}

Data

1: Initialize

H_{u}

H_{a}

H_{u}\leftarrow H_{u}\cup P_{u}

H_{a}\leftarrow H_{a}\cup P_{a}

4: while stopping criterion not met do

answer_{u}\leftarrow LLM(H_{u})

time\leftarrow RandomNextTime(time)

H_{u}\leftarrow H_{u}\cup answer_{u}\cup time

H_{a}\leftarrow H_{a}\cup answer_{u}\cup time

answer_{a}\leftarrow LLM(H_{a})

10:

H_{u}\leftarrow H_{u}\cup answer_{a}

11:

H_{a}\leftarrow H_{a}\cup answer_{a}

12: end while

13:

Data\leftarrow H_{a}\setminus P_{a}

4 Dataset

4.1 EM-Train and Training Paradigm

Based on MADGF in Section 3, we collected and created EM-Train. It consists of 15, 533 data entries, with an average of 16.75 conversation rounds per data entry and an average length of 8, 597 characters. Then, we trained the Echo model using the EM-Train dataset.

Compared to the conventional LLM training paradigm (user-assistant), we modified the training paradigm to user-time-assistant. As shown in Figure 5 (a), in traditional LLMs, the chat template for instruction fine-tuning alternates between two roles: user and assistant. In our modified approach, as highlighted in red in Figure 5 (b), we introduced an additional role, ”observation,” which includes temporal information. During training, the content of the observation does not participate in gradient updates, and the attention mask remains consistent with the traditional decoder-only method. During inference, whenever a user inputs a prompt, real-time information is automatically integrated into the context, enabling the creation of a time-aware AI assistant.

4.2 EM-Test

We manually developed a benchmark called EM-Test for evaluating the episodic memory of LLMs. Each test instance consists of multi-turn dialogues. In addition to dialogues that are not directly related to episodic memory testing, these dialogues may include multiple related historical dialogues and a corresponding test point, as shown in Figure 6. At each test point, we annotate the test question (Question), the temporal context (Observation), and the reference answer (Reference Answer).

During testing, we provide all historical dialogues as conversation history to the model. Then, we input the test question and the temporal context to obtain the model’s output. The model’s output is either manually scored or compared against the reference answer to quantitatively evaluate the episodic memory capabilities of the LLMs.

We labeled the time span and difficulty of the test points to achieve more granular results. In terms of time span, we categorized them into eight types based on the required duration of episodic memory for answering questions: ”just now,” ”one day,” ”few days,” ”one month,” ”few months,” ”one year,” ”few years,” and ”several decades.” We also divided the difficulty of test points into easy and hard levels. For an easy-level test point, the model only needs to recall a simple scenario. For a hard-level test point, the model must possess complex episodic memory capabilities. Figure 6 provides an example of a hard-level test point. In this example, to answer “Did I take any days off from school this year?”, the model needs to recall writing a leave note for the user and the timing of that event being in the same year as the current one, to deduce the correct answer.

Additionally, we manually created the EM-Test-Without-Time scenario test set to evaluate the model’s episodic memory ability without considering time information. Compared to EM-Test, EM-Test-Without-Time does not include temporal context and only considers easy and hard difficulty levels. Table 1 presents the relevant statistical information for both EM-Test and EM-Test-Without-Time.

Table 1: Detailed statistics of EM-Test.

Time Span	Easy	Hard	Total
EM-Test
just now	18	7	25
one day	5	5	10
few days	10	8	18
one month	4	4	8
few months	4	7	11
one year	5	4	9
few years	7	9	16
several decades	4	5	9
Overall Number	57	49	106
EM-Test-Without-Time
Overall Number	89	34	123

5 Experiment

5.1 Experimental Setups

Selected LLMs.

We evaluate a series of LLMs on EM-Test, including the current state-of-the-art open-source and closed-source models. Particularly, we select LLAMA3-8b (Dubey et al., 2024), ChatGLM3-6B (GLM et al., 2024) for open-source models, and for closed-source models, we employ GPT-3.5-turbo (OpenAI, 2023), GPT-4 (OpenAI, 2023), ChatGLM3-trubo (GLM et al., 2024).

Implementation Details.

In MADGF, the LLM serving as the agent is Qwen2-72B-Instruct (Yang et al., 2024), which is a high-performance open-source LLM. We use chatglm3-6B (GLM et al., 2024) as the base model for Echo, and implement it with full fine-tuning.

Evaluation Methods and Metrics.

In the quantitative analysis, we first collect the responses of LLMs at the test points, then ask human annotators to score these responses on a scale of 1 to 10. Additionally, we use the widely adopted Sentence Transformer model, all-MiniLM-L6-v2¹¹1https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2, to encode the LLMs’ responses and the reference standard outputs provided by the test set, obtaining $\mathcal{E}_{\text{LLM}}$ and $\mathcal{E}_{\text{Standard}}$ . We then calculate the cosine similarity $\mathcal{S}$ using the Equation 1.

\mathcal{S}=cos\_sim(\mathcal{E}_{\text{LLM}},\mathcal{E}_{\text{Standard}})\times 100

(1)

We consider using the Pearson correlation coefficient, denoted by $\mathcal{R}$ , to measure the correlation between the human score and the similarity metric. It is calculated using Equation 2.

\mathcal{R}=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\sqrt{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}}

(2)

where:

•

$x_{i}$ and $y_{i}$ are individual sample points indexed with $i$ ,
•

$\bar{x}$ and $\bar{y}$ are the mean values of $x$ and $y$ respectively,
•

$n$ is the number of sample points.

If the $\mathcal{R}$ value between two datasets is greater than 0.8, it is considered to be highly positively correlated (Cohen, 2013).

5.2 Quantitative Analysis

Overall Performance

Table 2: Comparison of model results based on human score. The first and second highest score are marked in red and blue. JN: just now, OD: one day, FD: few days, OM: one month, FM: few months, OY: one year, FY: few years, SD: several decades.

Models	Overall	JN	OD	FD	OM	FM	OY	FY	SD
Easy
ChatGLM3-6B	2.7	2.8	7.2	2.7	1.8	2.8	0.0	1.0	3.5
LLAMA3-8B	5.6	6.0	4.2	6.1	4.6	6.0	5.6	4.7	8.0
ChatGLM3-Turbo	3.9	4.5	5.6	3.5	3.0	2.0	4.2	3.3	5.0
GPT-3.5-turbo	5.2	5.4	4.6	6.6	5.3	5.8	4.6	4.1	5.6
GPT-4	5.8	5.0	6.6	6.1	5.0	5.8	6.0	4.0	7.8
Echo (Ours)	6.7	6.2	6.4	6.4	8.0	6.8	6.6	6.1	7.5
Mean Value (Easy)	5.0	5.0	5.8	5.2	4.5	4.8	4.5	3.9	6.2
Hard
ChatGLM3-6B	2.6	2.0	4.8	1.9	3.5	1.0	1.8	1.9	2.0
LLAMA3-8B	5.5	7.0	5.2	4.0	4.5	6.4	5.0	5.1	7.0
ChatGLM3-Turbo	3.5	2.7	6.2	2.4	1.8	4.9	4.3	3.2	3.0
GPT-3.5-turbo	4.0	5.7	4.4	2.9	1.8	6.1	4.8	3.0	3.0
GPT-4	5.4	5.7	5.4	5.0	3.5	5.6	6.0	5.1	7.0
Echo (Ours)	5.9	5.9	6.0	6.1	5.5	6.9	5.8	5.7	5.6
Mean Value (Hard)	4.5	4.8	5.3	3.7	3.4	5.1	4.6	4.0	4.6

Table 3: Comparison of model results based on similarity metric.

Models	Overall	JN	OD	FD	OM	FM	OY	FY	SD
Easy
ChatGLM3-6B	57.0	42.7	76.7	50.8	52.2	37.2	50.1	68.2	77.9
LLAMA3-8B	70.2	64.3	77.0	73.6	64.2	56.6	78.1	59.7	87.7
ChatGLM3-Turbo	58.3	45.6	68.0	66.2	43.5	40.4	57.4	63.0	81.9
GPT-3.5-turbo	74.8	68.2	66.5	90.4	74.8	56.7	74.9	85.8	81.5
GPT-4	72.3	63.6	72.3	76.8	64.2	53.5	81.0	80.8	86.5
Echo (Ours)	84.0	85.6	77.1	86.2	81.5	79.4	86.7	88.1	87.1
Mean Value (Easy)	68.4	60.7	71.9	73.0	62.4	53.0	70.4	73.3	82.8
Hard
ChatGLM3-6B	56.3	53.2	71.8	45.6	58.9	37.6	60.9	71.1	51.4
LLAMA3-8B	64.3	72.9	63.8	52.7	71.1	64.1	53.4	69.3	67.4
ChatGLM3-Turbo	52.5	55.2	59.2	44.3	46.9	55.0	40.6	66.9	51.8
GPT-3.5-turbo	65.1	76.1	63.3	52.0	71.8	63.8	53.8	74.0	66.1
GPT-4	67.7	71.9	66.8	59.1	70.7	68.9	64.3	70.5	69.8
Echo (Ours)	74.5	82.1	73.1	69.9	78.6	71.7	73.3	75.4	72.0
Mean Value (Hard)	63.4	68.6	66.3	53.9	66.4	60.2	57.7	71.2	63.1

We present the results of LLMs on human scores and similarity metrics in Table 2 and Table 3, respectively. Our experimental analysis is provided from the following aspects: performance of several LLMs, human score and similarity metrics, comparisons across easy and hard levels, comparisons across different time spans, and consistency between human score and similarity metric.

Performance of Several LLMs. It can be observed that our Echo model achieved the best performance in both Human Score and Similarity Metric. Specifically, it scored 6.7 and 5.9 in the easy and hard levels of Human Score, respectively, and 84.0 and 74.5 in the Similarity Metric. Over different time spans, Echo nearly obtained the best or second-best scores across all metrics, except for the easy level ”One day” and ”Several Decades” in Human Score. These results indicate that the Echo model excels in EM capability. In contrast, among all models, the open-source ChatGLM3-6B performed the worst overall. As the base model of Echo, this indirectly demonstrates the effectiveness of the EM-Train data generated using the MADGF framework in enhancing a model’s EM capability. Moreover, GPT-4 also showed excellent comprehensive performance, achieving second-best scores in the easy level of Human Score and the hard level of Similarity Metric. In the hard level of Human Score, GPT-4 (5.4) narrowly trailed behind LLAMA3-8B (5.5), which had the second-best performance. In the easy level of Similarity Metric, GPT-4 (72.3) slightly lagged behind LLAMA3-8B (74.8), which also had the second-best performance.

Comparisons across Easy and Hard Levels. Most LLMs perform well on easy-level problems, but their performance drops at the hard level. Specifically, we found that all LLMs performed better overall at the Easy Level compared to the Hard Level. For instance, in Table 2, GPT-3.5-turbo scored 5.2 at the Easy Level, but only 4.0 at the Hard Level. Additionally, we observed that the average score for all models at the Hard Level in Table 2 was 4.5, which is 0.5 lower than the Easy Level (5.0). Similar results were seen in Table 3.

Comparisons across Different Time Spans. Models exhibit different performances across various time spans. Due to the inconsistency in model performance between the human score and similarity metric, we first consider the common performance under both metrics, then focus on the analysis based primarily on the human score. On the easy time span, we found that models perform better on ”several decades” questions. This is because the mean values for ”several decades” are the highest for all models according to both metrics (6.2 for human score and 82.8 for similarity metric). Meanwhile, we observed that models perform poorly on certain time spans (few days, few months, few years), which may be attributed to the difficulty models have in understanding these temporal concepts.

Consistency Between Human Scores and Similarity Metrics. We consider calculating the Pearson correlation coefficient $\mathcal{R}$ for two metrics to observe their correlation. The overall results of the Human Score (i.e., 2.7, 5.6, …, 6.7) and the Similarity Metrics (i.e., 57.0, 70.2, …, 84.0) are used to compute similarity at the Easy Level, and similarly for the Hard Level. Using Equation 2, we obtained Pearson correlation coefficients $\mathcal{R}$ of 0.935 for the Easy Level and 0.842 for the Hard Level, both greater than 0.8. Therefore, the results of the two metrics can be considered highly positively correlated. Given that the Human Score requires expensive manual evaluation, Similarity Metrics can be considered as a cost-effective alternative for subsequent model evaluations. Additionally, we found inconsistencies in the performance evaluation of models by Human Score and Similarity Metrics across different time spans. We calculated the Pearson correlation coefficient $\mathcal{R}$ between the mean values of the two metrics over different time spans, resulting in coefficients of 0.429 (moderate correlation) and 0.003 (very weak correlation) for the Easy Level and Hard Level, respectively. Our hypothesis is that the EM-Test may have insufficient data points for each time span, leading to inadequate statistical samples. This insufficiency results in inconsistent outcomes between Human Scores and Similarity Metrics.

Performance in Episodic Memory Without Temporal Information

We tested the EM capability of the models without considering time information, as shown in Table 4. We found that the models perform similarly whether or not time information is considered. Our Echo model and GPT-4 still performed well, achieving the first and second highest scores, respectively; while ChatGLM3-6B continued to perform the worst. These results indicate that the dataset EM-Train, obtained using our MADGF framework, can effectively improve the EM capability of models even when time information is not considered. In addition, we provide extended experiments regarding the model’s temporal awareness and reasoning capability in the appendix for further analysis.

Table 4: Comparison of performance in episodic memory without temporal information.

Models	Easy	Hard
ChatGLM3-6B	79.9	66.9
LLAMA3-8B	89.0	73.3
ChatGLM3-Turbo	84.3	67.0
GPT-3.5-turbo	90.0	77.8
GPT-4	90.4	78.3
Echo (Ours)	96.0	81.8
Mean Value	88.3	74.2

5.3 Qualitative Analysis

Analysis of Complex Episodic Memory Ability

We conducted experiments on the Echo model using real-life scenarios that require complex Episodic Memory abilities, as shown in Figure 7. Some dialogues unrelated to episodic memory have been omitted using vertical ellipses. These dialogues are intended to increase the challenge of improving the model’s Episodic Memory ability in long texts. Additionally, some content details unrelated to episodic memory skills have also been omitted. To test the model’s performance over longer time spans, all time information was manually provided.

The test dialogue on the left side of Figure 7 demonstrates that the model can accurately recall recent conversation content and timing, indicating its ability to understand time and associate it with events, which is a sign of Episodic Memory capability. In the test questions on the right side, the model shows even stronger Episodic Memory ability by recalling what human characters ate on a specific day from lengthy historical records, and understanding and judging whether conversations took place during a certain period.

Analysis of Episodic Memory Ability Without Temporal Information

We tested Echo using questions that do not require considering time information for responses, as shown in Figure 8. From the test dialogue, it is clear that Echo can accurately recall the human character’s favorite band and food, and provide relevant information even after multiple rounds of dialogue. Additionally, in the final round of test questions, Echo did not confuse any content that we had not actually told it, avoiding the hallucination issue. This problem often occurs when conversing with other LLMs.

6 Conclusion

In this paper, we investigate the episodic memory capabilities of LLMs. We propose an innovative Multi-agent data generation framework to collect high-quality, context-rich fine-tuning data, named EM-Train, which we used to further train the Echo model. We innovatively introduce time information into the training paradigm of LLMs. We also develop a multi-round dialogue test set, EM-Test, to evaluate the episodic memory capabilities of LLMs. Experimental results show that EM-Train significantly improves the Episodic Memory of LLMs. The experiments also verify that LLMs can gain time perception and reasoning abilities by incorporating time information into their training paradigms. Furthermore, qualitative experimental analysis indicates that Echo exhibits some human-like episodic memory capabilities. Our research provides a preliminary exploration of complex episodic memory capabilities with temporal information for LLMs.

References

Bärmann et al. (2024) Bärmann, L., DeChant, C., Plewnia, J., Peller-Konrad, F., Bauer, D., Asfour, T., and Waibel, A. Episodic memory verbalization using hierarchical representations of life-long robot experience. arXiv preprint arXiv:2409.17702, 2024.
Cohen (2013) Cohen, J. Statistical power analysis for the behavioral sciences. routledge, 2013.
Dan et al. (2023) Dan, Y., Lei, Z., Gu, Y., Li, Y., Yin, J., Lin, J., Ye, L., Tie, Z., Zhou, Y., Wang, Y., et al. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint arXiv:2308.02773, 2023.
Ding et al. (2023) Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
Dubey et al. (2024) Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Fountas et al. (2024) Fountas, Z., Benfeghoul, M. A., Oomerjee, A., Christopoulou, F., Lampouras, G., Bou-Ammar, H., and Wang, J. Human-like episodic memory for infinite context llms. arXiv preprint arXiv:2407.09450, 2024.
Gao & Zhang (2024) Gao, H. and Zhang, Y. Memory sharing for large language model based agents. arXiv preprint arXiv:2404.09982, 2024.
GLM et al. (2024) GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Rojas, D., Feng, G., Zhao, H., Lai, H., Yu, H., Wang, H., Sun, J., Zhang, J., Cheng, J., Gui, J., Tang, J., Zhang, J., Li, J., Zhao, L., Wu, L., Zhong, L., Liu, M., Huang, M., Zhang, P., Zheng, Q., Lu, R., Duan, S., Zhang, S., Cao, S., Yang, S., Tam, W. L., Zhao, W., Liu, X., Xia, X., Zhang, X., Gu, X., Lv, X., Liu, X., Liu, X., Yang, X., Song, X., Zhang, X., An, Y., Xu, Y., Niu, Y., Yang, Y., Li, Y., Bai, Y., Dong, Y., Qi, Z., Wang, Z., Yang, Z., Du, Z., Hou, Z., and Wang, Z. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
Hu et al. (2023) Hu, C., Fu, J., Du, C., Luo, S., Zhao, J., and Zhao, H. Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901, 2023.
Ke et al. (2024) Ke, L., Tong, S., Chen, P., and Peng, K. Exploring the frontiers of llms in psychological applications: A comprehensive review. arXiv preprint arXiv:2401.01519, 2024.
Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
Li et al. (2023) Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. Camel: Communicative agents for” mind” exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023.
Liu et al. (2023) Liu, W., Hu, H., Zhou, J., Ding, Y., Li, J., Zeng, J., He, M., Chen, Q., Jiang, B., Zhou, A., et al. Mathematical language models: A survey. arXiv preprint arXiv:2312.07622, 2023.
Luo et al. (2023) Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., and Zhang, D. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
Moscovitch et al. (2016) Moscovitch, M., Cabeza, R., Winocur, G., and Nadel, L. Episodic memory and beyond: the hippocampus and neocortex in transformation. Annual review of psychology, 67(1):105–134, 2016.
Naveed et al. (2023) Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., and Mian, A. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435, 2023.
OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
Packer et al. (2023) Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., and Gonzalez, J. E. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023.
Qin et al. (2024) Qin, Y., Hu, S., Lin, Y., Chen, W., Ding, N., Cui, G., Zeng, Z., Huang, Y., Xiao, C., Han, C., Fung, Y. R., Su, Y., Wang, H., Qian, C., Tian, R., Zhu, K., Liang, S., Shen, X., Xu, B., Zhang, Z., Ye, Y., Li, B., Tang, Z., Yi, J., Zhu, Y., Dai, Z., Yan, L., Cong, X., Lu, Y., Zhao, W., Huang, Y., Yan, J., Han, X., Sun, X., Li, D., Phang, J., Yang, C., Wu, T., Ji, H., Liu, Z., and Sun, M. Tool learning with foundation models, 2024. URL https://arxiv.org/abs/2304.08354.
Schacter (2012) Schacter, D. L. Constructive memory: past and future. Dialogues in clinical neuroscience, 14(1):7–18, 2012.
Spens & Burgess (2024) Spens, E. and Burgess, N. A generative model of memory construction and consolidation. Nature Human Behaviour, 8(3):526–543, 2024.
Sprott (1933) Sprott, W. Remembering: A study in experimental and social psychology., 1933.
Squire (2004) Squire, L. R. Memory systems of the brain: a brief history and current perspective. Neurobiology of learning and memory, 82(3):171–177, 2004.
Tan et al. (2023) Tan, Q., Ng, H. T., and Bing, L. Towards benchmarking and improving the temporal reasoning capability of large language models. arXiv preprint arXiv:2306.08952, 2023.
Tulving (1972) Tulving, E. Episodic and semantic memory. Organization of memory/Academic Press, 1972.
Tulving (1983) Tulving, E. Elements of episodic memory, 1983.
Wang et al. (2022) Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
Wang et al. (2023) Wang, Z. M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Yang, J., et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746, 2023.
Xu et al. (2023) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
Yang et al. (2024) Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
Zhang et al. (2023) Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., and Wang, R. Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989, 2023.
Zhao et al. (2024) Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., and Deng, Y. Wildchat: 1m chatgpt interaction logs in the wild. arXiv preprint arXiv:2405.01470, 2024.
Zhao et al. (2023) Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Zhong et al. (2024) Zhong, W., Guo, L., Gao, Q., Ye, H., and Wang, Y. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 19724–19731, 2024.

Appendix

Appendix A Implementation Details of MADGF

A.1 Plots used in Human Prompt Template

Table 5 presents the plots used in the human prompt template. It includes 20 plots, with the numbers of true episodic memories and hallucinatory episodic memories marked in blue and red, respectively. The final plot is fixed as ”say goodbye” to guide the conclusion of the conversation.

Table 5: Example of plots for human prompt.

No.	Plot Description
1	Ask what day of the week it is today
2	Request AI to inform you of the current date and time
3	Ask if we talked the day before yesterday, and if AI answers yes, then ask what topic we discussed
4	Ask a question about earth science
5	Ask AI for its name and call it by that name instead of AI from now on
6	Ask AI to remember your fitness plan
7	Ask what day the next working day is
8	Ask a piece of information you haven’t told AI before: the date you first attended an online course
9	Ask AI how it is feeling today
10	Ask a piece of information you haven’t told AI before: your cherished books
11	Inquire about AI’s perspective on the development of artificial intelligence
12	Ask AI to remember your grandfather’s favorite news source
13	Ask AI if it remembers your fitness plan
14	Ask a career-related question
15	Ask AI if it remembers your grandfather’s favorite news source, and if it does, ask when you shared it
16	Ask what day the last working day was
17	Ask a simple physics question
18	Ask a piece of information you haven’t told AI before: the date of your first marathon completion
19	Ask a piece of information you haven’t told AI before: your private collection inventory
20	Say goodbye

A.2 Hallucinatory Plots used in Assistant Prompt Template

Table 6 provides an example of hallucinatory plots used in the assistant prompt template, aimed at guiding the assistant to avoid hallucination issues. The example includes four memories that did not occur in actual conversations, corresponding to the plots marked in red (8, 10, 18, and 19) in Table 5.

Table 6: Example of hallucinatory plots for assistant prompt.

No.	Noteworthy Hallucinatory Plots
1	first day attending an online course
2	cherished books
3	first marathon completion date
4	private collection inventory

A.3 Common Plots used in Assistant Prompt Template

The common plots designed to prompt the AI assistant to proactively seek relevant information in a human-like manner. One example is ”name, old, hobby, gender”.

Appendix B Extended Experiments on Temporal Awareness and Reasoning Capability of the Model

To enhance and evaluate the temporal awareness and reasoning capabilities of the model, we have developed temporally aware and reasoning-enhanced training and testing datasets. We then conducted both quantitative and qualitative experimental analyses of Echo.

B.1 Temporal Reasoning Dataset

Training Dataset

We improved upon a portion of the training set proposed by Tan et.al (Tan et al., 2023) to create a dataset that emphasizes temporal awareness and reasoning. In the work by Tan et.al (Tan et al., 2023), the data were entirely synthesized programmatically, with questions being relatively simplistic, lacking inquiries about specific days of the week or recent dates. Utilizing both programming techniques and manual annotations, we constructed an 8K training dataset. The data format adheres to Echo’s training paradigm of user-time-assistant, making it highly suitable for Echo model training. Table 7 provides examples from our training dataset, which includes various complex scenarios for temporal reasoning questions, aiding in developing Echo’s robust temporal awareness and reasoning skills after training.

Table 7: Examples of temporal reasoning dataset.

Training Dataset
No.	Question, Time, Answer
1	Question: What is the time 10 years and 6 months after June 1856? Time: Monday, September 9, 2024, 16:03:45 Answer: The time is Dec, 1866
2	Question: What was the date a year ago today? Time: Friday, April 3, 2020, 5:04:46 Answer: Today is 2020-4-3, therefore a year ago today should be 2019-4-3.
3	Question: What day of the week was yesterday? Time: Friday, April 23, 2049, 23:30:07 Answer: Yesterday was Thursday.
4	Question: Will the day after tomorrow be a weekend? Time: Saturday, July 3, 2027, 19:17:33 Answer: No, the day after tomorrow will be Monday, July 5th, 2027.
5	Question:How can I use PHP to compare the existentialist philosophies of … Time: Tuesday, June 22nd, 2038, 11: 01: 39 Answer: As an AI assistant, I can provide guidance on how to approach the task, … Question: How long ago was our last chat? Time: Tuesday, June 22nd, 2038, 11: 31: 7 Answer: Our last conversation was just now, 30 minutes ago, at 11:1:39.
Evaluation Dataset
No.	Question, Time, Answer, key word
1	Question: 432 years ago today, which day was it? Time: Tuesday, September 3, 2013, 23: 42: 54 Answer: Today, 432 years ago, was 1581-9-3. key word: ”1581”, ”9\|Sep\|September”, ”3”

Evaluation Dataset

We manually annotated a temporally aware and reasoning-enhanced evaluation dataset consisting of 292 instances, including 32 short-term (within one week) and 260 long-term test questions, as shown in Table 8. Each test question provides all possible keywords contained in the standard answer, allowing for accurate quantitative analysis of whether the model’s output is correct through string matching.

Table 8: Statistics of the evaluation dataset for temporal reasoning

Time Span	Short	Long	Total
Overall Number	32	260	292

B.2 Quantitative Analysis of Temporal Perception and Reasoning Ability

On the Evaluation Dataset for Temporal Reasoning, we conducted a quantitative analysis. As in Section 5, we selected LLAMA3-8b (Dubey et al., 2024) and ChatGLM3-6B (GLM et al., 2024) for open-source models, and for closed-source models, we employed GPT-3.5-turbo (OpenAI, 2023), GPT-4 (OpenAI, 2023), and ChatGLM3-turbo (GLM et al., 2024) for evaluation and comparison. When calculating the metrics, we detected keywords within the models’ responses.

As shown in Table 9, we found that our Echo model still performs the best, with time-aware and reasoning abilities exceeding 90 in both short-term (98.1) and long-term (94.6) scenarios. In contrast, ChatGLM3-6B performed very poorly, with time-aware and reasoning abilities below 10 in both short-term (9.4) and long-term (8.8) scenarios. This indicates that the EM-Train dataset significantly improves the time-aware and reasoning capabilities of the models. Additionally, we observed that GPT-4 achieved suboptimal performance on long-term tests, but did not achieve suboptimal performance on short-term tests. Upon examining the model’s outputs, we noticed that GPT-4 tends to produce errors and hallucinations in short-term temporal reasoning. For example, the correct answer was ”The date the day before yesterday was July 1st, 2023.”, but GPT-4’s output was ”The day before yesterday would have been July 2, 2023.”.

Table 9: Comparison of performance in time perception and reasoning.

Models	Short-term	Long-term
ChatGLM3-6B	9.4	8.8
LLAMA3-8B	90.6	65.0
ChatGLM3-Turbo	46.9	34.6
GPT-3.5-turbo	56.2	78.1
GPT-4	78.1	93.8
Echo (Ours)	98.1	94.6
Mean Value	63.2	62.5

B.3 Qualitative Analysis of Temporal Perception and Reasoning Ability

We present the qualitative analysis results of Echo in Figure 9. It is evident that the model can perceive the current time and perform reasoning tasks, such as correctly answering questions about the current season and how many years have passed since the first moon landing. Additionally, the model can also perceive and reason about past and future times. For example, it accurately answered questions about what the date will be 100 years and 20 years from now, how long until November, and how much time has passed since the first chat session.