This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Actional Atomic-Concept Learning for Demystifying
Vision-Language Navigation

Bingqian Lin1, Yi Zhu2, Xiaodan Liang1,3 , Liang Lin4, Jianzhuang Liu2 Part of this work was done during an internship in Huawei Noah’s Ark Lab.Corresponding author.
Abstract

Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional features and visual features trained using one-hot labels to linguistic instruction features. However, the big semantic gap among these multi-modal inputs makes the alignment difficult and therefore limits the navigation performance. In this paper, we propose Actional Atomic-Concept Learning (AACL), which maps visual observations to actional atomic concepts for facilitating the alignment. Specifically, an actional atomic concept is a natural language phrase containing an atomic action and an object, e.g., “go up stairs”. These actional atomic concepts, which serve as the bridge between observations and instructions, can effectively mitigate the semantic gap and simplify the alignment. AACL contains three core components: 1) a concept mapping module to map the observations to the actional atomic concept representations through the VLN environment and the recently proposed Contrastive Language-Image Pretraining (CLIP) model, 2) a concept refining adapter to encourage more instruction-oriented object concept extraction by re-ranking the predicted object concepts by CLIP, and 3) an observation co-embedding module which utilizes concept representations to regularize the observation representations. Our AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks. Moreover, the visualization shows that AACL significantly improves the interpretability in action decision.

Refer to caption
Figure 1: Comparison between existing VLN agents and the proposed AACL. Through mapping the visual observations to actional atomic concepts, AACL can simplify the multi-modal alignment and distinguish different observation candidates easily to make accurate action decision.

Introduction

Vision-Language Navigation (VLN) (Anderson et al. 2018; Ku et al. 2020; Chen et al. 2019; Nguyen and Daumé 2019) has attracted increasing interests in robotic applications since an instruction-following navigation agent is practical and flexible in real-world scenarios. For accomplishing successful navigation, a VLN agent needs to align complicated visual observations to language instructions to reach the required target point. For example, when asking to “turn left to the bathroom”, the agent should choose the right observation which not only contains the mentioned object “bathroom” but also indicates the direction “turn left”.

Most of early VLN approaches adopt the LSTM-based encoder-decoder framework (Fried et al. 2018; Tan, Yu, and Bansal 2019; Wang et al. 2019; Ma et al. 2019; Zhu et al. 2020), which encodes both the visual observations and language instructions and then generates the action sequence. With the development of large-scale cross-modal pretraining models in vision-language tasks (Li et al. 2020a, b; Chen et al. 2020; Lu et al. 2019), emerging works attempt to introduce them into VLN tasks (Hao et al. 2020; Hong et al. 2021; Chen et al. 2021; Moudgil et al. 2021). However, both the non-pretraining-based or pretraining-based approaches represent the visual observations by raw directional features and visual features trained using one-hot labels, which are difficult to be aligned to the linguistic instruction features due to the large semantic gap among them. This direct alignment process also leads to poor interpretability of action decision and therefore makes the VLN agents unreliable to be deployed to real environments.

In this work, we aim to mitigate the semantic gap and simplify the alignment in VLN by proposing a new framework, called Actional Atomic-Concept Learning (AACL). Since the instructions usually consist of atomic action concepts, e.g., “turn right”, and object concepts111In this work, we also treat the scene concept, e.g., “bathroom”, as the object concept. , e.g., “kitchen”, in AACL, the visual observations are mapped to actional atomic concepts, which are natural language phrases each containing an action and an object. The actions are extracted from a predefined atomic action set. These actional atomic concepts, which can be viewed as the bridge between observations and instructions, effectively facilitate the alignment as well as provide good interpretability for action decision.

AACL consists of three main components. Firstly, a concept mapping module is constructed to map each single view observation to the actional atomic concept. For deriving the object concept, we resort to the recently proposed Contrastive Language-Image Pretraining (CLIP) model (Radford et al. 2021) rather than image classification or object detection models pretrained on a fixed category set. Benefiting from the powerful open-world object recognition ability of CLIP, AACL can better generalize to diverse navigation scenarios. And we map the sequential direction information in VLN environments during navigation to the action concept. Secondly, to encourage more instruction-oriented object concept extraction for facilitating the multi-modal alignment, a concept refining adapter is further introduced to re-rank the predicted object concepts of CLIP according to the instruction. Lastly, an observation co-embedding module embeds each observation and its paired actional atomic concept, and then uses concept representations to regularize the observation representations through an observation contrast strategy. Figure 1 presents an action selection comparison between existing VLN agents and our AACL. Through mapping visual observations to actional atomic concepts formed by language, AACL can simplify the modality alignment and distinguish different action candidates easier to make correct actions.

We conduct experiments on several popular VLN benchmarks, including one with fine-grained instructions (R2R (Anderson et al. 2018)) and two with high-level instructions (REVERIE (Qi et al. 2020) and R2R-Last (Chen et al. 2021)). Experimental results show that our AACL outperforms the state-of-the-art approaches on all benchmarks. Moreover, benefiting from these actional atomic concepts, AACL shows excellent interpretability in making action decision, which is a step closer towards developing reliable VLN agents in real-world applications.

Refer to caption
Figure 2: Overview of our Actional Atomic-Concept Learning (AACL). At each timestep tt, the agent receives the instruction II, the observation OtO_{t}, and the navigation history HtH_{t}. For each Ot,nO_{t,n} in OtO_{t} containing the single-view image Bt,nB_{t,n} and the direction At,nA_{t,n}, object concept mapping and action concept mapping are conducted based on the concept refining adapter to obtain the actional atomic concept representations 𝐮~t,n\mathbf{\tilde{u}}_{t,n}. Then 𝐮~t,n\mathbf{\tilde{u}}_{t,n} is used to regularize the visual representation 𝐯t,n\mathbf{v}_{t,n} and the directional representation 𝐞At,n\mathbf{e}_{A_{t,n}} through the observation co-embedding module for making action selection. For simplicity, we omit the learning process of navigation histories HtH_{t}, which is similar to that of observations OtO_{t}.

Related Work

Vision-Language Navigation. Developing navigation agents which can follow natural language instructions has attracted increasingly research interests in recent years (Anderson et al. 2018; Ku et al. 2020; Chen et al. 2019; Nguyen and Daumé 2019; Qi et al. 2020). Most of early Vision-Language Navigation (VLN) approaches employ the LSTM-based encoder-decoder framework (Fried et al. 2018; Tan, Yu, and Bansal 2019; Wang et al. 2019; Ma et al. 2019; Zhu et al. 2020; Wang, Wu, and Shen 2020; Qi et al. 2020; Fu et al. 2020). Speaker-Follower (Fried et al. 2018) introduces a novel instruction augmentation strategy to mitigate the annotation burden of high-quality instructions. EnvDrop (Tan, Yu, and Bansal 2019) employs environmental dropout to augment limited training data by mimicking unseen environments. Due to the success of Transformer-based cross-modal pretraining (Li et al. 2020a, b; Chen et al. 2020; Lu et al. 2019; Li et al. 2019), recent works have explored transformer architectures into VLN tasks (Hao et al. 2020; Hong et al. 2021; Chen et al. 2021; Moudgil et al. 2021; Qi et al. 2021; Guhur et al. 2021). HAMT (Chen et al. 2021) develops a history aware multimodal transformer to better encode the long-horizon navigation history. DUET (Chen et al. 2022) constructs a dual-scale graph transformer for joint long-term action planning and fine-grained cross-modal understanding. HOP (Qiao et al. 2022) introduces a new history-and-order aware pretraining paradigm for encouraging the learning of spatio-temporal multimodal correspondence. However, these pretraining-based methods still learn to align the raw directional features and visual features trained using one-hot labels to the linguistic instruction features, leading to limited performance due to the large semantic gap among these multi-modal inputs.

In contrast to the above mentioned VLN approaches, in this work, we build a bridge among multi-modal inputs for facilitating the alignment by introducing actional atomic concepts formed by language. Through these actional atomic concepts, the alignment can be significantly simplified and good interpretability can be provided.

Contrastive Language-Image Pretraining (CLIP). CLIP (Radford et al. 2021) is a cross-modal pretrained model using 400 million image and text pairs collected from the web. Through natural language supervision (Jia et al. 2021; Sariyildiz, Perez, and Larlus 2020; Desai and Johnson 2021) rather than one-hot labels of fixed size of object categories, CLIP has shown great potential in open-world object recognition. Recently, many works have attempted to introduce CLIP into various computer vision (CV) or vision-language (V&L) tasks to improve the generalization ability of the downstream models (Song et al. 2022; Subramanian et al. 2022; Khandelwal et al. 2022; Shen et al. 2022; Rao et al. 2022; Dai et al. 2022; Liang et al. 2022). DenseCLIP (Rao et al. 2022) introduces CLIP into dense prediction tasks, e.g., semantic segmentation, through converting the original image-text matching in CLIP to the pixel-text matching. (Dai et al. 2022) distills the vision-language knowledge learned in CLIP to enhance the multimodal generation models. EmbCLIP (Khandelwal et al. 2022) investigates the ability of CLIP’s visual representations in improving embodied AI tasks. Some works have also tried to apply CLIP into VLN tasks (Liang et al. 2022; Shen et al. 2022). ProbES (Liang et al. 2022) utilizes the knowledge learned from CLIP to build an in-domain dataset by self-exploration for pretraining. (Shen et al. 2022) replaces the ResNet visual encoder pretrained on ImageNet in the conventional VLN models with the pretrained CLIP visual encoder.

In this paper, we resort to the powerful object recognition ability of CLIP to provide object concepts for each single-view observation. To encourage instruction-oriented object concept extraction for better alignment, a concept refining adapter is further introduced beyond CLIP to re-rank its predicted object concepts according to the instruction.

Table 1: Atomic Action Concept Mapping.
Elevation θ~t,n\tilde{\theta}_{t,n} Heading ψ~t,n\tilde{\psi}_{t,n}
(-2π\pi, -3π\pi/2] (-3π\pi/2, -π\pi/2) [-π\pi/2, 0) 0 (0, π\pi/2] (π\pi/2, 3π\pi/2) [3π\pi/2, 2π\pi)
>>0 go up
<<0 go down
0 turn right go back turn left go forward turn right go back turn left

Preliminaries

VLN Problem Setup

Given a natural language instruction I={w1,,wl}I=\{w_{1},...,w_{l}\} with ll words, a VLN agent is asked to navigate from a start viewpoint SS to the goal viewpoint GG. At each timestep tt, the agent receives a panoramic observation, containing NoN_{o} image views Ot={Ot,n}n=1NoO_{t}=\{O_{t,n}\}_{n=1}^{N_{o}}. Each Ot,nO_{t,n} contains the image Bt,nB_{t,n} and the attached direction information At,nA_{t,n}. The visual feature 𝐯t,n\mathbf{v}_{t,n} for Bt,nB_{t,n} is obtained by a pretrained ResNet (He et al. 2016) or ViT (Dosovitskiy et al. 2021). At,nA_{t,n} is usually composed of the heading ψt,n\psi_{t,n} and the elevation θt,n\theta_{t,n}. Each panoramic observation contains dd navigable viewpoints Ct={Ct,i}i=1dC_{t}=\{C_{t,i}\}_{i=1}^{d} as the action candidates. At timestep tt, the agent predicts an action 𝐚t\mathbf{a}_{t} from CtC_{t} based on the instruction II and current visual observations OtO_{t}.

Baseline Agents

Our AACL can be applied to many previous VLN models. In this work, two strong baseline agents HAMT (Chen et al. 2021) and DUET (Chen et al. 2022) are selected. In this section, we briefly describe one baseline HAMT. In HAMT, the agent receives the instruction II, the panoramic observation OtO_{t}, and the navigation history HtH_{t} at each timestep tt. HtH_{t} is a sequence of historical visual observations. A standard BERT (Devlin et al. 2019) is used to obtain the instruction feature 𝐟I\mathbf{f}_{I} for II. For each view nn with the angle information <ψt,n,θt,n><\psi_{t,n},\theta_{t,n}> in OtO_{t}, the direction feature is defined by 𝐞At,n=(sinψt,n,cosψt,n,sinθt,n,cosθt,n\mathbf{e}_{A_{t,n}}=(\mathrm{sin}\psi_{t,n},\mathrm{cos}\psi_{t,n},\mathrm{sin}\theta_{t,n},\mathrm{cos}{\theta_{t,n}}). With the visual feature 𝐯t,n\mathbf{v}_{t,n} and the direction feature 𝐞At,n\mathbf{e}_{A_{t,n}}, the observation embedding 𝐨t,n\mathbf{o}_{t,n} for each view nn is calculated by:

𝐨t,n\displaystyle\mathbf{o}_{t,n} =Dr(LN(LN(𝐖v𝐯t,n)+LN(𝐖a𝐞At,n)\displaystyle=\mathrm{Dr}(\mathrm{LN}(\mathrm{LN}(\mathbf{W}_{v}\mathbf{v}_{t,n})+\mathrm{LN}(\mathbf{W}_{a}\mathbf{e}_{A_{t,n}}) (1)
+𝐞t,nN+𝐞vT)),\displaystyle+\mathbf{e}_{t,n}^{N}+\mathbf{e}_{v}^{T})),

where LN()\mathrm{LN}(\cdot) and Dr()\mathrm{Dr}(\cdot) denote layer normalization (Ba, Kiros, and Hinton 2016) and dropout, respectively. 𝐖v\mathbf{W}_{v} and 𝐖a\mathbf{W}_{a} are learnable weights, and 𝐞t,nN\mathbf{e}_{t,n}^{N} and 𝐞vT\mathbf{e}_{v}^{T} denote the navigable embedding and the type embedding, respectively (Chen et al. 2021). The observation feature 𝐨t\mathbf{o}_{t} is represented by 𝐨t=[𝐨t,1;;𝐨t,No]\mathbf{o}_{t}=[\mathbf{o}_{t,1};...;\mathbf{o}_{t,N_{o}}], where NoN_{o} is the number of views. And a hierarchical vision transformer (Chen et al. 2021) is constructed to get the history feature 𝐡t=[𝐡t,1;;𝐡t,t1]\mathbf{h}_{t}=[\mathbf{h}_{t,1};...;\mathbf{h}_{t,t-1}] for the navigation history HtH_{t}.

Then 𝐟I\mathbf{f}_{I}, 𝐨t\mathbf{o}_{t}, and 𝐡t\mathbf{h}_{t} are fed into a cross-modal transformer encoder Ec()E^{c}(\cdot), resulting in:

𝐟~It,𝐨~t,𝐡~t=Ec(𝐟I,[𝐨t;𝐡t]).\mathbf{\tilde{f}}_{I}^{t},\mathbf{\tilde{o}}_{t},\mathbf{\tilde{h}}_{t}=E^{c}(\mathbf{f}_{I},[\mathbf{o}_{t};\mathbf{h}_{t}]). (2)

The updated instruction feature 𝐟~It\mathbf{\tilde{f}}_{I}^{t} and observation feature 𝐨~t\mathbf{\tilde{o}}_{t} are used for action prediction:

𝐚t=Ea(𝐟~It,𝐨~t),\mathbf{a}_{t}=E^{a}(\mathbf{\tilde{f}}_{I}^{t},\mathbf{\tilde{o}}_{t}), (3)

where Ea()E^{a}(\cdot) is a two-layer fully-connected network. For more model details, refer to (Chen et al. 2021).

In Eq. 1, HAMT obtains the observation feature 𝐨t,n\mathbf{o}_{t,n} directly by the pretrained visual feature 𝐯t,n\mathbf{v}_{t,n} and the raw direction feature 𝐞At,n\mathbf{e}_{A_{t,n}}. In AACL, we map the observations Ot,nO_{t,n} to actional atomic concepts Ut,nU_{t,n} formed by language and use Ut,nU_{t,n} to obtain the new observation feature 𝐨t,n\mathbf{o}^{\prime}_{t,n}. In this way, the gap between OtO_{t} and II can be effectively mitigated to simplify the alignment.

Actional Atomic-Concept Learning

In this section, we describe our AACL in detail, the overview of which is presented in Figure 2. At timestep tt, the agent receives multi-modal inputs II, OtO_{t}, and HtH_{t} similar to HAMT. For each Ot,nO_{t,n} in OtO_{t} containing the single-view image Bt,nB_{t,n} and the direction At,nA_{t,n}, AACL first conducts object concept mapping and atomic action concept mapping to obtain the object concept Ut,nobjU_{t,n}^{obj} and the action concept Ut,nactU_{t,n}^{act}. And a concept refining adapter is built to re-rank Ut,nobjU_{t,n}^{obj} according to the instruction II for better alignment. The actional atomic concept Ut,nU_{t,n} is then obtained by concatenating Ut,nactU_{t,n}^{act} and Ut,nobjU_{t,n}^{obj}, and fed to the concept encoder Et()E^{t}(\cdot) to get the concept feature 𝐮~t,n\mathbf{\tilde{u}}_{t,n}. Finally, an observation co-embedding module is constructed to use 𝐮~t,n\mathbf{\tilde{u}}_{t,n} for regularizing the visual feature 𝐯t,n\mathbf{v}_{t,n} and the directional feature 𝐞At,n\mathbf{e}_{A_{t,n}} to get new observation features 𝐨t,n\mathbf{o}^{\prime}_{t,n}. For HtH_{t} which contains historical visual observations, we also use AACL to get the enhanced history features 𝐡t\mathbf{h}^{\prime}_{t} like OtO_{t}. Then 𝐨t={𝐨t,n}n=1No\mathbf{o}^{\prime}_{t}=\{\mathbf{o}^{\prime}_{t,n}\}_{n=1}^{N_{o}}, 𝐡t\mathbf{h}^{\prime}_{t}, and the instruction features 𝐟I\mathbf{f}_{I} are fed to the cross-modal Transformer encoder Ec()E^{c}(\cdot) for calculating the action predictions 𝐚t\mathbf{a}^{\prime}_{t}.

Actional Atomic-Concept Mapping

Object Concept Mapping. For each observation Ot,nO_{t,n} containing the single-view image Bt,nB_{t,n}, we map Bt,nB_{t,n} to get the object concept Ut,nobjU_{t,n}^{obj} based on a pre-built in-domain object concept repository. Benefiting from large-scale language supervision from 400M image-text pairs, CLIP (Radford et al. 2021) has more powerful open-world object recognition ability than conventional image classification or object detection models pretrained on a fixed-size object category set. In this work, we resort to CLIP to conduct the object concept mapping considering its good generalizability. Concretely, the object concept repository {Ucobj}c=1Nc\{U_{c}^{obj}\}_{c=1}^{N_{c}} is constructed by extracting object words from the training dataset, where NcN_{c} is the repository size. And we get the image feature 𝐟Bt,n\mathbf{f}_{B_{t,n}} through the pretrained CLIP Image Encoder ECLIPv()E^{v}_{\mathrm{CLIP}}(\cdot):

𝐟Bt,n=ECLIPv(Bt,n).\mathbf{f}_{B_{t,n}}=E^{v}_{\mathrm{CLIP}}(B_{t,n}). (4)

For object concept UcobjU_{c}^{obj}, we construct the text phrase TcT_{c} formed as “a photo of a {UcobjU_{c}^{obj}}”. Then the text feature 𝐟Tc\mathbf{f}_{T_{c}} is derived through the pretrained CLIP Text Encoder ECLIPt()E^{t}_{\mathrm{CLIP}}(\cdot):

𝐟Tc=ECLIPt(Tc).\mathbf{f}_{T_{c}}=E^{t}_{\mathrm{CLIP}}(T_{c}). (5)

Then the mapping probability of the image Bt,nB_{t,n} regarding the object concept UcobjU_{c}^{obj} is calculated by:

𝐩(y=Ucobj|Bt,n)=exp(sim(𝐟Bt,n,𝐟Tc)/τ)c=1Nc(exp(sim(𝐟Bt,n,𝐟Tc)/τ)),\mathbf{p}(y=U_{c}^{obj}|B_{t,n})=\frac{\mathrm{exp}(\mathrm{sim}(\mathbf{f}_{B_{t,n}},\mathbf{f}_{T_{c}})/\tau)}{\sum_{c=1}^{N_{c}}(\mathrm{exp}(\mathrm{sim}(\mathbf{f}_{B_{t,n}},\mathbf{f}_{T_{c}})/\tau))}, (6)

where sim(,)\mathrm{sim}(\cdot,\cdot) denotes the cosine similarity, τ\tau represents the temperature parameter. Considering that a single-view image in the observation usually contains more than one salient object, we extract the top kk object concepts (text) having the maximum mapping probabilities conditioned on Bt,nB_{t,n} as its corresponding object concepts, i.e., Ut,nobj={Ut,n,iobj}i=1kU_{t,n}^{obj}=\{U^{obj}_{t,n,i}\}_{i=1}^{k}.

Atomic Action Concept Mapping. The atomic action concept Ut,nactU_{t,n}^{act} for Ot,nO_{t,n} can be derived through its directional information At,nA_{t,n} and the directional information A~t1\tilde{A}_{t-1} of the agent’s selected action at timestep tt-1. We first use six basic actions in VLN tasks to build the predefined atomic action set, i.e., go up, go down, go forward, go back, turn right, and turn left. Denote At,n=<ψt,n,θt,n>A_{t,n}=<\psi_{t,n},\theta_{t,n}>, where ψt,n[0,2π)\psi_{t,n}\in[0,2\pi) and θt,n[π2,π2]\theta_{t,n}\in[-\frac{\pi}{2},\frac{\pi}{2}] are the heading and the elevation, respectively. Similarly, A~t1=<ψ~t1,θ~t1>\tilde{A}_{t-1}=<\tilde{\psi}_{t-1},\tilde{\theta}_{t-1}>, where ψ~t1[0,2π)\tilde{\psi}_{t-1}\in[0,2\pi) and θ~t1[π2,π2]\tilde{\theta}_{t-1}\in[-\frac{\pi}{2},\frac{\pi}{2}]. We calculate the relative direction of <ψt,n,θt,n><\psi_{t,n},\theta_{t,n}> to <ψ~t1,θ~t1><\tilde{\psi}_{t-1},\tilde{\theta}_{t-1}> by:

ψ~t,n=ψt,nψ~t1,θ~t,n=θt,nθ~t1.\displaystyle\tilde{\psi}_{t,n}=\psi_{t,n}-\tilde{\psi}_{t-1},\quad\tilde{\theta}_{t,n}=\theta_{t,n}-\tilde{\theta}_{t-1}. (7)

Then we use <ψ~t,n,θ~t,n><\tilde{\psi}_{t,n},\tilde{\theta}_{t,n}> to obtain Ut,nact.U_{t,n}^{act}. Following the direction judgement rule in VLN (Anderson et al. 2018), we use θ~t,n\tilde{\theta}_{t,n} first to judge whether Ut,nactU^{act}_{t,n} is “go up” or “go down” by comparing it to zero. Otherwise, Ut,nactU^{act}_{t,n} is further determined through ψ~t,n\tilde{\psi}_{t,n}. Specifically, if ψ~t,n\tilde{\psi}_{t,n} is equal to zero, Ut,nactU^{act}_{t,n} is “go forward”. Otherwise, Ut,nactU^{act}_{t,n} is further determined to be “turn right”, “turn left”, or “go back”. The detailed mapping rule is listed in Table 1.

Concept Refining Adapter

After getting {Ut,n,iobj}i=1k\{U^{obj}_{t,n,i}\}_{i=1}^{k} and Ut,nactU_{t,n}^{act} for each Ot,nO_{t,n}, the actional atomic concept {Ut,n,i}i=1k\{U_{t,n,i}\}_{i=1}^{k} can be obtained by directly concatenating Ut,nactU_{t,n}^{act} and each Ut,n,iobjU_{t,n,i}^{obj}. A direct way to obtain the actional atomic concept feature 𝐮t,n\mathbf{u}_{t,n} is to feed each Ut,n,iU_{t,n,i} to the concept encoder Et()E^{t}(\cdot) and get a weighted sum based on their object prediction probability 𝐩i\mathbf{p}_{i} by CLIP:

𝐮t,n=i=1k𝐩iEt(Ut,n,i),\mathbf{u}_{t,n}=\sum_{i=1}^{k}\mathbf{p}_{i}\cdot E^{t}(U_{t,n,i}), (8)

where Et()E^{t}(\cdot) is the BERT-like language encoder. However, even if CLIP can extract informative object concepts for each observation, some noisy object concepts may exist and extracting instruction-oriented object concepts would be more useful for alignment and making action decisions. Inspired by (Gao et al. 2021), we propose to construct a concept refining adapter beyond CLIP to refine the object concept under the constraint of the instruction. Given a feature 𝐟\mathbf{f}, the concept refining adapter is written as:

A(𝐟)=ReLU(𝐟T𝐖1)𝐖2,A(\mathbf{f})=\mathrm{ReLU}(\mathbf{f}^{T}\mathbf{W}_{1})\mathbf{W}_{2}, (9)

where 𝐖1\mathbf{W}_{1} and 𝐖2\mathbf{W}_{2} are learnable parameters, and ReLU()\mathrm{ReLU}(\cdot) is the rectified linear unit for activation. Denote the instruction feature as 𝐟I={𝐟Icls,𝐟I1,𝐟Il}\mathbf{f}_{I}=\{\mathbf{f}_{I}^{cls},\mathbf{f}_{I}^{1},...\mathbf{f}_{I}^{l}\}. For the image feature 𝐟Bt,n\mathbf{f}_{B_{t,n}} of the single-view image Bt,nB_{t,n}, we obtain the updated image feature 𝐟~Bt,n\mathbf{\tilde{f}}_{B_{t,n}} by feeding 𝐟Bt,n\mathbf{f}_{B_{t,n}} and 𝐟Icls\mathbf{f}_{I}^{cls} to A()A(\cdot):

𝐟~Bt,n=α𝐟Bt,n+(1α)A([𝐟Bt,n;𝐟Icls]),\mathbf{\tilde{f}}_{B_{t,n}}=\alpha\cdot\mathbf{f}_{B_{t,n}}+(1-\alpha)\cdot A([\mathbf{f}_{B_{t,n}};\mathbf{f}_{I}^{cls}]), (10)

where α\alpha servers as the residual ratio to help adjust the degree of maintaining the original knowledge for better performance (Gao et al. 2021), and [;][\cdot;\cdot] denotes feature concatenation. Denote the top kk object concept features obtained by CLIP for the single-view image Bt,nB_{t,n} as {𝐟Ti}i=1k\{\mathbf{f}_{T_{i}}\}_{i=1}^{k}. We use the updated image feature 𝐟~Bt,n\mathbf{\tilde{f}}_{B_{t,n}} to get the re-ranking object prediction probability 𝐩~\mathbf{\tilde{p}} of {𝐟Ti}i=1k\{\mathbf{f}_{T_{i}}\}_{i=1}^{k}:

𝐩~=Softmax(sim(𝐟~Bt,n,𝐟T1),,sim(𝐟~Bt,n,𝐟Tk)).\mathbf{\tilde{p}}=\mathrm{Softmax}(\mathrm{sim}(\mathbf{\tilde{f}}_{B_{t,n}},\mathbf{f}_{T_{1}}),...,\mathrm{sim}(\mathbf{\tilde{f}}_{B_{t,n}},\mathbf{f}_{T_{k}})).\\ (11)

Then we get the refined concept feature 𝐮~t,n\mathbf{\tilde{u}}_{t,n} by replacing 𝐩\mathbf{p} in Eq. 8 by 𝐩~\mathbf{\tilde{p}}.

Observation Co-Embedding

After obtaining the concept feature 𝐮~t,n\mathbf{\tilde{u}}_{t,n} for each single-view observation Ot,nO_{t,n}, we introduce an observation co-embedding module to use 𝐮~t,n\mathbf{\tilde{u}}_{t,n} for bridging multi-modal inputs and calculating the final observation feature 𝐨t,n\mathbf{o}^{\prime}_{t,n}. At first, we separately embed the visual feature 𝐯t,n\mathbf{v}_{t,n}, the direction feature 𝐞At,n\mathbf{e}_{A_{t,n}}, and the concept feature 𝐮~t,n\mathbf{\tilde{u}}_{t,n} to obtain 𝐨t,nv\mathbf{o}_{t,n}^{v}, 𝐨t,na\mathbf{o}_{t,n}^{a}, and 𝐨t,nu\mathbf{o}_{t,n}^{u}, respectively, by:

𝐨t,nv\displaystyle\mathbf{o}_{t,n}^{v} =Dr(LN(LN(𝐖~v𝐯t,n)+𝐞t,nN+𝐞vT)),\displaystyle=\mathrm{Dr}(\mathrm{LN}(\mathrm{LN}(\mathbf{\tilde{W}}_{v}\mathbf{v}_{t,n})+\mathbf{e}_{t,n}^{N}+\mathbf{e}_{v}^{T})), (12)
𝐨t,na\displaystyle\mathbf{o}_{t,n}^{a} =Dr(LN(LN(𝐖~a𝐞At,n)+𝐞t,nN+𝐞vT)),\displaystyle=\mathrm{Dr}(\mathrm{LN}(\mathrm{LN}(\mathbf{\tilde{W}}_{a}\mathbf{e}_{A_{t,n}})+\mathbf{e}_{t,n}^{N}+\mathbf{e}_{v}^{T})),
𝐨t,nu\displaystyle\mathbf{o}_{t,n}^{u} =Dr(LN(LN(𝐖~u𝐮~t,n)+𝐞t,nN+𝐞vT)),\displaystyle=\mathrm{Dr}(\mathrm{LN}(\mathrm{LN}(\mathbf{\tilde{W}}_{u}\mathbf{\tilde{u}}_{t,n})+\mathbf{e}_{t,n}^{N}+\mathbf{e}_{v}^{T})),

where 𝐖~v\mathbf{\tilde{W}}_{v}, 𝐖~a\mathbf{\tilde{W}}_{a} and 𝐖~u\mathbf{\tilde{W}}_{u} are learnable weights.

Unlike HAMT that combining different features into one embedding (Eq. 1), we keep the separate embeddings as in Eq. 12 such that a new observation contrast strategy can be performed. Concretely, the view embedding 𝐨t,nv\mathbf{o}_{t,n}^{v} and the direction embedding 𝐨t,na\mathbf{o}_{t,n}^{a} are summed as the visual embedding 𝐨t,nV=𝐨t,nv+𝐨t,na\mathbf{o}_{t,n}^{V}=\mathbf{o}_{t,n}^{v}+\mathbf{o}_{t,n}^{a}. Then 𝐨t,nV\mathbf{o}_{t,n}^{V} in each single-view observation Ot,nO_{t,n} is forced to stay close to the paired concept embedding 𝐨t,nu\mathbf{o}_{t,n}^{u} while staying far away from the concept embeddings 𝐨¯t,nu\overline{\mathbf{o}}_{t,n}^{u} in other single-view observations in OtO_{t}:

c=tnlog(esim(𝐨t,nV,𝐨t,nu)/τesim(𝐨t,nV,𝐨t,nu)/τ+𝐨¯t,nuesim(𝐨t,nV,𝐨¯t,nu)/τ),\mathcal{L}_{\mathrm{c}}=-\sum_{t}\sum_{n}\mathrm{log}(\frac{\mathrm{e}^{\mathrm{sim}(\mathbf{o}_{t,n}^{V},\mathbf{o}_{t,n}^{u})/\tau}}{\mathrm{e}^{\mathrm{sim}(\mathbf{o}_{t,n}^{V},\mathbf{o}_{t,n}^{u})/\tau}+\sum\limits_{\overline{\mathbf{o}}_{t,n}^{u}}\mathrm{e}^{\mathrm{sim}(\mathbf{o}_{t,n}^{V},\overline{\mathbf{o}}_{t,n}^{u})/\tau}}), (13)

where τ\tau is the temperature parameter. By observation contrast, the discrimination of each single-view observation can be effectively enhanced and the semantic gap between observations and instructions can be largely mitigated with the help of the actional atomic concept. To fully merge the information for each observation, we use 𝐨t,n=𝐨t,nV+𝐨t,nu\mathbf{o}^{\prime}_{t,n}=\mathbf{o}_{t,n}^{V}+\mathbf{o}_{t,n}^{u} to obtain the final observation feature 𝐨t,n\mathbf{o}^{\prime}_{t,n}.

Action Prediction

Similar to the observation feature 𝐨t={𝐨t,n}n=1No\mathbf{o}^{\prime}_{t}=\{\mathbf{o}^{\prime}_{t,n}\}_{n=1}^{N_{o}} (NoN_{o} is the number of views), the history feature 𝐡t\mathbf{h}^{\prime}_{t} is obtained for HtH_{t} through AACL. With 𝐨t\mathbf{o}^{\prime}_{t}, 𝐡t\mathbf{h}^{\prime}_{t}, and the instruction feature 𝐟I\mathbf{f}_{I}, the action 𝐚t\mathbf{a}^{\prime}_{t} can be obtained from the cross-modal transformer encoder Ec()E^{c}(\cdot) and the action prediction module Ea()E^{a}(\cdot) (see Eq. 2 and Eq. 3). Following most existing VLN works (Tan, Yu, and Bansal 2019; Hong et al. 2021; Chen et al. 2021), we combine Imitation Learning (IL) and Reinforcement Learning (RL) to train VLN agents. Let the imitation learning loss be IL\mathcal{L}_{\mathrm{IL}} and the reinforcement learning loss be RL\mathcal{L}_{\mathrm{RL}}. The total training objective of AACL is:

=RL+λ1IL+λ2c,\mathcal{L}=\mathcal{L}_{\mathrm{RL}}+\lambda_{1}\mathcal{L}_{\mathrm{IL}}+\lambda_{2}\mathcal{L}_{\mathrm{c}}, (14)

where λ1\lambda_{1} and λ2\lambda_{2} are balance parameters.

Table 2: Comparison with the SOTA methods on R2R.
Method Val Seen Val Unseen Test Unseen
TL NE \downarrow SR \uparrow SPL \uparrow TL NE \downarrow SR \uparrow SPL \uparrow TL NE \downarrow SR \uparrow SPL \uparrow
Seq2Seq (Anderson et al. 2018) 11.33 6.01 39 - 8.39 7.81 22 - 8.13 7.85 20 18
RCM+SIL(train) (Wang et al. 2019) 10.65 3.53 67 - 11.46 6.09 43 - 11.97 6.12 43 38
EnvDropout (Tan, Yu, and Bansal 2019) 11.00 3.99 62 59 10.70 5.22 52 48 11.66 5.23 51 47
PREVALENT (Hao et al. 2020) 10.32 3.67 69 65 10.19 4.71 58 53 10.51 5.30 54 51
ORIST (Qi et al. 2021) - - - - 10.90 4.72 57 51 11.31 5.10 57 52
VLN\circlearrowrightBERT (Hong et al. 2021) 11.13 2.90 72 68 12.01 3.93 63 57 12.35 4.09 63 57
HOP (Qiao et al. 2022) 11.51 2.46 76 70 12.52 3.79 64 57 13.29 3.87 64 58
HAMT (Chen et al. 2021) (baseline) 11.15 2.51 76 72 11.46 3.62 66 61 12.27 3.93 65 60
HAMT+AACL (ours) 11.31 2.53 76 72 12.09 3.41 69 63 12.74 3.71 66 61
DUET (Chen et al. 2022) (baseline) 12.32 2.28 79 73 13.94 3.31 72 60 14.73 3.65 69 59
DUET+AACL (ours) 13.32 2.15 80 72 15.01 3.00 74 61 15.47 3.38 71 59
Table 3: Navigation and object grounding performance on REVERIE.
Method Val Seen Val Unseen Test Unseen
TL SR \uparrow OSR\uparrow SPL \uparrow RGS\uparrow RGSPL\uparrow TL SR \uparrow OSR\uparrow SPL \uparrow RGS\uparrow RGSPL\uparrow TL SR \uparrow OSR\uparrow SPL \uparrow RGS\uparrow RGSPL\uparrow
RCM (Wang et al. 2019) 10.70 23.33 29.44 21.82 16.23 15.36 11.98 9.29 14.23 6.97 4.89 3.89 10.60 7.84 11.68 6.67 3.67 3.14
SMNA (Ma et al. 2019) 7.54 41.25 43.29 39.61 30.07 28.98 9.07 8.15 11.28 6.44 4.54 3.61 9.23 5.80 8.39 4.53 3.10 2.39
FAST-MATTN (Qi et al. 2020) 16.35 50.53 55.17 45.50 31.97 29.66 45.28 14.40 28.20 7.19 7.84 4.67 39.05 19.88 30.63 11.60 11.28 6.08
SIA (Lin, Li, and Yu 2021) 13.61 61.91 65.85 57.08 45.96 42.65 41.53 31.53 44.67 16.28 22.41 11.56 48.61 30.80 44.56 14.85 19.02 9.20
VLN\circlearrowrightBERT (Hong et al. 2021) 13.44 51.79 53.90 47.96 38.23 35.61 16.78 30.67 35.02 24.90 18.77 15.27 15.86 29.61 32.91 23.99 16.50 13.51
HOP (Qiao et al. 2022) 14.05 54.81 56.08 48.05 40.55 35.79 17.16 30.39 35.30 25.10 18.23 15.31 17.05 29.12 32.26 23.37 17.13 13.90
HAMT (Chen et al. 2021) (baseline) 12.79 43.29 47.65 40.19 27.20 25.18 14.08 32.95 36.84 30.20 18.92 17.28 13.62 30.40 33.41 26.67 14.88 13.08
HAMT+AACL (ours) 13.01 42.52 46.66 39.48 28.39 26.48 14.08 34.17 38.54 29.70 20.53 17.69 13.30 35.52 39.57 31.34 18.04 15.96
DUET (Chen et al. 2022) (baseline) 13.86 71.75 73.86 63.94 57.41 51.14 22.11 46.98 51.07 33.73 32.15 23.03 21.30 52.51 56.91 36.06 31.88 22.06
DUET+AACL (ours) 14.54 74.63 76.67 66.04 59.38 52.57 23.77 49.42 53.93 33.54 33.31 22.49 21.88 55.09 59.92 37.08 33.17 22.55

Experiments

Experimental Setup

Datasets. We evaluate the proposed AACL on several popular VLN benchmarks with both fine-grained instructions (R2R (Anderson et al. 2018)) and high-level instructions (REVERIE (Qi et al. 2020) and R2R-Last (Chen et al. 2021)). R2R includes 90 photo-realistic indoor scenes with 7189 trajectories. The dataset is split into train, val seen, val unseen, and test unseen sets with 61, 56, 11, and 18 scenes, respectively. REVERIE replaces the fine-grained instructions in R2R with high-level instructions which mainly target at object localization. R2R-Last only uses the last sentence of the original R2R instruction as the instruction.

Table 4: Comparison on R2R-Last.
Method Val Seen Val Unseen
SR\uparrow SPL\uparrow SR\uparrow SPL\uparrow
EnvDrop (Tan, Yu, and Bansal 2019) 42.8 38.4 34.3 28.3
VLN\circlearrowrightBERT (Hong et al. 2021) 50.2 45.8 41.6 37.3
HAMT (Chen et al. 2021) (baseline) 53.3 50.3 45.2 41.2
HAMT+AACL (ours) 54.2 51.1 47.2 42.1
Table 5: Ablation Study on R2R. The baseline agent we choose is HAMT (Chen et al. 2021).
Method Val Unseen
NE\downarrow SR\uparrow SPL\uparrow
separate embedding 3.66 66.67 61.19
w/o contrast 3.42 67.94 61.48
w/o refine 3.45 67.82 62.33
full model 3.41 68.54 62.96

Evaluation Metrics. We adopt the common metrics used in previous works (Chen et al. 2021; Anderson et al. 2018; Qi et al. 2020) to evaluate the model performance: 1) Navigation Error (NE) calculates the average distance between the agent stop position and the target viewpoint, 2) Trajectory Length (TL) is the average path length in meters, 3) Success Rate (SR) is the ratio of stopping within 3 meters to the goal, 4) Success rate weighted by Path Length (SPL) makes the trade-off between SR and TL, 5) Oracle Success Rate (OSR) calculates the ratio of containing a viewpoint along the path where the target object is visible, 6) Remote Grounding Success Rate (RGS) is the ratio of performing correct object grounding when stopping, and 7) Remote Grounding Success weighted by Path Length (RGSPL) weights RGS by TL. 1)–4), 3)–4), and 2)–7) are used for evaluation on R2R, R2R-Last, and REVERIE, respectively.

Baselines. In this work, we choose two strong baseline agents, HAMT (Chen et al. 2021) and DUET (Chen et al. 2022) to verify AACL’s effectiveness. In HAMT, a hierarchical transformer is adopted for storing historical observations and actions. In contrast, DUET keeps track of all visited and navigable locations through a topological map.

Implementation Details. We implement our model using the MindSpore Lite tool (MindSpore 2022). The batch size is set to 8, 8, 4 on R2R, R2R-Last, and REVERIE, respectively. The temperature parameter τ\tau is set to 0.5. The loss weight λ1\lambda_{1} is set to 0.2 on all datasets, and the loss weight λ2\lambda_{2} is set to 1, 1, and 0.01 on R2R, REVERIE, and R2R-Last, respectively. The residual ratio in Eq. 10 is set to 0.8 empirically. During object concept mapping, we remain top 5 object predictions for each observation. The learning rate of the concept refining adapter is set to 0.1.

Refer to caption
Figure 3: Visualization examples of action selection ((a) and (b)) and object concept mapping ((c)). In (a) and (b), the baseline is HAMT (Chen et al. 2021). The green boxes denote the correct actions and the red boxes denote the wrong ones.

Quantitative Results

Comparison with the State-of-the-Arts (SOTAs). Table 2222The original value 2.29 of NE under Val Unseen in HAMT is a typo, which is actually 3.62 confirmed by the author., Table 3, and Table 4 present the performance comparison between the SOTA methods and AACL, where we can find that AACL establishes new state-of-the-art results in most metrics on R2R, REVERIE and R2R-Last. These results show that AACL is useful not only when the instructions are fine-grained but also when the instruction information is limited, demonstrating that the proposed actional atomic concepts can effectively enhance the observation features, simplify their alignment to the linguistic instruction features, and therefore improve the agent performance. Moreover, we can find that AACL consistently outperforms the two strong baselines on these three benchmarks especially under Unseen scenarios, showing that AACL can be used as a general tool for the multi-modal alignment.

Ablation Study. Table 5 gives the ablation study of AACL. “separate embedding” means using the separate embedding scheme (Eq. 12) only for the visual feature and the directional feature. By comparing the results between “separate embedding” and “w/o contrast”, we can find that the direct introduction of actional atomic concepts under the separate embedding strategy can already improve the navigation performance (1.27% increase on SR), showing their effectiveness for enhancing the observation features. By comparing the results between “w/o contrast” and “w/o refine”, we can observe that the proposed observation contrast strategy can effectively regularize the observation representation and improve the performance (0.85% increase on SPL). The comparison between “w/o refine” and “full model” further shows the effectiveness of the concept refining adapter, demonstrating that the instruction-oriented object concept extraction can facilitate better cross-modal alignment.

Qualitative Results

Figure 3 visualizes some results of action decision and object concept mapping. We can find that by introducing the actional atomic concepts, the agent is able to perform better cross-modal alignment for improving action decisions. In Figure 3(a), although the candidate observations do not contain the visual appearance of “kitchen”, with the help of the actional atomic concepts, AACL successfully chooses the right action whose paired action concept matches the one mentioned in the instruction (“turn right”), while the baseline selects the wrong one. From the instruction attention comparison in the lower part of Figure 3(a), we can also observe that AACL attends to the right part (“right”) of the instruction at step 2, while the baseline attends to the wrong part (“and”). In Figure 3(b), within the actional atomic concept, AACL successfully chooses the correct action asked in the instruction. In Figure 3(c), we can observe that the probability of “kitchen” of AACL is higher than that of CLIP for the GT action (top-1 vs. top-4), showing that the concept refining adapter enables more instruction-oriented object concept extraction, which is useful for selecting correct actions.

Conclusion

In this work, we propose Actional Atomic-Concept Learning, which helps VLN agents demystify the alignment in VLN tasks through actional atomic concepts formed by language. During navigation, each visual observation is mapped to the specific actional atomic concept through the VLN environment and CLIP. A concept refining adapter is constructed to enable the instruction-oriented concept extraction. An observation co-embedding module is introduced to use concept features to regularize observation features. Experiments on public VLN benchmarks show that AACL achieves new SOTA results. Benefiting from these human-understandable actional atomic concepts, AACL shows excellent interpretability in making action decision.

Acknowledgements

This work was supported in part by National Key R&D Program of China under Grant No. 2020AAA0109700, National Natural Science Foundation of China (NSFC) under Grant No.61976233, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Guangdong Natural Science Foundation under Grant 2017A030312006, Shenzhen Fundamental Research Program (Project No. RCYX20200714114642083, No. JCYJ20190807154211365), and the Fundamental Research Funds for the Central Universities, Sun Yat-sen University under Grant No. 22lgqb38, CAAI-Huawei MindSpore Open Fund. We gratefully acknowledge the support of MindSpore, CANN (Compute Architecture for Neural Networks), and Ascend AI Processor used for this research.

References

  • Anderson et al. (2018) Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.; Sunderhauf, N.; Reid, I.; Gould, S.; and van den Hengel, A. 2018. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In CVPR.
  • Ba, Kiros, and Hinton (2016) Ba, J.; Kiros, J. R.; and Hinton, G. E. 2016. Layer Normalization. ArXiv, abs/1607.06450.
  • Chen et al. (2019) Chen, H.; Suhr, A.; Misra, D. K.; Snavely, N.; and Artzi, Y. 2019. TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments. In CVPR.
  • Chen et al. (2021) Chen, S.; Guhur, P.-L.; Schmid, C.; and Laptev, I. 2021. History Aware Multimodal Transformer for Vision-and-Language Navigation. In NeurIPS.
  • Chen et al. (2022) Chen, S.; Guhur, P.-L.; Tapaswi, M.; Schmid, C.; and Laptev, I. 2022. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation. In CVPR.
  • Chen et al. (2020) Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A. E.; Ahmed, F.; Gan, Z.; Cheng, Y.; and Liu, J. 2020. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.
  • Dai et al. (2022) Dai, W.; Hou, L.; Shang, L.; Jiang, X.; Liu, Q.; and Fung, P. 2022. Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation. In ACL.
  • Desai and Johnson (2021) Desai, K.; and Johnson, J. 2021. VirTex: Learning Visual Representations from Textual Annotations. In CVPR.
  • Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.
  • Dosovitskiy et al. (2021) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
  • Fried et al. (2018) Fried, D.; Hu, R.; Cirik, V.; Rohrbach, A.; Andreas, J.; Morency, L.-P.; Berg-Kirkpatrick, T.; Saenko, K.; Klein, D.; and Darrell, T. 2018. Speaker-Follower Models for Vision-and-Language Navigation. In NeurIPS.
  • Fu et al. (2020) Fu, T.-J.; Wang, X. E.; Peterson, M. F.; Grafton, S. T.; Eckstein, M. P.; and Wang, W. Y. 2020. Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler. In ECCV.
  • Gao et al. (2021) Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; and Qiao, Y. J. 2021. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. ArXiv, abs/2110.04544.
  • Guhur et al. (2021) Guhur, P.-L.; Tapaswi, M.; Chen, S.; Laptev, I.; and Schmid, C. 2021. Airbert: In-domain Pretraining for Vision-and-Language Navigation. In ICCV.
  • Hao et al. (2020) Hao, W.; Li, C.; Li, X.; Carin, L.; and Gao, J. 2020. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training. In CVPR.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In CVPR.
  • Hong et al. (2021) Hong, Y.; Wu, Q.; Qi, Y.; Rodriguez-Opazo, C.; and Gould, S. 2021. VLN BERT: A Recurrent Vision-and-Language BERT for Navigation. In CVPR.
  • Jia et al. (2021) Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q. V.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML.
  • Khandelwal et al. (2022) Khandelwal, A.; Weihs, L.; Mottaghi, R.; and Kembhavi, A. 2022. Simple but Effective: CLIP Embeddings for Embodied AI. In CVPR.
  • Ku et al. (2020) Ku, A.; Anderson, P.; Patel, R.; Ie, E.; and Baldridge, J. 2020. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding. In EMNLP.
  • Li et al. (2020a) Li, G.; Duan, N.; Fang, Y.; Gong, M.; and Jiang, D. 2020a. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In AAAI.
  • Li et al. (2019) Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; and Chang, K.-W. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv, abs/1908.03557.
  • Li et al. (2020b) Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; Choi, Y.; and Gao, J. 2020b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.
  • Liang et al. (2022) Liang, X.; Zhu, F.; Li, L.; Xu, H.; and Liang, X. 2022. Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration. In ACL.
  • Lin, Li, and Yu (2021) Lin, X.; Li, G.; and Yu, Y. 2021. Scene-Intuitive Agent for Remote Embodied Visual Grounding. In CVPR.
  • Lu et al. (2019) Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS.
  • Ma et al. (2019) Ma, C.-Y.; jiasen lu; Wu, Z.; AlRegib, G.; Kira, Z.; richard socher; and Xiong, C. 2019. Self-Monitoring Navigation Agent via Auxiliary Progress Estimation. In ICLR.
  • MindSpore (2022) MindSpore. 2022. MindSpore. https://www.mindspore.cn/.
  • Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous Methods for Deep Reinforcement Learning. In ICML.
  • Moudgil et al. (2021) Moudgil, A.; Majumdar, A.; Agrawal, H.; Lee, S.; and Batra, D. 2021. SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation. In NeurIPS.
  • Nguyen and Daumé (2019) Nguyen, K.; and Daumé, H. 2019. Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning. In EMNLP.
  • Qi et al. (2021) Qi, Y.; Pan, Z.; Hong, Y.; Yang, M.-H.; van den Hengel, A.; and Wu, Q. 2021. The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation. In ICCV.
  • Qi et al. (2020) Qi, Y.; Pan, Z.; Zhang, S.; van den Hengel, A.; and Wu, Q. 2020. Object-and-Action Aware Model for Visual Language Navigation. In ECCV.
  • Qi et al. (2020) Qi, Y.; Wu, Q.; Anderson, P.; Wang, X.; Wang, W. Y.; Shen, C.; and van den Hengel, A. 2020. REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR.
  • Qiao et al. (2022) Qiao, Y.; Qi, Y.; Hong, Y.; Yu, Z.; Wang, P.; and Wu, Q. 2022. HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation. In CVPR.
  • Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
  • Rao et al. (2022) Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; and Lu, J. 2022. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. In CVPR.
  • Sariyildiz, Perez, and Larlus (2020) Sariyildiz, M. B.; Perez, J.; and Larlus, D. 2020. Learning Visual Representations with Caption Annotations. In ECCV.
  • Shen et al. (2022) Shen, S.; Li, L. H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.-W.; Yao, Z.; and Keutzer, K. 2022. How Much Can CLIP Benefit Vision-and-Language Tasks? In ICLR.
  • Song et al. (2022) Song, H.; Dong, L.; Zhang, W.; Liu, T.; and Wei, F. 2022. CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment. In ACL.
  • Subramanian et al. (2022) Subramanian, S.; Merrill, W.; Darrell, T.; Gardner, M.; Singh, S.; and Rohrbach, A. 2022. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. In ACL.
  • Tan, Yu, and Bansal (2019) Tan, H.; Yu, L.; and Bansal, M. 2019. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout. In NAACL-HLT.
  • Wang, Wu, and Shen (2020) Wang, H.; Wu, Q.; and Shen, C. 2020. Soft Expert Reward Learning for Vision-and-Language Navigation. In ECCV.
  • Wang et al. (2019) Wang, X.; Huang, Q.; Celikyilmaz, A.; Gao, J.; Shen, D.; Wang, Y.-F.; Wang, W. Y.; and Zhang, L. 2019. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In CVPR.
  • Zhu et al. (2020) Zhu, F.; Zhu, Y.; Chang, X.; and Liang, X. 2020. Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks. In CVPR.

Appendix A Appendix

Training Objectives

In this section, we describe the imitation learning (IL) loss and the reinforcement learning loss (RL) used in training. Denote the predicted action at timestep tt as 𝐚t\mathbf{a}_{t}. The IL loss IL\mathcal{L}_{\mathrm{IL}} can be calculated by:

IL=t𝐚tlog(𝐚t),\mathcal{L}_{\mathrm{IL}}=\sum_{t}-\mathbf{a}_{t}^{*}\mathrm{log}(\mathbf{a}_{t}), (15)

where 𝐚t\mathbf{a}_{t}^{*} is the teacher action of the ground-truth path at timestep tt. And the RL loss is formulated as:

RL=t𝐚tslog(𝐚t)At,\mathcal{L}_{\mathrm{RL}}=\sum_{t}-\mathbf{a}_{t}^{s}\mathrm{log}(\mathbf{a}_{t})A_{t}, (16)

where 𝐚ts\mathbf{a}_{t}^{s} is the sampled action from the agent action prediction 𝐚t\mathbf{a}_{t}, and AtA_{t} is the advantage calculated by A2C algorithm (Mnih et al. 2016).

Implementation Details

Following (Chen et al. 2021), we also add an additional all-zero feature as the “stop” feature to the observation features during each timestep, and its paired atomic action concept is set to stop manually. Since the observation contrast strategy is introduced to enhance the discrimination of the action candidate for pursuing accurate action decision, it is not conducted for the navigation history feature. We construct the object concept repository by extracting object words from the training dataset and the augmentation dataset (Hong et al. 2021) of R2R for ensuring the diversity of the object concepts. During the object concept mapping, we directly use the CLIP visual feature (ViT-B/32) released by (Chen et al. 2021). And the CLIP text encoder is fixed. The dimensionalities of the learnable parameters 𝐖1\mathbf{W}_{1} and 𝐖2\mathbf{W}_{2} in the concept refining adapter are set to 256 and 512, respectively.

More Quantitative Results

In this section, we give more quantitative results in Table 6 and Table 7. Table 6 present the comparison of different object number kk in object concept mapping. where we can find that “kk=5” can achieve best performance in most metrics. This is reasonable because the number of salient objects in a single-view observation is usually not too much. Table 7 investigates the results when using different kinds of concepts for regularizing observation features. From Table 7, we can observe that using action and object concepts as actional atomic concepts is more for useful for improving the alignment and the navigation performance than using only object or action concepts.

More Visualization Results

In this section, we first give more visualization results of the action decisions by the baseline method (Chen et al. 2021) and AACL in Figure 4 \sim Figure 6. From Figure 4, we can find that through AACL, the agent can better align the observation to the instruction to make correct action decision. In Figure 4, the correct action chosen by AACL contains the actional atomic concept of “turn right bedrooms”, which matches the key information mentioned in the sub-instruction “walk through the open bedroom door on the right”. Figure 5 and Figure 6 present two failure cases. From Figure 5, we can find that although the action chosen by AACL is wrong, it can also be seen as a right action of “walk out of the dining room” due to the instruction ambiguity. In Figure 6, although the action chosen by AACL is different from the ground-truth one, its paired actional atomic concept is similar with that of the ground-truth action.

Figure 7 gives more visualization results of the object predictions of CLIP (Radford et al. 2021) and our AACL. We can find that the concept refining adapter can contribute to more instruction-oriented object concept extraction. For example, in Figure 7(b), the probability of “bathroom area” of AACL is higher than that of CLIP for the ground-truth action candidate (top-1 vs. top-4). These visualization results show the effectiveness of the proposed concept refining adapter.

Table 6: Comparison of different object numbers.
Setting NE\downarrow SR\uparrow SPL\uparrow
kk=1 3.52 67.52 62.24
kk=5 3.42 67.94 61.48
kk=10 3.47 66.88 60.99
Table 7: Comparison of different kinds of concepts.
Setting NE\downarrow SR\uparrow SPL\uparrow
object 3.60 66.37 61.39
action 3.51 67.13 61.34
action+object 3.42 67.94 61.48
Refer to caption
Figure 4: Visualization of the action selections by the baseline method (Chen et al. 2021) and our AACL. The green boxes denote the correct actions and the red boxes denote the wrong ones. After step 6 marked with the grey dashed box, the baseline and AACL make different trajectories.
Refer to caption
Figure 5: Failure case of AACL. The green boxes denote the correct actions and the red boxes denote the wrong ones. After step 1 marked with the grey dashed box, the ground-truth and AACL have different trajectories.
Refer to caption
Figure 6: Failure case of AACL. The green boxes denote the correct actions and the red boxes denote the wrong ones. After the panoramic views marked with the grey dashed boxes, the ground-truth and AACL have different trajectories.
Refer to caption
Figure 7: Visualization of the object predictions by CLIP (Radford et al. 2021) and our AACL.