Learning Hierarchical Relational Representations through
Relational Convolutions
Abstract
An evolving area of research in deep learning is the study of architectures and inductive biases that support the learning of relational feature representations. In this paper, we address the challenge of learning representations of hierarchical relations—that is, higher-order relational patterns among groups of objects. We introduce “relational convolutional networks”, a neural architecture equipped with computational mechanisms that capture progressively more complex relational features through the composition of simple modules. A key component of this framework is a novel operation that captures relational patterns in groups of objects by convolving graphlet filters—learnable templates of relational patterns—against subsets of the input. Composing relational convolutions gives rise to a deep architecture that learns representations of higher-order, hierarchical relations. We present the motivation and details of the architecture, together with a set of experiments to demonstrate how relational convolutional networks can provide an effective framework for modeling relational tasks that have hierarchical structure.
1 Introduction
Objects in the real world rarely exist in isolation; modeling the relationships between them is essential to accurately capturing complex systems. As increasingly powerful machine learning models advance towards building internal “world models,” it becomes crucial to explore natural inductive biases to enable efficient learning of relational representations. The computational challenge lies in developing the components required to construct robust, flexible, and progressively more complex relational representations.

An important component of relational cognitive processing in humans is an ability to reason about higher-order relational patterns between groups of objects. To illustrate this, it is instructive to consider experimental tasks from the cognitive psychology literature that probe abstract relational reasoning ability. Consider, for example, the task depicted to the right in Figure 1 which is a variant of a relational “match-to-sample” task [10, 37]. The subject is presented with a source triplet of objects and several target triplets, with each triplet having a particular relational pattern. The task is to match the source to a target triplet with the same relational pattern (in this case, the source has an “ABA” pattern that matches the second target). This task requires going beyond reasoning about pairwise relations; the subject must reason about each triplet of objects as a group, determine its relational pattern, then compare it to those of the target triplets, inferring the abstract rule in the process. The ability to infer generalizable abstract rules in such tasks is believed to be unique to humans [9].
Compositionality—used here to mean an ability to compose modules together to build iteratively more complex feature representations—is essential to the success of deep representation learning. For example, in the domain of visual processing, CNNs are able to extract higher-level features (e.g., textures and object-specific features) by composing simpler feature maps [41], resulting in a flexible architecture for computing “features of features”. In contrast, existing work on relational representation learning has primarily been limited to “shallow” first-order architectures (e.g., only explicitly capturing pairwise relations).
In this work, we propose relational convolutional networks as a compositional framework for learning hierarchical relational representations. The key to the framework involves formalizing the concept of convolving learnable templates of a relational pattern against a larger relation tensor. This operation produces a sequence of vectors representing the relational pattern within each group of objects. Crucially, composing relational convolutions captures higher-order relational features—i.e., relations between relations. Specifically, our proposed architecture introduces the following concepts and computational mechanisms.
-
•
Graphlet filters. A “graphlet filter” is a template for the pattern of relations between a (small) collection of objects. Since pairwise relations can be viewed as edges on a graph, the term “graphlet” is used to refer to a subgraph, and the term “filter” is used to refer to a learnable template or pattern.
-
•
Relational convolutions. We formalize a notion of relational convolution, analogous to spatial convolutions in CNNs, where a graphlet filter is matched against the relations within groups of objects to obtain a representation of the relational pattern in different groupings of the input.
-
•
Grouping mechanisms. For large problem instances, considering relational convolutions across all object combinations would be intractable. To achieve scalability, we introduce a learnable grouping mechanism based on attention which identifies the relevant groups that should be considered for the downstream task.
-
•
Compositional relational modules. The proposed architecture supports composable modules, where each module has learnable graphlet filters and groups. This enables learning higher-order relationships between objects—relations between relations.
The components of the architecture are presented in detail in Sections 2 and 3, and a schematic of the proposed architecture is shown in Figure 2. In a series of experiments, we show how relational convolutional networks provide a powerful framework for relational learning. We first carry out experiments on the “relational games” benchmark for relational reasoning proposed by [32], which consists of a suite of binary classification tasks for identifying abstract relational rules between a set of geometric objects represented as images. We next carry out experiments on a version of the Set card game, which requires processing of higher-order relations across multiple attributes. We find that relational convolutional networks outperform Transformers, graph neural networks, and existing relational architectures. These results demonstrate that both compositionality and relational inductive biases are essential for efficiently learning representations of complex higher-order relations.

1.1 Related Work
To place our framework in the context of previous work, we briefly discuss related forms of relational learning below, pointing first to the review of relational learning inductive biases by [6].
Graph neural networks (GNNs) are a class of neural network architectures which operate on graphs and process “relational” data [26, 21, 31, 34, 20, 38, e.g.,]. A defining feature of GNN models is their use of a form of neural message-passing, wherein the hidden representation of a node is updated as a function of the hidden representations of its neighbors on a graph [12]. Typical examples of tasks that GNNs are applied to include node classification, graph classification, and link prediction [14].
In GNNs, the ‘relations’ are given to the model as input via edges in a graph. In contrast, our architecture, as well as the relational architectures described below, operate on collections of objects without any relations given as input. Instead, such relational architectures must infer the relevant relations from the objects themselves. Still, graph neural networks can be applied to these relational tasks by passing in the collection of objects along with a complete graph.
Several works have proposed architectures with the ability to model relations by incorporating an attention mechanism [33, 34, 29, 40, 22, e.g.,]. Attention mechanisms, such as self-attention in Transformers [33], model relations between objects implicitly as an intermediate step in an information-retrieval operation to update the representation of each object as a function of its context.
There also exists a growing literature on neural architectures that aim to explicitly model relational information between objects. An early example is the relation network proposed by [30], which produces an embedding representation for a set of objects based on aggregated pairwise relations. [32] proposes the PrediNet architecture, which aims to learn relational representations that are compatible with predicate logic. [18] proposes CoRelNet, a simple architecture based on ‘similarity scores’ that aims to distill the relational inductive biases discovered in previous work into a minimal architecture. [3, 2] explore relational inductive biases in the context of Transformers, and propose a view of relational inductive biases as a type of selective “information bottleneck” which disentangles relational information from object-level features. [36] provides a cognitive science perspective on this idea, arguing that a relational information bottleneck may be a mechanism for abstraction in the human mind.
2 Multi-Dimensional Inner Product Relation Module
A relation function maps a pair of objects to a vector that represents the relationship between them. For example, a relation may represent comparisons along different attributes of the two objects, such as “ has the same color as , is larger than , and is to the left of ”. In principle, this can be modeled by an arbitrary learnable function on the concatenation of the two objects’ feature representations. For example, [30] use multilayer perceptrons (MLPs) to model relations by processing the concatenated feature vectors of object pairs. However, this approach lacks crucial inductive biases. While it is theoretically capable of modeling relations, it imposes no constraints to ensure that the learned pairwise function reflects meaningful relational patterns. In particular, it entangles the feature representations of the two objects without explicitly comparing their features.
Following previous work [33, 37, 18, 3, e.g.,], we propose modeling pairwise relations between objects via inner products of feature maps. This introduces added structure to the pairwise function that explicitly incorporates a comparison operation (the inner product). The advantage of this approach is that it provides added pressure to learn explicitly relational representations, disentangling relational information from attributes of individual objects, and inducing a geometry on the object space . For example, in the symmetric case, the inner product relation satisfies symmetry, positive definiteness, and induces a pseudometric on . The triangle inequality of the pseudometric expresses a transitivity property—if is related to and is related to , then must be related to .
More generally, we can allow for multi-dimensional relations by having multiple encoding functions, each extracting a feature to compute a relation on. Furthermore, we can allow for asymmetric relations by having different encoding functions for each object. Hence, we model relations by
(1) |
where are learnable functions. The intuition is that, for each dimension, the encoders extract, or ‘filter’ out, particular attributes of the objects and the inner products compute similarity across each attribute. A relation, in this sense, is similarity across a particular attribute. In the asymmetric case, the attributes extracted from the two objects are different, resulting in an asymmetric relation where one attribute of the first object is compared with a different attribute of the second object. For example, this can model relations of the form “ is brighter than ” (an antisymmetric relation).
[1] analyzes the function approximation properties of neural relation functions of the form of Equation 1. In particular, the function class of inner products of neural networks is characterized in both the symmetric case and the asymmetric case. In the symmetric case (i.e., ), it is shown that inner products of MLPs are universal approximators for symmetric positive definite kernels. In the asymmetric case, inner products of MLPs are universal approximators for continuous bivariate functions. The efficiency of approximation is characterized in terms of a bound on the number of neurons needed to achieve a particular approximation error.
To promote weight sharing, we can have one common non-linear map shared across all dimensions together with different linear projections for each dimension of the relation. That is, is given by
(2) |
where the learnable parameters are and . The non-linear map may be an MLP, for example, and are matrices. The class of functions realizable by Equation 2 is the same as Equation 1 but enables greater weight sharing.
The “Multi-dimensional Inner Product Relation” (MD-IPR) module receives a sequence of objects as input and models the pairwise relations between them by Equation 2, returning an relation tensor, , describing the relations between each pair of objects.
3 Relational Convolutions with Graphlet Filters
3.1 Relational Convolutions with Discrete Groups
In this section, we formalize a relational convolution operation which processes pairwise relations between objects to produce representations of the relational patterns within groups of objects. Suppose that we have a sequence of objects and a relation tensor describing the pairwise relations between them, obtained by an MD-IPR layer via . The key idea is to learn a template of relations between a small set of objects, and to “convolve” the template with the relation tensor, matching it against the relational patterns in different groups of objects. This transforms the relation tensor into a sequence of vectors, each summarizing the relational pattern in some group of objects. Crucially, this can now be composed with another relational layer to compute higher-order relations—i.e., relations on relations.
Fix some filter size , where is a hyperparameter of the relational convolution layer. One ‘filter’ of size is given by the graphlet filter . This is a template for the pairwise relations between a group of objects. Since pairwise relations can be viewed as edges on a graph, we use the term “graphlet filter” to refer to a template of pairwise relations between a small set of objects. Let be a group of objects among . Then, denote the relation sub-tensor associated with this group by . We define the ‘relational inner product’ between this relation subtensor and the filter by
(3) |
This is simply the standard inner product in the corresponding Euclidean space . This quantity represents how much the relational pattern in matches the template .
In a relational convolution layer, we learn different filters. Denote the collection of filters by , which we call a graphlet filter. We define the relational inner product of a relation subtensor with the graphlet filters as the -dimensional vector consisting of the relational inner products with each individual filter,
(4) |
This vector summarizes various aspects of the relational pattern within a group, captured by several different filters111We have overloaded the notation , but will use the convention that a collection of filters is denoted by a bold symbol (e.g., vs ) to distinguish between the two forms of the relational inner product.. Each filter corresponds to one dimension. This is reminiscent of convolutional neural networks, where each filter gives us one channel in the output tensor.
For a given group , the relational inner product with a graphlet filter, , gives us a vector summarizing the relational patterns inside that group. Let be a set of groupings of the objects, each of size . The relational convolution between a relation tensor and a relational graphlet filter is defined as the sequence of relational inner products with each group in
(5) |
In this section, we assumed that was given. If some prior information is known about what groupings are relevant, this can be encoded in . Otherwise, if is small, can be all possible combinations of size . However, when is large, considering all combinations will be intractable. In the next subsection, we consider the problem of differentiably learning the relevant groups.

3.2 Relational Convolutions with Group Attention
In the above formulation, the groups are ‘discrete’. Having discrete groups can be desirable for interpretability if the relevant groupings are known a priori or if considering every possible grouping is computationally and statistically feasible. However, if the relevant groupings are not known, then considering all possible combinations results in a rapid growth of the number of objects at each layer.
In order to address these issues, we can explicitly model and learn the relevant groups. This allows us to control the number of objects in the output sequence of a relational convolution such that only relevant groups are considered. We propose modeling groups via an attention operation.
Consider the input . Let be the number of groups to be learned and be the size of the graphlet filter (and hence the size of each group). These are hyperparameters of the model that we control. For each group , we learn different queries, , that will be used to retrieve a group of size via attention. The -th object in the -th group is retrieved as follows,
(6) | ||||||
where is the -th object retrieved in the -th group, is the query for retrieving the -th object in the -th group, is the key associated with the object , and is a temperature scaling parameter.
The for each object is computed as a function of its position, features, and/or context. For example, to group objects based on their position, the key can be a positional embedding, . To group based on features, the can be a linear projection of the object’s feature vector, . To group based on both position and features, the can be a sum or concatenation of the above. Finally, computing keys after a self-attention operation allows objects to be grouped based on the context in which they occur.
The relation subtensor for each group is computed using a shared MD-IPR layer ,
(7) |
The relational convolution is computed as before via,
(8) |
Overall, relational convolution with group attention can be summarized as follows: 1) learn groupings of objects, retrieving objects per group; 2) compute the relation tensor of each group using an MD-IPR module; 3) compute a relational convolution with a learned set of graphlet filters , producing a sequence of vectors each describing the relational pattern within a (learned) group of objects.
Computing input-dependent queries. In the simplest case, the query vectors are simply learned parameters of the model, representing a fixed criterion for selecting the groups. The queries can also be produced in an input-dependent manner. There are many ways to do this. For example, the input can be processed with some sequence or set embedder (e.g., through a self-attention operation) producing a vector embedding that can be mapped to different queries using learned linear maps.
Entropy regularization. Intuitively, we would ideally like the group attention scores in Equation 6 to be close to discrete assignments. To encourage the model to learn more structured group assignments, we add an entropy regularization to the loss function, , where is the Shannon entropy. As a heuristic, this regularization can be scaled by a factor proportional to so that it doesn’t dominate the underlying task’s loss. Sparsity regularization in neural attention has been explored in several previous works, including through entropy regularization [25, 24, 4, e.g.,].
Symmetric relational inner products. So far, we considered ordered groups. That is, the relational pattern computed by the relational inner product for the group is different from the group . In some scenarios, symmetry in the representation of the relational pattern is a useful inductive bias. To capture this, we define a symmetric variant of the relational inner product that is invariant to the ordering of the elements in the group. This can be done by pooling over all permutations in the group. In particular, we suggest max-pooling or average-pooling, although any set-aggregator would be valid. We define the permutation-invariant relational inner product as
(9) |
where denotes the set of permutations of the group , and pooling is done independently across dimensions.
Computational efficiency. Equation 6 can be computed in parallel with operations. When the hyperparameters of the model are fixed, this is linear in the sequence length . Equation 7 can be computed in parallel via efficient matrix multiplication with operations. Finally, Equation 8 can be computed in parallel with operations. The latter two computations do not scale with the number of objects in the input, and are only a function of the hyperparameters of the model.
3.3 Deep Relational Architectures by Composing Relational Convolutions
A relational convolution block (including a MD-IPR module) is a simple neural module that can be composed to build a deep architecture for learning iteratively more complex relational feature representations.
Following the notation in Figure 2, let denote the number of objects and the object dimension at layer . A relational convolution block receives as input a sequence of objects of shape and returns a sequence of objects of shape representing the relational patterns among groupings of objects. The output dimension corresponds to the number of graphlet filters , and is a hyperparameter. The sequence length corresponds to the number of groups, and is in the case of given discrete groups (Section 3.1) or a hyperparameter in the case of learned groups via group attention (Section 3.2). Each composition of a relational convolution block computes relational features of one degree higher (i.e., relations between relations).
A common recipe for building modern deep learning architectures is by using residual connections [15] and normalization [5]. This can be achieved for relational convolutional networks by fixing the number of groups and number of filters hyperparameters to be the same across all layers, such that the input shape and output shape remain the same. Then, letting denote the hidden representation at layer , the overall architecture becomes , where is a linear transformation that controls where information is written to in the residual stream. This ResNet-style architecture allows for the hidden representation to encode relational information at multiple layers of hierarchy, retaining the information at shallower layers. Additionally, we can insert MLP layers to process the relational representations before the next relational convolution layer. In this paper, we limit our exploration to relatively shallow networks of the form .
4 Experiments
In this section, we empirically evaluate the proposed relational convolutional network architecture (abbreviated RelConvNet) to assess its effectiveness at learning relational tasks. We compare this architecture to several existing relational architectures as well as general-purpose sequence models. The common input to all models is a sequence of objects . We evaluate against the following baselines.
-
•
Transformer [33]. The Transformer is a powerful general-purpose sequence model. It consists of alternating self-attention and multi-layer perceptron blocks. Self-attention performs an information retrieval operation, which updates the internal representation of each object as a function of its context. Dot product attention is computed via , and the MLP is applied independently on each object’s internal representation. The attention scores computed as an intermediate step in dot-product attention can perhaps be thought of as relations that determine what information to retrieve.
-
•
PrediNet [32]. The PrediNet architecture is an explicitly relational architecture inspired by predicate logic. At a high-level, the PrediNet architecture computes relations between pairs of objects. The pairs of objects are selected via a learned attention operation. The “ relations” refer to a difference between -dimensional embeddings of the selected objects. More precisely, for each head , a pair of objects is retrieved via an attention operation, and the final output of PrediNet is a set of difference relations given by .
-
•
CoRelNet [18]. The CoRelNet architecture is proposed as a minimal relational architecture distilling the core inductive biases that the authors argue are important for relational tasks. The CoRelNet module simply computes inner products between object representations and applies Softmax normalization, returning an “similarity matrix”. That is, the objects are processed independently to produce embeddings , and the similarity matrix is computed as . The similarity matrix is then flattened and passed through an MLP to produce the final output.
-
•
Graph Neural Networks. Graph neural networks are a class of neural network architectures which operate on graphs-structured data. A graph neural network typically receives two inputs: a graph described by a set of edges, and feature vectors for each node in the graph. GNNs can be described through the unifying framework of neural message-passing. Under this framework, graph-structured data is processed through an iterative message-passing operation given by , where . That is, each node’s internal representation is iteratively updated as a function of its neighborhood. Here, is parameterized by a neural network, and the variation between different GNN architectures lies in the architectural design of this update process. We use Graph Convolution Networks [21], Graph Attention Networks [34], and Graph Isomorphism Networks [38] as representative GNN baselines.
-
•
CNN. As a non-relational baseline, we test a regular convolutional neural network which processes the raw image input. The central modules in the baselines above receive an object-centric representation as input. That is, a sequence of vector embeddings produced by a small CNN each corresponding to one of the objects in the input. Here, instead, a deeper CNN model processes the raw image input representing the entire “scene” in an end-to-end manner.
4.1 Relational Games
The relational games dataset was contributed as a benchmark for relational reasoning by [32]. It consists of a family of binary classification tasks for identifying abstract relational rules between a set of objects represented as images. The object images depict simple geometric shapes and consist of three different splits with different visual styles for evaluating out-of-distribution generalization, referred to as “pentominoes”, “hexominoes”, and “stripes”. The input is a sequence of objects arranged in a grid. Each task corresponds to some relationship between the objects, and the target is to classify whether the relationship holds among the objects in the input or not (see Figure 4).
In our experiments, we evaluate out-of-distribution generalization by training on the pentominoes objects and evaluating on the hexominoes and stripes objects. The input to the models is presented as a sequence of objects, with each object represented as a RGB image. All models share the common architecture , where indicates the central module being tested. In all models, the objects are first processed independently by a CNN with a shared architecture. The processed objects are then passed to the central module of the model. The final prediction is produced by an MLP with a shared architecture. In this section, we focus our comparison on four models: RelConvNet (ours), CoRelNet [18], PrediNet [32], and a Transformer [33]222The GNN baselines failed to learn the relational games tasks in a way that generalizes, often severely overfitting. For clarity of presentation, we defer results on the GNN baselines to Appendix A. . The pentominoes split is used for training, and the hexominoes and stripes splits are used to test out-of-distribution generalization after training. We train for 50 epochs using the categorical cross-entropy loss and the Adam optimizer with learning rate , , and batch size . For each model and task, we run 5 trials with different random seeds. Appendix A describes further experimental details about the architectures and training setup.


Out-of-distribution generalization. Figure 5 reports model performance on the two hold-out object sets after training. On the hexominoes objects, which are similar-looking to the pentominoes objects used for training, RelConvNet and CoRelNet do nearly perfectly. PrediNet and the Transformer do well on the simpler tasks, but struggle with the more difficult ‘match pattern’ task. The ‘stripes’ objects are visually more distinct from the objects in the training split, making generalization more difficult. We observe an overall drop in performance for all models. The drop is particularly dramatic for CoRelNet333The experiments in [18] on the relational games benchmark use a technique called “context normalization” [35] as a preprocessing step. We choose not to use this technique since it is an added confounder. We discuss this choice in Appendix C.. The separation between RelConvNet and the other models is largest on the ‘match pattern’ task of the stripes split (the most difficult task and the most difficult generalization split). Here, RelConvNet maintains a mean accuracy of 87% while the other models drop below 65%. We attribute this to RelConvNet’s ability to naturally represent higher-order relations and model groupings of objects. The CNN baseline learns the easier ‘same’, ‘between’, and ‘occurs’ tasks nearly perfectly, but completely fails to learn the more difficult ‘xoccurs’ and ‘match pattern’ tasks. This hard boundary suggests that explicit relational architectural inductive biases are necessary for learning more difficult relational tasks.


Data efficiency. We observe that the relational inductive biases of RelConvNet, and relational models more generally, grant a significant advantage in sample-efficiency. Figure 6 shows the training accuracy over the first 2,000 batches for each model. RelConvNet, CoRelNet, and PrediNet are explicitly relational architectures, whereas the Transformer is not. The Transformer is able to process relational information through its attention mechanism, but this information is entangled with the features of individual objects (which, for these relational tasks, is extraneous information). The Transformer consistently requires the largest amount of data to learn the relational games tasks. PrediNet tends to be more sample-efficient. RelConvNet and CoRelNet are the most sample-efficient, with RelConvNet only slightly more sample-efficient on most tasks.
On the ‘match pattern’ task, which is the most difficult, RelConvNet is significantly more sample-efficient. We attribute this to the fact that RelConvNet is able to model higher-order relations through its relational convolution module. The ‘match pattern’ task can be thought of as a second-order relational task—it involves computing the relational pattern in each of two groups, and comparing the two relational patterns. The relational convolution module naturally models this kind of situation since it learns representations of the relational patterns within groups of objects.

Learning groups via group attention. Next, we analyze RelConvNet’s ability to learn useful groupings through group attention in an end-to-end manner. We train a -layer relational convolutional network with learned groups and a graphlet size of . We group based on position by using positional embeddings for . In Figure 7, we visualize the group attention scores (see Equation 6) learned from one of the training runs. For each group , the figure depicts a grid representing the objects attended to in that group. Since each group contains objects, we represent the value in the -channel HSV color representation. We observe that 1) group attention learns to ignore the middle row, which contains no relevant information; and 2) the selection of objects in the top row and the bottom row is structured. In particular, group considers the relational pattern within the bottom row and group considers the relational pattern in the top row, which is exactly how a human would tackle this problem. We refer to Figure 11 for an exploration of the effect of entropy regularization on group attention. We find that entropy regularization is necessary for the model to learn and causes the group attention scores to converge to interpretable discrete assignments.
4.2 Set: Grouping and Compositionality in Relational Reasoning
Set is a card game that forms a simple-to-describe but challenging relational task. The ‘objects’ are a set of cards with four attributes, each of which can take one of three possible values. ‘Color’ can be red, green, or purple; ‘number’ can be one, two, or three; ‘shape’ can be diamond, squiggle, or oval; and ‘fill’ can be solid, striped, or empty. A ‘set’ is a triplet of cards such that each attribute is either a) the same on all three cards, or b) different on all three cards.


In Set, the task is: given a hand of cards, find a ‘set’ among them. Figure 8(a) depicts a positive and negative example for , with indicating the ‘set’ in the positive example. This task is deceptively challenging, and is representative of the type of relational reasoning that humans excel at but machine learning systems still struggle with. To solve the task, one must process the sensory information of individual cards to identify the values of each attribute, and then reason about the relational pattern in each triplet of cards. The construct of relational convolutions proposed in this paper is a step towards developing machine learning systems that can perform this kind of relational reasoning.
In this section, we evaluate RelConvNet on a task based on Set and compare it to several baselines. The task is: given a collection of images of Set cards, determine whether or not they contain a ‘set’. All models share the common architecture , where indicates the central module being tested. The CNN embedder is pre-trained on the task of classifying the four attributes of the cards and an intermediate layer is used to generate embeddings. The output MLP architecture is shared across all models. Further architectural details can be found in Appendix A.
In Set, there exists triplets of cards, of which are a ‘set’. We partition the ‘sets’ into training (70%), validation (15%), and test (15%) sets. The training, validation, and test datasets are generated by sampling -tuples of cards such that with probability the -tuple does not contain a set, and with probability it contains a set among the corresponding partition of sets. Partitioning the data in this way allows us to measure the models’ ability to “learn the rule” and identify new unseen ‘sets’. We train for 100 epochs with the same loss, optimizer, and batch size as the experiments in the previous section. For each model, we run 10 trials with different random seeds.
When using the default optimizer hyperparameters as in the previous experiment without hyperparameter tuning, we find that RelConvNet is the only model able to meaningfully learn the task in a manner that generalizes to unseen ‘sets’. In particular, we observe that many baselines severely overfit to the training data, failing to learn the rule and generalize (see Section B.1). Although RelConvNet did not require hyperparameter tuning, we carry out an extensive hyperparameter sweep for all other baselines individually in order to validate our conclusions against the best-achievable performance for each baseline. We ran a total of 1600 experimental runs searching over combinations of architectural hyperparameters (number of layers) and optimization hyperparameters (weight decay, learning rate schedule) individually for each baseline, with the goal of finding a hyperparameter configuration that is representative of the best achievable performance for each model class on this task. The results of the hyperparameter sweep are summarized in Appendix B.
Figure 8(b) shows the hold-out test accuracy for each model. Figure 8(c) shows the training and validation accuracy over the course of training. Here, RelConvNet uses the Adam optimizer with the default Tensorflow hyperparameters (constant learning rate of , ) while each baseline has its own individually-optimized hyperparameters, described in Appendix B.
We observe a sharp separation between RelConvNet and all other baselines. While RelConvNet is able to learn the task and generalize to new ‘sets’ with near-perfect accuracy (avg: 97.9%), no other model is able to reach a comparable generalization accuracy even after hyperparameter tuning. The next best is the GAT model (avg: 67.5%). Several models are able to fit the training data, reaching near-perfect training accuracy, but they are unable to “learn the rule” in a way that generalizes to the validation or test sets. This suggests that while these models are powerful function approximators, they lack the inductive biases to learn hierarchical relations.

In Figure 9 we analyze the geometry of the representations learned by the relational convolution layer. We consider all triplets of cards, compute the relation subtensor using the learned MD-IPR layer, and plot the relational inner product with the learned graphlet filter . The result is a -dimensional vector for each triplet of cards. We perform PCA to plot this in two dimensions, and color-code each triplet of cards according to whether or not it forms a ‘set’. We find that the relational convolution layer learns a representation of the relational pattern in groups of objects that separates ‘sets’ and ‘non-sets’. In particular, the two classes form clusters that are linearly separable even when projected down to two dimensions by PCA. This explains why RelConvNet is able to learn the task in a way that generalizes while the other models are not. In Appendix E we expand on this discussion, and further analyze the representations learned by the MD-IPR layer, showing that the learned relations map to the color, number, shape, and fill attributes.
It is perhaps surprising that models like GNNs and Transformers perform poorly on these relational tasks, given their apparent ability to process relations through neural message-passing and attention, respectively. We remark that GNNs operate in a different domain compared to relational models like RelConvNe, PrediNet, and CoRelNet. In GNNs, the relations are an input to the model, received in the form of a graph, and are used to dictate the flow of information in a neural message-passing operation. By contrast, in relational convolutional networks, the input is simply a set of objects without relations—the relations need to be inferred as part of the feature representation process. Thus, GNNs operate in domains where relational information is already present (e.g., analysis of social networks, biological networks, etc.), whereas our framework aims to solve tasks that rely on relations but those relations need to be inferred end-to-end. This offers a partial explanation for the inability of GNNs to learn this task—GNNs are good at processing network-style relations when they are given as input, but may not be able to infer and hierarchically process relations when they are not given. In the case of Transformers, relations are modeled implicitly to direct information retrieval in attention, but are not encoded explicitly in the final representations. By contrast, RelConvNet operates on collections of objects and possesses inductive biases for learning iteratively more complex relational representations, guided only by the supervisory signal of the downstream task.
Models like CoRelNet and PrediNet have relational inductive biases, but lack compositionality. On the other hand, deep models like Transformers and GNNs are compositional, but lack relational inductive biases. This experiment suggests that compositionality and relational inductive biases are both necessary ingredients to efficiently learn representations of higher-order relations. RelConvNet is a compositional architecture imbued with relational inductive biases and a demonstrated ability to tackle hierarchical relational tasks.
5 Discussion
Summary
In this paper, we proposed a compositional architecture and framework for learning hierarchical relational representations via a novel relational convolution operation. The relational convolution operation we propose here is a ‘convolution’ in the sense that it considers a patch of the relation tensor, given by a subset of objects, and compares the relations within it to a template graphlet filter via an appropriately-defined inner product. This is analogous to convolutional neural networks, where an image filter is compared against different patches of the input image. Moreover, we propose an attention-based mechanism for modeling useful groupings of objects in order to maintain scalability. By alternating inner product relation layers and relational convolution layers, we obtain an architecture that naturally models hierarchical relations.
Discussion on relational inductive biases
In our experiments, we observed that general-purpose sequence models like the Transformer struggle to learn tasks that involve relational reasoning in a data-efficient manner. The relational inductive biases of RelConvNet, CoRelNet, and PrediNet result in significantly improved performance on the relational games tasks. These models each implement different kinds of relational inductive biases, and are each designed with different motivations in mind. For example, PrediNet’s architecture is loosely inspired by the structure of predicate logic, but can be understood as ultimately producing representations of pairwise difference-relations, with pairs of objects selected by an attention operation. CoRelNet is a minimal relational architecture that consists of computing an inner product similarity matrix followed by a softmax normalization. RelConvNet, our proposed architecture, provides further flexibility across several dimensions. Like CoRelNet, it models relations as inner products of feature maps, but it achieves greater representational capacity by learning multi-dimensional relations through multiple learned feature maps or filters. More importantly, the relational convolutions operation enables learning higher-order relations between groups of objects. This is in contrast to both PrediNet and CoRelNet, which are limited to pairwise relations. Our experiments show that the inductive biases of RelConvNet result in improved performance in relational reasoning tasks. In particular, the Set task, where RelConvNet was the only model able to generalize non-trivially, demonstrates the necessity for explicit inductive biases that support learning hierarchical relations.
Limitations and future work
The tasks considered here are solvable by modeling only second-order relations. In the case of the relational games benchmark of [32], we observe that the tasks are saturated by the relational convolutional networks architecture. While the “contains set” task demonstrates a sharp separation between relational convolutional networks and existing baselines, this task too only involves second-order relations. A more thorough evaluation of this architecture, and future architectures for modeling hierarchical relations, would require the development of new benchmark tasks and datasets that involve a larger number of objects and higher-order relations. This is a subtle and non-trivial task that we leave for future work.
The modules proposed in this paper assume object-centric representations as input. In particular, the tasks considered in our experiments have an explicit delineation between different objects. In more general settings, object information may need to be extracted from raw stimulus explicitly by the system (e.g., a natural image containing multiple objects in apriori unknown positions). Learning object-centric representations is an active area of research [28, 13, 22, 19], and is related but separate from learning relational representations. These methods produce a set of embedding vectors, each describing a different object in the scene, which can then be passed to the central processing module (e.g., a relational processing module such as RelConvNet). In future work, it will be important to explore how well RelConvNet integrates with methods for learning object-centric representations in an end-to-end system.
The experiments considered here are synthetic relational tasks designed for a controlled evaluation. In more realistic settings, we envision relational convolutional networks as modules embedded in a broader architecture. For example, a relational convolutional network can be embedded into an RL agent to enable performing tasks involving relational reasoning. Similarly, relational convolutions can perhaps be integrated into general-purpose sequence models, such as Transformers, to enable improved relational reasoning while retaining the generality of the architecture.
Code and Reproducibility
The project repository can be found here: https://github.com/Awni00/Relational-Convolutions. It includes an implementation of the relational convolutional networks architecture, code and instructions for reproducing our experimental results, and links to experimental logs.
Acknowledgment
This work is supported by the funds provided by the National Science Foundation and by DoD OUSD (R&E) under Cooperative Agreement PHY-2229929 (The NSF AI Institute for Artificial and Natural Intelligence).
References
- [1] Awni Altabaa and John Lafferty “Approximation of relation functions and attention mechanisms”, 2024 arXiv:2402.08856 [cs.LG]
- [2] Awni Altabaa and John Lafferty “Disentangling and Integrating Relational and Sensory Information in Transformer Architectures”, 2024 arXiv:2405.16727 [cs.LG]
- [3] Awni Altabaa, Taylor Whittington Webb, Jonathan D. Cohen and John Lafferty “Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers” In The Twelfth International Conference on Learning Representations, 2024
- [4] Giuseppe Attanasio, Debora Nozza, Dirk Hovy and Elena Baralis “Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists” In Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 1105–1119
- [5] Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E. Hinton “Layer Normalization”, 2016 arXiv: https://arxiv.org/abs/1607.06450
- [6] Peter W. Battaglia et al. “Relational Inductive Biases, Deep Learning, and Graph Networks” arXiv, 2018 arXiv:1806.01261 [cs, stat]
- [7] Patricia A Carpenter, Marcel A Just and Peter Shell “What one intelligence test measures: a theoretical account of the processing in the Raven Progressive Matrices Test.” In Psychological review 97.3 American Psychological Association, 1990, pp. 404
- [8] CE Englund, DL Reeves, CA Shingledecker, DR Thorne, KP Wilson and FW Hegge “Unified tri-service cognitive performance assessment battery (UTC-PAB) I. Design and Specification of the Battery” In Naval Health Research Center Report. San Diego, California, 1987
- [9] Joël Fagot, Edward A Wasserman and Michael E Young “Discriminating the relation between relations: the role of entropy in abstract conceptualization by baboons (Papio papio) and humans (Homo sapiens).” In Journal of Experimental Psychology: Animal Behavior Processes 27.4 American Psychological Association, 2001, pp. 316
- [10] Charles Bohris Ferster “Intermittent reinforcement of matching to sample in the pigeon” In Journal of the Experimental Analysis of Behavior 3.3 Society for the Experimental Analysis of Behavior, 1960, pp. 259
- [11] Stefan L Frank, Rens Bod and Morten H Christiansen “How hierarchical is language use?” In Proceedings of the Royal Society B: Biological Sciences 279.1747 The Royal Society, 2012, pp. 4522–4531
- [12] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals and George E. Dahl “Neural Message Passing for Quantum Chemistry” In International Conference on Machine Learning PMLR, 2017, pp. 1263–1272
- [13] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick and Alexander Lerchner “Multi-Object Representation Learning with Iterative Variational Inference” In Proceedings of the 36th International Conference on Machine Learning 97, Proceedings of Machine Learning Research PMLR, 2019, pp. 2424–2433 URL: https://proceedings.mlr.press/v97/greff19a.html
- [14] William L Hamilton “Graph Representation Learning”, Synthesis Lectures on Artificial Intelligence and Machine Learning San Rafael, CA: Morgan & Claypool, 2020
- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
- [16] Jean-Rémy Hochmann, Arin S. Tuerk, Sophia Sanborn, Rebecca Zhu, Robert Long, Megan Dempster and Susan Carey “Children’s representation of abstract relations in relational/array match-to-sample tasks” In Cognitive Psychology 99, 2017, pp. 17–43
- [17] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick and Ross Girshick “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910
- [18] Giancarlo Kerg, Sarthak Mittal, David Rolnick, Yoshua Bengio, Blake Richards and Guillaume Lajoie “On Neural Architecture Inductive Biases for Relational Tasks”, 2022 arXiv:2206.05056 [cs]
- [19] Thomas Kipf, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy and Klaus Greff “Conditional Object-Centric Learning from Video” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=aD7uesX1GF_
- [20] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling and Richard Zemel “Neural relational inference for interacting systems” In International conference on machine learning, 2018
- [21] Thomas N. Kipf and Max Welling “Semi-Supervised Classification with Graph Convolutional Networks” arXiv, 2017 DOI: 10.48550/arXiv.1609.02907
- [22] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy and Thomas Kipf “Object-centric learning with slot attention” In Advances in Neural Information Processing Systems 33, 2020, pp. 11525–11538
- [23] Gary F Marcus, Sugumaran Vijayan, Shoba Bandi Rao and Peter M Vishton “Rule learning by seven-month-old infants” In Science 283.5398 American Association for the Advancement of Science, 1999, pp. 77–80
- [24] André Martins, António Farinhas, Marcos Treviso, Vlad Niculae, Pedro Aguiar and Mario Figueiredo “Sparse and continuous attention mechanisms” In Advances in Neural Information Processing Systems 33, 2020, pp. 20989–21001
- [25] Vlad Niculae and Mathieu Blondel “A regularized framework for sparse and structured neural attention” In Advances in neural information processing systems 30, 2017
- [26] Mathias Niepert, Mohamed Ahmed and Konstantin Kutzkov “Learning convolutional neural networks for graphs” In International conference on machine learning, 2016 PMLR
- [27] Barbara Rosario, Marti A Hearst and Charles J Fillmore “The descent of hierarchy, and selection in relational semantics” In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 247–254
- [28] Sara Sabour, Nicholas Frosst and Geoffrey E Hinton “Dynamic routing between capsules” In Advances in neural information processing systems 30, 2017
- [29] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu and Timothy Lillicrap “Relational Recurrent Neural Networks” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018
- [30] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia and Timothy Lillicrap “A simple neural network module for relational reasoning” In Advances in neural information processing systems 30, 2017
- [31] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov and Max Welling “Modeling relational data with graph convolutional networks” In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15, 2018, pp. 593–607 Springer
- [32] Murray Shanahan, Kyriacos Nikiforou, Antonia Creswell, Christos Kaplanis, David Barrett and Marta Garnelo “An Explicitly Relational Neural Network Architecture” In Proceedings of the 37th International Conference on Machine Learning 119, Proceedings of Machine Learning Research PMLR, 2020, pp. 8593–8603
- [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser and Illia Polosukhin “Attention is all you need” In Advances in neural information processing systems 30, 2017
- [34] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò and Yoshua Bengio “Graph Attention Networks” In International Conference on Learning Representations, 2018 URL: https://openreview.net/forum?id=rJXMpikCZ
- [35] Taylor Webb, Zachary Dulberg, Steven Frankland, Alexander Petrov, Randall O’Reilly and Jonathan Cohen “Learning Representations that Support Extrapolation” In Proceedings of the 37th International Conference on Machine Learning 119, Proceedings of Machine Learning Research PMLR, 2020, pp. 10136–10146
- [36] Taylor W. Webb, Steven M. Frankland, Awni Altabaa, Kamesh Krishnamurthy, Declan Campbell, Jacob Russin, Randall O’Reilly, John Lafferty and Jonathan D. Cohen “The Relational Bottleneck as an Inductive Bias for Efficient Abstraction” arXiv, 2024 arXiv:2309.06629 [cs]
- [37] Taylor Whittington Webb, Ishan Sinha and Jonathan Cohen “Emergent Symbols through Binding in External Memory” In International Conference on Learning Representations, 2021
- [38] Keyulu Xu, Weihua Hu, Jure Leskovec and Stefanie Jegelka “How Powerful are Graph Neural Networks?” In International Conference on Learning Representations, 2019
- [39] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov and Alexander J Smola “Deep sets” In Advances in neural information processing systems 30, 2017
- [40] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap and Edward Lockhart “Deep reinforcement learning with relational inductive biases” In International conference on learning representations, 2018
- [41] Matthew D Zeiler and Rob Fergus “Visualizing and understanding convolutional networks” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, 2014, pp. 818–833 Springer
Appendix A Experiments Supplement
A.1 Relational Games (Section 4.1)
The pentominoes split is used for training, and the hexominoes and stripes splits are used to test out-of-distribution generalization after training. We hold out 1000 samples for validation (during training) and 5000 samples for testing (after training), and use the rest as the training set. We train for 50 epochs using the categorical cross-entropy loss and the Adam optimizer with learning rate , . We use a batch size of 512. For each model and task, we run 5 trials with different random seeds.Table 1 contains text descriptions of each task in the relational games dataset in the experiments of Section 4.1. Table 2 contains a description of the architectures of each model (or shared component) in the experiments. Table 3 reports the accuracy on the hold-out object sets (i.e., the numbers depicted in Figure 5 of the main text). Figures 11 and 10 explore the effect of entropy regularization in group attention on learning using the “match pattern” task as an example.
Task | Description |
---|---|
same | Two random cells out of nine are occupied by an object. They are the “same” if they have the same color, shape, and orientation (i.e., identical image) |
occurs | The top row contains one object and the bottom row contains three objects. The “occurs” relationship holds if at least one of the objects in the bottom row is the same as the object in the top row. |
xoccurs | Same as occurs, but the relationship holds if exactly one of the objects in the bottom row is the same as the object in the top row. |
between | The grid is occupied by three objects in a line (horizontal or vertical). The “between” relationship holds if the outer objects are the same. |
row match pattern | The first and third rows of the grid are occupied by three objects each. The “match pattern” relationship holds if the relation pattern in each row is the same (e.g., AAA, AAB, ABC, etc.) |
Model / Component | Architecture |
---|---|
Common CNN
Embedder |
Conv2D MaxPool2D Conv2D MaxPool2D Flatten.
Conv2D: num filters = 16, filter size = , activation = relu. MaxPool2D: stride = 2. |
Common output MLP | Dense(64, ‘relu’) Dense(2). |
RelConvNet |
CNN Embedder MD-IPR RelConv Flatten MLP.
MD-IPR: relation dim = 16, projection dim = 4, symmetric. RelConv: num filters = 16, filter size = 3, discrete groups = combinations. |
CoRelNet |
CNN Embedder CoRelNet Flatten MLP.
Standard CoRelNet has no hyperparameters. |
PrediNet |
CNN Embedder PrediNet Flatten MLP.
PrediNet: key dim = 4, number of heads = 4, num relations = 16. |
Transformer |
CNN Embedder TransformerEncoder AveragePooling MLP.
TransformerEncoder: num layers = 1, num heads = 8, feedforward intermediate size = 32, activation = relu. |
GCN |
CNN Embedder AddPosEmb (GCNConv Dense) AveragePooling MLP.
GCConv: channels = 32, Dense: num neurons = 32, activation = relu |
GAT |
CNN Embedder AddPosEmb (GATConv Dense) AveragePooling MLP.
GATonv: channels = 32, Dense: num neurons = 32, activation = relu |
GCN |
CNN Embedder AddPosEmb (GINConv Dense) AveragePooling MLP.
GINConv: channels = 32, Dense: num neurons = 32, activation = relu |
CNN |
(Conv2D MaxPool2D) Flatten Dense(128, ‘relu’) Dense(2)
Conv2D: num filters = [16, 16, 32, 32, 64, 64, 128, 128], filter size = MaxPool2D: stride = 2, apply every other layer. |
Hexos Accuracy | Stripes Accuracy | ||
---|---|---|---|
Task | Model | ||
same | RelConvNet | ||
CoRelNet | |||
PrediNet | |||
Transformer | |||
CNN | |||
between | RelConvNet | ||
CoRelNet | |||
PrediNet | |||
Transformer | |||
CNN | |||
occurs | RelConvNet | ||
CoRelNet | |||
PrediNet | |||
Transformer | |||
CNN | |||
xoccurs | RelConvNet | ||
CoRelNet | |||
PrediNet | |||
Transformer | |||
CNN | |||
match pattern | RelConvNet | ||
CoRelNet | |||
PrediNet | |||
Transformer | |||
CNN |
A.2 Set (Section 4.2)
We train for 100 epochs using the cross-entropy loss. RelConvNet uses the Adam optimizer with learning rate , . The baselines each use their own individually-tuned optimization hyperparameters, described in Appendix B. We use a batch size of 512. For each model and task, we run 5 trials with different random seeds.Table 4 contains a description of the architecture of each model in the “contains set” experiments of Section 4.2. Table 5 reports the generalization accuracies on the hold-out ‘sets’ (i.e., the numbers depicted in Figure 8(b) of the main text). Figure 12 explores the effect of different RelConvNet hyperparameters on the model’s ability to learn the the Set task.
Model / Component | Architecture |
---|---|
Common CNN
Embedder |
Conv2D MaxPool2D Conv2D MaxPool2D Flatten Dense(64, ’relu’) Dense(64, ’tanh’).
Conv2D: num filters = 32, filter size = , activation = relu. MaxPool2D: stride = 4. |
Common output MLP | Dense(64, ‘relu’) Dense(32, ‘relu’) Dense(2). |
RelConvNet |
CNN Embedder MD-IPR RelConv Flatten MLP.
MD-IPR: relation dim = 16, projection dim = 16, symmetric. RelConv: num filters = 16, filter size = 3, discrete groups = combinations, symmetric relational inner product with ‘max’ aggregator. |
CoRelNet |
CNN Embedder CoRelNet Flatten MLP.
Standard CoRelNet has no hyperparameters. |
PrediNet |
CNN Embedder PrediNet Flatten MLP.
PrediNet: key dim = 4, number of heads = 4, num relations = 16. |
Transformer |
CNN Embedder TransformerEncoder AveragePooling MLP.
TransformerEncoder: num layers = 2, num heads = 8, feedforward intermediate size = 128, activation = relu. |
GCN |
CNN Embedder (GCNConv Dense) AveragePooling MLP.
GCConv: channels = 128, Dense: num neurons = 128, activation = relu |
GAT |
CNN Embedder (GATConv Dense) AveragePooling MLP.
GATonv: channels = 128, Dense: num neurons = 128, activation = relu |
GCN |
CNN Embedder (GINConv Dense) AveragePooling MLP.
GINConv: channels = 128, Dense: num neurons = 128, activation = relu |
CNN |
(Conv2D MaxPool2D) Flatten Dense(128, ‘relu’) Dense(2)
Conv2D: num filters = [16, 16, 32, 32, 64, 64, 128, 128, 128, 128], filter size = MaxPool2D: stride = [(2,2), NA, (2,2), NA, (2,2), NA, (2,2), (1,2), (1,2), (2, 2)] |
Accuracy | |
---|---|
Model | |
RelConvNet | |
CoRelNet | |
PrediNet | |
Transformer | |
GCN | |
GIN | |
GAT | |
LSTM | |
CNN |



Appendix B Hyperparameter sweep for baseline models
In order to ensure that we compare RelConvNet against the best-achievable performance by each baseline architecture, we carry out an extensive hyperparameter sweep over combinations of architectural hyperparameters and optimization hyperparameters. In particular, as seen in Section B.1, the baseline models severely overfit on the Set task, fitting the training data but failing to generalize to unseen ‘sets’. Hence, we explore whether it is possible to avoid or alleviate overfitting through an appropriate choice of hyperparameters.
In Figure 13, we vary the number of layers in the baseline models to select an optimal configuration of each architecture. We find that increased depth beyond 2 layers is generally detrimental on this task. Based on these results, we choose the optimal number of layers as 2 for the Transformer, GCN, GIN baselines and 1 for the GAT baseline.
In Figure 14, we vary the level of weight decay. Expectedly, larger weight decay results in decreased training accuracy. Generally, weight decay has a small effect on validation performance (e.g., no discernable effect in CoRelNet or CNN). For some models, some choices of weight decay result in improved validation performance. Based on these results, we use a weight decay of 0 for CoRelNet/CNN, 0.032 for Transformer/GAT/GIN, and 1.024 for PrediNet/GCN/LSTM.
In Figure 15, we explore the effect of the learning rate schedule, comparing a cosine decay schedule against our default constant learning rate. For most models, there is no significant difference, with a constant learning rate sometimes slightly better. On the GAT model, however, the cosine learning rate schedule results in significantly improved performance. Based on these results, we use a cosine learning rate schedule for GAT and a constant learning rate for all other models.



B.1 Results without hyperparameter tuning
Figures 16 and 6 show the results of the Set experiment with a common default optimizer, without individual hyperparameter tuning.


Accuracy | |
---|---|
Model | |
RelConvNet | |
CoRelNet | |
PrediNet | |
Transformer | |
GCN | |
GAT | |
GIN | |
LSTM | |
GRU | |
CNN |
Appendix C Discussion on use of TCN in evaluating relational architectures
In Section 4.1 the CoRelNet model of [18] was among the baselines we compared to. In that work, the authors also evaluate their model on the relational games benchmark. A difference between their experimental set up and ours is that they use a method called “context normalization” as a preprocessing step on the sequence of objects.
“Context normalization” was proposed by [35]. The proposal is simple: Given a sequence of objects, , and a set of context windows which partition the objects, each object is normalized along each dimension with respect to the other objects in its context. That is, is computed as,
where are learnable gain and shift parameters for each dimension (initialized at 1 and 0, respectively, as with batch normalization). The context windows represent logical groupings of objects that are assumed to be known. For instance, [37, 18] consider a “relational match-to-sample” task where 3 pairs of objects are presented in sequence, and the task is to identify whether the relation in the first pair is the same as the relation in the second pair or the third pair. Here, the context windows would be the pairs of objects. In the relational games “match rows pattern” task, the context windows would be each row.
It is reported in [37, 18] that context normalization significantly accelerates learning and improves out-of-distribution generalization. Since [37, 18] use context normalization in their experiments, in this section we aim to explain our choice to exclude it. We argue that context normalization is a confounder and that an evaluation of relational architectures without such preprocessing is more informative.
To understand how context normalization works, consider first a context window of size 2, and let . Then, along each dimension, we have
In particular, what context normalization does when there are two objects is, along each dimension, output 0 if the value is the same, and if it is different (encoding whether it is larger or smaller). Hence, it makes the context-normalized output independent of the original feature representation. For tasks like relational games, where the key relation to model is same/different, this preprocessing is directly encoding this information in a “symbolic” way. In particular, for two objects , context normalized to produce , we have that if and only if . This makes out-of-distribution generalization trivial, and does not properly test a relational architecture’s ability to model the same/different relation.
Similarly, consider a context window of size 3. Then, along each dimension, we have,
Again, context normalization symbolically encodes the relational pattern. For any triplet of objects, regardless of the values they take, context normalization produces identical output in the cases above. With context windows larger than 3, the behavior becomes more complex.
These properties of context normalization make it a confounder in the evaluation of relational architectures. In particular, for small context windows especially, context normalization symbolically encodes the relevant information. Experiments on relational architectures should evaluate the architectures’ ability to learn those relations from data. Hence, we do not use context normalization in our experiments.
Appendix D Higher-order relational tasks
As noted in the discussion, the tasks considered in this paper are solvable by modeling second-order relations at most. One of the main innovations of the relational convolutions architecture over existing relational architectures is its compositionality and ability to model higher-order relations. An important direction of future research is to test the architecture’s ability to model hierarchical relations of increasingly higher order. Constructing such benchmarks is a non-trivial task which requires careful thought and consideration. This was outside the scope of this paper, but we provide an initial discussion here which may be useful for constructing such benchmarks in future work.
Propositional logic. Consider evaluating boolean logic formula such as,
Evaluating this logical expression (in this form) requires iteratively grouping objects and computing the relations between them. For instance, we begin by computing the relation within and the relation within , then we compute the relation between the groups and , etc. For a task which involves logical reasoning of this hierarchical form, one might imagine the group attention in RelConvNet learning the relevant groups and the relational convolution operation computing the relations within each group. Taking inspiration from logical reasoning with such hierarchical structure may lead to interesting benchmarks of higher-order relational representation.
Sequence modeling. In sequence modeling (e.g., language modeling), modeling the relations between objects is usually essential. For example, syntactic and semantic relations between words are crucial to parsing language. Higher-order relations are also important, capturing syntactic and semantic relational features across different locations in the text and across multiple length-scales and layers of hierarchy [11, 27, see for example some relevant work in linguistics]. The attention matrix in Transformers can be thought of as implicitly representing relations between tokens. It is possible that composing Transformer layers also learns hierarchical relations. However, as shown in this work and previous work on relational representation, Transformers have limited efficiency in representing relations. Thus, incorporating relational convolutions into Transformer-based sequence models may yield meaningful improvements in the relational aspects of sequence modeling. One way to do this is by cross-attending to a the sequence of relational objects produced by relational convolutions, each of which summarizes the relations within a group of objects at some level of hierarchy.
Set embedding. The objective of set embedding is to map a collection of objects to a euclidean vector which represents the important features of the objects in the set [39]. Depending on what the set embedding will be used for, it may need to represent a combination of object-level features and relational information, including perhaps relations of higher order. A set embedder which incorporates relational convolutions may be able to generate representations which summarize relations between objects at multiple layers of hierarchy.
Visual scene understanding. In a visual scene, there are typically several objects with spatial, visual, and semantic relations between them which are crucial for parsing the scene. The CLEVR benchmark on visual scene understanding [17] was used in early work on relational representation [30]. In more complex situations, the objects in the scene may fall into natural groupings, and the spatial, visual, and semantic relations between those groups may be important for parsing a scene (e.g., objects forming larger components with functional dependence determined by the relations between them). Integrating relational convolutions into a visual scene understanding system may enable reasoning about such higher-order relations.
Appendix E Geometry of representations learned by MD-IPR and Relational Convolutions
In this section, we explore and visualize the representations learned by MD-IPR and RelConv layers. In particular, we will visualize the representations produced by the RelConvNet model trained on the Set task described in Section 4.2. Recall that the MD-IPR layer learns encoders . In this model , (so that learned relations are symmetric), and each is a linear transformation to -dimensional space. The representations learned by a selection of 6 encoders is visualized in Figure 17. For each of the 81 possible Set cards, we apply each encoder in the MD-IPR layer, reduce to 2-dimensions via PCA, and visualize how each encoder separates the 4 attributes: number, color, fill, and shape. Observe, for example, that “Encoder 0” disentangles color and shape, “Encoder 2” disentangles fill, and “Encoder 3” disentangles number.
Next, we visualize, we explore the geometry of learned representations of relation vectors. That is, the inner products producing the 16-dimensional relation vector for each pair of objects. For each pairs of Set cards, we compute the 16-dimensional relation vector learned by the MD-IPR layer, reduce to 2 dimensions via PCA, and visualize how the learned relation disentangles the latent same/different relations among the four attributes. This is shown in Figure 18. We see some separation of the underlying same/different relations among the four attributes, even with only two dimensions out of 16.
Finally, we visualize the representations learned by the relational convolution layer. Recall that this layer learns a set of graphlet filters which form templates of relational patterns against which groups of objects are compared. In our experiments, the filter size is and the number of filters is . Hence, for each group of 3 Set cards, the relational convolution layer produces a 16-dimensional vector, , summarizing the relational structure of the group. Of the possible triplets of Set cards, we create a balanced sample of “sets” and “non-sets”. We then compute and reduce to 2 dimensions via PCA. Figure 9 strikingly shows that the representations learned by the relational convolution layer very clearly separate triplets of cards which form a set from those that don’t form a set.


