Expressiveness and Approximation Properties of Graph Neural Networks
Abstract
Characterizing the separation power of graph neural networks () provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific architectures, and tools for understanding arbitrary architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of in terms of the Weisfeiler-Leman () tests, which have become the yardstick to measure the separation power of . The crux is to view as expressions in a procedural tensor language describing the computations in the layers of the . Then, by a simple analysis of the obtained expressions, in terms of the number of indexes and the nesting depth of summations, bounds on the separation power in terms of the -tests readily follow. We use tensor language to define Higher-Order Message-Passing Neural Networks (or -), a natural extension of . Furthermore, the tensor language point of view allows for the derivation of universality results for classes of in a natural way. Our approach provides a toolbox with which architecture designers can analyze the separation power of their , without needing to know the intricacies of the -tests. We also provide insights in what is needed to boost the separation power of .
1 Introduction
Graph Neural Networks () (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) cover many popular deep learning methods for graph learning tasks (see Hamilton (2020) for a recent overview). These methods typically compute vector embeddings of vertices or graphs by relying on the underlying adjacency information. Invariance (for graph embeddings) and equivariance (for vertex embeddings) of ensure that these methods are oblivious to the precise representation of the graphs.
Separation power.
Our primary focus is on the separation power of architectures, i.e., on their ability to separate vertices or graphs by means of the computed embeddings. It has become standard to characterize architectures in terms of the separation power of graph algorithms such as color refinement () and -dimensional Weisfeiler-Leman tests , as initiated in Xu et al. (2019) and Morris et al. (2019). Unfortunately, understanding the separation power of any given architecture requires complex proofs, geared at the specifics of the architecture. We provide a tensor language-based technique to analyze the separation power of general .
Tensor languages.
Matrix query languages (Brijder et al., 2019; Geerts et al., 2021b) are defined to assess the expressive power of linear algebra. Balcilar et al. (2021a) observe that, by casting various into the (Brijder et al., 2019) matrix query language, one can use existing separation results (Geerts, 2021) to obtain upper bounds on the separation power of in terms of and . In this paper, we considerably extend this approach by defining, and studying, a new general-purpose tensor language specifically designed for modeling . As in Balcilar et al. (2021a), our focus on tensor languages allows us to obtain new insights about architectures. First, since tensor languages can only define invariant and equivariant graph functions, any that can be cast in our tensor language inherits these desired properties. More importantly, the separation power of our tensor language is as closely related to and as are. Loosely speaking, if tensor language expressions use indices, then their separation power is bounded by . Furthermore, if the maximum nesting of summations in the expression is , then rounds of are needed to obtain an upper bound on the separation power. A similar connection is obtained for and a fragment of tensor language that we call “guarded” tensor language.
We thus reduce problem of assessing the separation power of any specific architecture to the problem of specifying it in our tensor language, analyzing the number of indices used and counting their summation depth. This is usually much easier than dealing with intricacies of and , as casting in our tensor language is often as simple as writing down their layer-based definition. We believe that this provides a nice toolbox for designers to assess the separation power of their architecture. We use this toolbox to recover known results about the separation power of specific architectures such as (Xu et al., 2019), (Kipf & Welling, 2017), Folklore (Maron et al., 2019b), - (Morris et al., 2019), and several others. We also derive new results: we answer an open problem posed by Maron et al. (2019a) by showing that the separation power of Invariant Graph Networks (, introduced by Maron et al. (2019b), is bounded by . In addition, we revisit the analysis by Balcilar et al. (2021b) of (Defferrard et al., 2016), and show that (Levie et al., 2019) is bounded by .
When writing down in our tensor language, the less indices needed, the stronger the bounds in terms of we obtain. After all, is known to be strictly less separating than (Otto, 2017). Thus, it is important to minimize the number of indices used in tensor language expressions. We connect this number to the notion of treewidth: expressions of treewidth can be translated into expressions using indices. This corresponds to optimizing expressions, as done in many areas in machine learning, by reordering the summations (a.k.a. variable elimination).
Approximation and universality.
We also consider the ability of to approximate general invariant or equivariant graph functions. Once more, instead of focusing on specific architectures, we use our tensor languages to obtain general approximation results, which naturally translate to universality results for . We show: -index tensor language expressions suffice to approximate any (invariant/equivariant) graph function whose separating power is bounded by , and we can further refine this by comparing the number of rounds in with the summation depth of the expressions. These results provide a finer picture than the one obtained by Azizian & Lelarge (2021). Furthermore, focusing on “guarded” tensor expressions yields a similar universality result for , a result that, to our knowledge, was not known before. We also provide the link between approximation results for tensor expressions and , enabling us to transfer our insights into universality properties of . As an example, we show that can approximate any graph function that is less separating than . This case was left open in Azizian & Lelarge (2021).
In summary, we draw new and interesting connections between tensor languages, architectures and classic graph algorithms. We provide a general recipe to bound the separation power of , optimize them, and understand their approximation power. We show the usefulness of our method by recovering several recent results, as well as new results, some of them left open in previous work.
Related work.
Separation power has been studied for specific classes of (Morris et al., 2019; Xu et al., 2019; Maron et al., 2019b; Chen et al., 2019; Morris et al., 2020; Azizian & Lelarge, 2021). A first general result concerns the bounds in terms of and of Message-Passing Neural Networks (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019). Balcilar et al. (2021a) use the matrix query language to obtain upper bounds on the separation power of various . can only be used to obtain bounds up to and is limited to matrices. Our tensor language is more general and flexible and allows for reasoning over the number of indices, treewidth, and summation depth of expressions. These are all crucial for our main results. The tensor language introduced resembles - (Geerts et al., 2021b), but with the added ability to represent tensors. Neither separation power nor guarded fragments were considered in Geerts et al. (2021b). See Section A in the supplementary material for more details. For universality, Azizian & Lelarge (2021) is closest in spirit. Our approach provides an elegant way to recover and extend their results. Azizian & Lelarge (2021) describe how their work (and hence also ours) encompasses previous works (Keriven & Peyré, 2019; Maron et al., 2019c; Chen et al., 2019). Our results use connections between and logics (Immerman & Lander, 1990; Cai et al., 1992), and and guarded logics (Barceló et al., 2020). The optimization of algebraic computations and the use of treewidth relates to the approaches by Aji & McEliece (2000) and Abo Khamis et al. (2016).
2 Background
We denote sets by and multisets by . For , , . Vectors are denoted by , matrices by , and tensors by . Furthermore, is the -th entry of vector , is the -th entry of matrix and denotes the -th entry of a tensor . If certain dimensions are unspecified, then this is denoted by a “”. For example, and denote the -th row and -th column of matrix , respectively. Similarly for slices of tensors.
We consider undirected simple graphs equipped with a vertex-labelling . We assume that graphs have size , so consists of vertices and we often identify with . For a vertex , . We let be the set of all graphs of size and let be the set of pairs with and . Note that .
The color refinement algorithm ( (Morgan, 1965) iteratively computes vertex labellings based on neighboring vertices, as follows. For a graph and vertex , . Then, for , . We collect all vertex labels to obtain a label for the entire graph by defining . The -dimensional Weisfeiler-Leman algorithm () (Cai et al., 1992) iteratively computes labellings of -tuples of vertices. For a -tuple , its atomic type in , denoted by , is a vector in . The first entries are -values encoding the equality type of , i.e., whether for . The second entries are -values encoding adjacency information, i.e., whether for . The last real-valued entries correspond to for . Initially, for a graph and , assigns the label . For , revises the label according to with , where . We use to assign labels to vertices and graphs by defining: , for vertex-labellings, and , for graph-labellings. We use , , , and to denote the stable labellings produced by the corresponding algorithm over an arbitrary number of rounds. Our version of differs from in that also uses information from non-adjacent vertices; this distinction only matters for vertex embeddings (Grohe, 2021). We use the “folklore” of Cai et al. (1992), except Cai et al. use to refer to . While equivalent to “oblivious” (Grohe, 2021), used in some other works on , care is needed when comparing to our work.
Let be a graph with and let be a permutation of . We denote by the isomorphic copy of obtained by applying the permutation . Similarly, for , is the permuted version of . Let be some feature space. A function is called invariant if for any permutation . More generally, is equivariant if for any permutation . The functions and are equivariant, whereas and are invariant, for any and .
3 Specifying GNNs
Many use linear algebra computations on vectors, matrices or tensors, interleaved with the application of activation functions or . To understand the separation power of , we introduce a specification language, , for tensor language, that allows us to specify any algebraic computation in a procedural way by explicitly stating how each entry is to be computed. We gauge the separation power of by specifying them as expressions, and syntactically analyzing the components of such expressions. This technique gives rise to Higher-Order Message-Passing Neural Networks (or -), a natural extension of (Gilmer et al., 2017). For simplicity, we present using summation aggregation only but arbitrary aggregation functions on multisets of real values can be used as well (Section C.5 in the supplementary material).
To introduce , consider a typical layer in a of the form , where is an adjacency matrix, are vertex features such that is the feature vector of vertex , is a non-linear activation function, and is a weight matrix. By exposing the indices in the matrices and vectors we can equivalently write: for and :
In , we do not work with specific matrices or indices ranging over , but focus instead on expressions applicable to any matrix. We use index variables and instead of and , replace with a placeholder and with placeholders , for . We then represent the above computation in by expressions , one for each feature column, as follows:
These are pure syntactical expressions. To give them a semantics, we assign to a matrix , to column vectors , for , and to an index . By letting the variable under the summation range over , the expression evaluates to . As such, can be represented as a specific instance of the above expressions. Throughout the paper we reason about expressions in rather than specific instances thereof. Importantly, by showing that certain properties hold for expressions in , these properties are inherited by all of its instances. We use to enable a theoretical analysis of the separating power of ; It is not intended as a practical programming language for .
Syntax.
We first give the syntax of expressions. We have a binary predicate , to represent adjacency matrices, and unary vertex predicates , , to represent column vectors encoding the -dimensional vertex labels. In addition, we have a (possibly infinite) set of functions, such as activation functions or . Then, expressions are defined by the following grammar:
where , are index variables that specify entries in tensors, , , and . Summation aggregation is captured by .111We can replace by a more general aggregation construct for arbitrary functions that assign a real value to multisets of real values. We refer to the supplementary material (Section C.5) for details. We sometimes make explicit which functions are used in expressions in by writing for in . For example, the expressions described earlier are in .
The set of free index variables of an expression , denoted by , determines the order of the tensor represented by . It is defined inductively: , , , , , and . We sometimes explicitly write the free indices. In our example expressions , is the free index variable.
An important class of expressions are those that only use index variables . We denote by the -index variable fragment of . The expressions are in .
Semantics.
We next define the semantics of expressions in . Let be a vertex-labelled graph. We start by defining the interpretation of the predicates , and the (dis)equality predicates, relative to and a valuation assigning a vertex to each index variable:
In other words, is interpreted as the adjacency matrix of and the ’s interpret the vertex-labelling . Furthermore, we lift interpretations to arbitrary expressions in , as follows:
where, is the valuation but which now maps the index to the vertex . For simplicity, we identify valuations with their images. For example, denotes . To illustrate the semantics, for each , our example expressions satisfy for when is the adjacency matrix of and represents the vertex labels.
-MPNNs.
Consider a function for some . We say that the function can be represented in if there exists expressions in such that for each graph and each -tuple :
Of particular interest are th-order (or -) which refers to the class of functions that can be represented in . We can regard as functions . Hence, a is a - if its corresponding functions are -. For example, we can interpret as a function such that . We have seen that for each , with . Hence, and thus belongs to - and our example is a -.
TL represents equivariant or invariant functions.
We make a simple observation which follows from the type of operators allowed in expressions in .
Proposition 3.1.
Any function represented in is equivariant (invariant if ).
An immediate consequence is that when a is a -, it is automatically invariant or equivariant, depending on whether graph or vertex tuple embeddings are considered.
4 Separation Power of Tensor Languages
Our first main results concern the characterization of the separation power of tensor languages in terms of the color refinement and -dimensional Weisfeiler-Leman algorithms. We provide a fine-grained characterization by taking the number of rounds of these algorithms into account. This will allow for measuring the separation power of classes of in terms of their number of layers.
4.1 Separation Power
We define the separation power of graph functions in terms of an equivalence relation, based on the definition from Azizian & Lelarge (2021), hereby first focusing on their ability to separate vertices.222We differ slightly from Azizian & Lelarge (2021) in that they only define equivalence relations on graphs.
Definition 1.
Let be a set of functions . The equivalence relation is defined by on as follows: . ∎
In other words, when , no function in can separate in from in . For example, we can view and as functions from to some . As such and measure the separation power of these algorithms. The following strict inclusions are known: for all , and (Otto, 2017; Grohe, 2021). It is also known that more rounds () increase the separation power of these algorithms (Fürer, 2001).
For a fragment of expressions, we define as the equivalence relation associated with all functions that can be represented in . By definition, we here thus consider expressions in with one free index variable resulting in vertex embeddings.
4.2 Main Results
We first provide a link between and tensor language expressions using index variables:
Theorem 4.1.
For each and any collection of functions, .
This theorem gives us new insights: if we wish to understand how a new architecture compares against the algorithms, all we need to do is to show that such an architecture can be represented in , i.e., is a -, an arguably much easier endeavor. As an example of how to use this result, it is well known that triangles can be detected by but not by . Thus, in order to design that can detect triangles, layer definitions in rather than should be used.
We can do much more, relating the rounds of to the notion of summation depth of expressions. We also present present similar results for functions computing graph embeddings.
The summation depth of a expression measures the nesting depth of the summations in the expression. It is defined inductively: , , , , and . For example, expressions above have summation depth one. We write for the class of expressions in of summation depth at most , and use - for the corresponding class of -. We can now refine Theorem 4.1, taking into account the number of rounds used in .
Theorem 4.2.
For all , and any collection of functions, .
Guarded TL and color refinement.
As noted by Barceló et al. (2020), the separation power of vertex embeddings of simple , which propagate information only through neighboring vertices, is usually weaker than that of . For these types of architectures, Barceló et al. (2020) provide a relation with the weaker color refinement algorithm, but only in the special case of first-order classifiers. We can recover and extend this result in our general setting, with a guarded version of which, as we will show, has the same separation power as color refinement.
The guarded fragment of is inspired by the use of adjacency matrices in simple . In only equality predicates (constant ) and (constant ) are allowed, addition and multiplication require the component expressions to have the same (single) free index, and summation must occur in a guarded form , for . Guardedness means that summation only happens over neighbors. In , all expressions have a single free variable and thus only functions from can be represented. Our example expressions are guarded. The fragment consists of expressions in of summation depth at most . We denote by and the corresponding “guarded” classes of -.333For the connection to classical (Gilmer et al., 2017), see Section H in the supplementary material.
Theorem 4.3.
For all and any collection of functions: .
As an application of this theorem, to detect the existence of paths of length , the number of guarded layers in should account for a representation in of summation depth of at least . We recall that which, combined with our previous results, implies that (resp., -) is strictly more separating than (resp., ).
Graph embeddings.
We next establish connections between the graph versions of and , and expressions without free index variables. To this aim, we use , for a set of functions , as the equivalence relation over defined in analogy to : . We thus consider separation power on the graph level. For example, we can consider and for any and . Also here, but different from vertex embeddings, (Grohe, 2021). We define for a fragment of by considering expressions without free index variables.
The connection between the number of index variables in expressions and remains to hold. Apart from , no clean relationship exists between summation depth and rounds, however.444Indeed, the best one can obtain for general tensor logic expressions is . This follows from Cai et al. (1992) and connections to finite variable logics.
Theorem 4.4.
For all , and any collection of functions, we have that:
Intuitively, in (1) the increase in summation depth by one is incurred by the additional aggregation needed to collect all vertex labels computed by .
Optimality of number of indices.
Our results so far tell that graph functions represented in are at most as separating as . What is left unaddressed is whether all index variables are needed for the graph functions under consideration. It may well be, for example, that there exists an equivalent expression using less index variables. This would imply a stronger upper bound on the separation power by for . We next identify a large class of expressions, those of treewidth , for which the number of index variables can be reduced to .
Proposition 4.5.
Expressions in of treewidth are equivalent to expressions in .
Treewidth is defined in the supplementary material (Section G) and a treewidth of implies that the computation of tensor language expressions can be decomposed, by reordering summations, such that each local computation requires at most indices (see also Aji & McEliece (2000)). As a simple example, consider in such that counts the number of paths of length two starting from . This expression has a treewidth of one. And indeed, it is equivalent to the expression in (and in fact in ). As a consequence, no more vertices can be separated by than by , rather than as the original expression in suggests.
On the impact of functions.
All separation results for and fragments thereof hold irregardless of the chosen functions in , including when no functions are present at all. Function applications hence do not add expressive power. While this may seem counter-intuitive, it is due to the presence of summation and multiplication in that are enough to separate graphs or vertices.
5 Consequences for GNNs
We next interpret the general results on the separation power from Section 4 in the context of .
1. The separation power of any vertex embedding architecture which is an is bounded by the power of rounds of color refinement.
We consider the Graph Isomorphism Networks ( (Xu et al., 2019) and show that these are . To do so, we represent them in . Let be such a network; it updates vertex embeddings as follows. Initially, . For layer , is given by: , with and is an . We denote by the class of consisting layers. Clearly, can be represented in by considering the expressions for each . To represent , assume that we have expressions in representing . That is, we have for each vertex and . Then is represented by expressions defined as:
which are now expressions in where consists of . We have for each and , as desired. Hence, Theorem 4.3 tells that -layered cannot be more separating than rounds of color refinement, in accordance with known results (Xu et al., 2019; Morris et al., 2019). We thus simply cast in to obtain an upper bound on their separation power. In the supplementary material (Section D) we give similar analyses for GraphSage with various aggregation functions (Hamilton et al., 2017), (Kipf & Welling, 2017), simplified () (Wu et al., 2019), Principled Neighborbood Aggregation (s) (Corso et al., 2020), and revisit the analysis of (Defferrard et al., 2016) given in Balcilar et al. (2021a).
2. The separation power of any vertex embedding architecture which is an - is bounded by the power of rounds of .
For , we consider extended Graph Isomorphism Networks () (Barceló et al., 2020). For an , is defined as for , but for layer , is defined by , where is now an from . The difference with is the use of which corresponds to the unguarded summation . This implies that rather than needs to be used. In a similar way as for , we can represent layers in . That is, each is an -. Theorem 4.2 tells that rounds of bound the separation power of -layered extended , conform to Barceló et al. (2020). More generally, any looking to go beyond must use non-guarded aggregations.
For , it is straightforward to show that -layered “folklore” () (Maron et al., 2019b) are - and thus, by Theorem 4.2, rounds of bound their separation power. One merely needs to cast the layer definitions in and observe that indices and summation depth are needed. We thus refine and recover the bound for by Azizian & Lelarge (2021). We also show that the separation power of -Invariant Graph Networks ( (Maron et al., 2019b) are bounded by , albeit with an increase in the required rounds.
Theorem 5.1.
For any , the separation power of a -layered is bounded by the separation power of rounds of .
We hereby answer open problem 1 in Maron et al. (2019a). The case was solved in Chen et al. (2020) by analyzing properties of . By contrast, Theorem 4.2 shows that one can focus on expressing in and analyzing the summation depth of expressions. The proof of Theorem 5.1 requires non-trivial manipulations of tensor language expressions; it is a simplified proof of Geerts (2020). The additional rounds () are needed because aggregate information in one layer that becomes accessible to in rounds. We defer detail to Section E in the supplementary material, where we also identify a simple class of -layered that are as powerful as but whose separation power is bounded by rounds of .
We also consider “augmented” , which are combined with a preprocessing step in which higher-order graph information is computed. In the supplementary material (Section D.3) we show how encodes the preprocessing step, and how this leads to separation bounds in terms of , where depends on the treewidth of the graph information used. Finally, our approach can also be used to show that the spectral s (Levie et al., 2019) are bounded in separation power by . This result complements the spectral analysis of s given in Balcilar et al. (2021b).
3. The separation power of any graph embedding architecture which is a - is bounded by the power of .
Graph embedding methods are commonly obtained from vertex (tuple) embeddings methods by including a readout layer in which all vertex (tuple) embeddings are aggregated. For example, is a typical readout layer for . Since can be represented in , the readout layer can be represented in , using an extra summation. So they are -. Hence, their separation power is bounded by , in accordance with Theorem 4.4. This holds more generally. If vertex embedding methods are -, then so are their graph versions, which are then bounded by by our Theorem 4.4.
4. To go beyond the separation power of , it is necessary to use whose layers are represented by expressions of treewidth .
Hence, to design expressive one needs to define the layers such that treewidth of the resulting expressions is large enough. For example, to go beyond , representable linear algebra operations should be used. Treewidth also sheds light on the open problem from Maron et al. (2019a) where it was asked whether polynomial layers (in ) increase the separation power. Indeed, consider a layer of the form , which raises the adjacency matrix to the power three. Translated in , layer expressions resemble , of treewidth one. Proposition 4.5 tells that the layer is bounded by (and in fact by ) in separation power. If instead, the layer is of the form where holds the number of cliques containing the edge . Then, in we get expressions containing . The variables form a -clique resulting in expressions of treewidth two. As a consequence, the separation power will be bounded by . These examples show that it is not the number of multiplications (in both cases two) that gives power, it is how variables are connected to each other.
6 Function Approximation
We next provide characterizations of functions that can be approximated by expressions, when interpreted as functions. We recover and extend results from Azizian & Lelarge (2021) by taking the number of layers of into account. We also provide new results related to color refinement.
6.1 General TL Approximation Results
We assume that is a compact space by requiring that vertex labels come from a compact set . Let be a set of functions and define its closure as all functions from for which there exists a sequence such that for some norm . We assume to satisfy two properties. First, is concatenation-closed: if and are in , then is also in . Second, is function-closed, for a fixed : for any such that , also is in for any continuous function . For such , we let be the subset of functions in from to . Our next result is based on a generalized Stone-Weierstrass Theorem (Timofte, 2005), also used in Azizian & Lelarge (2021).
Theorem 6.1.
For any , and any set of functions, concatenation and function closed for , we have: .
This result gives us insight on which functions can be approximated by, for example, a set of functions originating from a class of . In this case, represent all functions approximated by instances of such a class and Theorem 6.1 tells us that this set corresponds precisely to the set of all functions that are equally or less separating than the in this class. If, in addition, is more separating that or , then we can say more. Let .
Corollary 6.2.
Under the assumptions of Theorem 6.1 and if , then .
The properties of being concatenation and function-closed are satisfied for sets of functions representable in our tensor languages, if contains all continuous functions , for any , or alternatively, all (by Lemma 32 in Azizian & Lelarge (2021)). Together with our results in Section 4, the corollary implies that , -, - or - can approximate all functions with equal or less separation power than , , or , respectively.
Prop. 3.1 also tells that the closure consists of invariant () and equivariant ( ) functions.
6.2 Consequences for GNNs
All our results combined provide a recipe to guarantee that a given function can be approximated by architectures. Indeed, suppose that your class of is an (respectively, -, - or -, for some ). Then, since most classes of are concatenation-closed and allow the application of arbitrary , this implies that your can only approximate functions that are no more separating than (respectively, , or ). To guarantee that that these functions can indeed be approximated, one additionally has to show that your class of matches the corresponding labeling algorithm in separation power.
For example, in are , and thus contains any function satisfying . Similarly, s are -, so contains any function satisfying ; and when extended with a readout layer, their closures consist of functions satisfying . Finally, s are -, so consist of functions such that . We thus recover and extend results by Azizian & Lelarge (2021) by including layer information () and by treating color refinement separately from for vertex embeddings. Furthermore, Theorem 5.1 implies that consists of functions satisfying and , a case left open in Azizian & Lelarge (2021).
7 Conclusion
Connecting and tensor languages allows us to use our analysis of tensor languages to understand the separation and approximation power of . The number of indices and summation depth needed to represent the layers in determine their separation power in terms of color refinement and Weisfeiler-Leman tests. The framework of - provides a handy toolbox to understand existing and new architectures, and we demonstrate this by recovering several results about the power of presented recently in the literature, as well as proving new results.
8 Aknowledgements & Disclosure Funding
This work is partially funded by ANID–Millennium Science Initiative Program–Code ICN17_002, Chile.
References
- Abo Khamis et al. (2016) Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. FAQ: Questions Asked Frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pp. 13–28. ACM, 2016. URL https://doi.org/10.1145/2902251.2902280.
- Aji & McEliece (2000) Srinivas M. Aji and Robert J. McEliece. The generalized distributive law. IEEE Transactions on Information Theory, 46(2):325–343, 2000. URL https://doi.org/10.1109/18.825794.
- Azizian & Lelarge (2021) Waiss Azizian and Marc Lelarge. Expressive power of invariant and equivariant graph neural networks. In Proceedings of the 9th International Conference on Learning Representations, ICLR, 2021. URL https://openreview.net/forum?id=lxHgXYN4bwl.
- Balcilar et al. (2021a) Muhammet Balcilar, Pierre Héroux, Benoit Gaüzère, Pascal Vasseur, Sébastien Adam, and Paul Honeine. Breaking the limits of message passing graph neural networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 599–608. PMLR, 2021a. URL http://proceedings.mlr.press/v139/balcilar21a.html.
- Balcilar et al. (2021b) Muhammet Balcilar, Guillaume Renton, Pierre Héroux, Benoit Gaüzère, Sébastien Adam, and Paul Honeine. Analyzing the expressive power of graph neural networks in a spectral perspective. In Proceedings of the 9th International Conference on Learning Representations, ICLR, 2021b. URL https://openreview.net/forum?id=-qh0M9XWxnv.
- Barceló et al. (2020) Pablo Barceló, Egor V Kostylev, Mikael Monet, Jorge Pérez, Juan Reutter, and Juan Pablo Silva. The logical expressiveness of graph neural networks. In Proceedings of the 8th International Conference on Learning Representations, ICLR, 2020. URL https://openreview.net/forum?id=r1lZ7AEKvB.
- Barceló et al. (2021) Pablo Barceló, Floris Geerts, Juan L. Reutter, and Maksimilian Ryschkov. Graph neural networks with local graph parameters. In Advances in Neural Information Processing Systems, volume 34, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/d4d8d1ac7e00e9105775a6b660dd3cbb-Abstract.html.
- Bodnar et al. (2021) Cristian Bodnar, Fabrizio Frasca, Yuguang Wang, Nina Otter, Guido F. Montúfar, Pietro Lió, and Michael M. Bronstein. Weisfeiler and Lehman go topological: Message passing simplicial networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 1026–1037. PMLR, 2021. URL http://proceedings.mlr.press/v139/bodnar21a.html.
- Bouritsas et al. (2020) Giorgos Bouritsas, Fabrizio Frasca, Stefanos Zafeiriou, and Michael M. Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. In Graph Representation Learning and Beyond (GRL+) Workshop at the 37 th International Conference on Machine Learning, 2020. URL https://arxiv.org/abs/2006.09252.
- Brijder et al. (2019) Robert Brijder, Floris Geerts, Jan Van den Bussche, and Timmy Weerwag. On the expressive power of query languages for matrices. ACM TODS, 44(4):15:1–15:31, 2019. URL https://doi.org/10.1145/3331445.
- Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations, ICLR, 2014. URL https://openreview.net/forum?id=DQNsQf-UsoDBa.
- Cai et al. (1992) Jin-yi Cai, Martin Fürer, and Neil Immerman. An optimal lower bound on the number of variables for graph identifications. Comb., 12(4):389–410, 1992. URL https://doi.org/10.1007/BF01305232.
- Chen et al. (2019) Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with GNNs. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/file/71ee911dd06428a96c143a0b135041a4-Paper.pdf.
- Chen et al. (2020) Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count substructures? In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://proceedings.neurips.cc/paper/2020/file/75877cb75154206c4e65e76b88a12712-Paper.pdf.
- Corso et al. (2020) Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbourhood aggregation for graph nets. In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://proceedings.neurips.cc/paper/2020/file/99cad265a1768cc2dd013f0e740300ae-Paper.pdf.
- Csanky (1976) L. Csanky. Fast parallel matrix inversion algorithms. SIAM J. Comput., 5(4):618–623, 1976. URL https://doi.org/10.1137/0205040.
- Curticapean et al. (2017) Radu Curticapean, Holger Dell, and Dániel Marx. Homomorphisms are a good basis for counting small subgraphs. In Proceedings of the 49th Symposium on Theory of Computing, STOC, pp. 210––223, 2017. URL http://dx.doi.org/10.1145/3055399.3055502.
- Damke et al. (2020) Clemens Damke, Vitalik Melnikov, and Eyke Hüllermeier. A novel higher-order weisfeiler-lehman graph convolution. In Proceedings of The 12th Asian Conference on Machine Learning, ACML, volume 129 of Proceedings of Machine Learning Research, pp. 49–64. PMLR, 2020. URL http://proceedings.mlr.press/v129/damke20a.html.
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, volume 30, 2016. URL https://proceedings.neurips.cc/paper/2016/file/04df4d434d481c5bb723be1b6df1ee65-Paper.pdf.
- Fürer (2001) Martin Fürer. Weisfeiler-Lehman refinement requires at least a linear number of iterations. In Proceedings of the 28th International Colloqium on Automata, Languages and Programming, ICALP, volume 2076 of Lecture Notes in Computer Science, pp. 322–333. Springer, 2001. URL https://doi.org/10.1007/3-540-48224-5_27.
- Geerts (2020) Floris Geerts. The expressive power of kth-order invariant graph networks. CoRR, abs/2007.12035, 2020. URL https://arxiv.org/abs/2007.12035.
- Geerts (2021) Floris Geerts. On the expressive power of linear algebra on graphs. Theory Comput. Syst., 65(1):179–239, 2021. URL https://doi.org/10.1007/s00224-020-09990-9.
- Geerts et al. (2021a) Floris Geerts, Filip Mazowiecki, and Guillermo A. Pérez. Let’s agree to degree: Comparing graph convolutional networks in the message-passing framework. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 3640–3649. PMLR, 2021a. URL http://proceedings.mlr.press/v139/geerts21a.html.
- Geerts et al. (2021b) Floris Geerts, Thomas Muñoz, Cristian Riveros, and Domagoj Vrgoc. Expressive power of linear algebra query languages. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pp. 342–354. ACM, 2021b. URL https://doi.org/10.1145/3452021.3458314.
- Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1263–1272, 2017. URL {http://proceedings.mlr.press/v70/gilmer17a/gilmer17a.pdf}.
- Grohe (2021) Martin Grohe. The logic of graph neural networks. In Proceedings of the 36th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS, pp. 1–17. IEEE, 2021. URL https://doi.org/10.1109/LICS52264.2021.9470677.
- Hamilton (2020) William L. Hamilton. Graph representation learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 14(3):1–159, 2020. URL https://doi.org/10.2200/S01045ED1V01Y202009AIM046.
- Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://proceedings.neurips.cc/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
- Hammond et al. (2011) David K. Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2010.04.005. URL https://www.sciencedirect.com/science/article/pii/S1063520310000552.
- Immerman & Lander (1990) Neil Immerman and Eric Lander. Describing graphs: A first-order approach to graph canonization. In Complexity Theory Retrospective: In Honor of Juris Hartmanis on the Occasion of His Sixtieth Birthday, pp. 59–81. Springer, 1990. URL https://doi.org/10.1007/978-1-4612-4478-3_5.
- Keriven & Peyré (2019) Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In Advances in Neural Information Processing Systems, volume 32, pp. 7092–7101, 2019. URL https://proceedings.neurips.cc/paper/2019/file/ea9268cb43f55d1d12380fb6ea5bf572-Paper.pdf.
- Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR, 2017. URL https://openreview.net/pdf?id=SJU4ayYgl.
- Levie et al. (2019) Ron Levie, Federico Monti, Xavier Bresson, and Michael M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Trans. Signal Process., 67(1):97–109, 2019. URL https://doi.org/10.1109/TSP.2018.2879624.
- Maron et al. (2019a) Haggai Maron, Heli Ben-Hamu, and Yaron Lipman. Open problems: Approximation power of invariant graph networks. In NeurIPS 2019 Graph Representation Learning Workshop, 2019a. URL https://grlearning.github.io/papers/31.pdf.
- Maron et al. (2019b) Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks. In Advances in Neural Information Processing Systems, volume 32, 2019b. URL https://proceedings.neurips.cc/paper/2019/file/bb04af0f7ecaee4aae62035497da1387-Paper.pdf.
- Maron et al. (2019c) Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph networks. In Proceedings of the 7th International Conference on Learning Representations, ICLR, 2019c. URL https://openreview.net/forum?id=Syx72jC9tm.
- Merkwirth & Lengauer (2005) Christian Merkwirth and Thomas Lengauer. Automatic generation of complementary descriptors with molecular graph networks. J. Chem. Inf. Model., 45(5):1159–1168, 2005. URL https://doi.org/10.1021/ci049613b.
- Morgan (1965) H. L. Morgan. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of Chemical Documentation, 5(2):107–113, 1965. URL https://doi.org/10.1021/c160017a018.
- Morris et al. (2019) Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and Leman go neural: Higher-order graph neural networks. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pp. 4602–4609, 2019. URL https://doi.org/10.1609/aaai.v33i01.33014602.
- Morris et al. (2020) Christopher Morris, Gaurav Rattan, and Petra Mutzel. Weisfeiler and Leman go sparse: Towards scalable higher-order graph embeddings. In Advances in Neural Information Processing Systems, volume 33, 2020. URL https://proceedings.neurips.cc//paper/2020/file/f81dee42585b3814de199b2e88757f5c-Paper.pdf.
- Otto (2017) Martin Otto. Bounded Variable Logics and Counting: A Study in Finite Models, volume 9 of Lecture Notes in Logic. Cambridge University Press, 2017. URL https://doi.org/10.1017/9781316716878.
- Otto (2019) Martin Otto. Graded modal logic and counting bisimulation. ArXiv, 2019. URL https://arxiv.org/abs/1910.00039.
- Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2009. URL https://doi.org/10.1109/TNN.2008.2005605.
- Timofte (2005) Vlad Timofte. Stone–Weierstrass theorems revisited. Journal of Approximation Theory, 136(1):45–59, 2005. URL https://doi.org/10.1016/j.jat.2005.05.004.
- Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.
- Wu et al. (2019) Felix Wu, Amauri H. Souza Jr., Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, ICML, volume 97 of Proceedings of Machine Learning Research, pp. 6861–6871. PMLR, 2019. URL http://proceedings.mlr.press/v97/wu19e.html.
- Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In Proceedings of the 7th International Conference on Learning Representations, ICLR, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
- Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://proceedings.neurips.cc/paper/2017/file/f22e4747da1aa27e363d86d40ff442fe-Paper.pdf.
Supplementary Material
Appendix A Related Work Cnt’d
We provide additional details on how the tensor language considered in this paper relates to recent work on other matrix query languages. Closest to is the matrix query language - (Geerts et al., 2021b) whose syntax is close to that of . There are, however, key differences. First, although - uses index variables (called vector variables), they all must occur under a summation. In other words, the concept of free index variables is missing, which implies that no general tensors can be represented. In , we can represent arbitrary tensors and the presence of free index variables is crucial to define vertex, or more generally, -tuple embeddings in the context of . Furthermore, no notion of summation depth was introduced for -. In , the summation depth is crucial to assess the separation power in terms of the number of rounds of color refinement and . And in fact, the separation power of - was not considered before, and neither are finite variable fragments of - and connections to color refinement and studied before. Finally, no other aggregation functions were considered for -. We detail in Section C.5 that can be gracefully extended to for some arbitrary set of aggregation functions.
Connections to and and the separation power of another matrix query language, (Brijder et al., 2019) were established in Geerts (2021). Yet, the design of is completely different in spirit than that of . Indeed, does not have index variables or explicit summation aggregation. Instead, it only supports matrix multiplication, matrix transposition, function applications, and turning a vector into a diagonal matrix. As such, can be shown to be included in . Similarly as for -, cannot represent general tensors, has no (free) index variables and summation depth is not considered (in view of the absence of an explicit summation).
We also emphasize that neither for nor for - a guarded fragment was considered. The guarded fragment is crucial to make connections to color refinement (Theorem 4.3). Furthermore, the analysis in terms of the number of index variables, summation depth and treewidth (Theorems 4.1,4.2 and Proposition 4.5), were not considered before in the matrix query language literature. For none of these matrix query languages, approximation results were considered (Section 6.1).
Matrix query languages are used to assess the expressive power of linear algebra. Balcilar et al. (2021a) use and the above mentioned connections to and , to assess the separation power of . More specifically, similar to our work, they show that several architectures can be represented in , or fragments thereof. As a consequence, bounds on their separation power easily follow. Furthermore, Balcilar et al. (2021a) propose new architectures inspired by special operators in . The use of can thus been seen as a continuation of their approach. We note, however, that is more general than (which is included in ), allows to represent more complex linear algebra computations by means summation (or other) aggregation, and finally, provides insights in the number of iterations needed for color refinement and . The connection between the number of variables (or treewidth) and is not present in the work by Balcilar et al. (2021a), neither is the notion of guarded fragment, needed to connect to color refinement. We believe that it is precisely these latter two insights that make the tensor language approach valuable for any designer who wishes to upper bound their architecture.
Appendix B Details of Section 3
B.1 Proof of Proposition 3.1
Let be a graph and let be a permutation of . As usual, we define as the graph with vertex set , edge set if and only if , and . We need to show that for any expression in either , or when has no free index variables, . We verify this by a simple induction on the structure of expressions in .
-
•
If , then for a valuation mapping to and to in :
where we used that is a permutation.
-
•
If , then for a valuation mapping to in :
where we used the definition of .
-
•
Similarly, if , then for a valuation assigning to and to :
where we used the definition of .
-
•
If , then for a valuation from to :
where we used the induction hypothesis for and . The cases and are dealt with in a similar way.
-
•
If , then
where we used again the induction hypothesis for .
-
•
Finally, if then for a valuation of to :
where we used the induction hypothesis for and that because is a permutation.
We remark that when does not contain free index variables, then for any valuation , from which invariance follows from the previous arguments. This concludes the proof of Proposition 3.1.
Appendix C Details of Section 4
In the following sections we prove Theorem 4.1, 4.2, 4.3 and 4.4. More specifically, we start by showing these results in the setting that only supports summation aggregation () and in which the vertex-labellings in graphs take values in . In this context, we introduce classical logics in Section C.1 and recall and extend connections between the separation power of these logics and the separation power of color refinement and in Section C.2. We connect and logics in Section C.3, to finally obtain the desired proofs in Section C.4. We then show how these results can be generalized in the presence of general aggregation operators in Section C.5, and to the setting where vertex-labellings take values in in Section C.6.
C.1 Classical Logics
In what follows, we consider graphs with . We start by defining the -variable fragment of first-order logic with counting quantifiers, followed by the definition of the guarded fragment of . Formulae in are defined over the set of variables and are formed by the following grammar:
where , is a binary predicate, for are unary predicates for some , and . The semantics of formulae in is defined in terms of interpretations relative to a given graph and a (partial) valuation . Such an interpretation maps formulae, graphs and valuations to Boolean values , in a similar way as we did for tensor language expressions.
More precisely, given a graph and partial valuation , we define for valuations defined on the free variables in . That is, we define:
In the last expression, denotes the valuation modified such that it maps to vertex .
We will also need the guarded fragment of in which we only allow equality conditions of the form , component expressions of conjunction and disjunction should have the same single free variable, and counting quantifiers can only occur in guarded form: or . The semantics of formulae in is inherited from formulae in .
Finally, we will also consider , that is, the logic extended with infinitary disjunctions and conjunctions. More precisely, we add to the grammar of formulae the following constructs:
where the index set can be arbitrary, even containing uncountably many indices. We define in the same way by relaxing the finite variable conditions. The semantics is, as expected: if for at least one , , and if for all , .
We define the free variables of formulae just as for , and similarly, quantifier rank is defined as summation depth (only existential quantifications increase the quantifier rank). For any of the above logics we define as the set of formulae in of quantifier rank at most .
To capture the separation power of logics, we define as the equivalence relation on defined by
where is any valuation such that , and likewise for . The relation is defined in a similar way, except that now the relation is only over pairs of graphs, and the characterization is over all formulae with no free variables (also called sentences). Finally, we also use, and define, the relation , which relates pairs from : consisting of a graph and an -tuple of vertices. The relation is defined as
where consist of free variables and is a valuation assigning the -th variable of to the -th value of , for any .
C.2 Characterization of Separation Power of Logics
We first connect the separation power of the color refinement and -dimensional Weisfeiler-Leman algorithms to the separation power of the logics we just introduced. Although most of these connections are known, we present them in a bit of a more fine-grained way. That is, we connect the number of rounds used in the algorithms to the quantifier rank of formulae in the above logics.
Proposition C.1.
For any , we have the following identities:
-
(1)
and ;
-
(2)
For , and
As a consequence, .
Proof.
For (1), the identity is known and can be found, for example, in Theorem V.10 in Grohe (2021). The identity can be found in Proposition V.4 in Grohe (2021). The identity is a consequence of the inclusion shown in (2) for .
For (2), we use that , see e.g., Theorem V.8 in Grohe (2021). We argue that this identity holds for . Indeed, suppose that and are not in . Let be a formula in such that . Consider the formula . Then, , and hence and are not in either. This implies that , and thus, by definition, . In other words, and are not in , from which the inclusion follows. Conversely, if and are not in , then . As a consequence, and are not in either. Let be a formula in such that . Then it is readily shown that we can convert into a formula in such that , and thus and are not in . Hence, we also have the inclusion , form which the first identity in (2) follows.
It remains to show . Clearly, if is not in then the multisets of labels and differ. It is known that with each label one can associate a formula in such that if and only if . So, if the multisets are different, there must be a that occurs more often in one multiset than in the other one. This can be detected by a fomulae of the form which is satisfied if there are tuples with label . It is now easily verified that the latter formula can be converted into a formula in . Hence, the inclusion follows.
For , we show that if is in , then this implies that for all formulae in and any valuation (notice that is superfluous in this definition when formulas have no free variables). Assume that is in . Since any formula of quantifier rank is a Boolean combination of formulas of less rank or a formula of the form where is of quantifier rank , without loss of generality consider a formula of the latter form, and assume for the sake of contradiction that but . Since , there must be at least elements satisfying . More precisely, let in be all vertices in such that for each valuation it holds that . As mentioned, it must be that is at least . Using again the fact that , we infer that the color is the same, for each such .
Now since , it is not difficult to see that there must be exactly vertices in such that . Otherwise, it would simply not be the case that the aggregation step of the colors, assigned by is the same in and . By the connection to logic, we again know that for valuation it holds that . It then follows that for any valuation , which was to be shown.
Finally, we remark that follows from the preceding inclusions in (2). ∎
Before moving to tensor languages, where we will use infinitary logics to simulate expressions in and , we recall that, when considering the separation power of logics, we can freely move between the logics and their infinitary counterparts:
Theorem C.2.
The following identities hold for any , and :
-
(1)
;
-
(2)
.
Proof.
For identity (1), notice that we only need to prove that , the other direction follows directly from the definition. We point out the well-known fact that two tuples and belong to if and only if the unravelling of rooted at up to depth is isomorphic to the unravelling of rooted at up to root . Here the unravelling is the infinite tree whose root is the root node, and whose children are the neighbors of the root node (see e.g. Barceló et al. (2020); Otto (2019). Now for the connection with infinitary logic. Assume that the unravellings of rooted at and of rooted at up to level are isomorphic, but assume for the sake of contradiction that there is a formula in such that , where and are any valuation mapping variable to and , respectively. Now since and are finite graphs, one can construct, from formula , a formula in such that . Notice that this is in contradiction with our assumption that unravellings where isomorphic and therefore indistinguishable by formulae in . To construct , consider an infinitary disjunction . Since and have a finite number of vertices, and the formulae have a finite number of variables, the number of different valuations from the variables to the vertices in or is also finite. Thus, one can replace any extra copy of , such that their value is the same in and . The final result is a finite disjunction, and the truth value over and is equivalent to the original infinitary disjunction.
For identity (2) we refer to Corollary 2.4 in Otto (2017). ∎
C.3 From to and
We are now finally ready to make the connection between expressions in and the infinitary logics introduced earlier.
Proposition C.3.
For any expression in and , there exists an expression in such that if and only if for any graph in and . Furthermore, if then . Finally, if has summation depth then has quantifier rank .
Proof.
We define inductively on the structure of expressions in .
-
•
. Assume first that is “”. We distinguish between (a) and (b) . For case (a), if , then we define , if , then we define , and if , then we define . For case (b), if , then we define , and for any , we define . The case when is “” is treated analogously.
-
•
. If , then we define , if , then we define . For all other , we define .
-
•
. If , then we define , if , then we define . For all other , we define .
-
•
. We observe that if and only if there are such that and and . Hence, it suffices to define
where and are the expressions such that if and only if and if and only if , which exist by induction.
-
•
. This is case is analogous to the previous one. Indeed, if and only if there are such that and and . Hence, it suffices to define
-
•
. This is case is again dealt with in a similar way. Indeed, if and only if there is a such that and . Hence, it suffices to define
-
•
with . We observe that if and only if there are such that and for . Hence, it suffices to define
-
•
. We observe that implies that we can partition into parts , of sizes , respectively, such that for each , and such that all ’s are pairwise distinct and . It now suffices to consider the following formula
where is shorthand notation for , and denotes .
This concludes the construction of . We observe that we only introduce a quantifiers when and hence if we assume by induction that summation depth and quantifier rank are in sync, then if has summation depth and thus has quantifier rank for any , then has summation depth , and as can be seen from the definition of , this formula has quantifier rank , as desired.
It remains to verify the claim about guarded expressions. This is again verified by induction. The only case requiring some attention is for which we can define
which is a formula in again only adding one to the quantifier rank of the formulae for . So also here, we have the one-to-one correspondence between summation depth and quantifier rank. ∎
C.4 Proof of Theorem 4.1, 4.2, 4.3 and 4.4
Proposition C.4.
We have the following inclusions: For any and any collection of functions:
-
•
;
-
•
; and
-
•
.
Proof.
We first show the second bullet by contraposition. That is, we show that if and are not in , then neither are they in . Indeed, suppose that there exists an expression in such that . From Proposition C.3 we know that there exists a formula in such that and . Hence, and do no belong to . Theorem C.2 implies that and also do not belong to . Finally, Proposition C.1 implies that and do not belong to , as desired. The third bullet is shown in precisely the same, but using the identities for rather than , and rather than .
We next show that our tensor languages are also more separating than the color refinement and -dimensional Weisfeiler-Leman algorithms.
Proposition C.5.
We have the following inclusions: For any and any collection of functions:
-
•
;
-
•
; and
-
•
.
Proof.
For any of these inclusions to hold, for any , we need to show the inclusion without the use of any functions. We again use the connections between the color refinement and -dimensional Weisfeiler-Leman algorithms and finite variable logics as stated in Proposition C.1. More precisely, we show for any formula there exists an expression such that for any graph in , implies and implies . By appropriately selecting and and by observing that when then , the inclusions follow.
The construction of is by induction on the structure of formulae in .
-
•
. Then, we define .
-
•
. Then, we define .
-
•
. Then, we define .
-
•
. Then, we define .
-
•
. Then, we define .
-
•
. Consider a polynomial such that for and for . Such a polynomial exists by interpolation. Then, we define .
We remark that we here crucially rely on the assumption that contains graphs of fixed size and that is closed under linear combinations and product. Clearly, if , then the above translations results in an expression . Furthermore, the quantifier rank of is in one-to-one correspondence to the summation depth of .
We can now apply Proposition C.1. That is, if and are not in then by Proposition C.1, there exists a formula in such that . We have just shown when we consider , in , also holds. Hence, and are not in either, for any . Hence, holds. The other bullets are shown in the same way, again by relying on Proposition C.1 and using that we can move from and to logical formulae, and to expressions in and , respectively, to separate from or from , respectively. ∎
C.5 Other aggregation functions
As is mentioned in the main paper, our upper bound results on the separation power of tensor languages (and hence also of represented in those languages) generalize easily when other aggregation functions than summation are used in expressions.
To clarify what we understand by an aggregation function, let us first recall the semantics of summation aggregation. Let , where represents summation aggregation, let be a graph, and let be a valuation assigning index variables to vertices in . The semantics is then given by:
as explained in Section 3. Semantically, we can alternatively view as a function which takes the sum of the elements in the following multiset of real values:
One can now consider, more generally, an aggregation function as a function which assigns to any multiset of values in a single real value. For example, could be , , , . Let be such a collection of aggregation functions. We next incorporate general aggregation function in tensor language.
First, we extend the syntax of expressions in by generalizing the construct in the grammar of expression. More precisely, we define as the class of expressions, formed just like tensor language expressions, but in which two additional constructs, unconditional and conditional aggregation, are allowed. For an aggregation function we define:
where in the latter construct (conditional aggregation) the expression represents a expression whose only free variable is . The intuition behind these constructs is that unconditional aggregation allows for aggregating, using aggregate function , over the values of where ranges unconditionally over all vertices in the graph. In contrast, for conditional aggregation , aggregation by of the values of is conditioned on the neighbors of the vertex assigned to . That is, the vertices for range only among the neighbors of the vertex assigned to .
More specifically, the semantics of the aggregation constructs is defined as follows:
We remark that we can also consider aggregations functions over multisets of values in for some . This requires extending the syntax with for unconditional aggregation and with for conditional aggregation. The semantics is as expected: and .
The need for considering conditional and unconditional aggregation separately is due to the use of arbitrary aggregation functions. Indeed, suppose that one uses an aggregation function for which is a neutral value. That is, for any multiset of real values, the equality holds. For example, the summation aggregation function satisfies this property. We then observe:
In other words, unconditional aggregation can simulate conditional aggregation. In contrast, when is not a neutral value of the aggregation function , conditional and unconditional aggregation behave differently. Indeed, in such cases and may evaluate to different values, as illustrated in the following example.
As aggregation function we take the average for multisets of real values. We remark that ’s in contribute to the size of and hence is not a neutral element of . Now, let us consider the expressions
Let be such that . Then, results in applying the average to the multiset which includes the value for every and a for every non-neighbor . In other words, results in . In contrast, results in applying the average to the multiset . In other words, this multiset only contains the value for each , ignoring any information about the non-neighbors of . In other words, results in . Hence, conditional and unconditional aggregation behave differently for the average aggregation function.
This said, one could alternative use a more general variant of conditional aggregation of the form with as semantics where one creates a multiset only for those valuations for which the condition evaluates to a non-zero value. This general form of aggregation includes conditional aggregation, by replacing with and restricting , and unconditional aggregation, by replacing with the constant function , e.g., . In order not to overload the syntax of expressions, we will not discuss this general form of aggregation further.
The notion of free index variables for expressions in is defined as before, where now , and where (recall that in conditional aggregation). Moreover, summation depth is replaced by the notion of aggregation depth, , defined in the same way as summation depth except that and . Similarly, the fragments and its aggregation depth restricted fragment are defined as before, using aggregation depth rather than summation depth.
For the guarded fragment, , expressions are now restricted such that aggregations must occur only in the form , for . In other words, aggregation only happens on multisets of values obtained from neighboring vertices.
We now argue that our upper bound results on the separation power remain valid for the extension of with arbitrary aggregation functions .
Proposition C.6.
We have the following inclusions: For any , any collection of functions and any collection of aggregation functions:
-
•
;
-
•
; and
-
•
.
Proof.
It suffices to show that Proposition C.3 also holds for expressions in the fragments of considered. In particular, we only need to revise the case of summation aggregation (that is, ) in the proof of Proposition C.3. Indeed, let us consider the more general case when one of the two aggregating functions are used.
-
•
. We then define
where now consists of all such that
-
•
. We then define
where again consists of all such that
It is readily verified that iff , and iff , as desired.
For the guarded case, we note that the expression above yields a guarded expression as long conditional aggregation is used of the form with , so we can reuse the argument in the proof of Proposition C.3 for the guarded case.∎
We will illustrate later on (Section D) that this generalization allows for assessing the separation power of that use a variety of aggregation functions.
The choice of supported aggregation functions has, of course, an impact on the ability of to match color refinement or the procedures in separation power. The same holds for , as shown by Xu et al. (2019). And indeed, the proof of Proposition C.5 relies on the presence of summation aggregation. We note that most lower bounds on the separation power of in terms of color refinement or the procedures assume summation aggregation since summation suffices to construct injective sum-decomposable functions on multisets (Xu et al., 2019; Zaheer et al., 2017), which are used to simulate color refinement and . A more in-depth analysis of lower bounding with less expressive aggregation functions, possibly using weaker versions of color refinement and is left as future work.
C.6 Generalization to Graphs with real-valued vertex labels
We next consider the more general setting in which for some . That is, vertices in a graph can carry real-valued vectors. We remark that no changes to neither the syntax nor the semantics of expressions are needed, yet note that is now an element in rather than or , for each .
A first observation is that the color refinement and procedures treat each real value as a separate label. That is, two values that differ only by any small , are considered different. The proofs of Theorem 4.1, 4.2, 4.3 and 4.4 rely on connections between color refinement and and the finite variable logics and , respectively. In the discrete context, the unary predicates used in the logical formulas indicate which label vertices have. That is, iff . To accommodate for real values in the context of separation power, these logics now need to be able to differentiate between different labels, that is, different real numbers. We therefore extend the unary predicates allowed in formulas. More precisely, for each dimension , we now have uncountably many predicates of the form , one for each . In any formula in or only a finite number of such predicates may occur. The Boolean semantics of these new predicates is as expected:
In other words, in our logics, we can now detect which real-valued labels vertices have. Although, in general, the introduction of infinite predicates may cause problems, we here consider a specific setting in which the vertices in a graph have a unique label. This is commonly assumed in graph learning. Given this, it is easily verified that all results in Section C.2 carry over, where all logics involved now use the unary predicates with and .
The connection between and logics also carries over. First, for Proposition C.3 we now need to connect expressions, that use a finite number of predicates , for , with the extended logics having uncountably many predicates , for and , at their disposal. It suffices to reconsider the case in the proof of Proposition C.3. More precisely, can now be an arbitrary value . We now simply define . By definition if and only if , as desired.
The proof for the extended version of proposition C.5 now needs a slightly different strategy, where we build the relevant formula after we construct the contrapositive of the Proposition. Let us first show how to construct a formula that is equivalent to a logical formula on any graph using only labels in a specific (finite) set of real numbers.
In other words, given a set of real values, we show that for any formula using unary predicates such that , we can construct the desired . As mentioned, we only need to reconsider the case . We define
Then, evaluates to
Indeed, if , then and hence , resulting in the same nominator and denominator in the above fraction. If , then for some value with . In this case, the nominator in the above fraction becomes zero. We remark that this revised construction still results in a guarded expression, when the input logical formula is guarded as well.
Coming back to the proof of the extended version of Proposition C.5, let us show the proof for the the fact that the other two items being analogous. Assume that there is a pair and which is not in . Then, by Proposition C.1, applied on graphs with real-valued labels, there exists a formula in such that . We remark that uses finitely many predicates. Let be the set of real values used in both and (and ). We note that is finite. We invoke the construction sketched above, and obtain a formula in such that . Hence, and is not in either, for any , which was to be shown.
Appendix D Details of Section 5
We here provide some additional details on the encoding of layers of in our tensor languages, and how, as a consequence of our results from Section 4, one obtains a bound on their separation power. This section showcases that it is relatively straightforward to represent in our tensor languages. Indeed, often, a direct translation of the layers, as defined in the literature, suffices.
D.1 color Refinement
We start with architectures related to color refinement, or in other words, architectures which can be represented in our guarded tensor language.
GraphSage.
We first consider a “basic” , that is, an instance of GraphSage (Hamilton et al., 2017) in which sum aggregation is used. The initial features are given by where is a hot-one encoding of the th vertex label in . We can represent the initial embedding easily in , without the use of any summation. Indeed, it suffices to define for . We have for , and thus the initial features can be represented by simple expressions in .
Assume now, by induction, that we can also represent the features computed by a basic in layer . That is, let be those features and for each let be expressions in representing them. We assume that, for each , . We remark that we assume that a summation depth of is needed for layer .
Then, in layer , a basic computes the next features as
where is the adjacency matrix of , and are weight matrices in , is a (constant) bias matrix consist of copies of , and is some activation function. We can simply use the following expressions , for :
Here, , and are real values corresponding the weight matrices and bias vector in layer . These are expressions in since the additional summation is guarded, and combined with the summation depth of of , this results in a summation depth of for layer . Furthermore, , as desired. If we denote by the class of -layered basic , then our results imply
and thus the separation power of basic is bounded by the separation power of color refinement. We thus recover known results by Xu et al. (2019) and Morris et al. (2019).
Furthermore, if one uses a readout layer in basic to obtain a graph embedding, one typically applies a function in the form of , in which aggregation takes places over all vertices of the graph. This corresponds to an expression in : , where is the projection of the readout function on the the coordinate. We note that this is indeed not a guarded expression anymore, and thus our results tell that
More generally, GraphSage allows for the use of general aggregation functions on the multiset of features of neighboring vertices. To cast the corresponding layers in , we need to consider the extension with an appropriate set of aggregation functions, as described in Section C.5. In this way, we can represent layer by means of the following expressions , for .
which is now an expression in and hence the bound in terms of iterations of color refinement carries over by Proposition C.6. Here, simply consists of the aggregation functions used in the layers in GraphSage.
GCNs.
Graph Convolution Networks () (Kipf & Welling, 2017) operate alike basic except that a normalized Laplacian is used to aggregate features, instead of the adjacency matrix . Here, is the diagonal matrix consisting of reciprocal of the square root of the vertex degrees in plus . The initial embedding is just as before. We use again to denote the number of features in layer . In layer , a computes . If, in addition to the activation function we add the function to , we can represent the layer, as follows. For , we define the expressions
where we omitted the bias vector for simplicity. We again observe that only guarded summations are needed. However, we remark that in every layer we now add two the overall summation depth, since we need an extra summation to compute the degrees. In other words, a -layered correspond to expressions in . If we denote by the class of -layered , then our results imply
We remark that another representation can be provided, in which the degree computation is factored out (Geerts et al., 2021a), resulting in a better upper bound . In a similar way as for basic , we also have .
SGCs.
As an other example, we consider a variation of Simple Graph Convolutions () (Wu et al., 2019), which use powers the adjacency matrix and only apply a non-linear activation function at the end. That is, for some and . We remark that actually use powers of the normalized Laplacian, that is, but this only incurs an additional summation depth as for . We focus here on our simpler version. It should be clear that we can represent the architecture in by means of the expressions:
for . A naive application of our results would imply an upper bound on their separation power by . We can, however, use Proposition 4.5. Indeed, it is readily verified that these expressions have a treewidth of one, because the variables form a path. And indeed, when for example, , we can equivalently write as
by reordering the summations and reusing index variables. This holds for arbitrary . We thus obtain guarded expressions in and our results tell that -layered are bounded by for vertex embeddings, and by for .
Principal Neighbourhood Aggregation.
Our next example is a in which different aggregation functions are used: Principal Neighborhood Aggregation is an architecture proposed by Corso et al. (2020) in which aggregation over neighboring vertices is done by means of , , and , and this in parallel. In addition, after aggregation, three different scalers are applied. Scalers are diagonal matrices whose diagonal entries are a function of the vertex degrees. Given the features for each vertex computed in layer , that is, , a computes ’s new features in layer in the following way (see layer definition (8) in (Corso et al., 2020)). First, vectors are computed such that
where is the projection of an on the th coordinate. Then, three different scalers are applied. The first scaler is simply the identity, the second two scalers and depend on the vertex degrees. As such, vectors are constructed as follows:
where and are functions from (see (Corso et al., 2020) for details). Finally, the new vertex embedding is obtained as
for some . The above layer definition translates naturally into expressions in , the extension of with aggregate functions (Section C.5). Indeed, suppose that for each we have expressions such that for any vertex . Then, simply corresponds to the guarded expressions
for , and similarly for the other components of using the respective aggregation functions, , and . Then, corresponds to
where we use summation aggregation to compute the degree information used in the functions in the scalers and . And finally,
represents . We see that all expressions only use two index variables and aggregation is applied in a guarded way. Furthermore, in each layer, the aggregation depth increases with one. As such, a -layered can be represented in , where consists of the and functions used in scalers, and consists of (for computing vertex degrees), and , , and . Proposition C.6 then implies a bound on the separation power by .
Other example.
In the same way, one can also easily analyze (Velickovic et al., 2018) and show that these can be represented in as well, and thus bounds by color refinement can be obtained.
D.2 -dimensional Weisfeiler-Leman tests
We next discuss architectures related to the -dimensional Weisfeiler-Leman algorithms. For , we discussed the extended in the main paper. We here focus on arbitrary .
Folklore GNNs.
We first consider the “Folklore” or for short (Maron et al., 2019b). For , computes a tensors. In particular, the initial tensor encodes for each . We can represent this tensor by the following expressions in :
for and . We note: for all , as desired. We let and set .
Then, in layer , a computes a tensor
where , for , and and are . We here use to denote combinations of indices in for and in for .
Let be the tensor computed by an in layer . Assume that for each tuple of elements in we have an expression satisfying and such that it is an expression in . That is, we need index variables and a summation depth of to represent layer .
Then, for layer , for each , it suffices to consider the expression
where and are the projections of the on the -coordinates. We remark that we need index variables, and one extra summation is needed. We thus obtain expressions in for the th layer, as desired. We remark that the expressions are simple translations of the defining layer definitions. Also, in this case, consists of all . When a is used for vertex embeddings, we now simply add to each expression a factor . As an immediate consequence of our results, if we denote by the class of -layered , then for vertex embeddings:
in accordance with the known results from Azizian & Lelarge (2021). When used for graph embeddings, an aggregation layer over all -tuples of vertices is added, followed by the application of an . This results in expressions with no free index variables, and of summation depth , where the increase with stems from the aggregation process over all -tuples. In view of our results, for graph embeddings:
in accordance again with Azizian & Lelarge (2021). We here emphasize that the upper bounds in terms of are obtained without the need to know how works. Indeed, one can really just focus on casting layers in the right tensor language!
We remark that Azizian & Lelarge (2021) define vertex embedding in a different way. Indeed, for a vertex , its embedding is obtained by aggregating of all tuples in the remaining coordinates of the tensors. They define accordingly. From the tensor language point of view, this corresponds to the addition of to the summation depth. Our results indicate that we loose the connection between rounds and layers, as in Azizian & Lelarge (2021). This is the reason why we defined vertex embedding in a different way and can ensure a correspondence between rounds and layers for vertex embeddings.
Other higher-order examples.
It is readily verified that -layered - (Morris et al., 2019) can be represented in , recovering the known upper bound by (Morris et al., 2019). It is an equally easy exercise to show that -convolutions (Damke et al., 2020) and Ring- (Chen et al., 2019) are bounded by , by simply writing their layers in . The invariant graph networks () (Maron et al., 2019b) will be treated in Section E, as their representation in requires some work.
D.3 Augmented GNNs
Higher-order architectures such as -, and , incur a substantial cost in terms of memory and computation (Morris et al., 2020). Some recent proposals infuse more efficient with higher-order information by means of some pre-processing step. We next show that the tensor language approach also enables to obtain upper bounds on the separation power of such “augmented” .
We first consider (Barceló et al., 2021) in which the initial vertex features are augmented with homomorphism counts of rooted graph patterns. More precisely, let be a connected rooted graph (with root vertex ), and consider a graph and vertex . Then, denotes the number of homomorphism from to , mapping to . We recall that a homomorphism is an edge-preserving mapping between vertex sets. Given a collection of rooted patterns, an runs an on the augmented initial vertex features:
Now, take any architecture that can be cast in or and assume, for simplicity of exposition, that a -layer corresponds to expressions in or . In order to analyze the impact of the augmented features, one only needs to revise the expressions that represent the initial features. In the absence of graph patterns, , as we have seen before. By contrast, to represent we need to cast the computation of in . Assume that the graph pattern consists of vertices and let us identify the vertex set with . Furthermore, without of loss generality, we assume that vertex “” is the root vertex in . To obtain we need to create an indicator function for the graph pattern and then count how many times this indicator value is equal to one in . The indicator function for is simply given by the expression . Then, counting just pours down to summation over all index variables except the one for the root vertex. More precisely, if we define
then . This encoding results in an expression in . However, it is well-known that we can equivalently write as an expression in where is the treewidth of the graph . As such, our results imply that are bounded in separation power by where is the maximal treewidth of graphs in . We thus recover the known upper bound as given in Barceló et al. (2021) using our tensor language approach.
Another example of augmented architectures are the Graph Substructure Networks ( (Bouritsas et al., 2020). By contrast to , subgraph isomorphism counts rather than homomorphism counts are used to augment the initial features. At the core of a thus lies the computation of , the number of subgraphs in isomorphic to (and such that the isomorphisms map to ). In a similar way as for homomorphisms counts, we can either directly cast the computation of in resulting again in the use of index variables. A possible reduction in terms of index variables, however, can be obtained by relying on the result (Theorem 1.1.) by Curticapean et al. (2017) in which it shown that can be computed in terms of homomorphism counts of graph patterns derived from . More precisely, Curticapean et al. (2017) define as the set of graphs consisting of all possible homomorphic images of . It is then readily verified that if the maximal treewidth of the graphs in is , then can be cast as an expression in . Hence, using a pattern collection can be represented in , where is the maximal treewidth of graphs in any of the spams of patterns in , and thus are bounded in separation power in accordance to the results by Barceló et al. (2021).
As a final example, we consider the recently introduced Message Passing Simplicial Networks (s) (Bodnar et al., 2021). In a nutshell, s are run on simplicial complexes of graphs instead of on the original graphs. We sketch how our tensor language approach can be used to assess the separation power of s on clique complexes. We use the simplified version of s which have the same expressive power as the full version of s (Theorem 6 in Bodnar et al. (2021)).
We recall some definitions. Let denote the set of all cliques in . Given two cliques and in , define if and there exists no in , such that . We define and .
For each in we have an initial feature vector . Bodnar et al. (2021) initialize all initial features with the same value. Then, in layer , for each , features are updated as follows:
where and are aggregation functions and , and are . With some effort, one can represent these computations by expressions in where is largest clique in . As such, the separation power of clique-complex s on graphs of clique size at most is bounded by . And indeed, Bodnar et al. (2021) consider Rook’s graph, which contains a -clique, and the Shirkhande graph, which does not contain a -clique. As such, the analysis above implies that clique-complex s are bounded by on the Shrikhande graph, and by on Rook’s graph, consistent with the observation in Bodnar et al. (2021). A more detailed analysis of s in terms of summation depth and for other simplicial complexes is left as future work.
This illustrates again that our approach can be used to assess the separation power of a variety of architectures in terms of , by simply writing them as tensor language expressions. Furthermore, bounds in terms of can be used for augmented which form a more efficient way of incorporating higher-order graph structural information than higher-order .
D.4 Spectral GNNs
In general, spectral are defined in terms of eigenvectors and eigenvalues of the (normalized) graph Laplacian (Bruna et al., 2014; Defferrard et al., 2016; Levie et al., 2019; Balcilar et al., 2021b)). The diagonalization of the graph Laplacian is, however, avoided in practice, due to its excessive cost. Instead, by relying on approximation results in spectral graph analysis (Hammond et al., 2011), the layers of practical spectral are defined in term propagation matrices consisting of functions, which operate directly on the graph Laplacian. This viewpoint allows for a spectral analysis of spectral and “spatial” in a uniform way, as shown by Balcilar et al. (2021b). In this section, we consider two specific instances of spectral : (Defferrard et al., 2016) and (Levie et al., 2019), and assess their separation power in terms of tensor logic. Our general results then provide bounds on their separation power in terms color refinement and , respectively.
Chebnet.
The separation power of (Defferrard et al., 2016) was already analyzed in Balcilar et al. (2021a) by representing them in the matrix query language (Brijder et al., 2019). It was shown (Theorem 2 (Balcilar et al., 2021a)) that it is only the maximal eigenvalue of the graph Laplacian used in the layers of that may result in the separation power of to go beyond . We here revisit and refine this result by showing that, when ignoring the use of , the separation power of is bounded already by color refinement (which, as mentioned in Section 2, is weaker than for vertex embeddings). In a nutshell, the layers of a are defined in terms of Chebyshev polynomials of the normalized Laplacian and these polynomials can be easily represented in . One can alternatively use the graph Laplacian in a , which allows for a similar analysis. The distinction between the choice of and only shows in the needed summation depth (in as in similar way as for the described earlier). We only consider the normalized Laplacian here.
More precisely, following Balcilar et al. (2021a; b), in layer , vertex embeddings are updated in a according to:
with
and where denotes the maximum eigenvalue of . We next use a similar analysis as in Balcilar et al. (2021a). That is, we ignore for the moment the maximal eigenvalue and redefine as for some constant . We thus see that each is a polynomial of the form with scalar functions and where we interpret . To upper bound the separation power using our tensor language approach, we can thus shift our attention entirely to representing for powers . Furthermore, since is again a polynomial of the form , we can further narrow down the problem to represent
in , for powers . And indeed, combining our analysis for and results in expressions in . As an example let us consider , that is we use a power of two. It then suffices to define, for each output dimension , the expressions:
where the are expressions representing layer . It is then readily verified that we can use to cast layer of a in with consisting of , , and the used activation function . We thus recover (and slightly refine) Theorem 2 in Balcilar et al. (2021a):
Corollary D.1.
On graphs sharing the same values, the separation power of is bounded by color refinement, both for graph and vertex embeddings.
A more fine-grained analysis of the expressions is needed when interested in bounding the summation depth and thus of the number of rounds needed for color refinement. Moreover, as shown by Balcilar et al. (2021a), when graphs have non-regular components with different values, can distinguish them, whilst cannot. To our knowledge, cannot be computed in for any . This implies that it not clear whether an upper bound on the separation power can be obtained for taking into account. It is an interesting open question whether there are two graphs and which cannot be distinguished by but can be distinguished based on . A positive answer would imply that the computation of is beyond reach for and other techniques are needed.
CayleyNet.
We next show how the separation power of (Levie et al., 2019) can be analyzed. To our knowledge, this analysis is new. We show that the separation power of is bounded by . Following Levie et al. (2019) and Balcilar et al. (2021b), in each layer , a updates features as follows:
with
where is a constant, is the imaginary unit, and maps a complex number to its real part. We immediately observe that a requires the use of complex numbers and matrix inversion. So far, we considered real numbers only, but when our separation results are concerned, the choice between real or complex numbers is insignificant. In fact, only the proof of Proposition C.3 requires a minor modification when working on complex numbers: the infinite disjunctions used in the proof now need to range over complex numbers. For matrix inversion, when dealing with separation power, one can use different expressions in for computing the matrix inverse, depending on the input size. And indeed, it is well-known (see e.g., Csanky (1976)) that based on the characteristic polynomial of , for any matrix can be computed as a polynomial if and where each coefficient is a polynomial in , for various . Here, is the trace of a matrix. As a consequence, layers in can be viewed as polynomials in with coefficients polynomials in . One now needs three index variables to represent the trace computations . Indeed, let be the expression representing . Then, for example, can be computed in using
and hence is represented by . In other words, we obtain expressions in . The polynomials in can be represented in just as for . This implies that each layer in can be represented, on graphs of fixed size, by expressions, where includes the activation function and the function . This suffices to use our general results and conclude that s are bounded in separation power by . An interesting question is to find graphs that can be separated by a but not by . We leave this as an open problem.
Appendix E Proof of Theorem 5.1
We here consider another higher-order proposal: the invariant graph networks or of Maron et al. (2019b). By contrast to , are linear architectures. If we denote by the class of layered , then following inclusions are known (Maron et al., 2019b)
The reverse inclusions were posed as open problems in Maron et al. (2019a) and were shown to hold by Chen et al. (2020) for , by means of an extensive case analysis and by relying on properties of . In this section, we show that the separation power of is bounded by that of , for arbitrary . Theorem 4.2 tells that we can entirely shift our attention to showing that the layers of can be represented in . In other words, we only need to show that index variables are needed for the layers. As we will see below, this requires a bit of work since a naive representation of the layers of use index variables. Nevertheless, we show that this can be reduced to index variables only.
By inspecting the expressions needed to represent the layers of in , we obtain that a layer require expressions of summation depth of . In other words, the correspondence between layers and summation depth is precisely in sync. This implies, by Theorem 4.2:
where we ignore the number of layers. We similarly obtain that , hereby answering the open problem posed in Maron et al. (2019a). Finally, we observe that the used in Maron et al. (2019b) to show the inclusion are of very simple form. By defining a simple class of , denoted by , we obtain
hereby recovering the layer/round connections.
We start with the following lemma:
Lemma E.1.
For any , a layer can be represented in .
Before proving this lemma, we recall . These are architectures that consist of linear equivariant layers. Such linear layers allow for an explicit description. Indeed, following Maron et al. (2019c), let be the equality pattern equivalence relation on such that for , if and only if for all . We denote by the equivalence classes induced by . Let us denote by the tensor computed by an in layer . Then, in layer , a new tensor in is computed, as follows. For and :
(1) |
for activation function , constants and in and where and are indicator functions for the -tuple to be in the equivalence class and the -tuple to be in class . As initial tensor one defines with where is the number of initial vertex labels, just as for .
We remark that the need for having a summation depth of in the expressions in , or equivalently for requiring rounds of , can intuitively be explained that each layer of a aggregates more information from “neighbouring” -tuples than does. Indeed, in each layer, an can use previous tuple embeddings of all possible -tuples. In a single round of only previous tuple embeddings from specific sets of -tuples are used. It is only after an additional rounds, that gets to the information about arbitrary -tuples, whereas this information is available in a in one layer directly.
Proof of Lemma E.1.
We have seen how can be represented in when dealing with . We assume now that also the th layer can be represented by expressions in and show that the same holds for the th layer.
We first represent in , based on the explicit description given earlier. The expressions use index variables and . More specifically, for we consider the expressions:
(2) |
where is a product of expressions of the form encoding the equality pattern , and similarly, is a product of expressions of the form , and encoding the equality pattern . These expressions are indicator functions for the their corresponding equality patterns. That is,
We remark that in the expressions we have two kinds of summations: those ranging over a fixed number of elements (over equality patterns, feature dimension), and those ranging over the index variables . The latter are the only ones contributing the summation depth. The former are just concise representations of a long summation over a fixed number of expressions.
We now only need to show that we can equivalently write as expressions in , that is, using only indices . As such, we can already ignore the term since this is already in . Furthermore, this expressions does not affect the summation depth.
Furthermore, as just mentioned, we can expand expression into linear combinations of other simpler expressions. As such, it suffices to show that index variables suffice for each expression of the form:
(3) |
obtained by fixing and in expression (2). To reduce the number of variables, as a first step we eliminate any disequality using the inclusion-exclusion principle. More precisely, we observe that can be written as:
(4) |
for some sets , and of pairs of indices in , and where , and . Here we use that , and and use the inclusion-exclusion principle to obtain a polynomial in equality conditions only.
In view of expression (4), we can push the summations over in expression (3) to the subexpressions that actually use . That is, we can rewrite expression (3) into the equivalent expression:
(5) |
By fixing , and , it now suffices to argue that
(6) |
can be equivalently expressed in .
Since our aim is to reduced the number of index variables from to , it is important to known which variables are the same. In expression (6), some equalities that hold between the variables may not be explicitly mentioned. For this reason, we expand , and with their implied equalities. That is, is added to , if for any such that
holds. Similar implied equalities and are added to and , respectively. let us denoted by , and . It should be clear that we can add these implied equalities to expression (6) without changing its semantics. In other words, expression (6) can be equivalently represented by
(7) |
There now two types of index variables among the : those that are equal to some , and those that are not. Now suppose that , and thus , and that also , and thus . Since we included the implied equalities, we also have , and thus . There is no reason to keep as it is implied by and . We can thus safely remove all pairs from such that (and thus also ). We denote by be the reduced set of pairs of indices obtained from in this way. We have that expression (7) can be equivalently written as
(8) |
where we also switched the order of equalities in and . Our construction of and ensures that none of the variables with belonging to a pair in is equal to some .
By contrast, the variable occurring in are equal to . We observe, however, that also certain equalities among the variables hold, as represented by the pairs in . let and define as a unique representative element in . For example, one can take to be smallest index in . We use this representative index (and corresponding -variable) to simplify . More precisely, we replace each pair with the pair . In terms of variables, we replace with . Let be the set modified in that way. Expression (8) can thus be equivalently written as
(9) |
where the free index variables of the subexpression
(10) |
are precisely the index variables for . Recall that our aim is to reduce the variables from to . We are now finally ready to do this. More specifically, we consider a bijection in which ensure that for each there is a such that and . Furthermore, among the summations we can ignore those for which holds. After all, they only contribute for a given value. Let be those indices in such that for some . Then, we can equivalently write expression (9) as
(11) |
where denotes the expression obtained by renaming of variables in into -variables according to . This is our desired expression in . If we analyze the summation depth of this expression, we have by induction that the summation depth of is at most . In the above expression, we are increasing the summation depth with at most . The largest size of is , which occurs when none of the -variables are equal to any of the -variables. As a consequence, we obtained an expression of summation depth at most , as desired. ∎
As a consequence, when using for vertex embeddings, using one simply pads the layer expression with which does not affect the number of variables or summation depth. When using of graph embeddings, an additional invariant layer is added to obtain an embedding from . Such invariant layers have a similar (simpler) representation as given in equation 1 (Maron et al., 2019c), and allow for a similar analysis. One can verify that expressions in are needed when such an invariant layer is added to previous layers. Based on this, Theorem 4.2, Lemma E.1 and Theorem 1 in Maron et al. (2019b), imply that and hold.
-dimensional GINs.
We can recover a layer-based characterization for that compute vertex embeddings by considering a special subset of . Indeed, the used in Maron et al. (2019b) to show are of a very special form. We extract the essence of these special in the form of -dimensional . That is, we define the class to consist of layers defined as follows. The initial layers are just as for . Then, for :
where , and are . It is now an easy exercise to show that can be represented in (remark that the summations used increase the summation depth with one only in each layer). Combined with Theorem 4.2 and by inspecting the proof of Theorem 1 in Maron et al. (2019b), we obtain:
Proposition E.2.
For any and any : .
We can define the invariant version of by adding a simple readout layer of the form
as is used in Maron et al. (2019b). We obtain, , by simply rephrasing the readout layer in .
Appendix F Details of Section 6
Let be the class of all continuous functions from to . We always assume that forms a compact space. For example, when vertices are labeled with values in , is a finite set which we equip with the discrete topology. When vertices carry labels in we assume that these labels come from a compact set . In this case, one can represent graphs in by elements in and the topology used is the one induced by some metric on the reals. Similarly, we equip with the topology induced by some metric .
Consider and define as the closure of in under the usual topology induced by . In other words, a continuous function is in if there exists a sequence of functions such that . The following theorem provides a characterization of the closure of a set of functions. We state it here modified to our setting.
Theorem F.1 ((Timofte, 2005)).
Let such that there exists a set satisfying and . Then,
where . We can equivalently replace by in the expression for .∎
We will use this theorem to show Theorem 6.1 in the setting that consists of functions that can be represented in , and more generally, sets of functions that satisfy two conditions, stated below. We more generally allow to consist of functions , where the may depend on . We will require to satisfy the following two conditions:
- concatenation-closed:
-
If and are in , then is also in .
- function-closed:
-
For a fixed , for any such that , also is in for any continuous function .
We denote by be the subset of of functions from to . See 6.1
Proof.
The proof consist of (i) verifying the existence of a set as mentioned Theorem F.1; and of (ii) eliminating the pointwise convergence condition “ in the closure characterization in Theorem F.1.
For showing (ii) we argue that such that the conditions is automatically satisfied for any . Indeed, take an arbitrary and consider the constant functions with the th basis vector. Since is function-closed for , so is . Hence, as well. Furthermore, if , for , then and thus is closed under scalar multiplication. Finally, consider . For and in , since is concatenation-closed. As a consequence, the function is in , showing that is also closed under addition. All combined, this shows that is closed under taking linear combinations and since the basis vectors of can be attained, , as desired.
For (i), we show the existence of a set such that and hold. Similarly as in Azizian & Lelarge (2021), we define
We remark that for and , , with being pointwise multiplication, is also in . Indeed, with the concatenation of and and being pointwise multiplication.
It remains to verify . Assume that and are not in . By definition, this implies the existence of a function such that with . We argue that and are also not in either. Indeed, Proposition 1 in Maron et al. (2019b) implies that there exists natural numbers such that the mapping satisfies , with . Since (and thus also is function-closed, for any . In particular, and concatenation-closure implies that is in too. Hence, , by definition. It now suffices to observe that , and thus and are not in , as desired. ∎
When we know more about we can say a bit more. In the following, we let and only consider the setting where is either (invariant graph functions) or (equivariant graph/vertex functions). See 6.2
Proof.
This is just a mere restatement of Theorem 6.1 in which in the condition is replaced by , where for and for . ∎
To relate all this to functions representable by tensor languages, we make the following observations. First, if we consider to be the set of all functions that can be represented in , , or , then will be automatically concatenation and function-closed, provided that consists of all functions in . Hence, Theorem 6.1 applies. Furthermore, our results from Section 4 tell us that for all , and , , , , and . As a consequence, Corollary 6.2 applies as well. We thus easily obtain the following characterizations:
Proposition F.2.
For any and :
-
•
If consists of all functions representable in , then ;
-
•
If consists of all functions representable in , then ;
-
•
If consists of all functions representable in , then ; and finally,
-
•
If consists of all functions representable in , then ,
provided that consists of all functions in .
In fact, Lemma 32 in Azizian & Lelarge (2021) implies that we can equivalently populate with all instead of all continuous functions. We can thus use and continuous functions interchangeably when considering the closure of functions.
At this point, we want to make a comparison with the results and techniques in Azizian & Lelarge (2021). Our proof strategy is very similar and is also based on Theorem F.1. The key distinguishing feature is that we consider functions instead of functions from graphs alone. This has as great advantage that no separate proofs are needed to deal with invariant or equivariant functions. Equivariance incurs quite some complexity in the setting considered in Azizian & Lelarge (2021). A second major difference is that, by considering functions representable in tensor languages, and based on our results from Section 4, we obtain a more fine-grained characterization. Indeed, we obtain characterizations in terms of the number of rounds used in and . In Azizian & Lelarge (2021), is always set to , that is, an unbounded number of rounds is considered. Furthermore, when it concerns functions , we recall that is different from . Only is considered in Azizian & Lelarge (2021). Finally, another difference is that we define the equivariant version in a different way than is done in Azizian & Lelarge (2021), because in this way, a tighter connection to logics and tensor languages can be made. In fact, if we were to use the equivariant version of from Azizian & Lelarge (2021), then we necessarily have to consider an unbounded number of rounds (similarly as in our case).
We conclude this section by providing a little more details about the consequences of the above results for . As we already mentioned in Section 6.2, many common architectures are concatenation and function-closed (using instead of continuous functions). This holds, for example, for the classes , , and and , as described in Section 5 and further detailed in Section E and D. Here, the subscript refers to the dimension of the embedding space.
We now consider a function that is not more separating than (respectively, , or , for some ), and want to know whether can be approximated by a class of . Proposition F.2 tells that such can be approximated by a class of as long as these are at least as separating as (respectively, , or ). This, in turn, amounts showing that the can be represented in the corresponding tensor language fragment, and that they can match the corresponding labeling algorithm in separation power. We illustrate this for the architectures mentioned above.
- •
- •
-
•
In Section 5 we mentioned (see details in Section D) that can be represented in . Theorem 4.2 then implies that . Furthermore, Maron et al. (2019b) showed that . As a consequence, . Similarly, for the special class of described in Section E. No restrictions are in place for the lower bounds and hence real-valued vertex-labelled graphs can be considered.
-
•
When or are extended with a readout layer, we showed in Section 5 that these can be represented in . Theorem 4.4 and the results by Xu et al. (2019) and Barceló et al. (2020) then imply that and coincide with the separation power of these architectures with a readout layer. Here again, discrete labels need to be considered.
-
•
Similarly, when or are used for graph embeddings, we can represent these in resulting again that their separation power coincides with that of . No restrictions are again in place on the vertex labels.
So for all these architectures, Corollary 6.2 applies and we can characterize the closures of these architectures in terms of functions that not more separating than their corresponding versions of or , as described in the main paper. In summary,
Proposition F.3.
For any :
and when extended with a readout layer: | ||||
Furthermore, for any | ||||
and when converted into graph embeddings: | ||||
where the closures of the tensor languages are interpreted as the closure of the graph or graph/vertex functions that they can represent. For results involving or , the graphs considered should have discretely labeled vertices.
As a side note, we remark that in order to simulate on graphs with real-valued labels, one can use a architecture of the form , which translates in as expressions of the form
The upper bound in terms of follows from our main results. To show that can be simulated, it suffices to observe that one can approximate the function used in Proposition 1 in Maron et al. (2019b) to injectively encode multisets of real vectors by means of . As such, a continuous version of the first bullet in the previous proposition can be obtained.
Appendix G Details on Treewidth and Proposition 4.5
As an extension of our main results in Section 4, we enrich the class of tensor language expressions for which connections to exist. More precisely, instead of requiring expressions to belong to , that is to only use index variables, we investigate when expressions in are semantically equivalent to an expression using variables. Proposition 4.5 identifies a large class of such expressions, those of treewidth . As a consequence, even when representing architectures may require more than index variables, sometimes this number can be reduced. As a consequence of our results, this implies that their separation power is in fact upper bounded by for a smaller . Stated otherwise, to boost the separation power of , the treewidth of the expressions representing the layers of the must have large treewidth.
We next introduce some concepts related to treewidth. We here closely follow the exposition given in Abo Khamis et al. (2016) for introducing treewidth by means variable elimination sequences of hypergraphs.
In this section, we restrict ourselves to summation aggregation.
G.1 Elimination sequences
We first define elimination sequences for hypergraphs. Later on, we show how to associate such hypergraphs to expressions in tensor languages, allowing us to define elimination sequences for tensor language expressions.
With a multi-hypergraph we simply mean a multiset of subsets of vertices . An elimination hypergraph sequences is a vertex ordering of the vertices of . With such a sequence , we can associate for a sequence of multi-hypergraphs as follows. We define
and for | ||||
The induced width on by is defined as . We further consider the setting in which has some distinguished vertices. As we will see shortly, these distinguished vertices correspond to the free index variables of tensor language expressions. Without loss of generality, we assume that the distinguished vertices are . When such distinguished vertices are present, an elimination sequence is just as before, except that the distinguished vertices come first in the sequence. If are the distinguished vertices, then we define the induced width of the sequence as . In other words, we count the number of distinguished vertices, and then augment it with the induced width of the sequence, starting from to to , hereby ignoring the distinguished variables in the ’s. One could, more generally, also try to reduce the number of free index variables but we assume that this number is fixed, similarly as how operate.
G.2 Conjunctive expressions and treewidth
We start by considering a special form of expressions, which we refer to as conjunctive expressions, in analogy to conjunctive queries in database research and logic. A conjunctive expression is of the form
where denote the free index variables, contains all index variables under the scope of a summation, and finally, is a product of base predicates in . That is, is a product of and with variables in or . With such a conjunctive expression, one can associate a multi-hypergraph in a canonical way (Abo Khamis et al., 2016). More precisely, given a conjunctive expression we define as:
-
•
consist of all index variables in and ;
-
•
: for each atomic base predicate in we have an edge containing the indices occurring in the predicate; and
-
•
the vertices corresponding to the free index variables form the distinguishing set of vertices.
We now define an elimination sequence for as an elimination sequence for taking the distinguished vertices into account. The following observation ties elimination sequences of to the number of variables needed to express .
Proposition G.1.
Let be a conjunctive expression for which an elimination sequence of induced with exists. Then is equivalent to an expression in .
Proof.
We show this by induction on the number of vertices in which are not distinguished. For the base case, all vertices are distinguished and hence does not contain any summation and is an expression in itself.
Suppose that in there are undistinguished vertices. That is,
By assumption, we have an elimination sequence of the undistinguished vertices. Assume that is first in this ordering. Let us write
where is the product of predicates corresponding to the edges , that is, those not containing , and is the product of all predicates corresponding to the edges , that is, those containing the predicate . Note that, because of the induced width of , contains all indices in which is of size . We now replace the previous expression with another expression
Where is regarded as an -ary predicate over the indices in . It is now easily verified that is the hypergraph corresponding to the variable ordering . We note that this is a hypergraph over undistinguished vertices. We can apply the induction hypothesis and replace with its equivalent expression in . To obtain the expression of , it now remains to replace the new predicate with its defining expression. We note again that contains at most indices, so it will occur in in the form where . In other words, one of the variables in is not used, say , and we can simply replace by . ∎
As a consequence, one way of showing that a conjunctive expression in is equivalently expressible in , is to find an elimination sequence of induced width . This in turn is equivalent to having a treewidth of , as is shown, e.g., in Abo Khamis et al. (2016). As usual, we define the treewidth of a conjunctive expression in as the treewidth of its associated hypergraph .
We recall the definition of treewidth (modified to our setting): A tree decomposition of with is such that
-
•
For any , there is a such that ; and
-
•
For any corresponding to a non-distinguished index variable, the set is not empty and forms a connected sub-tree of .
The width of a tree decomposition is given by . Now the treewidth of , is the minimum width of any of its tree decompositions. We denote by the treewidth of . Again, similar modifications are used when distinguished vertices are in place. Referring again to Abo Khamis et al. (2016), is equivalent to having a variable elimination sequence for of an induced width of . Hence, combining this observation with Proposition G.1 results in:
Corollary G.2.
Let be a conjunctive expression of treewidth . Then is equivalent to an expression in .
That is, we have established Proposition 4.5 for conjunctive expressions. We next lift this to arbitrary expressions.
G.3 Arbitrary expressions
First, we observe that any expression in can be written as a linear combination of conjunctive expressions. This readily follows from the linearity of the operations in and that equality and inequality predicates can be eliminated. More specifically, we may assume that in is of the form
with finite set of indices and , and conjunctive expressions. We now define
for expressions in . To deal with expressions in that may contain function application, we define as the maximum treewidth of the expressions: (i) obtained by replacing each top-level function application by a new predicate with free indices ; and (ii) all expressions occurring in a top-level function application in . We note that these expression either have no function applications (as in (i)) or have function applications of lower nesting depth (in , as in ). In other words, applying this definition recursively, we end up with expressions with no function applications, for which treewidth was already defined. With this notion of treewidth at hand, Proposition 4.5 now readily follows.
Appendix H Higher-order MPNNs
We conclude the supplementary material by elaborating on - and by relating them to classical (Gilmer et al., 2017). As underlying tensor language we use which includes arbitrary functions () and aggregation functions (), as defined in Section C.5.
We recall from Section 3 that - refer to the class of embeddings for some that can be represented in . When considering an embedding , the notion of being represented is defined in terms of the existence of expressions in , which together provide each of the components of the embedding in . We remark, however, that we can alternatively include concatenation in tensor language. As such, we can concatenate separate expressions into a single expression. As a positive side effect, for to be represented in tensor language, we can then simply define it by requiring the existence of a single expression, rather than separate ones. This results in a slightly more succinct way of reasoning about -.
In order to reason about - as a class of embeddings, we can obtain an equivalent definition for the class of - by inductively stating how new embeddings are computed out of old embeddings. Let be a set of distinct variables. In the following, denotes a tuple of vertices that have at least as many components as the highest index of variables used in expressions. Intuitively, variable refers to the th component in . We also denote the image of a graph and tuple by an expression , i.e., the semantics of given and , as rather than by . We further simply refer to embeddings rather than expressions.
We first define “atomic” - embeddings which extract basic information from the graph and the given tuple of vertices.
-
•
Label embeddings of the form , with , and defined by , are -;
-
•
Edge embeddings of the form , with , and defined by
are -; and
-
•
(Dis-)equality embeddings of the form , with , and defined by
are -.
We next inductively define new - from “old” -. That is, given - , the following are also -:
-
•
Function applications of the form are -, where , and defined by
Here, if , then for some . That is, generates an embedding in . We remark that our function applications include concatenation.
-
•
Unconditional aggregations of the form are -, where and , and defined by
Here, if generates an embedding in , then is an aggregation function assigning to multisets of vectors in a vector in , for some . So, generates an embedding in .
-
•
Conditional aggregations of the form are -, with , and defined by
As before, if generates an embedding in , then is an aggregation function assigning to multisets of vectors in a vector in , for some . So again, generates an embedding in .
As defined in the main paper, we also consider the subclass - by only considering - defined in terms of expressions of aggregation depth at most . Our main results, phrased in terms of - are:
Hence, if the embeddings computed by are -, one obtains an upper bound on the separation power in terms of .
The classical (Gilmer et al., 2017) are subclass of - in which no unconditional aggregation can be used and furthermore, function applications require input embeddings with the same single variable ( or ), and only and are allowed. In other words, they correspond to guarded tensor language expressions (Section 4.2). We denote this class of - by and by when restrictions on aggregation depth are in place. And indeed, the classical way of describing as
correspond to - that satisfy the above mentioned restrictions. Without readouts, compute vertex embeddings and hence, our results imply
Furthermore, with a readout function fall into the category of -:
where unconditional aggregation is used. Hence,
We thus see that - gracefully extend and can be used for obtaining upper bounds on the separation power of classes of .