This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conditional Contrastive Learning with Kernel

Yao-Hung Hubert Tsai1, Tianqin Li1, Martin Q. Ma1, Han Zhao2,
Kun Zhang1,3, Louis-Philippe Morency1, & Ruslan Salakhutdinov1
1Carnegie Mellon University 2University of Illinois at Urbana-Champaign
3Mohamed bin Zayed University of Artificial Intelligence
{yaohungt, tianqinl, qianlim, kunz1, morency, rsalakhu}@cs.cmu.edu
{hanzhao}@illinois.edu
Abstract

Conditional contrastive learning frameworks consider the conditional sampling procedure that constructs positive or negative data pairs conditioned on specific variables. Fair contrastive learning constructs negative pairs, for example, from the same gender (conditioning on sensitive information), which in turn reduces undesirable information from the learned representations; weakly supervised contrastive learning constructs positive pairs with similar annotative attributes (conditioning on auxiliary information), which in turn are incorporated into the representations. Although conditional contrastive learning enables many applications, the conditional sampling procedure can be challenging if we cannot obtain sufficient data pairs for some values of the conditioning variable. This paper presents Conditional Contrastive Learning with Kernel (CCL-K) that converts existing conditional contrastive objectives into alternative forms that mitigate the insufficient data problem. Instead of sampling data according to the value of the conditioning variable, CCL-K uses the Kernel Conditional Embedding Operator that samples data from all available data and assigns weights to each sampled data given the kernel similarity between the values of the conditioning variable. We conduct experiments using weakly supervised, fair, and hard negatives contrastive learning, showing CCL-K outperforms state-of-the-art baselines.

22footnotetext: Equal contribution. Code available at: https://github.com/Crazy-Jack/CCLK-release.

1 Introduction

Contrastive learning algorithms (Oord et al., 2018; Chen et al., 2020; He et al., 2020; Khosla et al., 2020) learn similar representations for positively-paired data and dissimilar representations for negatively-paired data. For instance, self-supervised visual contrastive learning (Hjelm et al., 2018) define two views of the same image (applying different image augmentations to each view) as a positive pair and different images as a negative pair. Supervised contrastive learning (Khosla et al., 2020) defines data with the same labels as a positive pair and data with different labels as a negative pair. We see that distinct contrastive approaches consider different positive pairs and negative pairs constructions according to their learning goals.

In conditional contrastive learning, positive and negative pairs are constructed conditioned on specific variables. The conditioning variables can be downstream labels (Khosla et al., 2020), sensitive attributes (Tsai et al., 2021c), auxiliary information (Tsai et al., 2021a), or data embedding features (Robinson et al., 2020; Wu et al., 2020). For example, in fair contrastive learning (Tsai et al., 2021c), conditioning on variables such as gender or race, is performed to remove undesirable information from the learned representations. Conditioning is achieved by constructing negative pairs from the same gender. As a second example, in weakly-supervised contrastive learning (Tsai et al., 2021a), the aim is to include extra information in the learned representations. This extra information could be, for example, some freely available attributes for images collected from social media. The conditioning is performed by constructing positive pairs with similar annotative attributes. The cornerstone of conditional contrastive learning is the conditional sampling procedure: efficiently constructing positive or negative pairs while properly enforcing conditioning.

The conditional sampling procedure requires access to sufficient data for each state of the conditioning variables. For example, if we are conditioning on the “age” attribute to reduce the age bias, then the conditional sampling procedure will work best if we can create a sufficient number of data pairs for each age group. However, in many real-world situations, some values of the conditioning variable may not have enough data points or even no data points at all. The sampling problem exacerbates when the conditioning variable is continuous.

In this paper, we introduce Conditional Contrastive Learning with Kernel (CCL-K), to help mitigate the problem of insufficient data, by providing an alternative formulation using similarity kernels (see Figure 1). Given a specific value of the conditioning variable, instead of sampling data that are exactly associated with this specific value, we can also sample data that have similar values of the conditioning variable. We leverage the Kernel Conditional Embedding Operator (Song et al., 2013) for the sampling process, which considers kernels (Schölkopf et al., 2002) to measure the similarities between the values of the conditioning variable. This new formulation with a weighting scheme based on similarity kernel allows us to use all training data when conditionally creating positive and negative pairs. It also enables contrastive learning to use continuous conditioning variables.

Refer to caption
Figure 1: Illustration of the main idea in CCL-K, best viewed in color. Suppose we select color as the conditioning variable and we want to sample red data points. Left figure: The traditional conditional sampling procedure only samples red points (i.e., the points in the circle). Right figure: The proposed CCL-K samples all data points (i.e., the sampled set expands from the inner circle to the outer circle) with a weighting scheme based on the similarities between conditioning variables’ values. The higher the weight, the higher probability of of a data point being sampled. For example, CCL-K can sample orange data with a high probability, because orange resembles to red. In this illustration, the weight ranges from 0 to 11 with white as 0 and black as 11.

To study the generalization of CCL-K, we conduct experiments on three tasks. The first task is weakly supervised contrastive learning , which incorporates auxiliary information by conditioning on the annotative attribute to improve the downstream task performance. For the second task, fair contrastive learning , we condition on the sensitive attribute to remove its information in the representations. The last task is hard negative contrastive learning , which samples negative pairs to learn dissimilar representations where the negative pairs have similar outcomes from the conditioning variable. We compare CCL-K with the baselines tailored for each of the three tasks, and CCL-K outperforms all baselines on downstream evaluations.

2 Conditional Sampling in Contrastive Learning

In Section 2.1, we first present the technical preliminaries of contrastive learning. Next, we introduce the conditional sampling procedure in Section 2.2, showing that it instantiates recent conditional contrastive frameworks. Last, in Section 2.3, we discuss the limitation of insufficient data in the current framework, presenting to convert existing objectives into alternative forms with kernels to alleviate the limitation. In the paper, we use uppercase letters (e.g., XX) for random variables, PP_{\cdot}{\cdot} for the distribution (e.g., PXP_{X} denotes the distribution of XX), lowercase letters (e.g., xx) for the outcome from a variable, and the calligraphy letter (e.g., 𝒳\mathcal{X}) for the sample space of a variable.

2.1 Technical Preliminaries - Unconditional Contrastive Learning

Contrastive methods learn similar representations for positive pairs and dissimilar representations for negative pairs (Chen et al., 2020; He et al., 2020; Hjelm et al., 2018). In prior literature (Oord et al., 2018; Tsai et al., 2021b; Bachman et al., 2019; Hjelm et al., 2018), the construction of the positive and negative pairs can be understood as sampling from the joint distribution PXYP_{XY} and product of marginal PXPYP_{X}P_{Y}. To see this, we begin by reviewing one popular contrastive approach, the InfoNCE objective (Oord et al., 2018):

InfoNCE:=sup𝑓𝔼(x,ypos)PXY,{yneg,i}i=1nPYn[logef(x,ypos)ef(x,ypos)+i=1nef(x,yneg,i)],{\rm InfoNCE}:=\underset{f}{\rm sup}\,\,\mathbb{E}_{(x,y_{\rm pos})\sim P_{XY}\,,\,\{y_{\rm neg,i}\}_{i=1}^{n}\sim{P_{Y}}^{\otimes n}}\Big{[}\,{\rm log}\,\frac{e^{f(x,y_{\rm pos})}}{e^{f(x,y_{\rm pos})}+\sum_{i=1}^{n}e^{f(x,y_{\rm neg,i})}}\Big{]}, (1)

where XX and YY represent the data and xx is the anchor data. (x,ypos)(x,y_{\rm pos}) are positively-paired and are sampled from PXYP_{XY} (xx and yy are different views to each other; e.g., augmented variants of the same image), and {(x,yneg,i)}i=1n\{(x,y_{\rm neg,i})\}_{i=1}^{n} are negatively-paired and are sampled from PXPYP_{X}P_{Y} (e.g., xx and yy are two random images). f(,)f(\cdot,\cdot) defines a mapping 𝒳×𝒴\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}, which is parameterized via neural nets (Chen et al., 2020) as:

f(x,y):=cosinesimilarity(gθX(x),gθY(y))/τ,f(x,y):={\rm cosine\,\,similarity}\Big{(}g_{\theta_{X}}(x),g_{\theta_{Y}}(y)\Big{)}/\tau, (2)

where gθX(x)g_{\theta_{X}}(x) and gθY(y)g_{\theta_{Y}}(y) are embedded features, gθXg_{\theta_{X}} and gθYg_{\theta_{Y}} are neural networks (gθXg_{\theta_{X}} can be the same as gθYg_{\theta_{Y}}) parameterized by parameters θX\theta_{X} and θY\theta_{Y}, and τ\tau is a hyper-parameter that rescales the score from the cosine similarity. The InfoNCE objective aims to maximize the similarity score between a data pair sampled from the joint distribution (i.e., (x,ypos)PXY(x,y_{\rm pos})\sim P_{XY}) and minimize the similarity score between a data pair sampled from the product of marginal distribution (i.e., (x,yneg)PXPY(x,y_{\rm neg})\sim P_{X}P_{Y}) (Tschannen et al., 2019).

2.2 Conditional Contrastive Learning

Recent literature (Robinson et al., 2020; Tsai et al., 2021a; c) has modified the InfoNCE objective to achieve different learning goals by sampling positive or negative pairs under conditioning variable ZZ (and its outcome zz). These different conditional contrastive learning frameworks have one common technical challenge: the conditional sampling procedure. The conditional sampling procedure samples the positive or negative data pairs from the product of conditional distribution: (x,y)PX|Z=zPY|Z=z(x,y)\sim P_{X|Z=z}P_{Y|Z=z}, where xx and yy are sampled given the same outcome from the conditioning variable (e.g., two random images with blue sky background, when selecting zz as blue sky background). We summarize how different frameworks use the conditional sampling procedure in Table 1.

Framework Conditioning Variable ZZ Positive Pairs from Negative Pairs from
Weakly Supervised (Tsai et al., 2021a) Auxiliary information PX|Z=zPY|Z=zP_{X|Z=z}P_{Y|Z=z} PXPYP_{X}P_{Y}
Fair (Tsai et al., 2021c) Sensitive information PXY|Z=zP_{XY|Z=z} PX|Z=zPY|Z=zP_{X|Z=z}P_{Y|Z=z}
Hard Negatives (Wu et al., 2020) Feature embedding of XX PXYP_{XY} PX|Z=zPY|Z=zP_{X|Z=z}P_{Y|Z=z}
Table 1: Conditional sampling procedure of weakly supervised, fair, and hard negative contrastive learning. We can regard the data pair (x,y)(x,y) sampled from PXYP_{XY} or PXY|ZP_{XY|Z} as strongly-correlated, such as views of the same image by applying different image augmentations; (x,y)(x,y) sampled from PX|ZPY|ZP_{X|Z}P_{Y|Z} as two random data that are both associated with the same outcome of the conditioning variable, such as two random images with the same annotative attributes; and (x,y)(x,y) from PXPYP_{X}P_{Y} as two uncorrelated data such as two random images.

Weakly Supervised Contrastive Learning.

Tsai et al. (2021a) consider the auxiliary information from data (e.g., annotation attributes of images) as a weak supervision signal and propose a contrastive objective to incorporate the weak supervision in the representations. This work is motivated by the argument that the auxiliary information implies semantic similarities. With this motivation, the weakly supervised contrastive learning framework learns similar representations for data with the same auxiliary information and dissimilar representations for data with different auxiliary information. Embracing this idea, the original InfoNCE objective can be modified into the weakly supervised InfoNCE (abbreviated as WeaklySup-InfoNCE) objective:

WeaklySupInfoNCE:=sup𝑓𝔼zPZ,(x,ypos)PX|Z=zPY|Z=z,{yneg,i}i=1nPYn[logef(x,ypos)ef(x,ypos)+i=1nef(x,yneg,i)].\begin{split}\leavevmode\resizebox{422.77661pt}{}{${\rm{WeaklySup}_{InfoNCE}}:=\underset{f}{\rm sup}\,\,\mathbb{E}_{z\sim P_{Z}\,,\,(x,y_{\rm pos})\sim P_{X|Z=z}P_{Y|Z=z}\,,\,\{y_{\rm neg,i}\}_{i=1}^{n}\sim P_{Y}^{\otimes n}}\Big{[}\,{\rm log}\,\frac{e^{f(x,y_{\rm pos})}}{e^{f(x,y_{\rm pos})}+\sum_{i=1}^{n}e^{f(x,y_{\rm neg,i})}}\Big{]}$}.\end{split} (3)

Here ZZ is the conditioning variable representing the auxiliary information from data, and zz is the outcome of auxiliary information that we sample from PZP_{Z}. (x,ypos)(x,y_{\rm pos}) are positive pairs sampled from PX|Z=zPY|Z=zP_{X|Z=z}P_{Y|Z=z}. In this design, the positive pairs always have the same outcome from the conditioning variable. {(x,yneg,i)}i=1n\{(x,y_{\rm neg,i})\}_{i=1}^{n} are negative pairs that are sampled from PXPYP_{X}P_{Y}.

Fair Contrastive Learning.

Another recent work (Tsai et al., 2021c) presented to remove undesirable sensitive information (such as gender) in the representation, by sampling negative pairs conditioning on sensitive attributes. The paper argues that fixing the outcome of the sensitive variable prevents the model from using the sensitive information to distinguish positive pairs from negative pairs (since all positive and negative samples share the same outcome), and the model will ignore the effect of the sensitive attribute during contrastive learning. Embracing this idea, the original InfoNCE objective can be modified into the Fair-InfoNCE objective:

FairInfoNCE:=sup𝑓𝔼zPZ,(x,ypos)PXY|Z=z,{yneg,i}i=1nPY|Z=zn[logef(x,ypos)ef(x,ypos)+i=1nef(x,yneg,i)].{\rm{Fair}_{InfoNCE}}:=\underset{f}{\rm sup}\,\,\mathbb{E}_{z\sim P_{Z}\,,\,(x,y_{\rm pos})\sim P_{XY|Z=z}\,,\,\{y_{\rm neg,i}\}_{i=1}^{n}\sim P_{Y|Z=z}^{\otimes n}}\Big{[}\,{\rm log}\,\frac{e^{f(x,y_{\rm pos})}}{e^{f(x,y_{\rm pos})}+\sum_{i=1}^{n}e^{f(x,y_{\rm neg,i})}}\Big{]}.

(4)

Here ZZ is the conditioning variable representing the sensitive information (e.g., gender), zz is the outcome of the sensitive information (e.g., female), and the anchor data xx is associated with zz (e.g., a data point that has the gender attribute being female). (x,ypos)(x,y_{\rm pos}) are positively-paired that are sampled from PXY|Z=zP_{XY|Z=z}, and xx and yposy_{\rm pos} are constructed to have the same zz. {(x,yneg,i)}i=1n\{(x,y_{\rm neg,i})\}_{i=1}^{n} are negatively-paired that are sampled from PX|Z=zPY|Z=zP_{X|Z=z}P_{Y|Z=z}. In this design, the positively-paired samples and the negatively-paired samples are always having the same outcome from the conditioning variable.

Hard-negative Contrastive Learning.

Robinson et al. (2020) and Kalantidis et al. (2020) argue that contrastive learning can benefit from hard negative samples (i.e., samples yy that are difficult to distinguish from an anchor xx). Rather than considering two arbitrary data as negatively-paired, these methods construct a negative data pair from two random data that are not too far from each other111Wu et al. (2020) argues that a better construction of negative data pairs is selecting two random data that are neither too far nor too close to each other.. Embracing this idea, the original InfoNCE objective is modified into the Hard Negative InfoNCE (abbreviated as HardNeg-InfoNCE) objective:

HardNegInfoNCE:=sup𝑓𝔼(x,ypos)PXY,zPZ|X=x,{yneg,i}i=1nPY|Z=zn[logef(x,ypos)ef(x,ypos)+i=1nef(x,yneg,i)].{\rm{HardNeg}_{InfoNCE}}:=\underset{f}{\rm sup}\,\,\mathbb{E}_{(x,y_{\rm pos})\sim P_{XY}\,,\,z\sim P_{Z|X=x}\,,\,\{y_{\rm neg,i}\}_{i=1}^{n}\sim P_{Y|Z=z}^{\otimes n}}\Big{[}\,{\rm log}\,\frac{e^{f(x,y_{\rm pos})}}{e^{f(x,y_{\rm pos})}+\sum_{i=1}^{n}e^{f(x,y_{\rm neg,i})}}\Big{]}.

(5)

Here ZZ is the conditioning variable representing the embedding feature of XX, in particular z=gθX(x)z=g_{\theta_{X}}(x) (see definition in equation 2, we refer gθX(x)g_{\theta_{X}}(x) as the embedding feature of xx and gθY(y)g_{\theta_{Y}}(y) as the embedding feature of yy). (x,ypos)(x,y_{\rm pos}) are positively-paired that are sampled from PXYP_{XY}. To construct negative pairs {(x,yneg,i)}i=1n\{(x,y_{{\rm neg},i})\}_{i=1}^{n}, we sample {(yneg,i)}i=1n\{(y_{\rm neg,i})\}_{i=1}^{n} from PY|Z=z=gθX(x)P_{Y|Z=z=g_{\theta_{X}}(x)}. We realize the sampling from PY|Z=z=gθX(x)P_{Y|Z=z=g_{\theta_{X}}(x)} as sampling data points from YY whose embedding features are close to z=gθX(x)z=g_{\theta_{X}}(x): sampling yy such that gθY(y)g_{\theta_{Y}}(y) is close to gθX(x)g_{\theta_{X}}(x).

2.3 Conditional Contrastive Learning with Kernel

The conditional sampling procedure common to all these conditional contrastive frameworks has a limitation when we have insufficient data points associated with some outcomes of the conditioning variable. In particular, given an anchor data xx and its corresponding conditioning variable’s outcome zz, if zz is uncommon, then it will be challenging for us to sample yy that is associated with zz via yPY|Z=zy\sim P_{Y|Z=z}222If xx is the anchor data and zz is its corresponding variable’s outcome, then for yPY|Z=zy\sim P_{Y|Z=z}, the data pair (x,y)(x,y) can be seen as sampling from PX|Z=zPY|Z=zP_{X|Z=z}P_{Y|Z=z}.. The insufficient data problem can be more serious when the cardinality of the conditioning variable |Z||Z| is large, which happens when ZZ contains many discrete values, or when ZZ is a continuous variable (cardinality |Z|=|Z|=\infty). In light of this limitation, we present to convert these objectives into alternative forms that can avoid the need to sample data from PY|ZP_{Y|Z} and can retain the same functions as the original forms. We name this new family of formulations Conditional Contrastive Learning with Kernel (CCL-K).

High Level Intuition.

The high level idea of our method is that, instead of sampling yy from PY|Z=zP_{Y|Z=z}, we sample yy from existing data of YY whose associated conditioning variable’s outcome is close to zz. For example, assuming the conditioning variable ZZ to be age and zz to be 8080 years old, instead of sampling the data points at the age of 8080 directly, we sample from all data points, assigning highest weights to the data points at the age from 7070 to 9090, given their proximity to 8080. Our intuition is that data with similar outcomes from the conditioning variable should be used in support of the conditional sampling. Mathematically, instead of sampling from PY|Z=zP_{Y|Z=z}, we sample from a distribution proportional to the weighted sum j=1Nw(zj,z)PY|Z=zj\sum_{j=1}^{N}w(z_{j},z)P_{Y|Z=z_{j}}, where w(zj,z)w(z_{j},z) represents how similar in the space of of the conditioning variable ZZ. This similarity is computed for all data points j=1Nj=1\cdots N. In this paper, we use the Kernel Conditional Embedding Operator (Song et al., 2013) for such approximation, where we represent the similarity using kernel (Schölkopf et al., 2002).

Step I - Problem Setup.

We want to avoid the conditional sampling procedure from existing conditional learning objectives (equation 345), and hence we are not supposed to have access to data pairs from the conditional distribution PX|ZPY|ZP_{X|Z}P_{Y|Z}. Instead, the only given data will be a batch of triplets {(xi,yi,zi)}i=1b\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}, which are independently sampled from the joint distribution PXYZbP_{XYZ}^{\,\,\,\otimes b} with bb being the batch size. In particular, when (xi,yi,zi)PXYZ(x_{i},y_{i},z_{i})\sim P_{XYZ}, (xi,yi)(x_{i},y_{i}) is a pair of data sampled from the joint distribution PXYP_{XY} (e.g., two augmented views of the same image) and ziz_{i} is the associated conditioning variable’s outcome (e.g., the annotative attribute of the image). To convert previous objectives into alternative forms that avoid the need of the conditional sampling procedure, we need to perform an estimation of the scoring function ef(x,y)e^{f(x,y)} for (x,y)PX|ZPY|Z(x,y)\sim P_{X|Z}P_{Y|Z} in equation 345 given only {(xi,yi,zi)}i=1bPXYZb\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b}.

Step II - Kernel Formulation.

We present to reformulate previous objectives into kernel expressions. We denote KXYb×bK_{XY}\in\mathbb{R}^{b\times b} as a kernel gram matrix between XX and YY: let the ithi_{\rm th} row and jthj_{\rm th} column of KXYK_{XY} be the exponential scoring function ef(xi,yj)e^{f(x_{i},y_{j})} (see equation 2) with [KXY]ij:=ef(xi,yj)[K_{XY}]_{ij}:=e^{f(x_{i},y_{j})}. KXYK_{XY} is a gram matrix because the design of ef(x,y)e^{f(x,y)} satisfies a kernel333Cosine similarity is a proper kernel, and the exponential of a proper kernel is also a proper kernel. between gθX(x)g_{\theta_{X}}(x) and gθY(y)g_{\theta_{Y}}(y):

ef(x,y)=exp(cosinesimilarity(gθX(x),gθY(y))/τ):=ϕ(gθX(x)),ϕ(gθY(y)),{e^{f(x,y)}={\rm exp}\,\Big{(}{\rm cosine\,\,similarity}(g_{\theta_{X}}(x),g_{\theta_{Y}}(y))/\tau\Big{)}:=\Big{\langle}\phi(g_{\theta_{X}}(x)),\phi(g_{\theta_{Y}}(y))\Big{\rangle}_{\mathcal{H}},} (6)

where ,\langle\cdot,\cdot\rangle_{\mathcal{H}} is the inner product in a Reproducing Kernel Hilbert Space (RKHS) \mathcal{H} and ϕ\phi is the corresponding feature map. KXYK_{XY} can also be represented as KXY=ΦXΦYK_{XY}=\Phi_{X}\Phi_{Y}^{\top} with ΦX=[ϕ(gθX(x1)),,ϕ(gθX(xb))]\Phi_{X}=\big{[}\phi\big{(}g_{\theta_{X}}(x_{1})\big{)},\cdots,\phi\big{(}g_{\theta_{X}}(x_{b})\big{)}\big{]}^{\top} and ΦY=[ϕ(gθY(y1)),,ϕ(gθY(yb))]\Phi_{Y}=\big{[}\phi\big{(}g_{\theta_{Y}}(y_{1})\big{)},\cdots,\phi\big{(}g_{\theta_{Y}}(y_{b})\big{)}\big{]}^{\top}. Similarly, we denote KZb×bK_{Z}\in\mathbb{R}^{b\times b} as a kernel gram matrix for ZZ, where [KZ]ij[K_{Z}]_{ij} represents the similarity between ziz_{i} and zjz_{j}: [KZ]ij:=γ(zi),γ(zj)𝒢[K_{Z}]_{ij}:=\big{\langle}\gamma(z_{i}),\gamma(z_{j})\big{\rangle}_{\mathcal{G}}, where γ()\gamma(\cdot) is an arbitrary kernel embedding for ZZ and 𝒢\mathcal{G} is its corresponding RKHS space. KZK_{Z} can also be represented as KZ=ΓZΓZK_{Z}=\Gamma_{Z}\Gamma_{Z}^{\top} with ΓZ=[γ(z1),,γ(zb)]\Gamma_{Z}=\big{[}\gamma(z_{1}),\cdots,\gamma(z_{b})\big{]}^{\top}.

Step III - Kernel-based Scoring Function ef(x,y)e^{f(x,y)} Estimation.

We present the following:

Definition 2.1 (Kernel Conditional Embedding Operator (Song et al., 2013)).

By Kernel Conditional Embedding Operator (Song et al., 2013), the finite-sample kernel estimation of 𝔼yPY|Z=z[ϕ(gθY(y))]\mathbb{E}_{y\sim P_{Y|Z=z}}\Big{[}\phi\Big{(}g_{\theta_{Y}}(y)\Big{)}\Big{]} is ΦY(KZ+λ𝐈)1ΓZγ(z)\Phi_{Y}^{\top}(K_{Z}+\lambda{\bf I})^{-1}\Gamma_{Z}\gamma(z), where λ\lambda is a hyper-parameter.

Refer to caption
Figure 2: (KZ+λI)1KZ(K_{Z}+\lambda I)^{-1}K_{Z} v.s. KZK_{Z}. We apply min-max normalization (xmin(x))/(max(x)min(x))\big{(}x-{\rm min}(x)\big{)}/\big{(}{\rm max}(x)-{\rm min}(x)\big{)} for both matrices for better visualization. We see that (KZ+λI)1KZ(K_{Z}+\lambda I)^{-1}K_{Z} can be seen as a smoothed version of KZK_{Z}, which suggests that each entry in (KZ+λI)1KZ(K_{Z}+\lambda I)^{-1}K_{Z} represents the similarities between zzs.
Proposition 2.2 (Estimation of ef(xi,y)e^{f(x_{i},y)} when yPY|Z=ziy\sim P_{Y|Z=z_{i}}).

Given {(xi,yi,zi)}i=1bPXYZb\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b}, the finite-sample kernel estimation of ef(xi,y)e^{f(x_{i},y)} when yPY|Z=ziy\sim P_{Y|Z=z_{i}} is [KXY(KZ+λ𝐈)1KZ]ii\Big{[}K_{XY}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}\Big{]}_{ii}. [KXY(KZ+λ𝐈)1KZ]ii=j=1bw(zj,zi)ef(xi,yj)\Big{[}K_{XY}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}\Big{]}_{ii}=\sum_{j=1}^{b}w(z_{j},z_{i})\,e^{f(x_{i},y_{j})} with w(zj,zi)=[(KZ+λ𝐈)1KZ]jiw(z_{j},z_{i})=\Big{[}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}\Big{]}_{ji}.

Proof.

For any Z=zZ=z, along with Definition 2.1, we estimate ϕ(gθY(y))\phi\Big{(}g_{\theta_{Y}}(y)\Big{)} when yPY|Z=zy\sim P_{Y|Z=z} \approx 𝔼yPY|Z=z[ϕ(gθY(y))]ΦY(KZ+λ𝐈)1ΓZγ(z)\mathbb{E}_{y\sim P_{Y|Z=z}}\Big{[}\phi\Big{(}g_{\theta_{Y}}(y)\Big{)}\Big{]}\approx\Phi_{Y}^{\top}(K_{Z}+\lambda{\bf I})^{-1}\Gamma_{Z}\,\gamma(z). Then, we plug in the result for the data pair (xi,zi)(x_{i},z_{i}) to estimate ef(xi,y)e^{f(x_{i},y)} when yPY|Z=ziy\sim P_{Y|Z=z_{i}}:
ϕ(gθX(xi)),ΦY(KZ+λ𝐈)1ΓZγ(zi)=tr(ϕ(gθX(xi))ΦY(KZ+λ𝐈)1ΓZγ(zi))=[KXY]i(KZ+λ𝐈)1[KZ]i=[KXY]i[(KZ+λ𝐈)1KZ]i=[KXY(KZ+λ𝐈)1KZ]ii\big{\langle}\phi\Big{(}g_{\theta_{X}}(x_{i})\Big{)},\Phi_{Y}^{\top}(K_{Z}+\lambda{\bf I})^{-1}\Gamma_{Z}\,\gamma(z_{i})\big{\rangle}_{\mathcal{H}}={\rm tr}\big{(}\phi\Big{(}g_{\theta_{X}}(x_{i})\Big{)}^{\top}\Phi_{Y}^{\top}(K_{Z}+\lambda{\bf I})^{-1}\Gamma_{Z}\,\gamma(z_{i})\big{)}=[K_{XY}]_{i*}(K_{Z}+\lambda{\bf I})^{-1}[K_{Z}]_{*i}=[K_{XY}]_{i*}\big{[}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}\big{]}_{*i}=\Big{[}K_{XY}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}\Big{]}_{ii}. ∎

[KXY(KZ+λ𝐈)1KZ]ii[K_{XY}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}]_{ii} is the kernel estimation of ef(xi,y)e^{f(x_{i},y)} when (xi,zi)PXZ,yPY|Z=zi(x_{i},z_{i})\sim P_{XZ}\,,\,y\sim P_{Y|Z=z_{i}}. It defines the similarity between the data pair sampled from PX|Z=ziPY|Z=ziP_{X|Z=z_{i}}P_{Y|Z=z_{i}}. Hence, (KZ+λ𝐈)1KZ(K_{Z}+\lambda{\bf I})^{-1}K_{Z} can be seen as a transformation applied on KXYK_{XY} (KXYK_{XY} defines the similarity between XX and YY), converting unconditional to conditional similarities between XX and YY (conditioning on ZZ). Proposition 2.2 also re-writes this estimation as a weighted sum over {ef(xi,yj)}j=1n\{e^{f(x_{i},y_{j})}\}_{j=1}^{n} with the weights w(zj,zi)=[(KZ+λ𝐈)1KZ]jiw(z_{j},z_{i})=\big{[}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}\big{]}_{ji}. We provide an illustration to compare (KZ+λ𝐈)1KZ(K_{Z}+\lambda{\bf I})^{-1}K_{Z} and KZK_{Z} in Figure 2, showing that (KZ+λ𝐈)1KZ(K_{Z}+\lambda{\bf I})^{-1}K_{Z} can be seen as a smoothed version of KZK_{Z}, suggesting the weight [(KZ+λ𝐈)1KZ]ji\big{[}(K_{Z}+\lambda{\bf I})^{-1}K_{Z}\big{]}_{ji} captures the similarity between (zj,zi)(z_{j},z_{i}). To conclude, we use the Kernel Conditional Embedding Operator (Song et al., 2013) to avoid explicitly sampling yPY|Z=zy\sim P_{Y|Z=z}, which alleviates the limitation of having insufficient data from YY that are associated with zz. It is worth noting that our method neither generates raw data directly nor includes additional training.

In terms of computational complexity, calculating the inverse (KZ+λI)1(K_{Z}+\lambda I)^{-1} costs O(b3)O(b^{3}) where bb is the batch size, or O(b2.376)O(b^{2.376}) using more efficient inverse algorithms like Coppersmith and Winograd (1987). We use the inverse approach with O(b3)O(b^{3}) computational cost, which will not be an issue for our method. This is because we consider a mini-batch training to constrain the size of bb, and the inverse (KZ+λI)1(K_{Z}+\lambda I)^{-1} does not contain gradients. The computational bottlenecks are gradients computation and their updates.

Step IV - Converting Existing Contrastive Learning Objectives.

We short hand KXY(KZ+λ𝐈)1KZK_{XY}(K_{Z}+\lambda{\bf I})^{-1}K_{Z} as KXY|ZK_{X\perp\!\!\!\perp Y|Z}, following notation from prior work (Fukumizu et al., 2007) that shorthands PX|ZPY|ZP_{X|Z}P_{Y|Z} as PXY|ZP_{X\perp\!\!\!\perp Y|Z}. Now, we plug in the estimation of ef(x,y)e^{f(x,y)} using Proposition 2.2 into WeaklySup-InfoNCE (equation 3), Fair-InfoNCE (equation 4), and HardNeg-InfoNCE (equation 5) objectives, coverting them into Conditional Contrastive learning with Kernel (CCL-K) objectives:

WeaklySupCCLK:=𝔼{(xi,yi,zi)}i=1bPXYZb[log[KXY|Z]ii[KXY|Z]ii+ji[KXY]ij].{\rm{WeaklySup}_{CCLK}}:=\mathbb{E}_{\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b}}\Big{[}{\rm log}\,\frac{[K_{X\perp\!\!\!\perp Y|Z}]_{ii}}{[K_{X\perp\!\!\!\perp Y|Z}]_{ii}+\sum_{j\neq i}[K_{XY}]_{ij}}\Big{]}. (7)
FairCCLK:=𝔼{(xi,yi,zi)}i=1bPXYZb[log[KXY]ii[KXY]ii+(b1)[KXY|Z]ii].{\rm{Fair}_{CCLK}}:=\mathbb{E}_{\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b}}\Big{[}{\rm log}\,\frac{[K_{XY}]_{ii}}{[K_{XY}]_{ii}+(b-1)[K_{X\perp\!\!\!\perp Y|Z}]_{ii}}\Big{]}. (8)
HardNegCCLK:=𝔼{(xi,yi,zi)}i=1bPXYZb[log[KXY]ii[KXY]ii+(b1)[KXY|Z]ii].{\rm{HardNeg}_{CCLK}}:=\mathbb{E}_{\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b}}\Big{[}{\rm log}\,\frac{[K_{XY}]_{ii}}{[K_{XY}]_{ii}+(b-1)[K_{X\perp\!\!\!\perp Y|Z}]_{ii}}\Big{]}. (9)

3 Related Work

The majority of the literature on contrastive learning focuses on self-supervised learning tasks (Oord et al., 2018), which leverages unlabeled samples for pretraining representations and then uses the learned representations for downstream tasks. Its applications span various domains, including computer vision (Hjelm et al., 2018), natural language processing (Kong et al., 2019), speech processing (Baevski et al., 2020), and even interdisciplinary (vision and language) across domains (Radford et al., 2021). Besides the empirical success, Arora et al. (2019); Lee et al. (2020); Tsai et al. (2021d) provide theoretical guarantees, showing that contrastively learning can reduce the sample complexity on downstream tasks. The standard self-supervised contrastive frameworks consider the objective that requires only data’s pairing information: it learns similar representations between different views of a data (augmented variants of the same image (Chen et al., 2020) or an image-caption pair (Radford et al., 2021)) and dissimilar representations between two random data. We refer to these frameworks as unconditional contrastive learning, in contrast to our paper’s focus - conditional contrastive learning, which considers contrastive objectives that take additional conditioning variables into account. Such conditioning variables can be sensitive information from data (Tsai et al., 2021c), auxiliary information from data (Tsai et al., 2021a), downstream labels (Khosla et al., 2020), or data’s embedded features (Robinson et al., 2020; Wu et al., 2020; Kalantidis et al., 2020). It is worth noting that, with additional conditioning variables, the conditional contrastive frameworks extend the self-supervised learning settings to the weakly supervised learning (Tsai et al., 2021a) or the supervised learning setting (Khosla et al., 2020).

Our paper also relates to the literature on few-shot conditional generation (Sinha et al., 2021), which aims to model the conditional generative probability (generating instances according to a conditioning variable) given only a limited amount of paired data (paired between an instance and its corresponding conditioning variable). Its applications span conditional mutual information estimation (Mondal et al., 2020), noisy signals recovery (Candes et al., 2006), image manipulation (Park et al., 2020; Sinha et al., 2021), etc. These applications require generating authentic data, which is notoriously challenging (Goodfellow et al., 2014; Arjovsky et al., 2017). On the contrary, our method models the conditional generative probability via Kernel Conditional Embedding Operator (Song et al., 2013), which generates kernel embeddings but not raw data. Ton et al. (2021) relates to our work and also uses conditional mean embedding to perform estimation regarding the conditional distribution. The differences is that Ton et al. (2021) tries to improve conditional density estimation while this paper aims to resolve the challenge of insufficient samples of the conditional variable. Also, both Ton et al. (2021) and this work consider noise contrastive method, more specifically, Ton et al. (2021) discusses noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010), while this work discusses InfoNCE (Oord et al., 2018) objective which is inspired from NCE.

Our proposed method can also connect to domain generalization (Blanchard et al., 2017), if we treat each ziz_{i} as a domain or a task indicator (Tsai et al., 2021c). In specific, Tsai et al. (2021c) considers a conditional contrastive learning setup, and one task of it performs contrastive learning from data across multiple domains, by conditioning on domain indicators to reduce domain-specific information for better generalization. This paper further extends this idea from Tsai et al. (2021c), by using conditional mean embedding to address the challenge of insufficient data in certain conditioning variables (in this case, domains).

4 Experiments

We conduct experiments on various conditional contrastive learning frameworks that are discussed in Section 2.2: Section 4.1 for the weakly supervised contrastive learning, Section 4.2 for the fair contrastive learning; and Section 4.3 for the hard-negatives contrastive learning.

Experimental Protocal.

We consider the setup from the recent contrastive learning literature (Chen et al., 2020; Robinson et al., 2020; Wu et al., 2020), which contains stages of pre-training, fine-tuning, and evaluation. In the pre-training stage, on data’s training split, we update the parameters in the feature encoder (i.e., gθ()g_{\theta_{\cdot}}(\cdot)s in equation 2) using the contrastive learning objectives (e.g., InfoNCE{\rm InfoNCE} (equation 1), WeaklySupInfoNCE{\rm{WeaklySup}_{InfoNCE}} (equation 3), or WeaklySupCCLK{\rm{WeaklySup}_{CCLK}} (equation 7)). In the fine-tuning stage, we fix the parameters of the feature encoder and add a small fine-tuning network on top of it. On the data’s training split, we fine-tune this small network with the downstream labels. In the evaluation stage, we evaluate the fine-tuned representations on the data’s test split. We adopt ResNet-50 He et al. (2016) or LeNet-5 LeCun et al. (1998) as the feature encoder and a linear layer as the fine-tuning network. All experiments are performed using LARS optimizer You et al. (2017). More details can be found in Appendix and our released code.

4.1 Weakly Supervised Contrastive Learning

In this subsection, we perform experiments within the weakly supervised contrastive learning framework (Tsai et al., 2021a), which considers auxiliary information as the conditioning variable ZZ. It aims to learn similar representations for a pair of data with similar outcomes from the conditioning variable (i.e., similar auxiliary information), and vice versa.

Datasets and Metrics.

We consider three visual datasets in this set of experiments. Data XX and YY represent images after applying arbitrary image augmentations. 1) UT-Zappos (Yu and Grauman, 2014): It contains 50,02550,025 shoe images over 2121 shoe categories. Each image is annotated with 77 binomially-distributed attributes as auxiliary information, and we convert them into 126126 binary attributes (Bernoulli-distributed). 2) CUB (Wah et al., 2011): It contains 11,78811,788 bird images spanning 200200 fine-grain bird species, meanwhile 312312 binary attributes are attached to each image. 3) ImageNet-100 (Russakovsky et al., 2015): It is a subset of ImageNet-1k Russakovsky et al. (2015) dataset, containing 0.120.12 million images spanning 100100 categories. This dataset does not come with auxiliary information, and hence we extract the 512512-dim. visual features from the CLIP (Radford et al., 2021) model (a large pre-trained visual model with natural language supervision) to be its auxiliary information. Note that we consider different types of auxiliary information. For UT-Zappos and CUB, we consider discrete and human-annotated attributes. For ImageNet-100, we consider continuous and pre-trained language-enriched features. We report the top-1 accuracy as the metric for the downstream classification task.

Methodology.

We consider the WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK} objective (equation 7) as the main method. For WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK}, we perform the sampling process {(xi,yi,zi)}i=1bPXYZb\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b} by first sampling an image imi{\rm im}_{i} along with its auxiliary information ziz_{i} and then applying different image augmentations on imi{\rm im}_{i} to obtain (xi,yi)(x_{i},y_{i}). We also study different types of kernels for KZK_{Z}. On the other hand, we select two baseline methods. The first one is the unconditional contrastive learning method: the InfoNCE{\rm InfoNCE} objective (Oord et al., 2018; Chen et al., 2020) (equation 1). The second one is the conditional contrastive learning baseline: the WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE} objective (equation 3). The difference between WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK} and WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE} is that the latter requires sampling a pair of data with the same outcome from the conditioning variable (i.e., (x,y)PX|Z=zPY|Z=z(x,y)\sim P_{X|Z=z}P_{Y|Z=z}). However, as suggested in Section 2.3, directly performing conditional sampling is challenging if there is not enough data to support the conditioning sampling procedure. Such limitation exists in our datasets: CUB has on average 1.0011.001 data points per ZZ’s configuration, and ImageNet-100 has only 11 data point per ZZ’s configuration since its conditioning variable is continuous and each instance from the dataset has a unique ZZ. To avoid this limitation, in WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE} clusters the data to ensure that data within the same cluster are abundant and have similar auxiliary information, and then treating the clustering information as the new conditioning variable. The result of WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE} is reported by selecting the optimal number of clusters via cross-validation.

UT-Zappos CUB ImageNet-100
Unconditional Contrastive Learning Methods
InfoNCE 77.8 ±\pm 1.5 14.1 ±\pm 0.7 76.2 ±\pm 0.3
Conditional Contrastive Learning Methods
WeaklySupInfoNCE 84.6 ±\pm 0.4 20.6 ±\pm 0.5 81.4 ±\pm 0.4
WeaklySupCCLK (ours) 86.6 ±\pm 0.7 29.9 ±\pm0.3 82.4 ±\pm 0.5
Kernels UT-Zappos CUB ImageNet-100
RBF 86.5 ±\pm 0.5 32.3 ±\pm 0.5 81.8 ±\pm 0.4
Laplacian 86.8 ±\pm 0.3 32.1 ±\pm 0.5 80.2 ±\pm 0.3
Linear 86.5 ±\pm 0.4 29.4 ±\pm 0.8 77.5 ±\pm 0.3
Cosine 86.6 ±\pm 0.7 29.9 ±\pm 0.3 82.4 ±\pm 0.5
Table 2: Object classification accuracy (%) under the weakly supervised contrastive learning setup. Left: results of the proposed method and the baselines. Right: different types of kernel choice in WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK}.

Results.

We show the results in Table 2. First, WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK} shows consistent improvements over unconditional baseline InfoNCE, with absolute improvements of 8.8%8.8\%, 15.8%15.8\%, and 6.2%6.2\% on UT-Zappos, CUB and, ImageNet-100 respectively. This is because the conditional method utilizes the additional auxiliary information (Tsai et al., 2021a). Second, WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK} performs better than WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE}, with absolute improvements of 2%2\%, 9.3%9.3\%, and 1%1\%. We attribute better performance of WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK} over WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE} to the following fact: WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE} first performs clustering on the auxiliary information and considers the new clusters as the conditioning variable, while WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK} directly considers the auxiliary information as the conditioning variable. The clustering in WeaklySupInfoNCE{\rm WeaklySup}_{\rm InfoNCE} may lose precision of the auxiliary information and may negatively affect the quality of the auxiliary information incorporated in the representation. Ablation study on the choice of kernels has shown the consistent performance of WeaklySupCCLK{\rm WeaklySup}_{\rm CCLK} across different kernels on the UT-Zappos, CUB and ImageNet-100 dataset, where we consider the following kernel functions: RBF, Laplacian, Linear and Cosine. Most kernels have similar performances, except that linear kernel is worse than others on ImageNet-100 (by at least 2.7%2.7\%).

4.2 Fair Contrastive Learning

In this subsection, we perform experiments within the fair contrastive learning framework (Tsai et al., 2021c), which considers sensitive information as the conditioning variable ZZ. It fixes the outcome of the sensitive variable for both the positively-paired and negatively-paired samples in the contrastive learning process, which leads the representations to ignore the effect from the sensitive variable.

Datasets and Metrics.

Our experiments focus on continuous sensitive information, to echo with the limitation of the conditional sampling procedure in Section 2.3. Nonetheless, existing datasets mostly consider discrete sensitive variables, such as gender or race. Therefore, we synthetically create ColorMNIST dataset, which randomly assigns a continuous RBG color value for the background in each handwritten digit image in the MNIST dataset (LeCun et al., 1998). We consider the background color as sensitive information. For statistics, ColorMNIST has 60,00060,000 colored digit images across 1010 digit labels. Similar to Section 4.1, data XX and YY represent images after applying arbitrary image augmentations. Our goal is two-fold: we want to see how well the learned representations 1) perform on the downstream classification task and 2) ignore the effect from the sensitive variable. For the former one, we report the top-1 accuracy as the metric; for the latter one, we report the Mean Square Error (MSE) when trying to predict the color information. Note that the MSE score is higher the better since we would like the learned representations to contain less color information.

Top-1 Accuracy (\uparrow) MSE (\uparrow)
Unconditional Contrastive Learning Methods
InfoNCE 84.1 ±\pm 1.8 48.8 ±\pm 4.5
Conditional Contrastive Learning Methods
FairInfoNCE 85.9 ±\pm 0.4 64.9 ±\pm 5.1
FairCCLK (ours) 86.4 ±\pm 0.9 64.7 ±\pm 3.9
Table 3: Classification accuracy (%) under the fair contrastive learning setup, and the MSE (higher the better) between color in an image and color prediction by the image’s representation. A higher MSE indicates less color information from the original image is contained in the learned representation.

Methodology.

We consider the FairCCLK{\rm Fair}_{\rm CCLK} objective (equation 8) as the main method. For FairCCLK{\rm Fair}_{\rm CCLK}, we perform the sampling process {(xi,yi,zi)}i=1bPXYZb\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b} by first sampling a digit image imi{\rm im}_{i} along with its sensitive information ziz_{i} and then applying different image augmentations on imi{\rm im}_{i} to obtain (xi,yi)(x_{i},y_{i}). We select the unconditional contrastive learning method - the InfoNCE{\rm InfoNCE} objective (equation 1) as our baseline. We also consider the FairInfoNCE{\rm Fair}_{\rm InfoNCE} objective (equation 4) as a baseline, by clustering the continuous conditioning variable ZZ into one of the following: 3, 5, 10, 15, or 20 clusters using K-means.

Results.

We show the results in Table 3. FairCCLK{\rm Fair}_{\rm CCLK} is consistently better than the InfoNCE, where the absolute accuracy improvement is 2.3%2.3\% and the relative improvement of MSE (higher the better) is 32.6%32.6\% over InfoNCE. We report the result of using the Cosine kernel and provide the ablation study of different kernel choices in the Appendix. This result suggests that the proposed FairCCLK{\rm Fair}_{\rm CCLK} can achieve better downstream classification accuracy while ignoring more sensitive information (color information) compared to the unconditional baseline, suggesting that our method can achieve a better level of fairness (by excluding color bias) without sacrificing performance. Next we compare to FairInfoNCE{\rm Fair}_{\rm InfoNCE} baseline, and we report the result using the 10-cluster partition of ZZ as it achieves the best top-1 accuracy. Compared to FairInfoNCE{\rm Fair}_{\rm InfoNCE}, FairCCLK{\rm Fair}_{\rm CCLK} is better in downstream accuracy, while slightly worse in MSE (a difference of 0.2). For FairInfoNCE{\rm Fair}_{\rm InfoNCE}, if the number of discretized value of ZZ increases, the MSE in general grows, but the accuracy peaks at 10 clusters and then declines. This suggests that FairInfoNCE{\rm Fair}_{\rm InfoNCE} can remove more sensitive information as the granularity of ZZ increases, but may hurt the downstream task performance. Overall, the FairCCLK{\rm Fair}_{\rm CCLK} performs slightly better than FairInfoNCE{\rm Fair}_{\rm InfoNCE}, and do not need clustering to discretize ZZ.

4.3 Hard-Negatives Contrastive Learning

In this subsection, we perform experiments within the hard negative contrastive learning framework (Robinson et al., 2020), which considers the embedded features as the conditioning variable ZZ. Different from conventional contrastive learning methods (Chen et al., 2020; He et al., 2020) that considers learning dissimilar representations for a pair of random data, the hard negative contrastive learning methods learn dissimilar representations for a pair of random data when they have similar embedded features (i.e., similar outcomes from the conditioning variable).

Datasets and Metrics.

We consider two visual datasets in this set of experiments. Data XX and YY represent images after applying arbitrary image augmentations. 1) CIFAR10 (Krizhevsky et al., 2009): It contains 60,00060,000 images spanning 1010 classes, e.g. automobile, plane, or dog. 2) ImageNet-100 (Russakovsky et al., 2015): It is the same dataset that we used in Section 2. We report the top-1 accuracy as the metric for the downstream classification task.

Methodology.

We consider the HardNegCCLK{\rm HardNeg}_{\rm CCLK} objective (equation 9) as the main method. For HardNegCCLK{\rm HardNeg}_{\rm CCLK}, we perform the sampling process {(xi,yi,zi)}i=1bPXYZb\{(x_{i},y_{i},z_{i})\}_{i=1}^{b}\sim P_{XYZ}^{\,\,\,\otimes b} by first sampling an image imi{\rm im}_{i}, then applying different image augmentations on imi{\rm im}_{i} to obtain (xi,yi)(x_{i},y_{i}), and last defining zi=gθX(xi)z_{i}=g_{\theta_{X}}(x_{i}). We select two baseline methods. The first one is the unconditional contrastive learning method: the InfoNCE{\rm InfoNCE} objective (Oord et al., 2018; Chen et al., 2020) (equation 1). The second one is the conditional contrastive learning baseline: the HardNegInfoNCE{\rm HardNeg}_{\rm InfoNCE}, objective (Robinson et al., 2020), and we report its result directly using the released code from the author (Robinson et al., 2020).

CIFAR-10 ImageNet-100
Unconditional Contrastive Learning Methods
InfoNCE 89.9 ±\pm 0.2 77.8 ±\pm 0.4
Conditional Contrastive Learning Methods
HardNegInfoNCE 91.4 ±\pm 0.2 79.2 ±\pm 0.5
HardNegCCLK (ours) 91.7 ±\pm 0.1 81.2 ±\pm 0.2
Table 4: Classification accuracy (%) under the hard negatives contrastive learning setup.

Results.

From Table 4, first, HardNegCCLK{\rm HardNeg}_{\rm CCLK} consistently shows better performances over the InfoNCE baseline, with absolute improvements of 1.8%1.8\% and 3.1%3.1\% on CIFAR-10 and ImageNet-100 respectively. This suggests that the hard negative sampling effectively improves the downstream performance, which is in accordance with the observation by Robinson et al. (2020). Next, HardNegCCLK{\rm HardNeg}_{\rm CCLK} also performs better than the HardNegInfoNCE{\rm HardNeg}_{\rm InfoNCE} baseline, with absolute improvements of 0.3%0.3\% and 2.0%2.0\% on CIFAR-10 and ImageNet-100 respectively. Both methods construct hard negatives by assigning a higher weight to a random paired data that are close in the embedding space and a lower weight to a random paired data that are far in the embedding space. The implementation by Robinson et al. (2020) uses Euclidean distances to measure the similarity and HardNegCCLK{\rm HardNeg}_{\rm CCLK} uses the smoothed kernel similarity (i.e., (KZ+λ𝐈)1KZ(K_{Z}+\lambda{\bf I})^{-1}K_{Z} in Proposition 2.2) to measure the similarity. Empirically our approach performs better.

5 Conclusion

In this paper, we present CCL-K, the Conditional Contrastive Learning objectives with Kernel expressions. CCL-K avoids the need to perform explicit conditional sampling in conditional contrastive learning frameworks, alleviating the insufficient data problem of the conditioning variable. CCL-K uses the Kernel Conditional Embedding Operator, which first defines the kernel similarities between the conditioning variable’s values and then samples data that have similar values of the conditioning variable. CCL-K is can directly work with continuous conditioning variables, while prior work requires binning or clustering to ensure sufficient data for each bin or cluster. Empirically, CCL-K also outperforms conditional contrastive baselines tailored for weakly-supervised contrastive learning, fair contrastive learning, and hard negatives contrastive learning. An interesting future direction is to add more flexibility to CCL-K, by relaxing the kernel similarity to arbitrary similarity measurements.

6 Ethics Statement

Because our method can improve removing sensitive information in contrastive learning representations, our contribution can have a positive impact on fairness and privacy, where biases or user-specific information should be excluded from the representation. However, the conditioning variable must be predefined, so our method cannot directly remove any biases that are implicit and are not captured by a variable in the dataset.

7 Reproducibility Statement

We provide an anonymous source code link for reproducing our result in the supplementary material and include complete files which can reproduce the data processing steps for each dataset we use in the supplementary material.

Acknowledgements

The authors would like to thank the anonymous reviewers for helpful comments and suggestions. This work is partially supported by the National Science Foundation IIS1763562, IARPA D17PC00340, ONR Grant N000141812861, Facebook PhD Fellowship, BMW, National Science Foundation awards 1722822 and 1750439, and National Institutes of Health awards R01MH125740, R01MH096951 and U01MH116925. KZ would like to acknowledge the support by the National Institutes of Health (NIH) under Contract R01HL159805, by the NSF-Convergence Accelerator Track-D award 2134901, and by the United States Air Force under Contract No. FA8650-17-C7715. HZ would like to thank support from a Facebook research award. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors, and no official endorsement should be inferred.

References

  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • Arora et al. [2019] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
  • Bachman et al. [2019] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019.
  • Baevski et al. [2020] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
  • Blanchard et al. [2017] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910, 2017.
  • Candes et al. [2006] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207–1223, 2006.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • Coppersmith and Winograd [1987] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. In Proceedings of the nineteenth annual ACM symposium on Theory of computing, pages 1–6, 1987.
  • Fukumizu et al. [2007] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In NIPS, volume 20, pages 489–496, 2007.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Gutmann and Hyvärinen [2010] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
  • Hjelm et al. [2018] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
  • Kalantidis et al. [2020] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. arXiv preprint arXiv:2010.01028, 2020.
  • Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020.
  • Kong et al. [2019] Lingpeng Kong, Cyprien de Masson d’Autume, Wang Ling, Lei Yu, Zihang Dai, and Dani Yogatama. A mutual information maximization perspective of language representation learning. arXiv preprint arXiv:1910.08350, 2019.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee et al. [2020] Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. arXiv preprint arXiv:2008.01064, 2020.
  • Liu and Nocedal [1989] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503–528, 1989.
  • Mondal et al. [2020] Arnab Mondal, Arnab Bhattacharjee, Sudipto Mukherjee, Himanshu Asnani, Sreeram Kannan, and AP Prathosh. C-mi-gan: Estimation of conditional mutual information using minmax formulation. In Conference on Uncertainty in Artificial Intelligence, pages 849–858. PMLR, 2020.
  • Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Park et al. [2020] Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A Efros, and Richard Zhang. Swapping autoencoder for deep image manipulation. arXiv preprint arXiv:2007.00653, 2020.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • Robinson et al. [2020] Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • Schölkopf et al. [2002] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • Sinha et al. [2021] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2c: Diffusion-denoising models for few-shot conditional generation. arXiv preprint arXiv:2106.06819, 2021.
  • Song et al. [2013] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.
  • Ton et al. [2021] Jean-Francois Ton, CHAN Lucian, Yee Whye Teh, and Dino Sejdinovic. Noise contrastive meta-learning for conditional density estimation using kernel mean embeddings. In International Conference on Artificial Intelligence and Statistics, pages 1099–1107. PMLR, 2021.
  • Tsai et al. [2021a] Yao-Hung Hubert Tsai, Tianqin Li, Weixin Liu, Peiyuan Liao, Ruslan Salakhutdinov, and Louis-Philippe Morency. Integrating auxiliary information in self-supervised learning. arXiv preprint arXiv:2106.02869, 2021a.
  • Tsai et al. [2021b] Yao-Hung Hubert Tsai, Martin Q Ma, Muqiao Yang, Han Zhao, Louis-Philippe Morency, and Ruslan Salakhutdinov. Self-supervised representation learning with relative predictive coding. arXiv preprint arXiv:2103.11275, 2021b.
  • Tsai et al. [2021c] Yao-Hung Hubert Tsai, Martin Q Ma, Han Zhao, Kun Zhang, Louis-Philippe Morency, and Ruslan Salakhutdinov. Conditional contrastive learning: Removing undesirable information in self-supervised representations. arXiv preprint arXiv:2106.02866, 2021c.
  • Tsai et al. [2021d] Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. In ICLR, 2021d.
  • Tschannen et al. [2019] Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019.
  • Wah et al. [2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • Wu et al. [2020] Mike Wu, Milan Mosse, Chengxu Zhuang, Daniel Yamins, and Noah Goodman. Conditional negative sampling for contrastive learning of visual representations. arXiv preprint arXiv:2010.02037, 2020.
  • You et al. [2017] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
  • Yu and Grauman [2014] Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 192–199, 2014.

Appendix A Code

We include our code in a Github link: https://github.com/Crazy-Jack/CCLK-release

Appendix B Ablation of Different Choices of Kernels

We report the kernel choice ablation study for FairCCLK and HardNegCCLK (Table 5). For the FairCCLK, we consider the synthetically created ColorMNIST dataset, where the background color for each image digit image is randomly assigned. We report the accuracy of the downstream classification task, as well as the Mean Square Error (MSE) between the assigned background color and a color predicted by the learned representation. A higher MSE score is better because we would like the learned representations to contain less color information. In Table 5, we consider the following kernel functions: RBF, Polynomial (degree of 3), Laplacian and Cosine. The performances using different kernels are consistent.

For the HardNegCCLK, we consider the CIFAR-10 dataset for the ablation study and use the top-1 accuracy for object classification as the metric. In Table 5, we consider the following kernel functions: Linear, RBF, and Polynomial (degree of 3), Laplacian and Cosine. The performances using different kernels are consistent.

Lastly, we also provide the ablation of different σ2\sigma^{2} when using the RBF kernel. First, the performance trend does change too much and is sensitive to different bandwidths for the RBF kernel. In Table 7 we show the result of using different σ2\sigma^{2} in the RBF kernel. We found that using σ2=1\sigma^{2}=1 significantly hurts performance (only 3.0%), using σ2=1000\sigma^{2}=1000 is also sub-optimal. The performances of σ2=10,100,500\sigma^{2}=10,100,500 are close.

Appendix C Additional Results on CIFAR10

In Section 4.3 in the main text, we report the results of HardNegCCLK as well as baseline methods on CIFAR10 dataset by training with 400 epochs using 256 batch size. Here we also include the results with larger batch size (512 batch size) and longer training time (1000 epochs). The results are summarized in Table 6. For reference, we name the training procedure with 256 batch size and 400 epochs setting 1 and the one with 512 batch size and 1000 epochs as setting 2. From Table 6 we observe that HardNegCCLK still has better performance comparing to HardNegInfoNCE, suggesting that our method is solid. However, the performance differences shrink between the vallina InfoNCE method and the Hard Negative mining approaches in general because the benefit of hard negative mining mainly lies in the training efficiency, i.e., spend less training iteration on discriminating the negative samples that have been already pushed away.

Appendix D FairInfoNCE on ColorMNIST

Here we include the results of performing FairInfoNCE on the ColorMNIST dataset as a baseline. We implemented FairInfoNCE based on Tsai et al. [2021c], where the idea is to remove the information of the conditional variable by sampling positive and negative pairs from the same outcome of the conditional variable at each iteration. Same as our previous setup in Section 4.2, we evaluate the results by the accuracy of the downstream classification task as well as the MSE value, both of which are deemed higher as the better (as we want the learned representation to contain less color information, so a larger MSE is desirable because it means the representation contains less color information for reconstruction.) To perform FairInfoNCE and condition on the continuous color information, we need to discretize the color variable using clustering method such as K-means. We use K-means to cluster samples based on their color variable, and we use the following numbers of clusters: {3,5,10,15,20}\{3,5,10,15,20\}.

As shown in Table 8, with the number of clusters increasing, the downstream accuracy first increases then drops, peaking at number of clusters being 10. The MSE values continue increase as number of clusters increase. This suggests that FairInfoNCE can remove more sensitive information as the granularity of ZZ increases, but the downstream task performance may decrease.

Appendix E Performance under insufficient number of samples

We would illustrate why the insufficient sample would be a problem for conditional contrastive learning. We provide comparison of performances under different number of conditioning data. To be specific, we provide Figure 3, where the x-axis is the averaged number of data samples per discretized conditioning variable (cluster) for conditional contrastive learning, if we use framework like the WeaklySupInfoNCE. The dataset is UT-Zappos and the conditioning variable is the annotative attributes. The discretization is done by grouping instances that share the same annotative attributes to the same cluster. The blue line is WeaklySupInfoNCE, which requires discretized conditioning variables. The black line represents the proposed WeaklySupCCLK which does not require discretization. As we can see from the figure, the performance of WeaklySupInfoNCE suffers when the number of data per cluster (conditioning variable) is small, and WeaklySupCCLK outperforms WeaklySupInfoNCE in all cases. From this example, we can see that when the data is very insufficient (towards the origin in this figure), the proposed WeaklySupCCLK outperforms WeaklySupInfoNCE significantly.

Refer to caption
Figure 3: Illustration of the problem of insufficient samples in conditional contrastive learning. When the average number of samples per outcome (cluster) of the conditional variable is small (towards the left of the x-axis in the figure), the previous conditional contrastive learning framework WeaklySupInfoNCE (blue) suffers, while the proposed WeaklySupCCLK (black) outperforms WeaklySupInfoNCE significantly and is very stable regardless of whether the samples are sufficient (towards the right of the x-axis) or insufficient (towards the left of the x-axis).

Appendix F Dataset Details

We provide the training details of experiments conducted on the following datasets: UT-Zappos50 [Yu and Grauman, 2014], CUB-200-2011 [Wah et al., 2011], CIFAR-10 [Krizhevsky et al., 2009], ColorMNIST (our creation), and ImageNet-100 [Russakovsky et al., 2015].

F.1 UT-Zappos50K

The following section describes the experiments we performed on UT-Zappos50K dataset.

Accessiblity

The dataset is attributed to [Yu and Grauman, 2014] and available at the link: http://vision.cs.utexas.edu/projects/finegrained/utzap50k. The dataset is for non-commercial use only.

FairCCLK Top-1 Accuracy (\uparrow) MSE (\uparrow)
RBF kernel 86.2 ±\pm 0.5 57.6 ±\pm 10.6
Polynomial kernel 86.7 ±\pm 0.5 61.3 ±\pm 9.4
Laplacian kernel 85.0 ±\pm 0.9 72.8 ±\pm 13.2
Cosine kernel 86.4 ±\pm 0.9 64.7 ±\pm 3.9
HardNegCCLK CIFAR-10
Linear kernel 91.5 ±\pm 0.2
RBF kernel 91.7 ±\pm 0.1
Polynomial kernel 90.3 ±\pm 0.4
Table 5: Ablation study of different types of kernel choices. Left: digit classification accuracy of FairCCLK in ColorMNIST, and MSE (higher the better) between the color in the original image and the color predicted based on the learned representation from that image. Higher MSE is better because we intend to remove color information in the representation. Right: classification accuracy of HardNegCCLK on CIFAR-10 object classification. The performances using different kernels in both settings are consistent.
CIFAR-10 (setting 1) CIFAR-10 (setting 2)
Unconditional Contrastive Learning Methods
InfoNCE 89.9 ±\pm 0.2 93.4 ±\pm 0.1
Conditional Contrastive Learning Methods
HardNegInfoNCE 91.4 ±\pm 0.2 93.6 ±\pm 0.2
HardNegCCLK (ours) 91.7 ±\pm 0.1 93.9 ±\pm 0.1
Table 6: Results of HardNegCCLK on CIFAR10 dataset with two different training settings.
RBF σ2\sigma^{2} 1 10 100 500 1000
Accuracy (%) 3.0 ±\pm 1.5 30.9 ±\pm 0.3 32.2 ±\pm 0.4 32.0 ±\pm 0.2 24.4 ±\pm 0.9
Table 7: Result of WeaklySupCCLK under different hyper-parameters σ2\sigma^{2} using the RBF kernel in the CUB dataset.
Number of Clusters Top-1 Accuracy (\uparrow) MSE (\uparrow)
3 82.12 ±\pm 0.3 56.27 ±\pm 4.9
5 84.55 ±\pm 0.4 58.67 ±\pm 4.8
10 85.90 ±\pm 0.4 64.91 ±\pm 5.1
15 85.22 ±\pm 0.4 65.02 ±\pm 5.0
20 84.23 ±\pm 0.3 65.11 ±\pm 4.9
Table 8: Results of FairInfoNCE on the colored MNIST dataset with different numbers of clusters.

Data Processing

The dataset contains images of shoe from Zappos.com. We downsamples the images to 32×3232\times 32. The official dataset has 4 large categories following 21 sub-categories. We utilize 21 subcategories for all our classification tasks. The dataset comes with 7 attributes as auxiliary information. We binarize the 7 discrete attributes into 126 binary attributes. We consider our conditional variable Z is this 128 dimensional variable.

Training and Test Split: We randomly split train-validation images by 7:37:3 ratio, resulting in 35,01735,017 train data and 15,00815,008 validation dataset.

Network Design and Optimization

We use ResNet-50 architecture to serve as a backbone for the encoder. To compensate the 32x32 image size, we change the first 7x7 2D convolution to 3x3 2D convolution and remove the first max pooling layer in the normal ResNet-50 (See code for details). This allows a finer grain of information processing. After using the modified ResNet-50 as the encoder, we include a 2048-2048-128 Multi-Layer Perceptron (MLP) as the projection head. Batch normalization is used after each 2048 activation layers. During the evaluation, we discard the projection head and train a linear layer on top of the encoder’s output. We train 1000 epochs for all experiments with LARS optimizer (base learning rate 1.5 and scale learning rate based on our batch size divided by 256) with batch size 152 on 4 NVIDIA 1080ti GPUs. It takes about 16 hours to finish 1000 epochs training.

F.2 CUB-200-2011

The following section describes the experiments we performed on CUB-200-2011 dataset.

Accessiblity

CUB-200-2011 is created by Wah et al. [2011] and is a fine-grained dataset for bird species. It can be downloaded from the link: http://www.vision.caltech.edu/visipedia/CUB-200-2011.html. The usage is restricted to noncommercial research and educational purposes.

Data Processing

The original dataset contains 200200 birds categories over 11,78811,788 images with 312312 binary attributes attached to each image. The image is rescaled to 224×224224\times 224.

Train Test Split: We follow the original train-validation split, resulting in 5,9945,994 train images and 5,7945,794 validation images. We combine the original training and validation set as our training set and use the original test set as our validation set. The resulting training set contains 6,8716,871 images and the validation set contains 6,9186,918 images.

Network Design and Optimization

We use ResNet-50 architecture as an encoder. We choose 2048-2048-128 MLP as the projection head. Batch normalization is used after each 2048 activation layers.Different than UT-Zappos dataset, we directly employ the original design of ResNet-50 since we are training on 224x224 images. Similarly, LARS is used for optimization during the contrastive learning pretraining and we fine tune a linear layer and use Limited-memory BFGS (L-BFGS [Liu and Nocedal, 1989]) optimizer after pretraining. All experiments are run with 1000 pretraining iterations and 500 L-BFGS fine tuning steps. We use 128 batch size and train it on 4 1080ti NVIDIA GPUs. It takes about 13 hours to finish 1000 epochs training.

F.3 CIFAR-10

The following section describes the experiments we performed on CIFAR-10.

Accessibility

CIFAR-10 [Krizhevsky et al., 2009] is an object detection dataset with 60,00060,000 32×3232\times 32 images in 10 classes. The test set includes 10,00010,000 images. The dataset can be downloaded at https://www.cs.toronto.edu/~kriz/cifar.html.

Data Processing and Train and Test split

We use the training and test split from the original dataset.

Network Design and Optimization

We employ ResNet-50 backbone architecture, but we change the first 7x7 2D convolution to 3x3 2D convolution and remove the first max pooling layer in the normal ResNet-50 (See code for details). This allows better results on CIFAR10 as this dataset consists of 32x32 resolution images. 2048-2048-128 projection head is employed during contrastive learning. Batch normalization is used after each 2048 activation layers. There are two CIFAR10 training settings we consider. The first setting, which is reported in the main text, trains contrastive learning with 256 batch size for 400 epochs. It takes a 4 GPU 1080ti Machine 8 hours to finish the pretraining. For the second setting where we train with 512 batch size for 1000 epochs, it takes an DGX-1 machine 48 hours to finish training. We use LARS optimizer for all CCL-K related experiments with base lr=1.5 and base batch size equals 256.

F.4 Creation of ColorMNIST

Refer to caption
Figure 4: Creation of ColorMNIST for experiments on Fair-InfoNCE validation.

Accessiblity

We create ColorMNIST dataset for experiments in Section 4.2 in the main text. The train and test split images can be accessed from our anonymous Github link (Section A). We allow any non-commerical usage of our dataset.

Data Processing

As discussed in Section 4.2 in the main text, we create the ColorMNIST dataset by assigning a random sampled color as MNIST’s background. Images are converted into 32x32 resolution and only the background is augmented with the sampled color while the digit stroke pixel remains black. Examples of the ColorMNIST images are shown in Figure 4.

Training and Test Split We follow the original MNIST train/test split, resulting in 60,000 training images and 10,000 testing images spanning 10 digit categories.

Network Design and Optimization

To train our model, we use LeNet-5 [LeCun et al., 1998] as backbone architecture and use 2 layer linear projection head to project it to 128 dimension. We use LARS [You et al., 2017] as our optimizer. After our network is pretrained using contrastive learning, we discard the head annd fine tune a linear layer use Limited-memory BFGS (L-BFGS [Liu and Nocedal, 1989]) as optimizer. All experiments are run with 1175 pretraining iterations and 500 L-BFGS fine tuning steps.

F.5 ImageNet-100

The following section describes the experiments we performed on ImageNet-100 dataset in Section 4 in the main text.

Accessibility

This dataset is a subset of ImageNet-1K dataset, which comes from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012-2017 [Russakovsky et al., 2015]. ILSVRC is for non-commercial research and educational purposes and we refer to the ImageNet official site for more information: https://www.image-net.org/download.php.

Data Processing

We select 100100 classes from ImageNet-1K to conduct experiments. Selected class names can be accessed from our Github link.

Training and Test Split: The training split contains 128,783128,783 images and the test split contains 5,0005,000 images. The images are rescaled to size 224×224224\times 224.

Network Design and Optimization Hyper-parameters

We use conventional ResNet-50 as the backbone for the encoder. 2048-2048-128 MLP layer and a l2l2 normalization layer is used after the encoder during training and discarded in the linear evaluation protocol. Batch normalization is used after each 2048 activation layers. For optimization, we choose 128 batch size for WeaklySupCCLK\text{WeaklySup}_{\text{CCLK}} setting and 512 batch size for HardNegCCLK\text{HardNeg}_{\text{CCLK}}. Open AI CLIP model [Radford et al., 2021] is used to extract continuous feature from the raw image. WeaklySupInfoNCE\text{WeaklySup}_{\text{InfoNCE}}, we discretize the ZZ space using Kmeans clustering with k=100, 200, 500, 2500. The best result of WeaklySupInfoNCE\text{WeaklySup}_{\text{InfoNCE}} is produced by k=200. All experiments are trained with 200 epochs and require 53 hours of training on DGX machine with 8 Tesla P100 GPUs.