This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Active Learning for Direct Preference Optimization

Branislav Kveton    Xintong Li    Julian McAuley    Ryan Rossi    Jingbo Shang    Junda Wu    Tong Yu
Abstract

Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of selecting the most informative feedback for training them is under-explored. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline. We propose efficient algorithms for both settings. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We prove that the errors in our DPO logit estimates diminish with more feedback. We show the effectiveness of our algorithms empirically in the setting that matches our theory and also on large language models.


1 Introduction

Reinforcement learning from human feedback (RLHF) has been effective in aligning and fine-tuning large language models (LLMs) (Ouyang et al., 2022; Rafailov et al., 2023). The main difference from classic reinforcement learning (RL) (Sutton and Barto, 1998) is that the agent learns from human feedback, which is expressed as preferences for different potential choices (Christiano et al., 2017). The human feedback allows LLMs to be adapted beyond the distribution of data that was used for their pre-training and generate more human-like responses. The feedback can be incorporated by learning a reward model (Ouyang et al., 2022) from preferences over two (Bradley and Terry, 1952) or multiple (Plackett, 1975; Luce, 2005) choices. Proximal policy optimization (PPO) (Schulman et al., 2017) is then used to maximize the expected reward of the LLM policy under the reward model. Learning of reward models can be avoided by directly optimizing the policy with preferential feedback, known as direct preference optimization (DPO) (Rafailov et al., 2023).

Learning of human preferences for LLM optimization has two main components: preference modeling (Rafailov et al., 2023; Ethayarajh et al., 2024) and how the preferences are elicited (Lightman et al., 2024). We focus on the latter and note that this problem is analogous to classic active learning (Bishop, 2006). Prior works formulated this problem as identifying a subset of prompts with candidate responses, either online or offline, where preferential feedback would improve policy learning by RLHF, either through a reward model or DPO. These works differ in how the prompts are selected: Mehta et al. (2023); Ji et al. (2024); Muldrew et al. (2024) choose prompts based on differences of estimated rewards to their responses; Mukherjee et al. (2024); Scheid et al. (2024); Thekumparampil et al. (2024) derive optimal policies for offline exploration using D-optimal designs (Pukelsheim, 2006); and Das et al. (2024); Liu et al. (2024) solve D-optimal designs online using a greedy algorithm. Most works prove that the errors in learned reward models diminish with more feedback. Interestingly, many works propose two kinds of algorithms (Mehta et al., 2023; Das et al., 2024; Ji et al., 2024), which are either analyzable or practical. We present the first analysis of active learning in DPO and our algorithms are practical.

We study active learning in direct preference optimization. At a high level, we collect preferential feedback to improve DPO policies learned from it. We study two settings: online and offline. In the online setting, the input is a dataset of NN prompts with two candidate responses per prompt. The human feedback is unknown in advance and we elicit it online. This setting is motivated by statistical efficiency; we elicit the most informative feedback within a fixed budget on human labor. In the offline setting, the input is a dataset of NN prompts with two candidate responses per prompt, and logged preferential feedback for the responses. This setting is motivated by computational efficiency; even if the human feedback is known in advance, we may not have computational resources to learn from all of it. We solve both settings in a unified way. The key idea in our work is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and identify the most informative subset of nn prompts out of NN using a D-optimal design (Pukelsheim, 2006). D-optimal designs are a well-established tool in adaptive learning (Lattimore and Szepesvari, 2019) for near-optimal information gathering. Several recent papers applied them to learning reward models in RLHF (Das et al., 2024; Mukherjee et al., 2024; Liu et al., 2024; Scheid et al., 2024).

We make the following contributions:

  1. 1.

    We formalize active learning for DPO as choosing a subset of nn data points out of NN such the error in DPO logits, the log odds of preferring one response to the other, is minimized (Section 3).

  2. 2.

    This is the first work that derives a D-optimal design for DPO (Section 4). The key idea is to assume log-linear policies, which linearize the DPO objective at the last layer of the neural network policy representation. The derived D-optimal design resembles that of logistic regression, with additional terms due to the reference policy and regularization by it. We propose two computationally-efficient algorithms, 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}, which select the most informative data points for DPO. 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO elicits preferential feedback online and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} leverages previously logged preferential feedback to have a better design.

  3. 3.

    We analyze 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}, and show that their logit errors are O~(d/n)\tilde{O}(d/\sqrt{n}), where dd is the number of features in the linearized DPO policies and nn is the budget on preferential human feedback. This is the first analysis for DPO and has several novel technical aspects. The main technical trick is relating the feedback model and policy parameter under the assumption of log-linear policies. Therefore, we can argue for concentration of the policy parameter with more feedback. The analysis is also under a practical assumption that preferential feedback can be elicited at most once per prompt. To attain a O~(d/n)\tilde{O}(d/\sqrt{n}) rate in this setting, we introduce a novel assumption on the sufficient diversity of prompts and candidate responses.

  4. 4.

    We evaluate 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} empirically. We experiment with both log-linear DPO policies, which match our theory, and on LLMs. Our methods perform well empirically, despite the fact that they are the first ones with an analysis for active learning in DPO.

The paper is structured as follows. In Section 2, we introduce classic methods for training LLMs. In Section 3, we introduce active learning for DPO. We introduce our algorithms in Section 4 and analyze them in Section 5. In Section 6, we evaluate our algorithms empirically. We review related work in detail in Appendix C and conclude in Section 7.

2 Background

We start by introducing our notation. The prompt is a string x𝒵x\in\mathcal{Z}, where 𝒵\mathcal{Z} is the space of all strings. The response is a string y𝒵y\in\mathcal{Z}. A large language model (LLM) is a policy that maps xx to yy. We denote the probability of generating response yy to prompt xx by a policy parameterized by θΘ\theta\in\Theta by π(yx;θ)\pi(y\mid x;\theta), where Θ\Theta is the space of policy parameters. To simplify terminology, we call θ\theta a policy when it is clear that we refer to π(;θ)\pi(\cdot\mid\cdot;\theta). Pre-trained LLMs can be optimized by supervised fine-tuning (Mangrulkar et al., 2022; Hu et al., 2022) and reinforcement learning from human feedback, which may require learning of a reward model (Ouyang et al., 2022) or not (Rafailov et al., 2023). These methods are introduced next.

2.1 Supervised Fine-Tuning

Supervised fine-tuning (SFT) (Mangrulkar et al., 2022; Hu et al., 2022) is a direct application of supervised learning to LLMs. The objective of SFT is to minimize the negative log-likelihood (loglik) of response yy given prompt xx,

sft(θ)=𝔼x,y[logπ(yx;θ)],\displaystyle\mathcal{L}_{\textsc{sft}}(\theta)=-\mathbb{E}_{x,y}\left[\log\pi(y\mid x;\theta)\right]\,, (1)

in expectation over prompt-response pairs (x,y)(x,y) sampled from a training set. One limitation of SFT is that we learn only from positive examples. Therefore, it is hard to learn not to generate certain yy given xx. This motivates learning of policies through rewards in Section 2.2.

2.2 Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has two stages: reward model learning and policy optimization. The reward model r:𝒳×𝒴r:\mathcal{X}\times\mathcal{Y}\to\mathbb{R} is learned from human feedback (Ouyang et al., 2022). The LLM policy is then optimized to maximize the expected reward under the reward model using proximal policy optimization (PPO) (Schulman et al., 2017). The objective is

rlhf(θ)\displaystyle\mathcal{L}_{\textsc{rlhf}}(\theta) (2)
=𝔼x,yπ(x;θ)[r(x,y)βlogπ(yx;θ)π0(yx)],\displaystyle\!\!=\mathbb{E}_{x,y\sim\pi(\cdot\mid x;\theta)}\left[r(x,y)-\beta\log\frac{\pi(y\mid x;\theta)}{\pi_{0}(y\mid x)}\right]\,,

where xx is a prompt sampled from a training set. The first term is the reward of response yy to prompt xx. The second term penalizes for deviations of policy θ\theta from a reference policy π0\pi_{0}, usually obtained by SFT (Section 2.1). The regularization is needed because the reward model is usually learned from data collected by π0\pi_{0} and thus cannot estimate the value of significantly different policies well. The parameter β0\beta\geq 0 trades off the two terms. We define the optimal RLHF policy as θrlhf=argmaxθΘrlhf(θ)\theta_{\textsc{rlhf}}=\operatorname*{arg\,max\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{rlhf}}(\theta).

2.3 Direct Preference Optimization

Direct preference optimization (DPO) (Rafailov et al., 2023) recasts RLHF as follows. Under the Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952; Luce, 2005) of human feedback, a response with reward r(x,y1)r(x,y_{1}) is preferred to that with reward r(x,y2)r(x,y_{2}) with probability

p(y1y2x)=μ(r(x,y1)r(x,y2)),\displaystyle p(y_{1}\succ y_{2}\mid x)=\mu(r(x,y_{1})-r(x,y_{2}))\,,

where μ(v)=1/(1+exp[v])\mu(v)=1/(1+\exp[-v]) is a sigmoid function. The key observation in DPO is that the policy that maximizes (2) has a closed form

π(yx;θrlhf)=1Z(x)π0(yx)exp[1βr(x,y)],\displaystyle\pi(y\mid x;\theta_{\textsc{rlhf}})=\frac{1}{Z(x)}\pi_{0}(y\mid x)\exp\left[\frac{1}{\beta}r(x,y)\right]\,,

where Z(x)Z(x) is the normalizer (Rafailov et al., 2023). This holds for any prompt xx and response yy, under the assumption that the space of optimized policies can represent each conditional distribution exactly. This can be rearranged as r(x,y)=βlogπ(yx;θrlhf)π0(yx)+βZ(x)\displaystyle r(x,y)=\beta\log\frac{\pi(y\mid x;\theta_{\textsc{rlhf}})}{\pi_{0}(y\mid x)}+\beta Z(x) and thus

p(y1y2x;θ)\displaystyle p(y_{1}\succ y_{2}\mid x;\theta) (3)
=μ(βlogπ(y1x;θ)π0(y1x)βlogπ(y2x;θ)π0(y2x))\displaystyle\!\!=\mu\left(\beta\log\frac{\pi(y_{1}\mid x;\theta)}{\pi_{0}(y_{1}\mid x)}-\beta\log\frac{\pi(y_{2}\mid x;\theta)}{\pi_{0}(y_{2}\mid x)}\right)

holds when θ=θrlhf\theta=\theta_{\textsc{rlhf}}. A nice property of this substitution is that the normalizers Z(x)Z(x), which are difficult to estimate when the space of responses is infinite, cancel out.

Therefore, instead of learning a reward model and optimizing (2), we can directly optimize the policy in (3). Specifically, let s{0,1}s\in\left\{0,1\right\} be a random variable such that s=1s=1 when y1y_{1} is preferred to y2y_{2} given xx, and s=0s=0 when y2y_{2} is preferred to y1y_{1} given xx. This problem can be viewed as fitting (3) to the distribution of sx,y1,y2s\mid x,y_{1},y_{2} and written as maximizing the negative loglik

dpo(θ)=𝔼[\displaystyle\mathcal{L}_{\textsc{dpo}}(\theta)=-\mathbb{E}[ slogp(y1y2x;θ)+\displaystyle s\log p(y_{1}\succ y_{2}\mid x;\theta)+{} (4)
(1s)logp(y2y1x;θ)],\displaystyle(1-s)\log p(y_{2}\succ y_{1}\mid x;\theta)]\,,

where the expectation is over prompt-candidate response pairs (x,y1,y2)(x,y_{1},y_{2}) sampled from a training set, and stochastic preferential feedback sx,y1,y2s\mid x,y_{1},y_{2}. We define the optimal DPO policy as

θ=argminθΘdpo(θ)\displaystyle\textstyle\theta_{*}=\operatorname*{arg\,min\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{dpo}}(\theta) (5)

and note that it is the maximum likelihood estimate (MLE) for (4). Note that (4) is equivalent to a more classic

dpo(θ)=𝔼[logp(ywylx;θ)]\displaystyle\mathcal{L}_{\textsc{dpo}}(\theta)=-\mathbb{E}\left[\log p(y_{w}\succ y_{l}\mid x;\theta)\right]

when the winning response is yw=sy1+(1s)y2y_{w}=sy_{1}+(1-s)y_{2} and the losing response is yl=(1s)y1+sy2y_{l}=(1-s)y_{1}+sy_{2}. We use the reparameterized objective in (4) because it clearly separates the random variable ss from the rest of the objective.

We also note that (3) can be rewritten as

p(y1y2x;θ)\displaystyle p(y_{1}\succ y_{2}\mid x;\theta)
=μ(βlogπ(y1x;θ)π(y2x;θ)βπ0(y1x)π0(y2x)),\displaystyle\!\!=\mu\left(\beta\log\frac{\pi(y_{1}\mid x;\theta)}{\pi(y_{2}\mid x;\theta)}-\beta\frac{\pi_{0}(y_{1}\mid x)}{\pi_{0}(y_{2}\mid x)}\right)\,,

where logπ0(y1x)π0(y2x)\log\frac{\pi_{0}(y_{1}\mid x)}{\pi_{0}(y_{2}\mid x)} depends on the reference policy π0\pi_{0} but not on the optimized policy θ\theta. We use this algebraic form because it separates the optimized part of the objective from essentially constants.

3 Setting

We study active learning in DPO (Section 2.3). Simply put, instead of assuming that (4) is approximated using a fixed dataset, we choose the dataset actively with the objective of learning policies that are close to θ\theta_{*}. We study two variants of this problem, offline and online, which we present next.

Offline feedback. The input to this setting is a dataset of size NN with preferential human feedback for all data points. The dataset is 𝒟={(xi,yi,1,yi,2,si)}i=1N\mathcal{D}=\{(x_{i},y_{i,1},y_{i,2},s_{i})\}_{i=1}^{N}, where xix_{i} is the prompt in data point i[N]i\in[N], yi,1y_{i,1} and yi,2y_{i,2} are the candidate responses, and sis_{i} is the preferential feedback. Specifically, si=1s_{i}=1 if the preferred response is yi,1y_{i,1}, and si=0s_{i}=0 if the preferred response is yi,2y_{i,2}. Our goal is to select a subset of 𝒟\mathcal{D} of size nn so that the DPO policy on this subset is \sayclose to θ\theta_{*}. This setting is motivated by computational efficiency. In particular, even if preferential feedback sis_{i} is known, we may not have computational resources to learn from all of it. Choosing the most informative subset of 𝒟\mathcal{D} of size nn is a natural way of maximizing the information gain within the computational cost constraint.

Online feedback. The input to this setting is a dataset of size NN without preferential human feedback. The dataset is 𝒟={(xi,yi,1,yi,2)}i=1N\mathcal{D}=\{(x_{i},y_{i,1},y_{i,2})\}_{i=1}^{N}, where xix_{i} is the prompt in data point i[N]i\in[N], and yi,1y_{i,1} and yi,2y_{i,2} are the candidate responses. The human feedback sis_{i} is elicited online. This setting is motivated by statistical efficiency. We want to collect the most informative feedback using only information about prompts xix_{i}, and candidate responses yi,1y_{i,1} and yi,2y_{i,2}.

Let 𝒮n[N]\mathcal{S}_{n}\subseteq[N] be a subset of nn data point indices from 𝒟\mathcal{D}, either collected online or offline. After the algorithm selects 𝒮n\mathcal{S}_{n}, we minimize an empirical approximation to (4) on 𝒮n\mathcal{S}_{n}. Before we define it, we introduce a more compact notation. Let

μi(θ)=μ(βlogπ(yi,1xi;θ)π(yi,2xi;θ)βbi)\displaystyle\mu_{i}(\theta)=\mu\left(\beta\log\frac{\pi(y_{i,1}\mid x_{i};\theta)}{\pi(y_{i,2}\mid x_{i};\theta)}-\beta b_{i}\right)

be the probability that response yi,1y_{i,1} is preferred to yi,2y_{i,2} given xix_{i} under policy θ\theta, where bi=log(π0(yi,1xi)π0(yi,2xi))b_{i}=\log\left(\frac{\pi_{0}(y_{i,1}\mid x_{i})}{\pi_{0}(y_{i,2}\mid x_{i})}\right) is the bias due to the reference policy π0\pi_{0}. Let

dpo(θ;𝒮)\displaystyle\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}) (6)
=i𝒮silogμi(θ)+(1si)log(1μi(θ))\displaystyle\!\!=-\sum_{i\in\mathcal{S}}s_{i}\log\mu_{i}(\theta)+(1-s_{i})\log(1-\mu_{i}(\theta))

be the DPO negative loglik on 𝒮[N]\mathcal{S}\subseteq[N]. Then (4) can be approximated on 𝒮n\mathcal{S}_{n} by 1ndpo(θ;𝒮n)\frac{1}{n}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}_{n}). We propose algorithms for choosing 𝒮n\mathcal{S}_{n} in Section 4.

Objective. Now we are ready to state our objective. Let θ\theta_{*} be the optimal DPO policy in (5). Let (θ,θ)=\mathcal{E}(\theta,\theta_{*})=

maxi[N]|βlogπ(yi,1xi;θ)π(yi,2xi;θ)βlogπ(yi,1xi;θ)π(yi,2xi;θ)|\displaystyle\max_{i\in[N]}\left|\beta\log\frac{\pi(y_{i,1}\mid x_{i};\theta)}{\pi(y_{i,2}\mid x_{i};\theta)}-\beta\log\frac{\pi(y_{i,1}\mid x_{i};\theta_{*})}{\pi(y_{i,2}\mid x_{i};\theta_{*})}\right| (7)

be the maximum logit error under policy θ\theta, the difference of DPO logits under θ\theta and θ\theta_{*}. Note that the biases cancel. Let θ^n=argminθΘdpo(θ;𝒮n)\hat{\theta}_{n}=\operatorname*{arg\,min\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}_{n}) denote the optimal DPO policy on 𝒮n\mathcal{S}_{n}. We want θ^n\hat{\theta}_{n} to be close to θ\theta_{*} in terms of (7). Specifically, we want (θ^n,θ)\mathcal{E}(\hat{\theta}_{n},\theta_{*}) to decrease with nn with a high probability. The motivation for (7) is that it can bound many other errors. For instance, since the Lipschitz factor of μ\mu is 1/41/4, we get

maxi[N]|μi(θ^n)μi(θ)|14(θ^n,θ).\displaystyle\max_{i\in[N]}|\mu_{i}(\hat{\theta}_{n})-\mu_{i}(\theta_{*})|\leq\frac{1}{4}\mathcal{E}(\hat{\theta}_{n},\theta_{*})\,.

Therefore, when the maximum logit error is small, the estimated probability that yi,1y_{i,1} is preferred to yi,2y_{i,2} under policy θ^n\hat{\theta}_{n}, for any data point i[N]i\in[N], is close to that under θ\theta_{*}.

4 Algorithms

The key idea in our paper is to linearize the policy at the last layer of its neural network representation and use linear algebra for active learning. Active learning on linearized neural networks was popularized in regret minimization by Riquelme et al. (2018). Das et al. (2024); Mukherjee et al. (2024); Thekumparampil et al. (2024); Liu et al. (2024); Scheid et al. (2024) applied it recently to learning reward models. In our work, we linearize policies and formalize it as follows.

Assumption 1.

All policies are log-linear,

π(yx;θ)exp[ϕ(x,y)θ],\displaystyle\pi(y\mid x;\theta)\propto\exp[\phi(x,y)^{\top}\theta]\,, (8)

where ϕ(x,y)d\phi(x,y)\in\mathbb{R}^{d} is the feature vector for pair (x,y)(x,y) and θd\theta\in\mathbb{R}^{d} is a policy parameter.

We make this assumption for the rest of the paper. Under this assumption, μi(θ)\mu_{i}(\theta) in (6) becomes

μi(θ)=μ(β(ϕiθbi)),\displaystyle\mu_{i}(\theta)=\mu(\beta(\phi_{i}^{\top}\theta-b_{i}))\,, (9)

where ϕi=ϕ(xi,yi,1)ϕ(xi,yi,2)\phi_{i}=\phi(x_{i},y_{i,1})-\phi(x_{i},y_{i,2}) is the difference of the feature vectors of responses yi,1y_{i,1} and yi,2y_{i,2} given xix_{i}. We note that the normalizers of π(yx;θ)\pi(y\mid x;\theta) cancel out. We also note that when (9) is substituted into (6), we obtain a similar expression to the negative loglik of logistic regression, except for the bias bib_{i} and β\beta. The key idea in our algorithms is to optimize the Hessian of the DPO negative loglik.

Lemma 1.

Let π(yx;θ)\pi(y\mid x;\theta) be a log-linear policy. Then the Hessian of dpo(θ;𝒮)\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}) in (6) with respect to θ\theta is

2dpo(θ;𝒮)=β2i𝒮μi(θ)(1μi(θ))ϕiϕi.\displaystyle\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S})=\beta^{2}\sum_{i\in\mathcal{S}}\mu_{i}(\theta)(1-\mu_{i}(\theta))\phi_{i}\phi_{i}^{\top}\,.

It is also positive semi-definite.

Proof.

The proof is in Section A.1. ∎

The Hessian 2dpo(θ;𝒮)\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}) can be used to derive the covariance matrix of the MLE of dpo(θ;𝒮)\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}) and is also known as the Fisher information matrix (Fisher, 1922). Therefore, it can be used for both uncertainty quantification and information gathering (Lattimore and Szepesvari, 2019). Since the MLE of dpo(θ;𝒮)\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}) is a policy, we can use the Hessian to select a subset of data points to learn better policies.

Specifically, let 𝒮n\mathcal{S}_{n} be a subset of nn data point indices and θ^n=argminθΘdpo(θ;𝒮n)\hat{\theta}_{n}=\operatorname*{arg\,min\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}_{n}) be the corresponding MLE. We show in Theorem 2 that the error in the logit estimate at data point i[N]i\in[N] is bounded with a high probability as

|ϕi(θ^nθ)|dϕi(2dpo(θ;𝒮n))1ϕi\displaystyle|\phi_{i}^{\top}(\hat{\theta}_{n}-\theta_{*})|\leq\sqrt{d\phi_{i}^{\top}(\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n}))^{-1}\phi_{i}}

up to logarithmic factors. To minimize it, we want to maximize all eigenvalues of 2dpo(θ;𝒮n)\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n}). We achieve this by maximizing logdet(2dpo(θ;𝒮n))\log\operatorname{det}(\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})) over 𝒮n\mathcal{S}_{n}.

This optimization problem is challenging for two reasons. First, it is a discrete optimization problem over 𝒮n\mathcal{S}_{n}. In our work, we maximize logdet(2dpo(θ;𝒮n))\log\operatorname{det}(\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})) greedily. An informal justification for this approach is that logdet(X)\log\operatorname{det}(X) is monotone and concave in XX for X0X\succeq 0, and thus a greedy algorithm should be near optimal (Nemhauser et al., 1978). We prove this formally in Section 5. Second, θ\theta_{*} is unknown. We overcome this by using its plug-in estimates (Stufken and Yang, 2012)

4.1 Active DPO with Online Preferential Feedback

Algorithm 1 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO: Active DPO with online feedback.
Input:Dataset 𝒟={(xi,yi,1,yi,2)}i=1N\mathcal{D}=\{(x_{i},y_{i,1},y_{i,2})\}_{i=1}^{N}H0γId,𝒮0H_{0}\leftarrow\gamma I_{d},\ \mathcal{S}_{0}\leftarrow\emptysett=1,,nt=1,\dots,nSolve θ^t1argminθΘdpo(θ;𝒮t1)\hat{\theta}_{t-1}\leftarrow\operatorname*{arg\,min\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}_{t-1})Let vt,iβμi(θ^t1)(1μi(θ^t1))ϕiv_{t,i}\leftarrow\beta\sqrt{\mu_{i}(\hat{\theta}_{t-1})(1-\mu_{i}(\hat{\theta}_{t-1}))}\phi_{i}Itargmaxi[N]𝒮t1logdet(Ht1+vt,ivt,i)\displaystyle I_{t}\leftarrow\operatorname*{arg\,max\,}_{i\in[N]\setminus\mathcal{S}_{t-1}}\log\operatorname{det}(H_{t-1}+v_{t,i}v_{t,i}^{\top})Get preferential feedback sIts_{I_{t}}on (xIt,yIt,1,yIt,2)(x_{I_{t}},y_{I_{t},1},y_{I_{t},2})HtHt1+vt,Itvt,ItH_{t}\leftarrow H_{t-1}+v_{t,I_{t}}v_{t,I_{t}}𝒮t𝒮t1+{It}\mathcal{S}_{t}\leftarrow\mathcal{S}_{t-1}+\left\{I_{t}\right\}Output:Data point indices 𝒮n\mathcal{S}_{n}for learning a model
\State
\State
\For
\State
\State
\State
\State
\State
\State
\EndFor
\State

Our first algorithm does not have access to any preferential feedback initially. It collects it online, re-estimates θ\theta_{*}, and approximately maximizes logdet(2dpo(θ;𝒮n))\log\operatorname{det}(\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})).

The pseudo-code of the algorithm is in Algorithm 1 and we call it active DPO (𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO). 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO chooses data points in nn rounds. The indices of the chosen data points in the first tt rounds are denoted by StS_{t} and the corresponding Hessian is HtH_{t}. We refer to it as the design matrix since it is used to select next data points. The design matrix is initialized to γId\gamma I_{d}, where γ>0\gamma>0 is a constant that guarantees that all HtH_{t} are well defined. In round tt, 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO selects the index ItI_{t} that greedily maximizes the information gain given HtH_{t} and the empirical estimate of θ\theta_{*} up to round tt, θ^t1\hat{\theta}_{t-1} (line 6). This is because

vt,ivt,i=β2μi(θ^t1)(1μi(θ^t1))ϕiϕt\displaystyle v_{t,i}v_{t,i}^{\top}=\beta^{2}\mu_{i}(\hat{\theta}_{t-1})(1-\mu_{i}(\hat{\theta}_{t-1}))\phi_{i}\phi_{t}^{\top}

can be viewed as the incremental gain due to data point ii in Lemma 1. After the data point ItI_{t} is chosen, we observe preferential feedback on it (line 7) and update all statistics (lines 8-9). Finally, after nn rounds, 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO outputs nn chosen indices (line 10) and an LLM policy is optimized on them using DPO.

The time complexity of 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO is O(n2+nN)O(n^{2}+nN). The former term is due to training on all past feedback in each round (line 4) and the latter is due to maximizing exactly in line 6. In experiments, we reduce the former to O(nlogn)O(n\log n) by estimating θ^t1\hat{\theta}_{t-1} only a logarithmic number of times, when t=2it=2^{i} for some integer i>0i>0. We reduce the latter to O(n)O(n) by replacing [N]𝒮t1[N]\setminus\mathcal{S}_{t-1} with its random subset of a fixed size 256256. Finally, note that ItI_{t} in line 66 can be equivalently expressed (Section A.3) as

It=argmaxi[N]𝒮t1vt,iHt11vt,i.\displaystyle\textstyle I_{t}=\operatorname*{arg\,max\,}_{i\in[N]\setminus\mathcal{S}_{t-1}}v_{t,i}H_{t-1}^{-1}v_{t,i}^{\top}\,. (10)

Therefore, the determinant does not need to be computed. The inverse Ht11H_{t-1}^{-1} can be computed incrementally using the Sherman-Morrison formula, with O(d2)O(d^{2}) update time. The statistical efficiency of 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO is analyzed in Section 5.

4.2 Active DPO with Offline Preferential Feedback

Our second algorithm has access to preferential feedback initially. All feedback is used to estimate θ\theta_{*}, which is then used to approximately maximize logdet(2dpo(θ;𝒮n))\log\operatorname{det}(\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})).

The pseudo-code of our algorithm is in Algorithm 2 and we call it 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}, where ++ indicates that 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} has access to more information than 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO. 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} differs from 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO in two steps. First, θ\theta_{*} is estimated initially (line 3) from all preferential feedback. Second, no preferential feedback is collected online. Similarly to 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO, the time complexity of 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} is O(nN)O(nN) because of the exact maximization in line 7. We reduce it to O(n)O(n) in experiments as in Section 4.1.

Algorithm 2 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}: Active DPO for offline feedback.
Input:Dataset 𝒟={(xi,yi,1,yi,2,si)}i=1N\mathcal{D}=\{(x_{i},y_{i,1},y_{i,2},s_{i})\}_{i=1}^{N}H0γId,𝒮0H_{0}\leftarrow\gamma I_{d},\ \mathcal{S}_{0}\leftarrow\emptysetSolve θ^argminθΘdpo(θ;[N])\hat{\theta}\leftarrow\operatorname*{arg\,min\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{dpo}}(\theta;[N])t=1,,nt=1,\dots,nθ^t1θ^\hat{\theta}_{t-1}\leftarrow\hat{\theta}Let vt,iβμi(θ^t1)(1μi(θ^t1))ϕiv_{t,i}\leftarrow\beta\sqrt{\mu_{i}(\hat{\theta}_{t-1})(1-\mu_{i}(\hat{\theta}_{t-1}))}\phi_{i}Itargmaxi[N]𝒮t1logdet(Ht1+vt,ivt,i)\displaystyle I_{t}\leftarrow\operatorname*{arg\,max\,}_{i\in[N]\setminus\mathcal{S}_{t-1}}\log\operatorname{det}(H_{t-1}+v_{t,i}v_{t,i}^{\top})HtHt1+vt,Itvt,ItH_{t}\leftarrow H_{t-1}+v_{t,I_{t}}v_{t,I_{t}}𝒮t𝒮t1+{It}\mathcal{S}_{t}\leftarrow\mathcal{S}_{t-1}+\left\{I_{t}\right\}Output:Data point indices 𝒮n\mathcal{S}_{n}for learning a model
\State
\State
\State
\For
\State
\State
\State
\State
\State
\EndFor
\State

5 Analysis

In this section, we provide a unified analysis for 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}. This is possible because the algorithms only differ in how the instance-specific factors in the design matrix are estimated. In 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}, they are estimated from all preferential feedback. In 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO, only the online elicited feedback up to round tt is used. We state our assumptions first.

We assume that all policies are log-linear (Assumption 1) and that the collected feedback sIts_{I_{t}} is conditionally independent given all feedback up to round tt, for all t[n]t\in[n]. Under this assumption, the negative loglik in (6) is similar to that of logistic regression and we can use existing concentration inequalities (Abbasi-Yadkori et al., 2011).

Assumption 2.

[Boundedness] For any i[N]i\in[N], ϕi21\|\phi_{i}\|_{2}\leq 1 and |bi|1|b_{i}|\leq 1. We assume that Θ\Theta is a unit sphere, and hence θ21\|\theta_{*}\|_{2}\leq 1 and θ^n21\|\hat{\theta}_{n}\|_{2}\leq 1.

Assumptions on feature vectors, comprising ϕi\phi_{i} and bib_{i}, are standard in the analyses of generalized linear models (Li et al., 2017; Kveton et al., 2020; Mukherjee et al., 2024). Our assumption on θ\theta_{*} and θ^n\hat{\theta}_{n} can be guaranteed by applying DPO to a unit sphere Θ\Theta. The assumption can be weakened to θ^nθ21\|\hat{\theta}_{n}-\theta_{*}\|_{2}\leq 1 using initial exploration (Li et al., 2017; Kveton et al., 2020).

We can analyze 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} in a unified way because the instance-specific factors in their design matrices can be bounded from below by cminc_{\min} and above by cmaxc_{\max}.

Assumption 3.

[Design matrix] For any i[N]i\in[N] and θΘ\theta\in\Theta, we have 0cminβ2μi(θ)(1μi(θ))cmax0\leq c_{\min}\leq\beta^{2}\mu_{i}(\theta)(1-\mu_{i}(\theta))\leq c_{\max}.

These constants obviously exist and can be easily derived. For instance, since maxxμ(x)(1μ(x))=0.25\max_{x\in\mathbb{R}}\mu(x)(1-\mu(x))=0.25, we get cmax=0.25β2c_{\max}=0.25\beta^{2}. Moreover, under Assumption 2, we have for any μi(θ)0.5\mu_{i}(\theta)\leq 0.5 that

β2μi(θ)(1μi(θ))β2μi2(θ)β2μ(4β)=cmin.\displaystyle\beta^{2}\mu_{i}(\theta)(1-\mu_{i}(\theta))\geq\beta^{2}\mu_{i}^{2}(\theta)\geq\beta^{2}\mu(-4\beta)=c_{\min}\,.

The argument for μi(θ)0.5\mu_{i}(\theta)\geq 0.5 is similar. The constants cminc_{\min} and cmaxc_{\max} appear in our bounds.

The last assumption is that the dataset is sufficiently diverse.

Assumption 4.

[Diverse dataset] There exists a constant κ1\kappa\geq 1 such that vt,iHt11vt,iκvt,ItHt11vt,Itv_{t,i}^{\top}H_{t-1}^{-1}v_{t,i}\leq\kappa v_{t,I_{t}}^{\top}H_{t-1}^{-1}v_{t,I_{t}} holds for any i[N]i\in[N] and t[n]t\in[n].

This assumption says that the maximizer in (10) is an approximate upper bound, up to a multiplicative κ1\kappa\geq 1, on the information gain at each data point, including those previously chosen that cannot be chosen again. We note that the assumption holds for κ=1\kappa=1 when repeated independent observations of the data points are allowed, as in all prior works (Appendix C). In this case, the maximization in (10) would be over i[N]i\in[N].

5.1 Main Result

We state our main claim below.

Theorem 2.

Let θ^n=argminθΘdpo(θ;𝒮n)\hat{\theta}_{n}=\operatorname*{arg\,min\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}_{n}). Then the maximum logit error under 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} is

(θ^n,θ)=O~(dlog(1/δ)/n)\displaystyle\mathcal{E}(\hat{\theta}_{n},\theta_{*})=\tilde{O}(d\sqrt{\log(1/\delta)/n})

with probability at least 1δ1-\delta, where O~\tilde{O} hides all logarithmic factors but those in δ\delta.

We prove the claim as follows. For log-linear policies, (7) reduces to maxi[N]|ϕi(θ^nθ)|\max_{i\in[N]}|\phi_{i}^{\top}(\hat{\theta}_{n}-\theta_{*})|. By the Cauchy-Schwarz inequality, for any data point i[N]i\in[N],

|ϕi(θ^nθ)|ϕiΣn1θ^nθΣn,\displaystyle|\phi_{i}^{\top}(\hat{\theta}_{n}-\theta_{*})|\leq\|\phi_{i}\|_{\Sigma_{n}^{-1}}\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}\,, (11)

where Σn=γId+2dpo(θ;𝒮n)\Sigma_{n}=\gamma I_{d}+\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n}) a regularized Hessian at the optimal DPO policy θ\theta_{*}. To bound the first term, we note that the feedback at data point ii is distributed as

siμi(θ)=μ(β(ϕiθbi)).\displaystyle s_{i}\sim\mu_{i}(\theta_{*})=\mu(\beta(\phi_{i}^{\top}\theta_{*}-b_{i}))\,. (12)

This assumption follows from the definition of DPO in (3), which says that μi(θ)\mu_{i}(\theta_{*}) is the probability that response yi,1y_{i,1} is preferred to yi,2y_{i,2} given xix_{i}. Thus we can build on existing concentration results for sub-Gaussian random variables to prove the following.

Theorem 3.

For any set of nn indices 𝒮n[N]\mathcal{S}_{n}\subseteq[N],

θ^nθΣnβ2dcminlog(1+cminn/γδ)+2γ12\displaystyle\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}\leq\sqrt{\frac{\beta^{2}d}{c_{\min}}\log\left(\frac{1+c_{\min}n/\gamma}{\delta}\right)}+2\gamma^{\frac{1}{2}}

holds with probability at least 1δ1-\delta.

To bound the second term in (11), we use the fact that the standard errors of the logit estimates do not increase over time and decrease at a desired rate if Assumption 4 holds for some constant κ1\kappa\geq 1.

Theorem 4.

For any data point i[N]i\in[N],

ϕiΣn1ϕicmax3log(1+cmaxnγd)cminγlog(1+cmax/γ)κdn.\displaystyle\phi_{i}^{\top}\Sigma_{n}^{-1}\phi_{i}\leq\frac{c_{\max}^{3}\log\left(1+\frac{c_{\max}n}{\gamma d}\right)}{c_{\min}\gamma\log(1+c_{\max}/\gamma)}\frac{\kappa d}{n}\,.

All proofs are in Appendix A.

5.2 Discussion

The bound in Theorem 2 is O~(dlog(1/δ)/n)\tilde{O}(d\sqrt{\log(1/\delta)/n}) and holds with probability at least 1δ1-\delta. As a result, the maximum logit error decreases with more feedback nn and increases with the number of learned policy parameters dd. The bound is not directly comparable to prior works in Appendix C because they bound reward model errors, while we bound a policy learning error. That being said, the dependence on nn and δ\delta is similar. The linear dependence on dd arises because Theorem 4 is proved through a self-normalizing bound in Theorem 3 that would apply even to infinitely-large datasets. We would get an O~(dlog(N)log(1/δ)/n)\tilde{O}(\sqrt{d\log(N)\log(1/\delta)/n}) bound, where NN is the dataset size, if we followed the analysis of Kveton et al. (2020) and applied a union bound over all data points.

6 Experiments

Refer to caption
Refer to caption
Figure 1: Experiments with log-linear policies on the CIFAR-10 (first row) and CIFAR-100 (second row) datasets.

We experiment with both log-linear (Section 6.1) and LLM (Section 6.2) policies. The log-linear experiments validate that 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} work as analyzed. The LLM experiments show that 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} perform well in practice when applied to LLMs. We conduct more experiments with log-linear policies in Appendix B.

6.1 Log-Linear Policies

This experiment is designed as follows. First, we take an existing multi-class classification dataset and turn it into a preferential feedback dataset. More specifically, we choose a random positive label and generate NN vectors {ϕi}i=1N\left\{\phi_{i}\right\}_{i=1}^{N}, where ϕid\phi_{i}\in\mathbb{R}^{d} is the difference of feature vectors of random positive and negative examples. Second, we label all ϕi\phi_{i} with 11 and learn a logistic regression model to simulate preferential feedback. Let θ¯\bar{\theta} and Σ¯\bar{\Sigma} be the learned model parameter and its covariance, respectively. Third, we generate preferential feedback siBer(μ(ϕiθ¯))s_{i}\sim\mathrm{Ber}(\mu(\phi_{i}^{\top}\bar{\theta})) for all ϕi\phi_{i} and get a dataset 𝒟={(ϕi,si)}i=1N\mathcal{D}=\left\{(\phi_{i},s_{i})\right\}_{i=1}^{N}. Fourth, we generate a reference policy as θ0𝒩(θ¯,Σ¯)\theta_{0}\sim\mathcal{N}(\bar{\theta},\bar{\Sigma}) and set the bias as bi=ϕiθ0b_{i}=\phi_{i}^{\top}\theta_{0}. Simply put, θ0\theta_{0} is close θ¯\bar{\theta}, as measured by the uncertainty of θ¯\bar{\theta}. Finally, we compute the optimal DPO policy θ\theta_{*} on 𝒟\mathcal{D}. All compared methods apply DPO to their selected subset 𝒮n\mathcal{S}_{n} of 𝒟\mathcal{D} and learn θ^n=argminθΘdpo(θ;𝒮n)\hat{\theta}_{n}=\operatorname*{arg\,min\,}_{\theta\in\Theta}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}_{n}).

We compare θ^n\hat{\theta}_{n} to θ\theta_{*} in three metrics. The first metric is the maximum logit error, maxi[N]|ϕi(θ^nθ)|\max_{i\in[N]}|\phi_{i}^{\top}(\hat{\theta}_{n}-\theta_{*})|, which we bound in Theorem 2. The second metric is the mean logit error 1Ni=1N|ϕi(θ^nθ)|\frac{1}{N}\sum_{i=1}^{N}|\phi_{i}^{\top}(\hat{\theta}_{n}-\theta_{*})|. Although we do not analyze it, our methods minimize it indirectly through the maximum error. The last metric is the error rate,

1Ni=1N𝟙{sgn(ϕiθ^nbi)sgn(ϕiθbi)},\displaystyle\frac{1}{N}\sum_{i=1}^{N}\mathds{1}\!\left\{\mathrm{sgn}(\phi_{i}^{\top}\hat{\theta}_{n}-b_{i})\neq\mathrm{sgn}(\phi_{i}^{\top}\theta_{*}-b_{i})\right\}\,,

which is the fraction of incorrectly ordered responses by θ^n\hat{\theta}_{n} when θ\theta_{*} is the ground truth.

Refer to caption
Refer to caption
Figure 2: Experiments with LLM policies on the Nectar dataset. We use Llama-3.2 (first row) and Phi-3 (second row) models.

We compare five algorithms. The first two algorithms are 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}. We expect 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} to perform better because it has access to more information. We consider three baselines: 𝚄𝚗𝚒𝚏𝚘𝚛𝚖\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt Uniform, 𝙰𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt APO, and 𝙿𝙼𝙲\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt PMC. 𝚄𝚗𝚒𝚏𝚘𝚛𝚖\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt Uniform selects data points uniformly at random. While simple, it is known to be competitive in real-world problems where feature vectors may cover the feature space close to uniformly (Ash et al., 2020, 2021; Mukherjee et al., 2024; Muldrew et al., 2024). 𝙰𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt APO is the practical incremental D-optimal design for linear models proposed in Das et al. (2024). The main difference from 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO is that 𝙰𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt APO neglects logistic model factors and β\beta (Lemma 1). Therefore, while it selects diverse ϕi\phi_{i}, they do not necessarily maximize the information gain in DPO. The last baseline is 𝙿𝙼𝙲\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt PMC of Muldrew et al. (2024), which selects data points with the highest differences between estimated rewards of their responses.

We experiment with CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). The features are a random subset of ResNet-50 embeddings (He et al., 2016) of size d=384d=384. The dataset size is N=216N=2^{16}. We set the DPO regularizer to β=1\beta=1 and experiment with other β\beta in Appendix B. Our CIFAR-10 results are reported in the first row of Figure 1. 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} is the best performing method in all metrics. Many improvements are major. For instance, the lowest maximum logit error of 𝚄𝚗𝚒𝚏𝚘𝚛𝚖\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt Uniform (n=215n=2^{15}) is attained by 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} at n<213n<2^{13}. The lowest maximum logit error of 𝙰𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt APO (n=215n=2^{15}) is attained by 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} at n<214n<2^{14}. 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO is the second best method in the maximum logit error. It is never worse than 𝚄𝚗𝚒𝚏𝚘𝚛𝚖\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt Uniform, 𝙰𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt APO, and 𝙿𝙼𝙲\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt PMC. 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO improves in all metrics over all baselines at larger sample sizes. Our CIFAR-100 results are reported in the second row of Figure 1 and we observe the same trends as on the CIFAR-10 dataset.

6.2 LLM Policies

We also experiment with a real-world preference dataset Nectar (Zhu et al., 2023) and two LLM policies: Llama-3.2 (3B parameters) (Dubey et al., 2024) and Phi-3 (Abdin et al., 2024). We sample N=5 000N=5\,000 prompts {xi}i=1N\left\{x_{i}\right\}_{i=1}^{N} from the dataset, each with two responses. The accepted {yi,w}i=1N\left\{y_{i,w}\right\}_{i=1}^{N} and rejected {yi,l}i=1N\left\{y_{i,l}\right\}_{i=1}^{N} responses are determined based on the ground truth in the dataset. The feature vector ϕ(x,y)\phi(x,y) is the embedding of the concatenated prompt and response from the last hidden layer of the LLM, of size d=4 096d=4\,096. The bias term is bi=logπ0(yi,wxi)logπ0(yi,lxi)b_{i}=\log\pi_{0}(y_{i,w}\mid x_{i})-\log\pi_{0}(y_{i,l}\mid x_{i}), where π0\pi_{0} is the initial LLM reference policy.

We report three metrics. The accuracy measures how well we distinguish between positive and negative responses,

1Ni=1N𝟙{logπ(yi,wxi;θ)π0(yi,wxi)>logπ(yi,lxi;θ)π0(yi,lxi)}.\displaystyle\frac{1}{N}\sum_{i=1}^{N}\mathds{1}\!\left\{\log\frac{\pi(y_{i,w}\mid x_{i};\theta)}{\pi_{0}(y_{i,w}\mid x_{i})}>\log\frac{\pi(y_{i,l}\mid x_{i};\theta)}{\pi_{0}(y_{i,l}\mid x_{i})}\right\}\,.

This metric is 11 minus the error rate in Figure 1 and thus identical, up to how we plot it. We could not plot the two other metrics in Figure 1 because they require knowing θ\theta_{*}. Therefore, we decided to plot two other metrics that reflect the confidence in distinguishing the responses. The margin is the advantage of a positive response over a negative one,

1Ni=1Nβlogπ(yi,wxi;θ)π0(yi,wxi)βlogπ(yi,lxi;θ)π0(yi,lxi).\displaystyle\frac{1}{N}\sum_{i=1}^{N}\beta\log\frac{\pi(y_{i,w}\mid x_{i};\theta)}{\pi_{0}(y_{i,w}\mid x_{i})}-\beta\log\frac{\pi(y_{i,l}\mid x_{i};\theta)}{\pi_{0}(y_{i,l}\mid x_{i})}\,.

The negative loglik is the logistic regression loss,

1Ni=1Nlogμ(βlogπ(yi,wxi;θ)π0(yi,wxi)βlogπ(yi,lxi;θ)π0(yi,lxi)).\displaystyle-\frac{1}{N}\sum_{i=1}^{N}\log\mu\left(\beta\log\frac{\pi(y_{i,w}\mid x_{i};\theta)}{\pi_{0}(y_{i,w}\mid x_{i})}-\beta\log\frac{\pi(y_{i,l}\mid x_{i};\theta)}{\pi_{0}(y_{i,l}\mid x_{i})}\right)\,.

Our results with Llama-3.2 and Phi-3 models are reported in Figure 2. We observe similar trends to Figure 1. 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} is clearly the best performing method in both the margin and negative loglik. 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO is among the best three methods for larger sample sizes. The least clear trend is in accuracy. We believe that this is because many responses are of a similar quality. Therefore, they cannot be easily distinguished and lie close to the decision boundary, which can be impacted by even minor changes in the LLM.

7 Conclusions

We propose an active learning framework for DPO. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We propose two algorithms. One is for the online setting, where the human feedback is elicited online, and the other is for the offline setting, where the feedback has already been collected and we choose its subset to improve the computation efficiency of DPO. We analyze both algorithms and also evaluate them empirically, in the setting that matches our theory and on LLMs.

This is the first work that applies optimal designs to DPO. The main difference from prior works is that the optimal design is applied to policy optimization. A natural direction for future work are other policy optimization frameworks, such as KTO (Ethayarajh et al., 2024). Our analysis could also be improved in several aspects. For instance, it is for log-linear policies and we have not derived an upper bound on κ\kappa in Assumption 4. In the setting of prior works, where multiple independent observations of preferential feedback for the same prompt are possible, κ=1\kappa=1.

References

  • Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
  • Abdin et al. [2024] Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  • Ash et al. [2020] Jordan Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • Ash et al. [2021] Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Gone fishing: Neural active learning with Fisher embeddings. In Advances in Neural Information Processing Systems 34, 2021.
  • Audibert et al. [2010] Jean-Yves Audibert, Sebastien Bubeck, and Remi Munos. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Annual Conference on Learning Theory, pages 41–53, 2010.
  • Azizi et al. [2022] Mohammad Javad Azizi, Branislav Kveton, and Mohammad Ghavamzadeh. Fixed-budget best-arm identification in structured bandits. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, 2022.
  • Bayer and Reuter [2024] Markus Bayer and Christian Reuter. Activellm: Large language model-based active learning for textual few-shot scenarios. arXiv preprint arXiv:2405.10808, 2024.
  • Bishop [2006] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.
  • Bradley and Terry [1952] Ralph Allan Bradley and Milton Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4):324–345, 1952.
  • Bubeck et al. [2009] Sebastien Bubeck, Remi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings of the 20th International Conference on Algorithmic Learning Theory, pages 23–37, 2009.
  • Chen et al. [2024] Yifang Chen, Shuohang Wang, Ziyi Yang, Hiteshi Sharma, Nikos Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, and Yelong Shen. Cost-effective proxy reward model construction with on-policy and active learning. arXiv preprint arXiv:2407.02119, 2024.
  • Christiano et al. [2017] Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30, 2017.
  • Das et al. [2024] Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient RLHF. CoRR, abs/2402.10500, 2024. URL https://arxiv.org/abs/2402.10500.
  • Doucet et al. [2024] Paul Doucet, Benjamin Estermann, Till Aczel, and Roger Wattenhofer. Bridging diversity and uncertainty in active learning with self-supervised pre-training. arXiv preprint arXiv:2403.03728, 2024.
  • Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  • Ethayarajh et al. [2024] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. In Proceedings of the 41th International Conference on Machine Learning, 2024.
  • Fisher [1922] Ronald Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London: Series A, 222:309–368, 1922.
  • Guo et al. [2024] Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • Hu et al. [2022] Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations, 2022.
  • Ji et al. [2024] Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401, 2024.
  • Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Kveton et al. [2015] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • Kveton et al. [2020] Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020.
  • Lagree et al. [2016] Paul Lagree, Claire Vernade, and Olivier Cappe. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems 29, pages 1597–1605, 2016.
  • Lattimore and Szepesvari [2019] Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019.
  • Li et al. [2017] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, pages 2071–2080, 2017.
  • Li et al. [2016] Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1245–1253, 2016.
  • Lightman et al. [2024] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In Proceedings of the 12th International Conference on Learning Representations, 2024.
  • Liu et al. [2024] Pangpang Liu, Chengchun Shi, and Will Wei Sun. Dual active learning for reinforcement learning from human feedback. CoRR, abs/2410.02504, 2024. URL https://arxiv.org/abs/2410.02504.
  • Luce [2005] Robert Duncan Luce. Individual Choice Behavior: A Theoretical Analysis. Dover Publications, 2005.
  • Mangrulkar et al. [2022] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  • Margatina et al. [2023] Katerina Margatina, Timo Schick, Nikolaos Aletras, and Jane Dwivedi-Yu. Active learning principles for in-context learning with large language models. arXiv preprint arXiv:2305.14264, 2023.
  • Mehta et al. [2023] Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. CoRR, abs/2312.00267, 2023. URL https://arxiv.org/abs/2312.00267.
  • Mukherjee et al. [2024] Subhojyoti Mukherjee, Anusha Lalitha, Kousha Kalantari, Aniket Deshmukh, Ge Liu, Yifei Ma, and Branislav Kveton. Optimal design for human preference elicitation. In Advances in Neural Information Processing Systems 37, 2024.
  • Muldrew et al. [2024] William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. arXiv preprint arXiv:2402.08114, 2024.
  • Nemhauser et al. [1978] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming, 14(1):265–294, 1978.
  • Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35, 2022.
  • Plackett [1975] Robin Lewis Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.
  • Pukelsheim [2006] Friedrich Pukelsheim. Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006.
  • Radlinski et al. [2008] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784–791, 2008.
  • Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems 36, 2023.
  • Riquelme et al. [2018] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for Thompson sampling. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  • Scheid et al. [2024] Antoine Scheid, Etienne Boursier, Alain Durmus, Michael Jordan, Pierre Menard, Eric Moulines, and Michal Valko. Optimal design for reward modeling in RLHF. CoRR, abs/2410.17055, 2024. URL https://arxiv.org/abs/2410.17055.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL https://arxiv.org/abs/1707.06347.
  • Stufken and Yang [2012] John Stufken and Min Yang. Optimal designs for generalized linear models. In Design and Analysis of Experiments, pages 137–164. John Wiley & Sons, 2012.
  • Sutton and Barto [1998] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
  • Thekumparampil et al. [2024] Kiran Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, and Branislav Kveton. Comparing few to rank many: Active human preference learning using randomized Frank-Wolfe. CoRR, abs/2412.19396, 2024. URL https://arxiv.org/abs/2412.19396.
  • Wang et al. [2024] Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, and Dianhui Chu. A survey on data selection for llm instruction tuning. arXiv preprint arXiv:2402.05123, 2024.
  • Yang and Tan [2022] Junwen Yang and Vincent Tan. Minimax optimal fixed-budget best arm identification in linear bandits. In Advances in Neural Information Processing Systems 35, 2022.
  • Zhang et al. [2022] Yiming Zhang, Shi Feng, and Chenhao Tan. Active example selection for in-context learning. arXiv preprint arXiv:2211.04486, 2022.
  • Zhu et al. [2023] Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
  • Zong et al. [2016] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016.

Appendix A Proofs and Supporting Lemmas

This section contains proofs of our main claims and supporting lemmas.

A.1 Proof of Lemma 1

Let vv\in\mathbb{R} and μ(v)=1/(1+exp[v])\mu(v)=1/(1+\exp[-v]). Then

vμ(v)=1(1+exp[v])2vexp[v]=exp[v](1+exp[v])2=μ(v)(1μ(v)).\displaystyle\frac{\partial}{\partial v}\mu(v)=-\frac{1}{(1+\exp[-v])^{2}}\frac{\partial}{\partial v}\exp[-v]=\frac{\exp[-v]}{(1+\exp[-v])^{2}}=\mu(v)(1-\mu(v))\,.

We start with computing the gradient of (6),

dpo(θ;𝒮)\displaystyle\nabla\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}) =i𝒮siμi(θ)μi(θ)(1si)μi(θ)1μi(θ)=βi𝒮(1si)μi(θ)ϕisi(1μi(θ))ϕi\displaystyle=-\sum_{i\in\mathcal{S}}s_{i}\frac{\nabla\mu_{i}(\theta)}{\mu_{i}(\theta)}-(1-s_{i})\frac{\nabla\mu_{i}(\theta)}{1-\mu_{i}(\theta)}=\beta\sum_{i\in\mathcal{S}}(1-s_{i})\mu_{i}(\theta)\phi_{i}-s_{i}(1-\mu_{i}(\theta))\phi_{i}
=βi𝒮(μi(θ)si)ϕi.\displaystyle=\beta\sum_{i\in\mathcal{S}}(\mu_{i}(\theta)-s_{i})\phi_{i}\,.

It follows that the Hessian is

2dpo(θ;𝒮)=(dpo(θ;𝒮))=βi𝒮ϕiμi(θ)=β2i𝒮μi(θ)(1μi(θ))ϕiϕi.\displaystyle\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S})=\nabla(\nabla\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}))=\beta\sum_{i\in\mathcal{S}}\phi_{i}\nabla\mu_{i}(\theta)=\beta^{2}\sum_{i\in\mathcal{S}}\mu_{i}(\theta)(1-\mu_{i}(\theta))\phi_{i}\phi_{i}^{\top}\,.

The term ϕiϕi\phi_{i}\phi_{i}^{\top} is an outer product, which is positive semi-definite. Because μi(θ)(1μi(θ))0\mu_{i}(\theta)(1-\mu_{i}(\theta))\geq 0, the Hessian is a weighted sum of positive semi-definite matrices, and thus a positive semi-definite matrix.

A.2 Proof of Theorem 3

Let Σ^n=2dpo(θ;𝒮n)\hat{\Sigma}_{n}=\nabla^{2}\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n}). We start by noting that Σ^n\hat{\Sigma}_{n} is a positive semi-definite matrix (Lemma 1). Therefore, dpo(θ;𝒮n)\mathcal{L}_{\textsc{dpo}}(\theta;\mathcal{S}_{n}) is strongly convex in θ\theta and

dpo(θ^n;𝒮n)\displaystyle\mathcal{L}_{\textsc{dpo}}(\hat{\theta}_{n};\mathcal{S}_{n}) dpo(θ;𝒮n)+dpo(θ;𝒮n),θ^nθ+12θ^nθΣ^n2\displaystyle\geq\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})+\langle\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n}),\hat{\theta}_{n}-\theta_{*}\rangle+\frac{1}{2}\|\hat{\theta}_{n}-\theta_{*}\|_{\hat{\Sigma}_{n}}^{2}

holds. Now we use that dpo(θ;𝒮n)dpo(θ^n;𝒮n)\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})\geq\mathcal{L}_{\textsc{dpo}}(\hat{\theta}_{n};\mathcal{S}_{n}) and that Σ^n=ΣnγId\hat{\Sigma}_{n}=\Sigma_{n}-\gamma I_{d}, rearrange the inequality, and get

θ^nθΣn22dpo(θ;𝒮n),θθ^n+γθ^nθ22.\displaystyle\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}^{2}\leq 2\langle\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n}),\theta_{*}-\hat{\theta}_{n}\rangle+\gamma\|\hat{\theta}_{n}-\theta_{*}\|_{2}^{2}\,.

Then we apply the Cauchy–Schwarz inequality to the right-hand side and get

θ^nθΣn22dpo(θ;𝒮n)Σn1θ^nθΣn+γθ^nθ22.\displaystyle\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}^{2}\leq 2\|\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})\|_{\Sigma_{n}^{-1}}\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}+\gamma\|\hat{\theta}_{n}-\theta_{*}\|_{2}^{2}\,.

Now we divide both sides by θ^nθΣn>0\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}>0 and get

θ^nθΣn2dpo(θ;𝒮n)Σn1+γθ^nθ22θ^nθΣn2dpo(θ;𝒮n)Σn1+2γ12.\displaystyle\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}\leq 2\|\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})\|_{\Sigma_{n}^{-1}}+\frac{\gamma\|\hat{\theta}_{n}-\theta_{*}\|_{2}^{2}}{\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}}\leq 2\|\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})\|_{\Sigma_{n}^{-1}}+2\gamma^{\frac{1}{2}}\,.

The last inequality follows from

θ^nθΣn=(θ^nθ)Σn(θ^nθ)γθ^nθ22,\displaystyle\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}=\sqrt{(\hat{\theta}_{n}-\theta_{*})^{\top}\Sigma_{n}(\hat{\theta}_{n}-\theta_{*})}\geq\sqrt{\gamma}\|\hat{\theta}_{n}-\theta_{*}\|_{2}^{2}\,,

which is proved using ΣnγId\Sigma_{n}\succeq\gamma I_{d}, and that θ^nθ22\|\hat{\theta}_{n}-\theta_{*}\|_{2}\leq 2.

Therefore, to bound θ^nθΣn\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}, it suffices to show that dpo(θ;𝒮n)Σn1\|\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})\|_{\Sigma_{n}^{-1}} is small with a high probability. We show this next. We start by recalling from Lemma 1 that

dpo(θ;𝒮n)=βi𝒮n(μi(θ)si)ϕi,\displaystyle\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})=\beta\sum_{i\in\mathcal{S}_{n}}(\mu_{i}(\theta_{*})-s_{i})\phi_{i}\,,

where sis_{i} is a binary random variable with mean 𝔼[si]=μi(θ)\mathbb{E}\left[s_{i}\right]=\mu_{i}(\theta_{*}), as described in (12). Let Zi=μi(θ)siZ_{i}=\mu_{i}(\theta_{*})-s_{i}. Since

Σncmin(γcminId+i𝒮nϕiϕi),\displaystyle\Sigma_{n}\succeq c_{\min}\left(\frac{\gamma}{c_{\min}}I_{d}+\sum_{i\in\mathcal{S}_{n}}\phi_{i}\phi_{i}^{\top}\right)\,,

we get

dpo(θ;𝒮n)Σn1βcmini𝒮nZiϕiVn1\displaystyle\|\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})\|_{\Sigma_{n}^{-1}}\leq\frac{\beta}{\sqrt{c_{\min}}}\Big{\|}\sum_{i\in\mathcal{S}_{n}}Z_{i}\phi_{i}\Big{\|}_{V_{n}^{-1}}

for Vn=γId/cmin+i𝒮nϕiϕiV_{n}=\gamma I_{d}/c_{\min}+\sum_{i\in\mathcal{S}_{n}}\phi_{i}\phi_{i}^{\top}. Finally, since sis_{i} are conditionally independent given the history and their variance proxy is 0.250.25, we can use Theorem 1 of Abbasi-Yadkori et al. [2011] and get that

i𝒮nZiϕiVn1d4log(1+cminn/γδ)\displaystyle\Big{\|}\sum_{i\in\mathcal{S}_{n}}Z_{i}\phi_{i}\Big{\|}_{V_{n}^{-1}}\leq\sqrt{\frac{d}{4}\log\left(\frac{1+c_{\min}n/\gamma}{\delta}\right)}

holds with probability at least 1δ1-\delta. Finally, we collect all inequalities and get that

θ^nθΣndpo(θ;𝒮n)Σn1+2γ12β2dcminlog(1+cminn/γδ)+2γ12\displaystyle\|\hat{\theta}_{n}-\theta_{*}\|_{\Sigma_{n}}\leq\|\nabla\mathcal{L}_{\textsc{dpo}}(\theta_{*};\mathcal{S}_{n})\|_{\Sigma_{n}^{-1}}+2\gamma^{\frac{1}{2}}\leq\sqrt{\frac{\beta^{2}d}{c_{\min}}\log\left(\frac{1+c_{\min}n/\gamma}{\delta}\right)}+2\gamma^{\frac{1}{2}}

holds with probability at least 1δ1-\delta.

A.3 Proof of Theorem 4

First, we introduce μt,i=μi(θ^t1)\mu_{t,i}=\mu_{i}(\hat{\theta}_{t-1}), and note that vt,iv_{t,i} in 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+} can be redefined as

vt,i=βμt,i(1μt,i)ϕi.\displaystyle v_{t,i}=\beta\sqrt{\mu_{t,i}(1-\mu_{t,i})}\phi_{i}\,.

Now note that

ϕiΣn12=ϕiΣn1ϕicmaxcminϕiHn1ϕi\displaystyle\|\phi_{i}\|_{\Sigma_{n}^{-1}}^{2}=\phi_{i}^{\top}\Sigma_{n}^{-1}\phi_{i}\leq\frac{c_{\max}}{c_{\min}}\phi_{i}^{\top}H_{n}^{-1}\phi_{i}

because Ht=γId+i𝒮tvt,ivt,iH_{t}=\gamma I_{d}+\sum_{i\in\mathcal{S}_{t}}v_{t,i}v_{t,i}^{\top}. Next we utilize the fact that the standard errors of the estimates decrease with more observations.

Lemma 5.

For any i[N]i\in[N] and t[n]t\in[n],

ϕiHt1ϕiϕiHt11ϕi.\displaystyle\phi_{i}^{\top}H_{t}^{-1}\phi_{i}\leq\phi_{i}^{\top}H_{t-1}^{-1}\phi_{i}\,.
Proof.

The proof follows from the Sherman–Morrison formula. Specifically, since

Ht1=Ht11Ht11ϕiϕiHt111+ϕiHt11ϕiHt11,\displaystyle H_{t}^{-1}=H_{t-1}^{-1}-\frac{H_{t-1}^{-1}\phi_{i}\phi_{i}^{\top}H_{t-1}^{-1}}{1+\phi_{i}^{\top}H_{t-1}^{-1}\phi_{i}}\preceq H_{t-1}^{-1}\,,

we get vHt1vvHt11vv^{\top}H_{t}^{-1}v\leq v^{\top}H_{t-1}^{-1}v for any vector vdv\in\mathbb{R}^{d}. This completes the proof. ∎

Lemma 5 implies that

ϕiHn1ϕi1nt=1nϕiHt11ϕicmaxnt=1nvt,iHt11vt,i\displaystyle\phi_{i}^{\top}H_{n}^{-1}\phi_{i}\leq\frac{1}{n}\sum_{t=1}^{n}\phi_{i}^{\top}H_{t-1}^{-1}\phi_{i}\leq\frac{c_{\max}}{n}\sum_{t=1}^{n}v_{t,i}^{\top}H_{t-1}^{-1}v_{t,i}

holds for any i[N]i\in[N]. This allows us to attribute the quality of the solution to individual greedy steps in 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO and 𝙰𝙳𝙿𝙾+\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO^{+}. The next step is to relate vt,iHt11vt,iv_{t,i}^{\top}H_{t-1}^{-1}v_{t,i} to vt,ItHt11vt,Itv_{t,I_{t}}^{\top}H_{t-1}^{-1}v_{t,I_{t}}. The key observation is that

It\displaystyle I_{t} =argmaxi[N]𝒮t1logdet(Ht1+vt,ivt,i)=argmaxi[N]𝒮t1logdet(Id+Ht112vt,ivt,iHt112)\displaystyle=\operatorname*{arg\,max\,}_{i\in[N]\setminus\mathcal{S}_{t-1}}\log\operatorname{det}(H_{t-1}+v_{t,i}v_{t,i}^{\top})=\operatorname*{arg\,max\,}_{i\in[N]\setminus\mathcal{S}_{t-1}}\log\operatorname{det}(I_{d}+H_{t-1}^{-\frac{1}{2}}v_{t,i}v_{t,i}^{\top}H_{t-1}^{-\frac{1}{2}})
=argmaxi[N]𝒮t1log(1+vt,iHt11vt,i)=argmaxi[N]𝒮t1vt,iHt11vt,i.\displaystyle=\operatorname*{arg\,max\,}_{i\in[N]\setminus\mathcal{S}_{t-1}}\log(1+v_{t,i}^{\top}H_{t-1}^{-1}v_{t,i})=\operatorname*{arg\,max\,}_{i\in[N]\setminus\mathcal{S}_{t-1}}v_{t,i}^{\top}H_{t-1}^{-1}v_{t,i}\,.

The second equality holds because Ht1H_{t-1} is fixed when ItI_{t} is selected. The last equality holds because the logarithm is a monotone function. It follows that ItI_{t} is the index of the feature vector with the maximum variance.

If the scope of the maximization was i[N]i\in[N], the inequality vt,iHt11vt,ivt,ItHt11vt,Itv_{t,i}^{\top}H_{t-1}^{-1}v_{t,i}\leq v_{t,I_{t}}^{\top}H_{t-1}^{-1}v_{t,I_{t}} would hold for any i[N]i\in[N]. Since the scope is i[N]𝒮t1i\in[N]\setminus\mathcal{S}_{t-1}, we make Assumption 4, which equates to assuming that ϕi\phi_{i} are sufficiently diverse. We also use the following logarithmic transformation.

Lemma 6.

For any vdv\in\mathbb{R}^{d} and t[n]t\in[n],

vHt11vcmaxγlog(1+cmax/γ)log(1+vHt11v).\displaystyle v^{\top}H_{t-1}^{-1}v\leq\frac{c_{\max}}{\gamma\log(1+c_{\max}/\gamma)}\log(1+v^{\top}H_{t-1}^{-1}v)\,.
Proof.

We start with an upper bound on vHt11vv^{\top}H_{t-1}^{-1}v. By Weyl’s inequalities, we have

λ1(Ht11)=λd1(Ht1)λd1(γId)=1/γ.\displaystyle\lambda_{1}(H_{t-1}^{-1})=\lambda_{d}^{-1}(H_{t-1})\leq\lambda_{d}^{-1}(\gamma I_{d})=1/\gamma\,.

Thus, under the assumption that v22cmax\|v\|_{2}^{2}\leq c_{\max}, we have vHt11vcmax/γv^{\top}H_{t-1}^{-1}v\leq c_{\max}/\gamma. Now note that for y[0,ymax]y\in[0,y_{\max}],

y=ylog(1+y)log(1+y)(maxy[0,ymax]ylog(1+y))log(1+y)=ymaxlog(1+ymax)log(1+y).\displaystyle y=\frac{y}{\log(1+y)}\log(1+y)\leq\left(\max_{y\in[0,y_{\max}]}\frac{y}{\log(1+y)}\right)\log(1+y)=\frac{y_{\max}}{\log(1+y_{\max})}\log(1+y)\,.

Finally, we set y=vHt11vy=v^{\top}H_{t-1}^{-1}v and ymax=cmax/γy_{\max}=c_{\max}/\gamma, and get our claim. ∎

Now we apply Assumption 4 and Lemma 6, use the telescoping property of the sum, and get

t=1nvt,iHt11vt,i\displaystyle\sum_{t=1}^{n}v_{t,i}^{\top}H_{t-1}^{-1}v_{t,i} κt=1nvt,ItHt11vt,Itct=1nlog(1+vt,ItHt11vt,It)=ct=1nlogdet(Id+Ht112vt,Itvt,ItHt112)\displaystyle\leq\kappa\sum_{t=1}^{n}v_{t,I_{t}}^{\top}H_{t-1}^{-1}v_{t,I_{t}}\leq c\sum_{t=1}^{n}\log(1+v_{t,I_{t}}^{\top}H_{t-1}^{-1}v_{t,I_{t}})=c\sum_{t=1}^{n}\log\operatorname{det}(I_{d}+H_{t-1}^{-\frac{1}{2}}v_{t,I_{t}}v_{t,I_{t}}^{\top}H_{t-1}^{-\frac{1}{2}})
=ct=1nlogdet(Ht1+vt,Itvt,It)logdet(Ht1)=ct=1nlogdet(Ht)logdet(Ht1)\displaystyle=c\sum_{t=1}^{n}\log\operatorname{det}(H_{t-1}+v_{t,I_{t}}v_{t,I_{t}}^{\top})-\log\operatorname{det}(H_{t-1})=c\sum_{t=1}^{n}\log\operatorname{det}(H_{t})-\log\operatorname{det}(H_{t-1})
=c(logdet(Hn)logdet(H0))=clogdet(H012HnH012),\displaystyle=c(\log\operatorname{det}(H_{n})-\log\operatorname{det}(H_{0}))=c\log\operatorname{det}(H_{0}^{-\frac{1}{2}}H_{n}H_{0}^{-\frac{1}{2}})\,,

where c=cmaxκγlog(1+cmax/γ)c=\frac{c_{\max}\kappa}{\gamma\log(1+c_{\max}/\gamma)}. Furthermore,

logdet(H012HnH012)\displaystyle\log\operatorname{det}(H_{0}^{-\frac{1}{2}}H_{n}H_{0}^{-\frac{1}{2}}) dlog(1dtr(H012HnH012))=dlog(1+1dt=1ntr(H012vt,Itvt,ItH012))\displaystyle\leq d\log\left(\frac{1}{d}\operatorname{tr}(H_{0}^{-\frac{1}{2}}H_{n}H_{0}^{-\frac{1}{2}})\right)=d\log\left(1+\frac{1}{d}\sum_{t=1}^{n}\operatorname{tr}(H_{0}^{-\frac{1}{2}}v_{t,I_{t}}v_{t,I_{t}}^{\top}H_{0}^{-\frac{1}{2}})\right)
=dlog(1+1dt=1nvt,ItH01vt,It)dlog(1+cmaxnγd).\displaystyle=d\log\left(1+\frac{1}{d}\sum_{t=1}^{n}v_{t,I_{t}}^{\top}H_{0}^{-1}v_{t,I_{t}}\right)\leq d\log\left(1+\frac{c_{\max}n}{\gamma d}\right)\,.

Finally, we combine all claims and get

ϕiHn1ϕi1nt=1nϕiHt11ϕicmaxκnt=1nvt,ItHt11vt,Itcmax2log(1+cmaxnγd)γlog(1+cmax/γ)κdn.\displaystyle\phi_{i}^{\top}H_{n}^{-1}\phi_{i}\leq\frac{1}{n}\sum_{t=1}^{n}\phi_{i}^{\top}H_{t-1}^{-1}\phi_{i}\leq\frac{c_{\max}\kappa}{n}\sum_{t=1}^{n}v_{t,I_{t}}^{\top}H_{t-1}^{-1}v_{t,I_{t}}\leq\frac{c_{\max}^{2}\log\left(1+\frac{c_{\max}n}{\gamma d}\right)}{\gamma\log(1+c_{\max}/\gamma)}\frac{\kappa d}{n}\,.

This completes the proof.

Appendix B Ablation Study

Refer to caption
Refer to caption
Figure 3: Experiments with log-linear policies on the CIFAR-10 dataset, with β=2\beta=2 (first row) and β=5\beta=5 (second row).
Refer to caption
Refer to caption
Figure 4: Experiments with log-linear policies on the CIFAR-10 (first row) and CIFAR-100 (second row) datasets with α=0\alpha=0 in (13).

In Section 6.1, we experiment with β=1\beta=1. There is nothing specific about this choice. In Figure 3, we report results for β{2,5}\beta\in\left\{2,5\right\} and observe improvements in both settings.

To increase the stability of our algorithms at small sample sizes, we replace μi(θ^t)(1μi(θ^t))\mu_{i}(\hat{\theta}_{t})(1-\mu_{i}(\hat{\theta}_{t})) with a high probability upper confidence bound (UCB). Let Σ^t\hat{\Sigma}_{t} be the covariance matrix for θ^t\hat{\theta}_{t}. Then the UCB is computed as

Ui=μ(zi)(1μ(zi)),zi=max{|β(ϕiθ^tbi)|αϕiΣ^tϕi, 0}\displaystyle U_{i}=\mu(z_{i})(1-\mu(z_{i}))\,,\quad z_{i}=\max\left\{\left|\beta(\phi_{i}^{\top}\hat{\theta}_{t}-b_{i})\right|-\alpha\sqrt{\phi_{i}^{\top}\hat{\Sigma}_{t}\phi_{i}},\ 0\right\} (13)

for some α>0\alpha>0. We set α=3\alpha=3 in Section 6. In Figure 4, we set α=0\alpha=0 and observe that this has no major impact on our trends as the number of data points nn increases.

Appendix C Related Work

The closest related works are on active learning with preferential feedback, and we review them first (Section C.1). Then we review active learning for fine-tuning (Section C.2) and other related works (Section C.3).

C.1 Active Learning for Preferential Feedback

Mehta et al. [2023] applied active learning to DPO in Section 5. Their acquisition function is

It=argmaxi[N](maxj[2]U(xi,yi,j)maxj[2]L(xi,yi,j)),\displaystyle I_{t}=\operatorname*{arg\,max\,}_{i\in[N]}(\max_{j\in[2]}U(x_{i},y_{i,j})-\max_{j\in[2]}L(x_{i},y_{i,j}))\,,

where U(x,y)U(x,y) is the UCB and L(x,y)L(x,y) is the LCB of r(x,y)r(x,y). The analysis is for dueling the UCB response with a random response. Their optimized metric is the maximum gap

maxi[N](maxj[2]r(xi,yi,j)r(xi,y^i)),\displaystyle\max_{i\in[N]}(\max_{j\in[2]}r(x_{i},y_{i,j})-r(x_{i},\hat{y}_{i}))\,, (14)

where r^\hat{r} is the estimated reward model and y^i=argmaxj[2]r^(xi,yi,j)\hat{y}_{i}=\operatorname*{arg\,max\,}_{j\in[2]}\hat{r}(x_{i},y_{i,j}) is the best response given xix_{i}. They prove that the maximum gap is O(1/n)O(1/\sqrt{n}) for sampling with replacement.

Das et al. [2024] proposed two algorithms for active RLHF. The acquisition function in APO is

It=argmaxi[N]ϕiHt1(θ^t1),\displaystyle I_{t}=\operatorname*{arg\,max\,}_{i\in[N]}\|\phi_{i}\|_{H_{t}^{-1}(\hat{\theta}_{t-1})}\,,

where Ht(θ^t1)H_{t}(\hat{\theta}_{t-1}) is a logistic regression Hessian in round tt, which is re-estimated in each round. They prove that (14) is O(1/n)O(1/\sqrt{n}) for sampling with replacement. APO is not evaluated. This is the closest algorithm design to 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO. The main difference in 𝙰𝙳𝙿𝙾\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}\pgfsys@color@cmyk@stroke{1}{0}{1}{0}\pgfsys@color@cmyk@fill{1}{0}{1}{0}\tt ADPO is that we maximize the information gain (line 6) and do not compute Ht1(θ^t1)H_{t}^{-1}(\hat{\theta}_{t-1}). Das et al. [2024] also proposed a practical APO,

It=argmaxi[N]ϕiHt1,\displaystyle I_{t}=\operatorname*{arg\,max\,}_{i\in[N]}\|\phi_{i}\|_{H_{t}^{-1}}\,,

where HtH_{t} is a linear regression Hessian in round tt. Practical APO is not analyzed. We use it as a baseline in Section 6.

Mukherjee et al. [2024] studied active learning with absolute and ranking feedback with K2K\geq 2 responses. For K=2K=2, their algorithm Dope is ItπI_{t}\sim\pi_{*}, where π\pi_{*} is a distribution over NN prompts with 22 responses obtained by the D-optimal design. They prove that

argmaxi[N]|ϕi(θ^θ)|=O(1/n)\displaystyle\operatorname*{arg\,max\,}_{i\in[N]}|\phi_{i}^{\top}(\hat{\theta}-\theta_{*})|=O(1/\sqrt{n})

for sampling with replacement, where θ\theta_{*} is the true model parameter and θ^\hat{\theta} is its estimate from nn observations. Dope is evaluated on RLHF datasets. Thekumparampil et al. [2024] extended Mukherjee et al. [2024] to ranking NN items from KNK\leq N responses.

Liu et al. [2024] extended APO of Das et al. [2024] to selecting both the prompt and teacher model. They prove that (14) is O(1/n)O(1/\sqrt{n}) for sampling with replacement. The proposed algorithm is empirically evaluated.

Scheid et al. [2024] proposed offline and online algorithms for active learning of reward models in RLHF. The offline algorithm, which is in the same setting as our work, computes the D-optimal design, similarly to Mukherjee et al. [2024] for K=2K=2, and explores by sampling with replacement. They prove a O(1/n)O(1/\sqrt{n}) bound on (14). The paper does not contain any experiments.

Ji et al. [2024] proposed two active learning algorithms: APPO and ADPO. APPO is a regret minimizing algorithm similar to those in dueling bandits. In round tt, APPO is given a prompt as an input and proposes two responses to duel. APPO is analyzed. ADPO is a heuristic that queries responses on prompts where the agent is uncertain. The response is uncertain if |r(xi,yi,1)r(xi,yi,2)||r(x_{i},y_{i,1})-r(x_{i},y_{i,2})| in the DPO objective is high.

Muldrew et al. [2024] proposed an active learning algorithm for DPO that repeatedly acquires labels and fine-tunes on them. The data are acquired in batches until a budget is met. The acquisition function is

It=argmaxi[N]|r^(xi,yi,1)r^(xi,yi,2)|,\displaystyle I_{t}=\operatorname*{arg\,max\,}_{i\in[N]}|\hat{r}(x_{i},y_{i,1})-\hat{r}(x_{i},y_{i,2})|\,,

where r^\hat{r} is the estimated reward model. We use it as a baseline in Section 6.

Guo et al. [2024] proposed online DPO from AI feedback. The key is to elicit AI feedback instead of human feedback and then use it in DPO. This is an empirical paper.

Chen et al. [2024] proposed active learning with coresets for reward models. They learn cluster centroids in the space of prompt embeddings that minimize the maximum distance of the prompt to its closest centroid. This is an empirical paper.

C.2 Active Learning for Fine-Tuning

There are many related works on active learning in LLMs [Margatina et al., 2023, Bayer and Reuter, 2024, Zhang et al., 2022]. A recent survey by Wang et al. [2024] categorizes existing methods for data selection in instruction tuning. Most of these methods rely on heuristic approaches, such as uncertainty sampling, clustering, or diversity-based strategies, which often lack theoretical grounding. Doucet et al. [2024] proposed a method that bridges diversity and uncertainty in active learning by leveraging self-supervised pre-training to address the cold-start problem and enhance data efficiency. However, these approaches do not align data selection directly with the task-specific objective, limiting their effectiveness in optimizing downstream performance. Zhang et al. [2022] used LLMs for selecting instances for in-context learning. More recently, Bayer and Reuter [2024] proposed ActiveLLM, which is a pool-based sampling method that leverages LLMs to select batches of instances for humans to label. Despite this fundamental difference, they also study two variants of their approach, one that incorporates feedback and another one that does not.

C.3 Multi-Armed Bandits

Our setting is also related to multi-armed bandits. Due to the budget nn, it is reminiscent of fixed-budget best arm identification (BAI) [Bubeck et al., 2009, Audibert et al., 2010, Azizi et al., 2022, Yang and Tan, 2022]. The main difference is that we do not want to identify the best arm. We want to get a good estimate for a set of arms, essentially pairs of items, in the worst case. Online learning to rank has also been studied extensively [Radlinski et al., 2008, Kveton et al., 2015, Zong et al., 2016, Li et al., 2016, Lagree et al., 2016]. We do not minimize cumulative regret or try to identify the best arm.