Optimal Decision Rules for Simple Hypothesis Testing under General Criterion Involving Error Probabilities

Berkan Dulek, Cuneyd Ozturk, Student Member, IEEE, and Sinan Gezici, Senior Member, IEEE B. Dulek is with the Department of Electrical and Electronics Engineering, Hacettepe University, Beytepe Campus, Ankara 06800, Turkey, e-mail: berkan@ee.hacettepe.edu.tr.C. Ozturk and S. Gezici are with the Department of Electrical and Electronics Engineering, Bilkent University, Ankara 06800, Turkey, e-mails: {cuneyd,gezici}@ee.bilkent.edu.tr.

Abstract

The problem of simple $M-$ ary hypothesis testing under a generic performance criterion that depends on arbitrary functions of error probabilities is considered. Using results from convex analysis, it is proved that an optimal decision rule can be characterized as a randomization among at most two deterministic decision rules, of the form reminiscent to Bayes rule, if the boundary points corresponding to each rule have zero probability under each hypothesis. Otherwise, a randomization among at most $M(M-1)+1$ deterministic decision rules is sufficient. The form of the deterministic decision rules are explicitly specified. Likelihood ratios are shown to be sufficient statistics. Classical performance measures including Bayesian, minimax, Neyman-Pearson, generalized Neyman-Pearson, restricted Bayesian, and prospect theory based approaches are all covered under the proposed formulation. A numerical example is presented for prospect theory based binary hypothesis testing.

Index Terms– Hypothesis testing, optimal tests, convexity, likelihood ratio, randomization.

I Problem Statement

Consider a detection problem with $M$ simple hypotheses:

\mathcal{H}_{j}:\boldsymbol{Y}\thicksim f_{j}(\cdot),\text{ with }j=0,1,\ldots,M-1,

(1)

where the random observation $\boldsymbol{Y}$ takes values from an observation set $\Gamma$ with $\Gamma\subset{\mathbb{R}}^{N}$ . Depending on whether the observed random vector $\boldsymbol{Y}\in\Gamma$ is continuous-valued or discrete-valued, $f_{j}(\boldsymbol{y})$ denotes either the probability density function (pdf) or the probability mass function (pmf) under hypothesis $\mathcal{H}_{j}$ . For compactness of notation, the term density is used for both pdf and pmf. In order to decide among the hypotheses, we consider the set of pointwise randomized decision functions, denoted by $\mathsf{D}$ , i.e., $\boldsymbol{\delta}\vcentcolon=(\delta_{0},\delta_{1},\ldots,\delta_{M-1})\in\mathsf{D}$ such that $\sum_{i=0}^{M-1}\delta_{i}(\boldsymbol{y})=1$ and $\delta_{i}(\boldsymbol{y})\in[0,1]$ for $0\leq i\leq M-1$ and $\boldsymbol{y}\in\Gamma$ . More explicitly, given the observation $\boldsymbol{y}$ , the detector decides in favor of hypothesis $\mathcal{H}_{i}$ with probability $\delta_{i}(\boldsymbol{y})$ . Then, the probability of choosing hypothesis $\mathcal{H}_{i}$ when hypothesis $\mathcal{H}_{j}$ is true, denoted by $p_{ij}$ with $0\leq i,j\leq M-1$ , is given by

p_{ij}\vcentcolon=\mathbb{E}_{j}[\delta_{i}(\boldsymbol{y})]=\int_{\Gamma}\delta_{i}(\boldsymbol{y})f_{j}(\boldsymbol{y})\mu(d\boldsymbol{y}),

(2)

where $\mathbb{E}_{j}[\cdot]$ denotes expected value under hypothesis $\mathcal{H}_{j}$ and $\mu(d\boldsymbol{y})$ is used in (2) to denote the $N-$ fold integral and sum for continuous and discrete cases, respectively. Let $\boldsymbol{p}(\boldsymbol{\delta})$ denote the (column) vector containing all pairwise error probabilities $p_{ij}$ for $0\leq i,j\leq M-1$ and $i\neq j$ corresponding to the decision rule $\boldsymbol{\delta}$ . It is sufficient to include only the pairwise error probabilities in $\boldsymbol{p}(\boldsymbol{\delta})$ , i.e., $p_{ij}$ with $i\neq j$ . To see this, note that (2) in conjunction with $\sum_{i=0}^{M-1}\delta_{i}(\boldsymbol{y})=1$ imply $\sum_{i=0}^{M-1}p_{ij}=1$ , from which we get the probability of correctly identifying hypothesis $\mathcal{H}_{i}$ as $p_{ii}=1-\sum_{i=0,i\neq j}^{M-1}p_{ij}$ .

For $M$ -ary hypothesis testing, we consider a generic decision criterion that can be expressed in terms of the error probabilities as follows:

$\displaystyle\underset{\boldsymbol{\delta}\in\mathsf{D}}{\textrm{minimize}}\qquad\$	$\displaystyle g_{0}(\boldsymbol{p}(\boldsymbol{\delta}))$
subject to	$\displaystyle g_{i}(\boldsymbol{p}(\boldsymbol{\delta}))\leq 0,\quad i=1,2,\ldots,m$
	$\displaystyle h_{j}(\boldsymbol{p}(\boldsymbol{\delta}))=0,\quad j=1,2,\ldots,p$	(3)

where $g_{i}$ and $h_{j}$ denote arbitrary functions of the pairwise error probability vector. Classical hypothesis testing criteria such as Bayesian, minimax, Neyman-Pearson (NP) [1], generalized Neyman-Pearson [2], restricted Bayesian [3], and prospect theory based hypothesis testing [4] are all special cases of the formulation in (3). For example, in the restricted Bayesian framework, the Bayes risk with respect to (w.r.t.) a certain prior is minimized subject to a constraint on the maximum conditional risk [3]:

	$\displaystyle\underset{\boldsymbol{\delta}\in\mathsf{D}}{\textrm{minimize}}\quad$	$\displaystyle\quad\,r_{B}(\boldsymbol{\delta})$
	subject to	$\displaystyle\underset{0\leq j\leq M-1}{\textrm{max}}\ R_{j}(\boldsymbol{\delta})\ \leq\alpha$		(4)

for some $\alpha\geq\alpha_{m}$ , where $\alpha_{m}$ is the maximum conditional risk of the minimax procedure [1]. The conditional risk when hypothesis $H_{j}$ is true, denoted by $R_{j}(\boldsymbol{\delta})$ , is given by $R_{j}(\boldsymbol{\delta})=\sum_{i=0}^{M-1}c_{ij}p_{ij}$ and the Bayes risk is expressed as $r_{B}(\boldsymbol{\delta})=\sum_{j=0}^{M-1}\pi_{j}R_{j}(\boldsymbol{\delta})$ , where $\pi_{j}$ denotes the a priori probability of hypothesis $\mathcal{H}_{j}$ and $c_{ij}$ is the cost incurred by choosing hypothesis $\mathcal{H}_{i}$ when in fact hypothesis $\mathcal{H}_{j}$ is true. Hence, (I) is a special case of (3).

In this letter, for the first time in the literature, we provide a unified characterization of optimal decision rules for simple hypothesis testing under a general criterion involving error probabilities.

II Preliminaries

Let $\boldsymbol{v}$ be a real (column) vector of length $M(M-1)$ whose elements are denoted as $v_{ij}$ for $0\leq i,j\leq M-1$ and $i\neq j$ . Next, we present an optimal deterministic decision rule that minimizes the weighted sum of $p_{ij}$ ’s with arbitrary real weights $\boldsymbol{v}$ .¹¹1In classical Bayesian $M-$ ary hypothesis testing, $v_{ij}=\pi_{j}(c_{ij}-c_{jj})$ .

II-A Optimal decision rule that minimizes $\boldsymbol{v}^{T}\boldsymbol{p}(\boldsymbol{\delta})$

The corresponding weighted sum of pairwise error probabilities can be written as

	$\displaystyle\boldsymbol{v}^{T}\boldsymbol{p}(\boldsymbol{\delta})$	$\displaystyle=\sum_{i=0}^{M-1}\sum_{j=0,j\neq i}^{M-1}v_{ij}p_{ij}$
		$\displaystyle=\int_{\Gamma}\sum_{i=0}^{M-1}\delta_{i}(\boldsymbol{y})\left(\sum_{j=0,j\neq i}^{M-1}v_{ij}f_{j}(\boldsymbol{y})\right)\mu(d\boldsymbol{y}),$		(5)

where (2) is substituted for $p_{ij}$ in (II-A). Defining $V_{i}(\boldsymbol{y})\vcentcolon=\sum_{j=0,j\neq i}^{M-1}v_{ij}f_{j}(\boldsymbol{y})$ , we get

	$\displaystyle\boldsymbol{v}^{T}\boldsymbol{p}(\boldsymbol{\delta})$	$\displaystyle=\int_{\Gamma}\sum_{i=0}^{M-1}\delta_{i}(\boldsymbol{y})V_{i}(\boldsymbol{y})\,\mu(d\boldsymbol{y})$
		$\displaystyle\geq\int_{\Gamma}\min_{0\leq i\leq M-1}\{V_{i}(\boldsymbol{y})\}\ \mu(d\boldsymbol{y})$		(6)

The lower bound in (II-A) is achieved if, for all $\boldsymbol{y}\in\Gamma$ , we set

\delta_{\ell}(\boldsymbol{y})=1\text{ for }\ell=\operatorname*{argmin}_{0\leq i\leq M-1}V_{i}(\boldsymbol{y})

(7)

(and hence, $\delta_{i}(\boldsymbol{y})=0$ for all $i\neq\ell$ ), i.e., each observed vector $\boldsymbol{y}$ is assigned to the corresponding hypothesis that minimizes $V_{i}(\boldsymbol{y})$ over all $0\leq i\leq M-1$ . In case where there are multiple hypotheses that achieve the same minimum value of $V_{\ell}(\boldsymbol{y})$ for a given observation $\boldsymbol{y}$ , the ties can be broken by arbitrarily selecting one of them since the boundary decision does not affect the decision criterion $\boldsymbol{v}^{T}\boldsymbol{p}(\boldsymbol{\delta})$ . However, pairwise probabilities for erroneously selecting hypotheses $\mathcal{H}_{i}$ and $\mathcal{H}_{j}$ will change if the set of boundary points

\mathsf{B}_{i,j}(\boldsymbol{v})\vcentcolon=\{\boldsymbol{y}\in\Gamma\,:\,V_{i}(\boldsymbol{y})=V_{j}(\boldsymbol{y})\leq V_{k}(\boldsymbol{y})\\ \text{ for all }0\leq k\leq M-1,k\neq i,k\neq j\}

(8)

occurs with nonzero probability. We also define the set of all boundary points

\mathsf{B}(\boldsymbol{v})\vcentcolon=\underset{\begin{subarray}{c}0\leq i\leq M-1\\ i<j\leq M-1\end{subarray}}{\bigcup}\mathsf{B}_{i,j}(\boldsymbol{v})

(9)

and the complimentary set where $V_{i}(\boldsymbol{y})$ for some $0\leq i\leq M-1$ is strictly smaller than the rest:

\bar{\mathsf{B}}(\boldsymbol{v})\vcentcolon=\Gamma\setminus\mathsf{B}(\boldsymbol{v})=\{\boldsymbol{y}\in\Gamma\,:\,V_{i}(\boldsymbol{y})<V_{j}(\boldsymbol{y}),\text{ for some }\\ 0\leq i\leq M-1\text{ and all }0\leq j\leq M-1,j\neq i\}

(10)

II-B The set of achievable pairwise error probability vectors

Let $\mathsf{P}$ denote the set of all pairwise error probability vectors that can be achieved by randomized decision functions $\boldsymbol{\delta}\in\mathsf{D}$ , i.e., $\mathsf{P}\vcentcolon=\{\boldsymbol{p}(\boldsymbol{\delta})\,:\,\boldsymbol{\delta}\in\mathsf{D}\}$ . In this part, we present some properties of $\mathsf{P}$ .

Property 1: $\mathsf{P}$ is a convex set.

Proof: Let $\boldsymbol{p}^{1}(\boldsymbol{\delta}^{1})$ and $\boldsymbol{p}^{2}(\boldsymbol{\delta}^{2})$ be two pairwise error probability vectors obtained by employing randomized decision functions $\boldsymbol{\delta}^{1}$ and $\boldsymbol{\delta}^{2}$ , respectively. Then, for any $\theta$ with $0\leq\theta\leq 1$ , $\boldsymbol{p}_{\theta}=\theta\boldsymbol{p}^{1}(\boldsymbol{\delta}^{1})+(1-\theta)\boldsymbol{p}^{2}(\boldsymbol{\delta}^{2})\in\mathsf{P}$ since $\boldsymbol{p}_{\theta}$ is the pairwise error probability vector corresponding to the randomized decision rule $\theta\boldsymbol{\delta}^{1}+(1-\theta)\boldsymbol{\delta}^{2}$ as seen from (2).

Property 2: Let $\boldsymbol{p}_{0}$ be a point on the boundary of $\mathsf{P}$ . There exists a hyperplane $\{\boldsymbol{p}\,:\,\boldsymbol{v}^{T}\boldsymbol{p}=\boldsymbol{v}^{T}\boldsymbol{p}_{0}\}$ that is tangent to $\mathsf{P}$ at $\boldsymbol{p}_{0}$ and $\boldsymbol{v}^{T}\boldsymbol{p}\geq\boldsymbol{v}^{T}\boldsymbol{p}_{0}$ for all $\boldsymbol{p}\in\mathsf{P}$ .

Proof: Follows immediately from the supporting hyperplane theorem [5, Sec. 2.5.2].

III Characterization of Optimal Decision Rule

In order to characterize the solution of (3), we first present the following lemma.

Lemma: Let $\boldsymbol{p}_{0}$ be a point on the boundary of $\mathsf{P}$ and $\{\boldsymbol{p}\,:\,\boldsymbol{v}^{T}\boldsymbol{p}=\boldsymbol{v}^{T}\boldsymbol{p}_{0}\}$ be a supporting hyperplane to $\mathsf{P}$ at the point $\boldsymbol{p}_{0}$ .
Case 1: Any deterministic decision rule of the form given in (7) corresponding to the weights specified by $\boldsymbol{v}$ yields $\boldsymbol{p}_{0}$ if $\mathsf{B}(\boldsymbol{v})$ , defined in (9), has zero probability under all hypotheses.
Case 2: $\boldsymbol{p}_{0}$ is achieved by a randomization among at most $M(M-1)$ deterministic decision rules of the form given in (7), all corresponding to the same weights specified by $\boldsymbol{v}$ , if $\mathsf{B}(\boldsymbol{v})$ , defined in (9), has nonzero probability under some hypotheses.

Proof: See Appendix A.

It should be noted that the condition in case 1 of the lemma, i.e., $\mathsf{B}(\boldsymbol{v})$ has zero probability under all hypotheses, is not difficult to satisfy. A simple example is when the observation under hypothesis $\mathcal{H}_{i}$ is Gaussian distributed with mean $\mu_{i}$ and variance $\sigma^{2}$ for all $0\leq i\leq M-1$ . Furthermore, the lemma implies that any extreme point of the convex set $\mathsf{P}$ , i.e., any point on the boundary of the convex set $\mathsf{P}$ that is not a convex combination of any other points in the set, can be achieved by a deterministic decision rule of the form (7) without any randomization. The points that are on the boundary but not extreme points can be obtained via randomization as stated in case 2.

Next, we present a unified characterization of the optimal decision rule for problems that are in the form of (3). We suppose that the problem in (3) is feasible and let $\boldsymbol{\delta}^{\ast}$ and $\boldsymbol{p}^{\ast}(\boldsymbol{\delta}^{\ast})$ denote an optimal decision rule and the corresponding pairwise error probabilities, respectively.

Theorem: An optimal decision rule that solves (3) can be obtained as
Case 1: a randomization among at most two deterministic decision rules of the form given in (7), each specified by some real $\boldsymbol{v}$ , if $\mathsf{B}(\boldsymbol{v})$ , defined in (9), has zero probability under all hypotheses for all real $\boldsymbol{v}$ ; otherwise
Case 2: a randomization among at most $M(M-1)+1$ deterministic decision rules of the form given in (7), one specified by some real $\boldsymbol{v}$ and the remaining $M(M-1)$ correspond to the same weights specified by another real $\boldsymbol{v}$ .

Proof: If the optimal point $\boldsymbol{p}^{\ast}(\boldsymbol{\delta}^{\ast})$ is on the boundary of $\mathsf{P}$ , then the lemma takes care of the proof. Here, we consider the case when $\boldsymbol{p}^{\ast}(\boldsymbol{\delta}^{\ast})$ is an interior point of $\mathsf{P}$ . First, we pick an arbitrary $\boldsymbol{v}^{1}\in\mathbb{R}^{M(M-1)}$ and derive the optimal deterministic decision rule according to (7). Let $\boldsymbol{p}^{1}$ denote the pairwise error probability vector corresponding to the employed decision rule. Then, we move along the ray that originates from $\boldsymbol{p}^{1}$ and passes through $\boldsymbol{p}^{\ast}(\boldsymbol{\delta}^{\ast})$ . Since $\boldsymbol{P}$ is bounded, this ray will intersect with the boundary of $\boldsymbol{P}$ at some point, say $\boldsymbol{p}^{2}$ . If the condition in case 1 is satisfied, then by lemma-case 1, there exists a deterministic decision rule of the form given in (7) that yields $\boldsymbol{p}^{2}$ . Otherwise, by lemma-case 2, $\boldsymbol{p}^{2}$ is achieved by a randomization among at most $M(M-1)$ deterministic decision rules of the form given in (7), all sharing the same weight vector $\boldsymbol{v}^{2}$ . Since $\boldsymbol{p}^{\ast}(\boldsymbol{\delta}^{\ast})$ resides on the line segment that connects $\boldsymbol{p}^{1}$ to $\boldsymbol{p}^{2}$ , it can be attained by appropriately randomizing among the decision rules that yield $\boldsymbol{p}^{1}$ and $\boldsymbol{p}^{2}$ . $\blacksquare$

When the optimization problem in (3) possesses certain structure, the maximum number of deterministic decision rules required to achieve optimal performance may be reduced below those given in the theorem. For example, suppose that the objective is a concave function of $\boldsymbol{p}$ and there are a total of $n$ constraints in (3) which are all linear in $\boldsymbol{p}$ (i.e., the feasible set, denoted by $\mathsf{P}^{\prime}$ , is the intersection of $\mathsf{P}$ with halfspaces and hyperplanes). It is well known that the minimum of a concave function over a closed bounded convex set is achieved at an extreme point [5]. Hence, in this case, the optimal point $\boldsymbol{p}^{\ast}$ is an extreme point of $\mathsf{P}^{\prime}$ . By Dubin’s theorem [6], any extreme point of $\mathsf{P}^{\prime}$ can be written as a convex combination of $n+1$ or fewer extreme points of $\mathsf{P}$ . Since any extreme point of $\mathsf{P}$ can be achieved by a deterministic decision rule of the form (7), the optimal decision rule is obtained as a randomization among at most $n+1$ deterministic decision rules of the form (7). If there are no constraints in (3), i.e., $n=0$ , the deterministic decision rule given in (7) is optimal and no randomization is required with a concave objective function.

An immediate and important corollary of the theorem is given below.

Corollary: Likelihood ratios are sufficient statistics for simple $M-$ ary hypothesis testing under any decision criterion that is expressed in terms of arbitrary functions of error probabilities as specified in (3).

Proof: It is stated in the theorem that a solution of the generic optimization problem in (3) can be expressed in terms of decision rules of the form given in (7). These decision rules only involve comparisons among $V_{i}(\boldsymbol{y})$ ’s, which are linear w.r.t. the density terms $f_{i}(\boldsymbol{y})$ ’s. Normalizing $f_{i}(\boldsymbol{y})$ ’s with $f_{0}(\boldsymbol{y})$ and defining $L_{i}(\boldsymbol{y})\vcentcolon=f_{i}(\boldsymbol{y})/f_{0}(\boldsymbol{y})$ , we see that an optimal decision rule that solves the problem in (3) depends on the observation $\boldsymbol{y}$ only through the likelihood ratios. $\blacksquare$

IV Numerical Examples

In this section, numerical examples are presented by considering a binary hypothesis testing problem; i.e., $M=2$ in (1). Suppose that a bit ( $0$ or $1$ ) is sent over two independent binary channels to a decision maker, which aims to make an optimal decision based on the binary channel outputs. The output of binary channel $k$ is denoted by $y_{k}\in\{0,1\},\ k=1,2$ , and the decision maker declares its decision based on $\boldsymbol{y}=[y_{1},y_{2}]$ . The probability that the output of binary channel $k$ is $i$ when bit $j$ is sent is denoted by $p_{ij}^{(k)}$ for $0\leq i,j\leq 1$ with $p_{0j}^{(k)}+p_{1j}^{(k)}=1$ . Then, the pmf of $\boldsymbol{y}$ under $\mathcal{H}_{j}$ is given by

\displaystyle f_{j}(\boldsymbol{y})=\begin{cases}p_{0j}^{(1)}p_{0j}^{(2)}\,,&{\textrm{if}}~\boldsymbol{y}=[0,0]\\ p_{0j}^{(1)}p_{1j}^{(2)}\,,&{\textrm{if}}~\boldsymbol{y}=[0,1]\\ p_{1j}^{(1)}p_{0j}^{(2)}\,,&{\textrm{if}}~\boldsymbol{y}=[1,0]\\ p_{1j}^{(1)}p_{1j}^{(2)}\,,&{\textrm{if}}~\boldsymbol{y}=[1,1]\end{cases}

(11)

for $j\in\{0,1\}$ . As in the previous sections, the pairwise error probability vector of the decision maker for a given decision rule $\boldsymbol{\delta}$ is represented by $\boldsymbol{p}(\boldsymbol{\delta})$ , which is expressed as $\boldsymbol{p}(\boldsymbol{\delta})=[p_{10},p_{01}]^{T}$ in this case. It is assumed that the decision maker knows the conditional pdfs in (11).

In this section, a special case of (3) is considered based on prospect theory by focusing on a behavioral decision maker [4, 7, 8, 9]. In particular, there exist no constraints (i.e., $m=p=0$ in (3)) and the objective function in (3) is expressed as

\displaystyle\vskip-8.53581ptg_{0}(\boldsymbol{p}(\boldsymbol{\delta}))=\sum_{i=0}^{1}\sum_{j=0}^{1}w(P(\mathcal{H}_{i}\text{ is selected \& }\mathcal{H}_{j}\text{ is true}))v(c_{ij})

(12)

where $w(\cdot)$ is a weight function and $v(\cdot)$ is a value function, which characterize how a behavioral decision maker distorts probabilities and costs, respectively [4], and $P(\cdot)$ denotes the probability of its argument. In the numerical examples, the following weight function is employed: $w(p)=\frac{p^{\kappa}}{(p^{\kappa}+(1-p)^{\kappa})^{1/\kappa}}$ [4, 7, 8, 9]. In addition, the other parameters are set as $v(c_{00})=3$ , $v(c_{01})=10$ , $v(c_{10})=20$ , and $v(c_{11})=7$ . Furthermore, the prior probabilities of bit $0$ and bit $1$ are assumed to be equal.

The aim of the decision maker is to obtain a decision rule that minimizes (12). In the first example, $\kappa$ is set to $5$ , and the parameters of the binary channels are selected as $p_{10}^{(1)}=p_{10}^{(2)}=0.4$ and $p_{01}^{(1)}=p_{01}^{(2)}=0.1$ . In this case, it can be shown via (11) that there exist $6$ different deterministic decision rules in the form of (7), which achieve the pairwise error probability vectors marked with blue stars in Fig. 1. The convex hull of these pairwise error probability vectors is also illustrated in the figure. Over these deterministic decision rules (i.e., in the absence of randomization), the minimum achievable value of (12) becomes $0.1901$ , which corresponds to the pairwise error probability vector shown with the green square in Fig. 1. If randomization between two deterministic decision rules in the form of (7) is considered, the resulting minimum objective value becomes $0.0422$ , and the corresponding pairwise error probability vector is indicated with the red triangle in the figure. On the other hand, in compliance with the theorem (case 2), the minimum value of (12) is achieved via randomization of (at most) three deterministic decision rules in the form of (7) (since $M(M-1)+1=3$ ). In this case, the optimal decision rule randomizes among $\delta_{1}$ , $\delta_{2}$ , and $\delta_{3}$ , with randomization coefficients of $0.41$ , $0.51$ , and $0.08$ , respectively, as given below:

$\displaystyle\delta_{1}(\boldsymbol{y})$	$\displaystyle=0~\text{for all~}\boldsymbol{y}$
$\displaystyle\delta_{2}(\boldsymbol{y})$	$\displaystyle=\begin{cases}0\,,\text{ if }\boldsymbol{y}\in\{[0,1],[1,0],[1,1]\}\\ 1\,,\text{ if }\boldsymbol{y}=[0,0]\\ \end{cases}$	(13)
$\displaystyle\delta_{3}(\boldsymbol{y})$	$\displaystyle=\begin{cases}0\,,\text{ if }\boldsymbol{y}=[1,1]\\ 1\,,\text{ if }\boldsymbol{y}\in\{[0,0],[0,1],[1,0]\}\end{cases}$

This optimal decision rule achieves the lowest objective value of $0.0400$ , and the corresponding pairwise error probability vector is marked with the black circle in Fig. 1. Hence, this example shows that randomization among three deterministic decision rules may be required to obtain the solution of (3).

Refer to caption — Figure 1: Convex hull of pairwise error probability vectors corresponding to deterministic decision rules in (7), and pairwise error probability vectors corresponding to decision rules which yield the minimum objectives attained via no randomization (marked with square), randomization of two (marked with triangle) and three deterministic decision rules (marked with circle), where $p_{10}^{(1)}=p_{10}^{(2)}=0.4$ , $p_{01}^{(1)}=p_{01}^{(2)}=0.1$ , and $\kappa=5$ .

In the second example, the parameters are taken as $\kappa=1.5$ , $p_{10}^{(1)}=0.3$ , $p_{10}^{(2)}=0.2$ , $p_{01}^{(1)}=0.4$ , and $p_{01}^{(2)}=0.25$ . In this case, there exist $8$ different deterministic decision rules in the form of (7), which achieve the pairwise error probability vectors marked with blue stars in Fig. 2. The minimum value of (12) among these deterministic decision rules is $3.9278$ , which corresponds to the pairwise error probability vector shown with the green square in the figure. In addition, the pairwise error probability vectors corresponding to the solutions with randomization of two and three deterministic decision rules are marked with the red triangle and the black circle, respectively. In this scenario, the minimum objective value ( $3.8432$ ) can be achieved via randomization of two deterministic decision rules, as well. This is again in compliance with the theorem (case 2), which states that an optimal decision rule can be obtained as a randomization among at most $M(M-1)+1$ deterministic decision rules of the form given in (7).

V Concluding Remarks

This letter presents a unified characterization of optimal decision rules for simple $M-$ ary hypothesis testing under a generic performance criterion that depends on arbitrary functions of error probabilities. It is shown that optimal performance with respect to the design criterion can be achieved by randomizing among at most two deterministic decision rules of the form reminiscent (but not necessarily identical) to Bayes rule when points on the decision boundary do not contribute to the error probabilities. For the general case, the solution for an optimal decision rule is reduced to a search over two weight coefficient vectors, each of length $M(M-1)$ . Likelihood ratios are shown to be sufficient statistics. Classical performance measures including Bayesian, minimax, Neyman-Pearson, generalized Neyman-Pearson, restricted Bayesian, and prospect theory based approaches all appear as special cases of the considered framework.

Finally, we point out that the form of optimal local sensor decision rules for the problem of distributed detection [10, 11, 12, 13] with conditionally independent observations at the sensors and an arbitrary fusion rule can be characterized using the proposed framework.

Appendix A Proof of Lemma

Since $\{\boldsymbol{p}\,:\,\boldsymbol{v}^{T}\boldsymbol{p}=\boldsymbol{v}^{T}\boldsymbol{p}_{0}\}$ is a supporting hyperplane to $\mathsf{P}$ at the point $\boldsymbol{p}_{0}$ , we get $\boldsymbol{v}^{T}\boldsymbol{p}\geq\boldsymbol{v}^{T}\boldsymbol{p}_{0}$ for all $\boldsymbol{p}\in\mathsf{P}$ . Furthermore, the deterministic decision rule given in (7), denoted here by $\boldsymbol{\delta}^{\ast}$ , minimizes $\boldsymbol{v}^{T}\boldsymbol{p}$ among all decision rules $\boldsymbol{\delta}\in\mathsf{D}$ (and consequently over all $\boldsymbol{p}\in\mathsf{P}$ ). Since $\boldsymbol{p}_{0}\in\mathsf{P}$ as well, the deterministic decision rule given in (7) achieves a performance score of $\boldsymbol{v}^{T}\boldsymbol{p}_{0}$ . Any other decision rule that does not agree with $\boldsymbol{\delta}^{\ast}$ on any subset of $\bar{\mathsf{B}}(\boldsymbol{v})$ with nonzero probability measure will have a strictly greater performance score than $\boldsymbol{v}^{T}\boldsymbol{p}_{0}$ (due to the optimality of $\boldsymbol{\delta}^{\ast}$ ), and hence, cannot be on the supporting hyperplane.
Case 1: We prove the first part by contrapositive. Suppose that the deterministic decision rule $\boldsymbol{\delta}^{\ast}$ given in (7) yields $\boldsymbol{p}^{\ast}\neq\boldsymbol{p}_{0}$ meaning that $\boldsymbol{p}_{0}$ is achieved by some other decision rule $\boldsymbol{\delta}^{0}\in\mathsf{D}$ . Since $\boldsymbol{\delta}^{\ast}$ minimizes $\boldsymbol{v}^{T}\boldsymbol{p}$ over all $\boldsymbol{p}\in\mathsf{P}$ , $\boldsymbol{v}^{T}\boldsymbol{p}^{\ast}=\boldsymbol{v}^{T}\boldsymbol{p}_{0}$ holds and both $\boldsymbol{p}^{\ast}$ and $\boldsymbol{p}_{0}$ are located on the supporting hyperplane $\{\boldsymbol{p}\,:\,\boldsymbol{v}^{T}\boldsymbol{p}=\boldsymbol{v}^{T}\boldsymbol{p}_{0}\}$ . This implies that $\boldsymbol{\delta}^{\ast}$ and $\boldsymbol{\delta}^{0}$ must agree on any subset of $\bar{\mathsf{B}}(\boldsymbol{v})$ with nonzero probability measure. As a result, the difference between the pairwise probability vectors $\boldsymbol{p}^{\ast}$ and $\boldsymbol{p}_{0}$ must stem from the difference of $\boldsymbol{\delta}^{\ast}$ and $\boldsymbol{\delta}^{0}$ over $\mathsf{B}(\boldsymbol{v})$ . Consequently, the set $\mathsf{B}(\boldsymbol{v})$ cannot have zero probability under all hypotheses.
Case 2: Suppose that the set of boundary points specified by $\mathsf{B}(\boldsymbol{v})$ has nonzero probability under some hypotheses. In this case, each point in $\mathsf{B}_{i,j}(\boldsymbol{v})$ can be assigned arbitrarily (or in a randomized manner) to hypotheses $\mathcal{H}_{i}$ and $\mathcal{H}_{j}$ . Since the way the ties are broken does not change $\boldsymbol{v}^{T}\boldsymbol{p}$ , the resulting error probability vectors are all located on the intersection of the set $\mathsf{P}$ with the $M(M-1)-1$ dimensional supporting hyperplane $\{\boldsymbol{p}\,:\,\boldsymbol{v}^{T}\boldsymbol{p}=\boldsymbol{v}^{T}\boldsymbol{p}_{0}\}$ . By Carathéodory’s Theorem [14], any point (including $\boldsymbol{p}_{0}$ ) in the intersection set, whose dimension is at most $M(M-1)-1$ , can be represented as a convex combination of at most $M(M-1)$ extreme points of this set. Since these extreme points can only be obtained via deterministic decision rules which all agree with $\boldsymbol{\delta}^{\ast}$ on the set $\bar{\mathsf{B}}(\boldsymbol{v})$ , $\boldsymbol{p}_{0}$ can be achieved by a randomization among at most $M(M-1)$ deterministic decision rules of the form given in (7), all corresponding to the weights specified by $\boldsymbol{v}$ . $\blacksquare$

References

[1] H. V. Poor, An Introduction to Signal Detection and Estimation. New York: Springer-Verlag, 1994.
[2] E. L. Lehmann, Testing Statistical Hypotheses, 2nd ed. New York, USA: Chapman & Hall, 1986.
[3] J. L. Hodges, Jr. and E. L. Lehmann, “The use of previous experience in reaching statistical decisions,” Ann. Math. Stat., vol. 23, no. 3, pp. 396–407, Sep. 1952.
[4] S. Gezici and P. K. Varshney, “On the optimality of likelihood ratio test for prospect theory-based binary hypothesis testing,” IEEE Signal Process. Lett., vol. 25, no. 12, pp. 1845–1849, Dec 2018.
[5] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: Cambridge University Press, 2004.
[6] H. Witsenhausen, “Some aspects of convexity useful in information theory,” IEEE Trans. Inform. Theory, vol. 26, no. 3, pp. 265–271, May 1980.
[7] R. Gonzales and G. Wu, “On the shape of the probability weighting function,” Cognitive Psychology, vol. 38, no. 1, pp. 129–166, 1999.
[8] D. Prelec, “The probability weighting function,” Econometrica, vol. 66, no. 3, pp. 497–527, 1998.
[9] A. Tversky and D. Kahneman, “Advances in prospect theory: Cumulative represenation of uncertainty,” Journal of Risk and Uncertainty, vol. 5, pp. 297–323, 1992.
[10] C. Altay and H. Delic, “Optimal quantization intervals in distributed detection,” IEEE Trans. Aerosp. Electron. Syst., vol. 52, no. 1, pp. 38–48, February 2016.
[11] C. A. M. Sotomayor, R. P. David, and R. Sampaio-Neto, “Adaptive nonassisted distributed detection in sensor networks,” IEEE Trans. Aerosp. Electron. Syst., vol. 53, no. 6, pp. 3165–3174, Dec 2017.
[12] A. Ghobadzadeh and R. S. Adve, “Separating function estimation test for binary distributed radar detection with unknown parameters,” IEEE Trans. Aerosp. Electron. Syst., 2018.
[13] D. Warren and P. Willett, “Optimum quantization for detector fusion: some proofs, examples, and pathology,” J. Franklin Inst., vol. 336, no. 2, pp. 323–359, 1999.
[14] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton University Press, 1968.