Retrieving Data Permutations from Noisy Observations: High and Low Noise Asymptotics
Abstract
This paper considers the problem of recovering the permutation of an -dimensional random vector observed in Gaussian noise. First, a general expression for the probability of error is derived when a linear decoder (i.e., linear estimator followed by a sorting operation) is used. The derived expression holds with minimal assumptions on the distribution of and when the noise has memory. Second, for the case of isotropic noise (i.e., noise with a diagonal scalar covariance matrix), the rates of convergence of the probability of error are characterized in the high and low noise regimes. In the low noise regime, for every dimension , the probability of error is shown to behave proportionally to , where is the noise standard deviation. Moreover, the slope is computed exactly for several distributions and it is shown to behave quadratically in . In the high noise regime, for every dimension , the probability of correctness is shown to behave as , and the exact expression for the rate of convergence is also provided.
I Introduction
The problem of recovering data permutations from noisy observations is becoming a common task of modern communication and computing systems. For example, systems based on data sorting operations, such as a recommender system or a data analysis system, make use of the data permutations and leverage the information that can be obtained from the data ordering. In particular, recommender systems clearly utilize the sorting information in order to optimize their next recommendation. As for the case of a recommender system, data analysis systems are also often interested in rankings of massive data sets rather than in the exact values of the data. In such systems, users may desire to enclose their data when it contains sensitive information. A common solution to privatize individual data is to add a sufficient amount of random noise to guarantee the desired privacy level [1]. However, adding too much noise can render the task of recovering a permutation impossible as the data will be too noisy. Therefore, for a given noise level, it is important to understand the fundamental limits of the data permutation recovery problem.
In this work, following preliminary works in [2] and [3], we study the data permutation recovery problem in the framework of an -ary hypothesis testing. The specific goal of this paper is to study fundamental limits of such problem under the constraint that a linear decoder (i.e., linear estimator followed by a sorting operation) is employed. Studying linear decoders is interesting for several reasons. First, as it was shown in [2] linear decoders are optimal (i.e., they lead to the smallest probability of error) when the noise is isotropic, and the distribution of the input data is exchangeable. Second, the optimal decoder can be linear even if the noise is colored; see [3] for the exact conditions. Third, linear decoders have at most polynomial complexity in the data dimension and hence, they are suitable for practical implementations.
The structure of the paper is as follows. In Section II, we introduce the notation and formally define the problem. In Section III, we characterize the probability of error when linear decoders are used. The derived expression holds with minimal assumptions on the distribution of the data and holds when the noise has memory. In Section IV, we utilize the expression for the error probability derived in Section III and characterize the asymptotic behavior of the probability of error for the isotropic noise case (i.e., when the noise covariance matrix is a diagonal scalar matrix) in the low and high noise regimes. For example, we show that the probability of error linearly increases in (i.e., the standard deviation of the noise) in the low noise regime (i.e., when ). We derive the exact slope and we show it to be at most a quadratic function of the data dimension for a general class of distributions. In addition, we show that the behavior of the probability of correctness in the high noise regime (i.e., when ) is proportional to , and we characterize the exact slope.
I-A Related work
Permutation associated estimation problems have recently gained significant importance and are studied in various fields [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. The ranking (e.g., data permutation) estimation problem under a joint Gaussian distribution was investigated in [4, 5, 6, 7]. In particular, in [4] the author considered a pairwise ordering for the bivariate case; the extended version to the -dimension was considered in [5]. The generalization of the assumption of a Gaussian distribution to an elliptically contoured distribution can be found in [6, 7]. The authors in [4, 5, 6, 7] analyzed the structure of the covariance matrix that maximizes the probability of correctness of such estimation problems using the minimum mean square error (MMSE) estimator. In [3], the MMSE estimator was shown to be the only linear estimator that achieves the minimum probability of error for the ranking estimation problem. Most of recent works study a problem based on a linear regression framework premultiplied by an unknown permutation matrix, which suitably models the problem with unknown labels. In [8], the feature matching problem in computer vision was formulated as a permutation recovery problem. The multivariate linear regression model with an unknown permutation was studied in [10, 9]. The authors provided necessary and sufficient conditions on the signal-to-noise ratio for an exact permutation recovery and characterized the minimax prediction error. The isotonic regression without data labels, namely the uncoupled isotonic regression, was discussed in [11]. Data estimation given randomly selected measurements – referred to as unlabeled sensing – was studied in [12, 13, 14]. In [12], the authors characterized a necessary condition on the dimension of the observation vector for uniquely recovering the original data in the noiseless case. A generalized framework of unlabeled sensing was presented in [15, 16, 17]. The estimation of a sorted vector based on noisy observations was proposed in [18], where the MMSE estimator on sorted data was characterized as a linear combination of estimators on the unsorted data.
II Notation and Framework
Notation. Boldface upper case letters denote vector random variables; the boldface lower case letter indicates a specific realization of ; denotes the -th order statistics of ; is the norm of ; is the set of integers from to ; is the identity matrix of dimension ; is the column vector of dimension of all zeros; calligraphic letters indicate sets; is the cardinality of ; for and , is the set of elements that belong to but not to , is the set of elements that belong both to and , and is the set of elements which are in either set. For a set , denotes the volume, i.e., the -dimensional Lebesgue measure. For two -dimensional vectors and , if for all , the -th element of is larger than or equal to the -th element of , then we use . Finally, the multiplication of a matrix by a set is denoted and defined as .
We consider the framework in Fig. 1, where an -dimensional random vector is first generated according to a certain distribution and then passed through an additive noisy channel with Gaussian transition probability, the output of which is denoted as . Thus, we have , with where is the covariance matrix of the additive noise , and where and are independent.
In this work, we are interested in studying the probability of error of the “data permutation recovery” problem formulated in [2, 3] that, given the observation of , seeks to retrieve the permutation (among the possible ones) according to which the vector is sorted. Specifically, this problem can be formulated within a hypothesis testing framework with hypotheses , where is the collection of all permutations of the elements of , and where is the hypothesis that is an -dimensional vector sorted according to the permutation , that is
(1) |
with being the -th element of , and being the -th element of . Given this, the optimal decoder in Fig. 1 will output such that
(2) |
where denotes the permutation according to which the random vector is sorted. In particular, the decoder will declare that the input vector if and only if the observation vector , where are the so-called optimal decision regions111The notation indicates that, in general, the decision regions might be functions of the noise covariance matrix ., which can be derived by leveraging the maximum a posterior probability (MAP) criterion [19, Appendix 3C] and are given by [2, 3]
(3) |
where with denoting the conditional probability density function of given that . In order to guarantee that the collection is a partition of the -dimensional space, we assume that if , then one of the hypotheses is arbitrarily selected.
III Probability of Error with Linear Decoder
In this section, we focus on characterizing the probability of error of the data permutation recovery problem introduced in Section II. Given the hypothesis and decision regions defined in (1) and (3), we have that the error probability is given by
(4a) | ||||
(4b) |
where is the probability of correctness.
In particular, we assess the probability of error when a linear decoder is employed. This decoder first computes a permutation-independent linear transformation of , i.e., , where and are the same for all permutations, and then it outputs the permutation according to which is sorted. The decision regions in (3) when a linear decoder is used become
(5) |
Our choice of assessing the probability of error performance of a linear decoder stems primarily from its low complexity (at most polynomial in ) compared to a brute force evaluation of the optimal test (3), which has a practically prohibitive complexity of . Moreover, for the case it has been shown in [3] that a linear decoder is indeed optimal, i.e., it minimizes the probability of error, under certain conditions on the noise covariance matrix .
We next derive an expression for the probability of error when a linear decoder is used. Towards this end, for each , we define a matrix such that
(6) |
where if and only if and is equal to zero otherwise. For instance, let and consider ; then, we have that
The theorem below provides an expression for the error probability of the data permutation recovery problem when a linear decoder is used.
Theorem 1.
Let be an exchangeable random vector222A sequence of random variables is said to be exchangeable if, for any permutation of the indices , we have that is equal in distribution to .. Then, for any invertible and defined in (5) and any noise covariance matrix , the probability of error is given by
(7) |
where with given by (6), and where is the multivariate Gaussian Q-function with covariance .
Proof:
By substituting the decision regions in (5) inside (4) and by using the Bayes’ theorem, we obtain
(8) |
where follows from the fact that is exchangeable and hence, and letting , and is due to the law of total expectation.
We now focus on the conditional probability inside the conditional expectation in (III). For each we have
(9) |
where the last equality follows by letting . Note that .
Then, given , the event inside the conditional probability in (III) can be expressed as
(10) |
where the last equality follows by using the definition of in (6). By introducing a random vector , where , we have an equivalent expression for (III) as . By substituting this inside (III), we obtain
(11) |
where the last equality follows by letting be the multivariate Gaussian Q-function with covariance . We conclude the proof of Theorem 1 by using . ∎
We highlight that (7) holds with minimal assumption on the distribution of (i.e., exchangeability) and hence, it can be used to study the error probability of the data permutation recovery problem in various noise settings, e.g., noise has memory, noise is isotropic. In the remaining of this paper, we will focus on the isotropic noise scenario, i.e., we assume that is a diagonal scalar matrix.
IV Isotropic Noise
We here study the error probability of the data permutation recovery problem when the noise is isotropic, i.e., . Under this assumption, the regions in (5) depends on only through the parameter and hence, we let . Moreover, when is exchangeable, it has been shown in [2] that , i.e., for the isotropic noise setting the optimal decoder is indeed linear and hence, the probability of error in Theorem 1 is the minimum.
In Section IV-A, we will evaluate the probability of error in (7) when and then in Section IV-B and Section IV-C, we will use this expression to derive the rates of convergence of in the low noise regime (i.e., ) and high noise regime (i.e., ), respectively.
IV-A Probability of Error
Under the assumption we have that [2] and hence, with reference to (5), we have that and . Moreover, by substituting these values inside in Theorem 1, we obtain
(12) |
that is is a tridiagonal Toeplitz matrix.
The probability of error in the isotropic noise scenario is then given by the next corollary.
Corollary 1.
Let be an exchangeable random vector and let . Then, for an arbitrary , the probability of error is given by
(13) |
where is defined in (12) and where is the multivariate Gaussian Q-function with covariance .
Proof:
We note that (14) is a function of and hence, in what follows we will use to highlight this dependence.
IV-B Low Noise Asymptotic
We here focus on the asymptotic behavior of the probability of error in the low noise regime (i.e., ). In particular, the next result, proved in Appendix A, shows that the probability of error in this regime is approximately linear in .
Theorem 2.
Let consist of i.i.d. random variables generated according to . Let be an independent copy of and assume that
(16) |
Then,
(17) |
where .
Remark 1.
Remark 2.
We now show that, under the condition , the asymptotic behavior of the probability of error in the low noise regime for an i.i.d. is upper bounded by . In particular, we have the following lemma.
Lemma 1.
Assume that , where is a constant. Then,
(20) |
Proof:
We conclude this section by providing some examples of (17) for a few distributions.
Example 3. Consider . Then,
(24) |
Note that the upper bound in (24) follows from Lemma 1, where we used the fact that . For the lower bound we use the following inequality [22, Lemma 10.1]:
(25) |
where the last step follows since . Combining the expression in (19) and the bound in (25), we arrive at the following lower bound,
(26) |
which implies the lower bound in (24).
IV-C High Noise Asymptotic
We now focus on the asymptotic behavior of the probability of error in the high noise regime (i.e., ). It is not difficult to argue that if is exchangeable, then we have that
(27) |
The interpretation is that if is large, then the output carries no information of , and the decoder can only rely on the prior knowledge; hence, the best thing that the decoder can do is to guess one of the hypotheses.
The next result, proved in Appendix B, sharpens the limit in (27) by finding the rate of convergence.
Theorem 3.
Let be an exchangeable random vector such that . Then,
(28) |
where and
(29) |
where is defined in (1), is the -dimensional ball centered at the origin with unitary radius, and is the -dimensional ellipsoid centered at the origin with unit radii along standard axes except a radius along the -th axis.
Finding a closed-form expression for the ’s in (29) does not appear to be an easy task. In the next lemma, we provide upper and lower bounds on the ’s, which lead to expressions that are amenable to computations.
Lemma 2.
In the high noise regime, the convergence rate of the probability of correctness can be bounded as
where .
Proof:
We start by observing that
(30) |
that is, the ellipsoid : (i) contains the ball since has minimum radius equal to ; and (ii) is contained inside the ball since has maximum radius equal to .
Thus, from (30) we obtain
(31) |
where the last equality follows since is a cone that occupies a portion of the space and hence, .
Similarly, from (30) we obtain
(32) |
where in the equality we have used the facts that: (i) , (ii) , and (iii) for any invertible matrix and any set .
We conclude this section by providing some examples of the range for a few common distributions (see Appendix E for the detailed computations). In particular, these examples show that the term dominates in the expression of the rate for several distribution of interest.
Example 1. Consider . Then,
Example 2. Consider . Then,
(34) |
Example 3. Let be -sub-Gaussian333A random variable is -sub-Gaussian if for all . . Then [22],
Appendix A Proof of Theorem 2
Before proceeding with the proof of Theorem 2, we first present two ancillary results.
Lemma 3.
Let consist of i.i.d. random variables generated according to . Let be an independent copy of and assume that
Then, the following holds
where .
Proof:
The proof is provided in Appendix C. ∎
Lemma 4.
Let where is defined in (12). Then, for any subset ,
(35) |
Proof:
The bound in (35) holds if the random vector consists of negatively associated random variables [24]. Observe that the Gaussian random vector consists of either negatively correlated or independent random variables (see the structure of in (12)). As was shown in [24], this implies that the random variables in are negatively associated. This concludes the proof of Lemma 4. ∎
We now leverage the two lemmas above to prove Theorem 2. From Corollary 1 we have that
(36) |
where . The expression in (36) can be equivalently written as
(37) |
where is due to the law of total expectation, and follows from the inclusion-exclusion principle where with .
From the expression in (A) it follows that
(38) |
In what follows, we therefore analyze . We have that
(39) |
where is a -dimensional random vector with entries for .
We next consider two separate cases.
Case 1: . Let ; then, we can write (A) as
where the last equality follows by applying the change of variable. Thus, we have that
(40) |
where the first equality follows from the dominated convergence theorem, which is verifiable since for any , where the fact that is shown in Lemma 3.
Appendix B Proof of Theorem 3
We start by noting that, in view of the limit in (27), we have that . We now consider the following limit,
(45) |
Instead of working with , we parameterize the problem in terms of . Then, (45) can be equivalently expressed as
(46) |
where the last equality can be argued by using the definition of the derivative or the L’Hôpital’s rule.
From Corollary 1, the probability of correctness is given by
(47) |
where we let and , and we used the exchangeablity of .
Using the expression in (47), the derivative of with respect to is now given by
(48) |
where in we used the Leibniz’s integral rule, and follows since
(49) |
where the labeled equalities follow from: since each entry of is positive (we ignore the case when , and in fact for an i.i.d. ); the product rule and the fact that , where is the Dirac delta function; the scaling property of the Dirac delta function.
We now consider the integral inside the expectation in (B). By using the sifting property of the Dirac delta function, the integral becomes
(50) |
where is obtained by retaining all the entries of except the -th one. We next substitute (B) inside (B) and we compute the limit in (46). We obtain
(51) |
where the labeled equalities follows from: using the dominated convergence theorem, which is verified since where is assumed to be absolutely integrable; and using the following,
To finalize the proof, it remains to compute . This can be done as follows,
(52) |
where the labeled equalities follow from: the definition of and writing it in terms of standard normal; the law of total expectation, where we abbreviate ; using the fact that
letting , defining a diagonal matrix with the -th element equal to and the others equal to one, and recalling that from (1) we have ; using the -dimensional volume expression for the probability of a standard normal vector [2]; the fact that for any invertible matrix and any set ; letting be the -dimensional ellipsoid centered at the origin with unit radii along standard axes except a radius along the -th axis.
Appendix C Proof of Lemma 3
We start by noting that the joint density function is given by [20]
where is the cumulative distribution function of and is the probability density function of .
By using the upper bounds of and , we obtain
where the second inequality follows since the integrand is always positive, and the last inequality is due to the assumption that . This shows that the joint density is bounded everywhere.
For the marginal density, we obtain
where the inequality follows from the fact that we have shown above that
This concludes the proof of Lemma 3.
Appendix D Examples for the Low Noise Regime
D-A Uniform Distribution
D-B Exponential Distribution
Appendix E Examples for the High Noise Regime
The key to the proof is to use the following expressions from [23],
First, consider . Then,
and hence, we obtain
Next, let . Then,
and hence, we obtain
References
- [1] C. Dwork, “Differential privacy: A survey of results,” in International conference on theory and applications of models of computation. Springer, 2008, pp. 1–19.
- [2] M. Jeong, A. Dytso, M. Cardone, and H. V. Poor, “Recovering structure of noisy data through hypothesis testing,” in Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), June 2020, pp. 1307–1312.
- [3] ——, “Recovering data permutations from noisy observations: The linear regime,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 854–869, 2020.
- [4] S. R. Searle et al., “Prediction, mixed models, and variance components,” 1973.
- [5] S. Portnoy, “Maximizing the probability of correctly ordering random variables using linear predictors,” Journal of Multivariate Analysis, vol. 12, no. 2, pp. 256–269, 1982.
- [6] K. Nomakuchi and T. Sakata, “Characterizations of the forms of covariance matrix of an elliptically contoured distribution,” Sankhyā: The Indian Journal of Statistics, Series A, pp. 205–210, 1988.
- [7] ——, “Characterization of conditional covariance and unified theory in the problem of ordering random variables,” Annals of the Institute of Statistical Mathematics, vol. 40, no. 1, pp. 93–99, 1988.
- [8] O. Collier and A. S. Dalalyan, “Minimax rates in permutation estimation for feature matching,” The Journal of Machine Learning Research, vol. 17, no. 6, pp. 1 –31, January 2016.
- [9] A. Pananjady, M. J. Wainwright, and T. A. Courtade, “Linear regression with shuffled data: Statistical and computational limits of permutation recovery,” IEEE Transactions on Information Theory, vol. 64, no. 5, pp. 3286–3300, May 2018.
- [10] ——, “Denoising linear models with permuted data,” in Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 446–450.
- [11] P. Rigollet and J. Weed, “Uncoupled isotonic regression via minimum Wasserstein deconvolution,” Information and Inference: A Journal of the IMA, vol. 8, no. 4, pp. 691–717, December 2019.
- [12] J. Unnikrishnan, S. Haghighatshoar, and M. Vetterli, “Unlabeled sensing with random linear measurements,” IEEE Transactions on Information Theory, vol. 64, no. 5, pp. 3237–3253, May 2018.
- [13] S. Haghighatshoar and G. Caire, “Signal recovery from unlabeled samples,” IEEE Transactions on Signal Processing, vol. 66, no. 5, pp. 1242–1257, March 2018.
- [14] H. Zhang, M. Slawski, and P. Li, “Permutation recovery from multiple measurement vectors in unlabeled sensing,” in Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), July 2019, pp. 1857–1861.
- [15] I. Dokmanić, “Permutations unlabeled beyond sampling unknown,” IEEE Signal Processing Letters, vol. 26, no. 6, pp. 823–827, April 2019.
- [16] M. Tsakiris and L. Peng, “Homomorphic sensing,” in Proceedings of the 36th International Conference on Machine Learning (ICML), vol. 97, June 2019, pp. 6335–6344.
- [17] M. C. Tsakiris, “Eigenspace conditions for homomorphic sensing,” arXiv:1812.07966, April 2019.
- [18] A. Dytso, M. Cardone, M. S. Veedu, and H. V. Poor, “On estimation under noisy order statistics,” in Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), July 2019, pp. 36–40.
- [19] S. M. Kay, Fundamentals of Statistical Signal Processing, vol. 2: Detection Theory. Prentice Hall PTR, 1998.
- [20] R. Pyke, “Spacings,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 27, no. 3, pp. 395–449, 1965. [Online]. Available: http://www.jstor.org/stable/2345793
- [21] J. E. Angus, “The probability integral transform and related results,” SIAM review, vol. 36, no. 4, pp. 652–654, 1994.
- [22] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
- [23] H. A. David and H. N. Nagaraja, “Order statistics,” Encyclopedia of statistical sciences, 2004.
- [24] K. Joag-Dev and F. Proschan, “Negative association of random variables with applications,” The Annals of Statistics, pp. 286–295, 1983.