This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: HIIT Basic Research Unit, Laboratory of Computer and Information Science, Helsinki University of Technology, Finland. 11email: ntatti@cc.hut.fi

Safe Projections of Binary Data Sets

Nikolaj Tatti
(January, 2006)
Abstract

Selectivity estimation of a boolean query based on frequent itemsets can be solved by describing the problem by a linear program. However, the number of variables in the equations is exponential, rendering the approach tractable only for small-dimensional cases. One natural approach would be to project the data to the variables occurring in the query. This can, however, change the outcome of the linear program.

We introduce the concept of safe sets: projecting the data to a safe set does not change the outcome of the linear program. We characterise safe sets using graph theoretic concepts and give an algorithm for finding minimal safe sets containing given attributes. We describe a heuristic algorithm for finding almost-safe sets given a size restriction, and show empirically that these sets outperform the trivial projection.

We also show a connection between safe sets and Markov Random Fields and use it to further reduce the number of variables in the linear program, given some regularity assumptions on the frequent itemsets.

Keywords:
Itemsets Boolean Query Estimation Linear Programming
MSC:
68R10 90C05
CR:
G.3
journal: Acta Informatica

1 Introduction

Consider the following problem: given a large, sparse matrix that holds boolean values, and a boolean formula on the columns of the matrix, approximate the probability that the formula is true for a random row of the matrix. A straightforward exact solution is to evaluate the formula on each row. Now consider the same problem using instead of the original matrix a family of frequent itemsets, i.e., sets of columns where true values co-occur in a large fraction of all rows [1, 2]. An optimal solution is obtained by applying linear programming in the space of probability distributions [11, 19, 3], but since a distribution has exponentially many components, the number of variables in the linear program is also large and this makes the approach infeasible. However, if the target formula refers to a small subset of the columns, it may be possible to remove most of the other columns without degrading the solution; somewhat surprisingly, it is not safe to remove all columns that do not appear in the formula. In this paper we investigate the question of which columns may be safely removed. Let us clarify this scenario with the following simple example.

Example 1

Assume that we have three attributes, say aa, bb, and cc, and a data set DD having five transactions

D={(1,0,1),(0,0,1),(0,1,1),(1,1,0),(1,0,0)}.D=\left\{\left(1,0,1\right),\left(0,0,1\right),\left(0,1,1\right),\left(1,1,0\right),\left(1,0,0\right)\right\}.

Let us consider five itemsets, namely aa, bb, cc, abab, and acac. The frequency of an itemset is the fraction of transactions in which all the attributes appearing in the itemset occur simultaneously. This gives us the frequencies θa=35\theta_{a}=\frac{3}{5}, θb=25\theta_{b}=\frac{2}{5}, θc=35\theta_{c}=\frac{3}{5}, θab=15\theta_{ab}=\frac{1}{5}, and θac=15\theta_{ac}=\frac{1}{5}. Let θ=[θa,θb,θc,θab,θac]T\theta=\left[\theta_{a},\theta_{b},\theta_{c},\theta_{ab},\theta_{ac}\right]^{T}. Let us now assume that we want to estimate the frequency of the formula bcb\land c. Consider now a distribution pp defined on these three attributes. We assume that the distribution satisfies the frequencies, that is, p(a=1)=θap(a=1)=\theta_{a}, p(a=1,b=1)=θabp(a=1,b=1)=\theta_{ab}, etc. We want to find a distribution minimising/maximising p(bc=1)p(b\land c=1). To convert this problem into a linear program we consider pp as a real vector having 23=82^{3}=8 elements. To guarantee that pp is indeed a distribution we must require that pp sum to 11 and that p0p\geq 0. The requirements that pp must satisfy the frequencies can be expressed in a form Ap=θAp=\theta for a certain AA. In addition, p(bc=1)p(b\land c=1) can be expressed as cTpc^{T}p for a certain cc. Thus we have transform the original problem into a linear program

mincTp s.t. p=1,p0,Ap=θ.\min c^{T}p\quad\text{ s.t. }\sum p=1,p\geq 0,Ap=\theta.

Solving this program (and also the max-version of the program) gives us an interval I=[15,25]I=\left[\frac{1}{5},\frac{2}{5}\right] for possible frequencies of p(bc=1)p(b\land c=1). This interval has the following property: A rational frequency ηI\eta\in I if and only if there is a data set having the frequencies θ\theta and having η\eta as the fraction of the transactions satisfying the formula bcb\land c. If we, however, delete the attribute aa from the data set and evaluate the boundaries using only the frequencies θb\theta_{b} and θc\theta_{c}, we obtain a different interval I=[0,25]I^{\prime}=\left[0,\frac{2}{5}\right].

The problem is motivated by data mining, where fast methods for computing frequent itemsets are a recurring research theme [10]. A potential new application for the problem is privacy-preserving data mining, where the data is not made available except indirectly, through some statistics. The idea of using itemsets as a surrogate for data stems from [16], where inclusion-exclusion is used to approximate boolean queries. Another approach is to assume a model for the data, such as maximum entropy [21]. The linear programming approach requires no model assumptions.

The boolean query scenario can be seen as a special case for the following minimisation problem: Let KK be the number of attributes. Given a family \mathcal{F} of itemsets, frequencies θ\theta for \mathcal{F}, and some function ff that maps any distribution defined on a set {0,1}K\left\{0,1\right\}^{K} to a real number find a distribution satisfying the frequencies θ\theta and minimising ff. To reduce the dimension KK we assume that ff depends only on a small subset, say BB, of items, that is, if pp is a distribution defined on {0,1}K\left\{0,1\right\}^{K} and pBp_{B} is pp marginalised to BB, then we can write f(p)=f(pB)f(p)=f(p_{B}). The projection is done by removing all the itemsets from \mathcal{F} that have attributes outside BB.

The question is, then, how the projection to BB alters the solution of the minimisation problem. Clearly, the solution remains the same if we can always extend a distribution defined on BB satisfying the projected family of itemsets to a distribution defined on all items and satisfying all itemsets in \mathcal{F}. We describe sufficient and necessary conditions for this extension property. This is done in terms of a certain graph extracted from the family \mathcal{F}. We call the set BB safe if it satisfies the extension property.

If the set BB is not safe, then we can find a safe set CC containing BB. We will describe an efficient polynomial-time algorithm for finding a safe set CC containing BB and having the minimal number of items. We will also show that this set is unique. We will also provide a heuristic algorithm for finding a restricted safe set CC having at maximum MM elements. This set is not necessarily a safe set and the solution to the minimisation problem may change. However, we believe that it is the best solution we can obtain using only MM elements.

The rest of the paper is organised as follows: Some preliminaries are described in Section 2. The concept of a safe set is presented in Section 3 and the construction algorithm is given in Section 4. In Section 5 we explain in more details the boolean query scenario. In Section 6 we study the connection between safe sets and MRFs. Section 7 is devoted to restricted safe sets. We present empirical tests in Section 8 and conclude the paper with Section 9. Proofs for the theorems are given in Appendix.

2 Preliminaries

We begin by giving some basic definitions. A 011 database is a pair D,A\left<D,A\right>, where AA is a set of items {a1,,aK}\left\{a_{1},\ldots,a_{K}\right\} and DD is a data set, that is, a multiset of subsets of AA.

A subset UAU\subseteq A of items is called an itemset. We define an itemset indicator function SU:{0,1}K{0,1}{S_{U}}:{\left\{0,1\right\}^{K}}\to{\left\{0,1\right\}} such that

SU(z)={1,zi=1 for all aiU0,otherwise.S_{U}(z)=\left\{\begin{array}[]{ll}1,&z_{i}=1\text{ for all }a_{i}\in U\\ 0,&\text{otherwise}\end{array}\right..

Throughout the paper we will use the following notation: We denote a random binary vector of length KK by X=XAX=X_{A}. Given an itemset UU we define XUX_{U} to be the binary vector of length |U|{\left|U\right|} obtained from XX by taking only the elements corresponding to UU.

The frequency of the itemset UU taken with respect of DD, denoted by U(D)U\left(D\right), is the mean of SUS_{U} taken with respect DD, that is, U(D)=1|D|zDSU(z)U\left(D\right)=\frac{1}{{\left|D\right|}}\sum_{z\in D}S_{U}(z). For more information on itemsets, see e.g. [1].

An antimonotonic family \mathcal{F} of itemsets is a collection of itemsets such that for each UU\in\mathcal{F} each subset of UU also belongs to \mathcal{F}. We define straightforwardly the itemset indicator function S={SUU}S_{\mathcal{F}}=\left\{S_{U}\mid U\in\mathcal{F}\right\} and the frequency (D)={U(D)U}\mathcal{F}\left(D\right)=\left\{U\left(D\right)\mid U\in\mathcal{F}\right\} for families of itemsets.

If we assume that \mathcal{F} is an ordered family, then we can treat SS_{\mathcal{F}} as an ordinary function S:{0,1}K{0,1}L{S_{\mathcal{F}}}:{\left\{0,1\right\}^{K}}\to{\left\{0,1\right\}^{L}}, where LL is the number of elements in \mathcal{F}. Also it makes sense to consider the frequencies (D)\mathcal{F}\left(D\right) as a vector (rather than a set). We will often use θ\theta to denote this vector. We say that a distribution pp defined on {0,1}K\left\{0,1\right\}^{K} satisfies the frequencies θ\theta, if Ep[S]=θ\operatorname{E}_{p}\left[S_{\mathcal{F}}\right]=\theta.

Given a set of items CC, we define a projection operator in the following way: A data set DCD_{C} is obtained from DD by deleting the attributes outside CC. A projected family of itemsets C={UUC}\mathcal{F}_{C}=\left\{U\in\mathcal{F}\mid U\subseteq C\right\} is obtained from \mathcal{F} by deleting the itemsets that have attributes outside CC. The projected frequency vector θC\theta_{C} is defined similarly. In addition, if we are given a distribution pp defined on {0,1}K\left\{0,1\right\}^{K}, we define a distribution pCp_{C} to be the marginalisation of pp to CC. Given a distribution qq over CC we say that pp is an extension of qq if pC=qp_{C}=q.

3 Safe Projection

In this section we define a safe set and describe how such sets can be characterised using certain graphs.

We assume that we are given a set of items A={a1,,aK}A=\left\{a_{1},\ldots,a_{K}\right\} and an antimonotonic family \mathcal{F} of itemsets and a frequency vector θ\theta for \mathcal{F}. We define \mathbb{P} to be the set of all probability distributions defined on the set {0,1}K\left\{0,1\right\}^{K}. We assume that we are given a function f:{f}:{\mathbb{P}}\to{\mathbb{R}} mapping a distribution to a real number. Let us consider the following problem:

Problem P:Minimisef(p)subject topEp[S]=θ.\begin{array}[]{lrcr}\textsc{Problem P:}\\ \text{Minimise}&f(p)\\ \text{subject to}&p&\in&\mathbb{P}\\ &\operatorname{E}_{p}\left[S_{\mathcal{F}}\right]&=&\theta.\end{array} (1)

That is, we are looking for the minimum value of ff among the distributions satisfying the frequencies θ\theta. Generally speaking, this is a very difficult problem. Each distribution in \mathbb{P} has 2K2^{K} entries and for large KK even the evaluation of f(p)f(p) may become infeasible. This forces us to make some assumptions on ff. We assume that there is a relatively small set CC such that ff does not depend on the attributes outside CC. In other words, we can define ff by a function fCf_{C} such that fC(pC)=f(p)f_{C}(p_{C})=f(p) for all pp. Similarly, we define C\mathbb{P}_{C} to be the set of all distributions defined on the set {0,1}|C|\left\{0,1\right\}^{\left|C\right|}. We will now consider the following projected problem:

Problem PC:MinimisefC(q)subject toqCEq[S𝒞]=θC.\begin{array}[]{lrcr}\textsc{Problem P${}_{C}$:}\\ \text{Minimise}&f_{C}(q)\\ \text{subject to}&q&\in&\mathbb{P}_{C}\\ &\operatorname{E}_{q}\left[S_{\mathcal{F_{C}}}\right]&=&\theta_{C}.\end{array}

Let us denote the minimising distribution of Problem P by p^\hat{p} and the minimising distribution of Problem PC by q^\hat{q}. It is easy to see that f(p^)fC(q^)f(\hat{p})\geq f_{C}(\hat{q}). In order to guarantee that f(p^)=fC(q^)f(\hat{p})=f_{C}(\hat{q}), we need to show that CC is safe as defined below.

Definition 1

Given an antimonotonic family \mathcal{F} and frequencies θ\theta for \mathcal{F}, a set CC is θ\theta-safe if for any distribution qCq\in\mathbb{P}_{C} satisfying the frequencies θC\theta_{C}, there exists an extension pp\in\mathbb{P} satisfying the frequencies θ\theta. If CC is safe for all θ\theta, we say that it is safe.

Example 2

Let us continue Example 1. We saw that the outcome of the linear program changes if we delete the attribute aa. Let us now show that the set C={b,c}C=\left\{b,c\right\} is not a safe set. Let qq be a distribution defined on the set CC such that q(b=0,c=0)=0q(b=0,c=0)=0, q(b=1,c=0)=25q(b=1,c=0)=\frac{2}{5}, q(b=0,c=1)=35q(b=0,c=1)=\frac{3}{5}, and q(b=1,c=1)=0q(b=1,c=1)=0. Obviously, this distribution satisfies the frequencies θb\theta_{b} and θc\theta_{c}. However, we cannot extend this distribution to aa such that all the frequencies are to be satisfied. Thus, CC is not a safe set.

We will now describe a sufficient condition for safeness. We define a dependency graph GG such that the vertices of GG are the items V(G)=AV(G)=A and the edges correspond to the itemsets in \mathcal{F} having two items E(G)={{ai,aj}aiaj}E(G)=\left\{\left\{a_{i},a_{j}\right\}\mid a_{i}a_{j}\in\mathcal{F}\right\}. The edges are undirected. Assume that we are given a subset CC of items and select xCx\notin C. A path P=(ai1,,aiL)P=\left(a_{i_{1}},\ldots,a_{i_{L}}\right) from xx to CC is a graph path such that x=ai1x=a_{i_{1}} and only aiLCa_{i_{L}}\in C. We define a frontier of xx with respect of CC to be the set of the last items of all paths from xx to CC

front(x,C)={aiLP=(ai1,,aiL) is a path from x to C}.\operatorname{front}\left(x,C\right)=\left\{a_{i_{L}}\mid P=\left(a_{i_{1}},\ldots,a_{i_{L}}\right)\text{ is a path from }x\text{ to }C\right\}.

Note that front(x,C)=front(y,C)\operatorname{front}\left(x,C\right)=\operatorname{front}\left(y,C\right), if xx and yy are connected by a path not going through CC. The following theorem gives a sufficient condition for safeness.

Theorem 3.1

Let \mathcal{F} be an antimonotonic family of itemsets. Let CC be a set of items CAC\subseteq A such that for each xCx\notin C the frontier of xx is in \mathcal{F}, that is, front(x,C)\operatorname{front}\left(x,C\right)\in\mathcal{F}. It follows that CC is a safe set.

The vague intuition behind Theorem 3.1 is the following: xx has influence on CC only through front(x,C)\operatorname{front}\left(x,C\right). If front(x,C)\operatorname{front}\left(x,C\right)\in\mathcal{F}, then the distributions marginalised to front(x,C)\operatorname{front}\left(x,C\right) are fixed by the frequencies. This means that xx has no influence on CC and hence it can be removed.

We saw in Examples 1 and 2 that the projection changes the outcome if the projection set is not safe. This holds also in the general case:

Theorem 3.2

Let \mathcal{F} be an antimonotonic family of itemsets. Let CC be a set of items CAC\subseteq A such that there exists xCx\notin C whose frontier is not in \mathcal{F}, that is, front(x,C)\operatorname{front}\left(x,C\right)\notin\mathcal{F}. Then there are frequencies θ\theta for \mathcal{F} such that CC is not θ\theta-safe.

Safeness implies that we can extend every satisfying distribution qq in Problem PC to a satisfying distribution pp in Problem P. This implies that the optimal values of the problems are equal:

Theorem 3.3

Let \mathcal{F} be an antimonotonic family of itemsets. If CC is a safe set, then the minimum value of Problem P is equal to the minimum value of Problem PC for any query function and for any frequencies θ\theta for \mathcal{F}.

If the condition of being safe does not hold, that is, there is a distribution qq that cannot be extended, then we can define a query ff resulting 0 if the input distribution is qq, and 11 otherwise. This construction proves the following theorem:

Theorem 3.4

Let \mathcal{F} be an antimonotonic family of itemsets. If CC is not a safe set, then there is a function ff and frequencies θ\theta for \mathcal{F} such that the minimum value of Problem P is strictly larger than the minimum value of Problem PC.

Example 3

Assume that we have 66 attributes, namely, {a,b,c,d,e,f}\left\{a,b,c,d,e,f\right\}, and an antimonotonic family \mathcal{F} whose maximal itemsets are abab, bcbc, cdcd, adad, dede, cece, and afaf. The dependency graph is given in Fig. 1.

Refer to caption
Figure 1: An example of dependency graph.

Let C1={a,b,c}C_{1}=\left\{a,b,c\right\}. This set is not a safe set since front(d,C1)=ac\operatorname{front}\left(d,C_{1}\right)=ac\notin\mathcal{F}. On the other hand the set C2={a,b,c,d}C_{2}=\left\{a,b,c,d\right\} is safe since front(f,C2)=a\operatorname{front}\left(f,C_{2}\right)=a\in\mathcal{F} and front(e,C2)=cd\operatorname{front}\left(e,C_{2}\right)=cd\in\mathcal{F}.

The proof of Theorem 3.1 reveals also an interesting fact:

Theorem 3.5

Let \mathcal{F} be an antimonotonic family of itemsets and let θ\theta be frequencies for \mathcal{F}. Let CC be a safe set. Let pMEp^{ME} be the maximum entropy distribution defined on AA and satisfying θ\theta. Let qMEq^{ME} be the maximum entropy distribution defined on CC and satisfying the projected frequencies θC\theta_{C}. Then qMEq^{ME} is pMEp^{ME} marginalised to CC.

The theorem tells us that if we want to obtain the maximum entropy distribution marginalised to CC and if the set CC is safe, then we can remove the items outside CC. This is useful since finding maximum entropy using Iterative Fitting Procedure requires exponential amount of time [7, 12]. Using maximum entropy for estimating the frequencies of itemsets has been shown to be an effective method in practice [21]. In addition, if we estimate the frequencies of several boolean formulae using maximum entropy distribution marginalised to safe sets, then the frequencies are consistent. By this we mean that the frequencies are all evaluated from the same distribution, namely pMEp^{ME}.

4 Constructing a Safe Set

Assume that we are given a function ff that depends only on a set BB, not necessarily safe. In this section we consider a problem of finding a safe set CC such that BCB\subseteq C for a given BB. Since there are usually several safe sets that include BB, for example, the set of all attributes AA is always a safe set, we want to find a safe set having the minimal number of attributes. In this section we will describe an algorithm for finding such a safe set. We will also show that this particular safe set is unique.

The idea behind the algorithm is to augment BB until the safeness condition is satisfied. However, the order in which we add the items into BB matters. Thus we need to order the items. To do this we need to define a few concepts: A neighbourhood N(xr)N\left(x\mid r\right) of an item xx of radius rr is the set of the items reachable from xx by a graph path of length at most rr, that is,

N(xr)={yP:xy,|P|r}.N\left(x\mid r\right)=\left\{y\mid\exists P:x\to y,{\left|P\right|}\leq r\right\}. (2)

In addition, we define a restricted neighbourhood NC(xy)N_{C}\left(x\mid y\right) which is similar to N(xr)N\left(x\mid r\right) except that now we require that only the last element of the path PP in Eq. 2 can belong to CC. Note that NC(xr)Cfront(x,C)N_{C}\left(x\mid r\right)\cap C\subseteq\operatorname{front}\left(x,C\right) and that the equality holds for sufficiently large rr.

The rank of an item xx with respect of CC, denoted by rank(xC)\operatorname{rank}\left(x\mid C\right), is a vector vv of length |A|1{\left|A\right|}-1 such that viv_{i} is the number of elements in CC to whom the shortest path from xx has the length ii, that is,

vi=|C(NC(xi)NC(xi1))|.v_{i}={\left|C\cap\left(N_{C}\left(x\mid i\right)-N_{C}\left(x\mid i-1\right)\right)\right|}.

We can compare ranks using the bibliographic order. In other words, if we let v=rank(xC)v=\operatorname{rank}\left(x\mid C\right) and w=rank(yC)w=\operatorname{rank}\left(y\mid C\right), then rank(xC)<rank(yC)\operatorname{rank}\left(x\mid C\right)<\operatorname{rank}\left(y\mid C\right) if and only if there is an integer MM such that vM<wMv_{M}<w_{M} and vi=wiv_{i}=w_{i} for all i=1,,M1i=1,\ldots,M-1.

We are now ready to describe our search algorithm. The idea is to search the items that violate the assumption in Theorem 3.1. If there are several candidates, then items having the maximal rank are selected. Due to efficiency reasons, we do not look for violations by calculating front(x,C)\operatorname{front}\left(x,C\right). Instead, we check whether NC(xr)CN_{C}\left(x\mid r\right)\cap C\in\mathcal{F}. This is sufficient because

NC(xr)Cfront(x,C).N_{C}\left(x\mid r\right)\cap C\notin\mathcal{F}\implies\operatorname{front}\left(x,C\right)\notin\mathcal{F}.

This is true because NC(xr)Cfront(x,C)N_{C}\left(x\mid r\right)\cap C\subseteq\operatorname{front}\left(x,C\right) and \mathcal{F} is antimonotonic. The process is described in full detail in Algorithm 1.

Algorithm 1 The algorithm for finding a safe set CC. The required input is BB, the set that should be contained in CC, and an antimonotonic family \mathcal{F} of itemsets. The graph GG is the dependency graph evaluated from \mathcal{F}.
CBC\Leftarrow B.
repeat
  r1r\Leftarrow 1.
  V{xyC,xyE(G)}CV\Leftarrow\left\{x\mid\exists y\in C,xy\in E(G)\right\}-C {VV contains the neighbours of CC.}
  repeat
   For each xVx\in V, UxNC(xr)CU_{x}\Leftarrow N_{C}\left(x\mid r\right)\cap C.
   if there exists UxU_{x} such that UxU_{x}\notin\mathcal{F} then
    Break {A violation is found.}
   end if
   rr+1r\Leftarrow r+1.
  until no UxU_{x} changed
  if there is a violation then
   W{xVUx}W\Leftarrow\left\{x\in V\mid U_{x}\notin\mathcal{F}\right\} {WW contains the violating items.}
   vmax{rank(xC)xW}v\Leftarrow\max\left\{\operatorname{rank}\left(x\mid C\right)\mid x\in W\right\}.
   Z{xWrank(xC)=v}Z\Leftarrow\left\{x\in W\mid\operatorname{rank}\left(x\mid C\right)=v\right\}
   CCZC\Leftarrow C\cup Z {Augment CC with the violating items having the largest rank.}
  end if
until there are no violations.

We will refer to the safe set Algorithm 1 produces as safe(B)\operatorname{safe}\left(B\mid\mathcal{F}\right). We will now show that safe(B)\operatorname{safe}\left(B\mid\mathcal{F}\right) is the smallest possible, that is,

|safe(B)|=min{|Y|BY,Y is a safe set}.{\left|\operatorname{safe}\left(B\mid\mathcal{F}\right)\right|}=\min\left\{{\left|Y\right|}\mid B\subseteq Y,Y\text{ is a safe set}\right\}.

The following theorem shows that in Algorithm 1 we add only necessary items into CC during each iteration.

Theorem 4.1

Let CC be a set of items during some iteration of Algorithm 1 and let Z={xWrank(xC)=v}Z=\left\{x\in W\mid\operatorname{rank}\left(x\mid C\right)=v\right\} be the set of items as it is defined in Algorithm 1. Let YY be any safe set containing CC. Then it follows that ZYZ\subseteq Y.

Corollary 1

A safe set containing BB containing the minimal number of items is unique. Also, this set is contained in each safe set containing BB.

Corollary 2

Algorithm 1 produces the optimal safe set.

Example 4

Let us continue Example 3. Assume that our initial set BB is {a,b,c}\left\{a,b,c\right\}. We note that front(d,B)=front(e,B)=ac\operatorname{front}\left(d,B\right)=\operatorname{front}\left(e,B\right)=ac\notin\mathcal{F}. Therefore, BB is not a safe set. The ranks are rank(dB)=2\operatorname{rank}\left(d\mid B\right)=2 and rank(eB)=[1,1]T\operatorname{rank}\left(e\mid B\right)=\left[1,1\right]^{T} (the trailing zeros are removed). It follows that the rank of dd is larger than the rank of ee and therefore dd is added into BB during Algorithm 1. The resulting set C={a,b,c,d}C=\left\{a,b,c,d\right\} is the minimal safe set containing BB.

5 Frequencies of Boolean Formulae.

A boolean formula f:{0,1}K{0,1}{f}:{\left\{0,1\right\}^{K}}\to{\left\{0,1\right\}} maps a binary vector to a binary value. Given a family \mathcal{F} of itemsets and frequencies θ\theta for \mathcal{F} we define a frequency interval, denoted by fi(f,θ)\operatorname{fi}\left(f\mid\mathcal{F},\theta\right), to be

fi(f,θ)={Ep[f]Ep[S]=θ},\operatorname{fi}\left(f\mid\mathcal{F},\theta\right)=\left\{\operatorname{E}_{p}\left[f\right]\mid\operatorname{E}_{p}\left[S_{\mathcal{F}}\right]=\theta\right\},

that is, a set of possible frequencies coming from the distribution satisfying given frequencies. For example, if the formula ff is of form a1aMa_{1}\land\ldots\land a_{M}, then we are approximating the frequency of a possibly unknown itemset.

Note that this set is truly an interval and its boundaries can be found using the optimisation problem given in Eq. 1. It has been shown that finding the boundaries can be reduced to a linear programming [11, 19, 3]. However, the problem is exponential in KK and therefore it is crucial to reduce the dimension. Let us assume that the boolean formula depends only on the variables coming from some set, say BB. We can now use Algorithm 1 to find a safe set CC including BB and thus to reduce the dimension.

Example 5

Let us continue Example 3. We assign the following frequencies to the itemsets: θx=0.5\theta_{x}=0.5 where x{a,b,c,d,e,f}x\in\left\{a,b,c,d,e,f\right\}, θbd=0.5\theta_{bd}=0.5, θcd=0.4\theta_{cd}=0.4, and the frequencies of the rest itemsets in \mathcal{F} are equal to 0.250.25. We consider the formula f=bcf=b\land c. In this case ff depends only on B={b,c}B=\left\{b,c\right\}. If we project directly to BB, then the frequency is equal to fi(fB,θB)=[0,0.5]\operatorname{fi}\left(f\mid\mathcal{F}_{B},\theta_{B}\right)=\left[0,0.5\right].

The minimal safe set containing BB is C={a,b,c,d}C=\left\{a,b,c,d\right\}. Since θbd=0.5\theta_{bd}=0.5 it follows that bb is equivalent to dd. This implies that the frequency of ff must be equal to fi(fC,θC)=θcd=0.4\operatorname{fi}\left(f\mid\mathcal{F}_{C},\theta_{C}\right)=\theta_{cd}=0.4.

There exists many problems similar to ours: A well-studied problem is called PSAT in which we are given a CNF-formula and probabilities for each clause asking whether there is a distribution satisfying these probabilities. This problem is NP-complete [9]. A reduction technique for the minimisation problem where the constraints and the query are allowed to be conditional is given in [14]. However, this technique will not work in our case since we are working only with unconditional queries. A general problem where we are allowed to have first-order logic conditional sentences as the constraints/queries is studied in [15]. This problem is shown to be NP-complete. Though these problems are of more general form they can be emulated with itemsets [4]. However, we should note that in the general case this construction does not result an antimonotonic family.

There are many alternative ways of approximating boolean queries based on statistics: For example, the use of wavelets has been investigated in [17]. Query estimation using histograms was studied in [18] (though this approach does not work for binary data). We can also consider assigning some probability model to data such as Chow-Liu tree model or mixture model (see e.g. [22, 21, 6]). Finally, if BB is an itemset and we know all the proper subsets of BB and BB is safe, then to estimate the frequency of BB we can use inclusion-exclusion formulae given in [5].

6 Safe Sets and Junction Trees

Theorem 3.1 suggests that there is a connection between safe sets and Markov Random Fields (see e.g. [13] for more information on MRF). In this section we will describe how the minimal safe sets can be obtained from junction trees. We will demonstrate through a counter-example that this connection cannot be used directly. We will also show that we can use junction trees to reformulate the optimisation problem and possibly reduce the computational burden.

6.1 Safe Sets and Separators

Let us assume that the dependency graph GG obtained from a family \mathcal{F} of itemsets is triangulated, that is, the graph does not contain chordless circuits of size 44 or larger. In this case we say that \mathcal{F} is triangulated. For simplicity, we assume that the dependency graph is connected. We need some concepts from Markov Random Field theory (see e.g. [13]): The clique graph is a graph having cliques of GG as vertices and two vertices are connected if the corresponding cliques share a mutual item. Note that this graph is connected. A spanning tree of the clique graph is called a junction tree if it has a running intersection property. By this we mean that if two cliques contain the same item, then each clique along the path in the junction tree also contains the same item. An edge between two cliques is called a separator, and we associate with each separator the set of items mutual to both cliques.

We also make some further assumptions concerning the family \mathcal{F}: Let VV be the set of items of some clique of the dependency graph. We assume that every proper subset of VV is in \mathcal{F}. If \mathcal{F} satisfies this property for each clique, then we say that \mathcal{F} is clique-safe. We do not need to have VV\in\mathcal{F} because there is no node having an entire clique as a frontier.

Let us now investigate how safe sets and junction trees are connected. First, fix some junction tree, say TT, obtained from GG. Assume that we are given a set BB of items, not necessarily safe. For each item bBb\in B we select some clique QbV(T)Q_{b}\in V(T) such that bQbb\in Q_{b} (same clique can be associated with several items). Let b,cBb,c\in B and consider the path in TT from QbQ_{b} to QcQ_{c}. We call the separators along such paths inner separators. The other separators are called outer separators. We always choose cliques QbQ_{b} such that the number of inner separators is the smallest possible. This does not necessarily make the choice of the cliques unique, but the set of inner separators is always unique. We also define an inner clique to be a clique incident to some inner separator. We refer to the other cliques as outer cliques.

Example 6

Let us assume that we have 55 items, namely {a,b,c,d,e}\left\{a,b,c,d,e\right\}. The dependency graph, its clique graph, and the possible junction trees are given in Figure 2.

Refer to caption
Refer to caption
Refer to captionRefer to caption
Figure 2: An example of an dependency graph, a corresponding clique graph, and the possible junction trees.

Let B={a,d}B=\left\{a,d\right\}. Then the inner separator in the upper junction tree is the left edge. In the lower junction tree both edges are inner separators.

The following three theorems describe the relation between the safe sets containing BB and the inner separators.

Theorem 6.1

Let \mathcal{F} be an antimonotonic, triangulated and clique-safe family of itemsets. Let TT be a junction tree. Let CC be a set containing BB and all the items from the inner separators of BB. Then CC is a safe set.

The following corollary follows from Corollary 1.

Corollary 3

Let \mathcal{F} be an antimonotonic, triangulated and clique-safe family of itemsets. Let TT be a junction tree. The minimal safe set containing BB may contain (in addition to the set BB) only items from the inner separators of BB.

Theorem 6.2

Let \mathcal{F} be an antimonotonic, triangulated and clique-safe family of itemsets. There exists a junction tree such that the minimal safe set is precisely the set BB and the items from the inner separators of BB.

Theorem 6.2 raises the following question: Is there a tree, not depending on BB, such that the minimal safe set is precisely the set BB and the items from the inner separators. Unfortunately, this is not the case as the following example shows.

Example 7

Let us continue Example 6. Let B1={a,d}B_{1}=\left\{a,d\right\} and B2={a,e}B_{2}=\left\{a,e\right\}. The corresponding minimal safe sets are C1={a,b,d}C_{1}=\left\{a,b,d\right\} and C2={a,b,e}C_{2}=\left\{a,b,e\right\}. The first case corresponds to the upper junction tree given in Figure 2, and the latter case corresponds the lower junction tree.

6.2 Reformulation of the Optimisation Problem Using Junction Trees

We have seen that a optimisation problem can be reduced to a problem having 2|C|2^{\left|C\right|} variables, where CC is a safe set. However, it may be the case that CC is very large. For example, imagine that the dependency graph is a single path (ai1,,aiL)\left(a_{i_{1}},\ldots,a_{i_{L}}\right) and we are interested in finding the frequency for ai1aiLa_{i_{1}}\land a_{i_{L}}. Then the safe set contains the entire path. In this section we will try to reduce the computational burden even further.

The main benefit of MRF is that we are able to represent the distribution as a fraction of certain distributions. We can use this factorisation to encode the constraints. A small drawback is that we may not be able to express easily the distribution defined on BB, the set of which the query depends. This happens when BB is not contained in any clique. This can be remedied by adding edges to the dependency graph.

Let us make the previous discussion more rigorous. Let ff be a query function and let BB be the set of attributes of which ff depends. Let C=safe(B)C=\operatorname{safe}\left(B\mid\mathcal{F}\right) be the minimal safe set containing BB. Project the items outside CC and let GG be the connectivity graph obtained from C\mathcal{F}_{C}. We add some additional edges to GG. First, we make the set BB fully connected. Second, we triangulate the graph. Let TT be a junction tree of the resulting graph.

Since BB is fully connected, there is a clique QrQ_{r} such that BQrB\subseteq Q_{r}. For each clique QiQ_{i} in TT we define pip_{i} to be a distribution defined on QiQ_{i}. Similarly, for each separator SjS_{j} we define qjq_{j} to be a distribution defined on SjS_{j}. Denote by 𝕊i\mathbb{S}_{i} the collection of separators of a clique QiQ_{i}.

Problem LP:Minimisef(pr)subject toFor each QiV(T),pi satisfies θQipi is an extension of qjfor each Sj𝕊i.\begin{array}[]{ll}\textsc{Problem LP:}\\ \text{Minimise}&f(p_{r})\\ \text{subject to}&\text{For each }Q_{i}\in V(T),\\ &\quad p_{i}\text{ satisfies }\theta_{Q_{i}}\\ &\quad p_{i}\text{ is an extension of }q_{j}\\ &\quad\text{for each }S_{j}\in\mathbb{S}_{i}.\end{array} (3)

The following theorem states that the above formulation is correct:

Theorem 6.3

The problem in Eq. 3 solves correctly the optimisation problem.

Note that we can remove all qjq_{j} by combining the constraining equations. Thus we have replaced the original optimisation problem having 2|C|2^{\left|C\right|} variables with a problem having 2|Qi|\sum 2^{{\left|Q_{i}\right|}} variables. The number of cliques in TT is bounded by |C|{\left|C\right|}, the number of attributes in the safe set. To see this select any leaf clique QiQ_{i}. This clique must contain a variable that is not contained in any other clique because otherwise QiQ_{i} is contained in its parent clique. We remove QiQ_{i} and repeat this procedure. Since there are only |C|{\left|C\right|} attributes, there can be only |C|{\left|C\right|} cliques. Let MM be the size of the maximal clique. Then the number of variables is bounded by |C|2M{\left|C\right|}2^{M}. If MM is small, then solving the problem is much easier than the original formulation.

Example 8

Assume that we have a family of itemsets whose dependency graph GG is a path (ai1,,aiL)\left(a_{i_{1}},\ldots,a_{i_{L}}\right) and that we want to evaluate the boundaries for a formula ai1aiLa_{i_{1}}\land a_{i_{L}}. We cannot neglect any variable inside the path, hence we have a linear program having 2L2^{L} variables.

By adding the edge {ai1,aiL}\left\{a_{i_{1}},a_{i_{L}}\right\} to GG we obtain a cycle. To triangulate the graph we add the edges {ai1,aij}\left\{a_{i_{1}},a_{i_{j}}\right\} for 3jL13\leq j\leq L-1. The junction tree in consists of L2L-2 cliques of the form ai1aijaij+1a_{i_{1}}a_{i_{j}}a_{i_{j+1}}, where 2jL12\leq j\leq L-1. The reformulation of the linear program gives us a program containing only (L2)23\left(L-2\right)2^{3} variables.

7 Restricted Safe Sets

Given a set BB Algorithm 1 constructs the minimal safe set CC. However, the set CC may still be too large. In this section we will study a scenario where we require that the set CC should have MM items, at maximum. Even if such a safe set may not exist we will try to construct CC such that the solution of the original minimisation problem described in Eq. 1 does not alter. As a solution we will describe a heuristic algorithm that uses the information available from the frequencies.

First, let us note that in the definition of a safe set we require that we can extend the distribution for any frequencies. In other words, we assume that the frequencies are the worst possible. This is also seen in Algorithm 1 since the algorithm does not use any information available from the frequencies.

Let us now consider how we can use the frequencies. Assume that we are given a family \mathcal{F} of itemsets and frequencies θ\theta for \mathcal{F}. Let CC be some (not necessarily a safe) set. Let xCx\notin C be some item violating the safeness condition. Assume that each path from xx to CC has an edge e=(u,v)e=(u,v) having the following property: Let θuv\theta_{uv}, θu\theta_{u}, and θv\theta_{v} be the frequencies of the itemsets uvuv, uu, and vv, respectively. We assume that θuv=θuθv\theta_{uv}=\theta_{u}\theta_{v} and that the itemset uvuv is not contained in any larger itemset in \mathcal{F}. We denote the set of such edges by EE.

Let WW be the set of items reachable from xx by paths not using the edges in EE. Note that the set WW has the same property than xx. We argue that we can remove the set WW. This is true since if we are given a distribution pp defined on AWA-W, then we can extend this distribution, for example, by setting p(XA)=pME(XW)p(XAW)p(X_{A})=p^{ME}(X_{W})p(X_{A-W}), where pME(XW)p^{ME}(X_{W}) is the maximum entropy distribution defined on WW. Note that if we remove the edges EE, then Algorithm 1 will not include WW.

Let us now consider how we can use this situation in practice. Assume that we are given a function ww which assign to each edge a non-negative weight. This weight represents the correlation of the edge and should be 0 if the independence assumption holds. Assume that we are given an item xCx\notin C violating the safeness condition but we cannot afford adding xx into CC. Define HH to be the subgraph containing xx, the frontier front(x,C)\operatorname{front}\left(x,C\right) and all the intermediate nodes along the paths from xx to CC. We consider finding a set of edges EE that would cut xx from its frontier and have the minimal cost eEw(e)\sum_{e\in E}w(e). This is a well-known min-cut problem and it can be solved efficiently (see e.g. [20]). We can now use this in our algorithm in the following way: We build the minimal safe set containing the set BB. For each added item we construct a cut with a minimal cost. If the safe set is larger than a constant MM, we select from the cuts the one having the smallest weight. During this selection we neglect the items that were added before the constraint MM was exceeded. We remove the edges and the corresponding itemsets and restart the construction. The algorithm is given in full detail in Algorithm 2.

Algorithm 2 The algorithm for finding a restricted safe set CC. The required input is BB, the set that should be contained in CC, an antimonotonic family \mathcal{F} of itemsets, a constant MM which is an upper bound for |C|{\left|C\right|}, and a weight function ww for the edges. The graph GG is the dependency graph evaluated from \mathcal{F}.
CBC\Leftarrow B.
repeat
  Find a violating item xx having the largest rank.
  if |C|+1>M{\left|C\right|}+1>M then
   Let HH be the graph containing xx, front(x,C)\operatorname{front}\left(x,C\right) and all the intermediate nodes.
   Let ExE_{x} be the min-cut of HH cutting xx and front(x,C)\operatorname{front}\left(x,C\right) from each other.
   Let vxv_{x} be the cost of ExE_{x}.
  end if
  CC+xC\Leftarrow C+x.
until there are no violations.
if |C|>M{\left|C\right|}>M then
  Let xx be the item such that vxv_{x} is the smallest possible.
  Remove the edges ExE_{x} from the dependency graph.
  Remove the itemsets corresponding to the edges from \mathcal{F}.
  Remove also possible higher-order itemsets to preserve the antimonotonicity of \mathcal{F}.
  Restart the algorithm.
end if
Example 9

We continue Example 5. As a weight function for the edges we use the mutual information. This gives us wbd=0.6931w_{bd}=0.6931 and wcd=0.1927w_{cd}=0.1927. The rest of the weights are 0. Let B={b,c}B=\left\{b,c\right\}. We set the upper bound for the size of the safe set to be M=3M=3. The minimal safe set is C={a,b,c,d}C=\left\{a,b,c,d\right\}. The min cuts are Ea={(a,b),(a,c)}E_{a}=\left\{\left(a,b\right),\left(a,c\right)\right\} and Ed={(d,b),(d,c)}E_{d}=\left\{\left(d,b\right),\left(d,c\right)\right\}. The corresponding weights are va=0v_{a}=0 and vd=wbd+wcd>0v_{d}=w_{bd}+w_{cd}>0. Thus by cutting the edges EaE_{a} we obtain the set Cr={b,c,d}C^{r}=\left\{b,c,d\right\}. The frequency interval for the formula bcb\land c is fi(fCr,θCr)=0.4\operatorname{fi}\left(f\mid\mathcal{F}_{C^{r}},\theta_{C^{r}}\right)=0.4 which is the same as in Example 5.

8 Empirical Tests

We performed empirical tests to assess the practical relevance of the restricted safe sets, comparing it to the (possibly) unsafe trivial projection. We mined itemset families from two data sets, and estimated boolean queries using both the safe projection and the trivial projection. The first data set, which we call Paleo111Paleo was constructed from NOW public release 030717 available from [8]., describes fossil findings: the attributes correspond to genera of mammals, the transactions to excavation sites. The Paleo data is sparse, and the genera and sites exhibit strong correlations. The second data set, which we call Mushroom, was obtained from the FIMI repository222http://fimi.cs.helsinki.fi. The data is relatively dense.

First we used the Apriori [2] algorithm to retrieve some families of itemsets. A problem with Apriori was that the obtained itemsets were concentrated on the attributes having high frequency. A random query conducted on such a family will be safe with high probability — such a query is trivial to solve. More interesting families would the ones having almost all variables interacting with each other, that is, their dependency graphs have only a small number of isolated nodes. Hence we modified APriori: Let AA be the set containing all items and for each aAa\in A let m(a)m(a) be the frequency of aa. Let mm be the smallest frequency m=minaAm(a)m=\min_{a\in A}m(a) and define s(a)=m(a)/ms(a)=m(a)/m. Let UU be an itemset and let θU\theta_{U} be its frequency. Define ηU=aUs(a)\eta_{U}=\prod_{a\in U}s(a). We modify Apriori such that the itemset UU is in the output if and only if the ratio θU/ηU\theta_{U}/\eta_{U} is larger than given threshold σ\sigma. Note that this family is antimonotonic and so Apriori can be used. By this modification we are trying to give sparse items a fair chance and in our tests the relative frequencies did produce more scattered families.

For each family of itemsets we evaluated 1000010000 random boolean queries. We varied the size of the queries between 22 and 44. At first, such queries seem too simple but our initial experiments showed that these queries do result large safe sets. A few examples are given in Figure 3. In most of the queries the trivial projection is safe but there are also very large safe sets. Needless to say that we are forced to use restricted safe sets.

Refer to caption
Refer to caption
Figure 3: Distributions of the sizes of safe sets. The left histogram is obtained from Paleo data by using σ=3×103\sigma=3\times 10^{-3} as the threshold parameter for modified APriori. The right histogram is obtained from Mushroom data with σ=0.8×108\sigma=0.8\times 10^{-8}.

Given a query ff we calculated two intervals i1(f)=fi(fB,θB)i_{1}(f)=\operatorname{fi}\left(f\mid\mathcal{F}_{B},\theta_{B}\right) and i2(f)=fi(fC,θC)i_{2}(f)=\operatorname{fi}\left(f\mid\mathcal{F}_{C},\theta_{C}\right) where BB contains the attributes of ff and CC is the restricted safe set obtained from BB using Algorithm 2. In other words, i1(f)i_{1}(f) is obtained by using the trivial projection and i2(f)i_{2}(f) is obtained by projecting to the restricted safe set. As parameters for Algorithm 2 we set the upper bound M=8M=8 and the weight function ww to be the mutual information.

We divided queries into two classes. A class Trivial contained the queries in which the trivial projection and the restricted safe set were equal. The rest of the queries were labelled as Complex. We also defined a class All that contained all the queries.

As a measure of goodness for a frequency interval we considered the difference between the upper and the lower bound. Clearly i2(f)i1(f)i_{2}(f)\subseteq i_{1}(f), so if we define a ratio r(f)=i2(f)i1(f)r(f)=\frac{\left\|i_{2}(f)\right\|}{\left\|i_{1}(f)\right\|}, then it is always guaranteed that 0r(f)10\leq r(f)\leq 1. Note that the ratio for the queries in Trivial is always 11.

The ratios were divided into appropriate bins. The results obtained from Paleo data are shown in the contingency table given in Tables 1 and 2 and the results for Mushroom data are given in Tables 3 and 4.

σ×103\sigma\times 10^{-3}
Class rr\geq r<r< 33 3.253.25 3.53.5 3.753.75 44
Complex 0 0.20.2 11 0 0 0 0
0.20.2 0.40.4 0 11 11 0 0
0.40.4 0.60.6 1515 1111 1010 55 44
0.60.6 0.80.8 7474 5353 5050 5555 4545
0.80.8 11 238238 173173 124124 9999 6868
11 32893289 19311931 13531353 11161116 868868
Trivial 11 63836383 78317831 84628462 87258725 90159015
Table 1: Counts of queries obtained from Paleo data and classified according to the ratio r(f)r(f), giving the relative tightness of the bounds from restricted safe sets compared to the trivial projections. A column represents a family of itemsets used as the constraints. The parameter σ\sigma is the threshold given to the modified APriori. The class Trivial contains the queries in which the projections were equal; Complex contains the remaining queries. For example, there were 1515 complex queries having the ratios between 0.40.60.4-0.6 in the first family.
σ×103\sigma\times 10^{-3}
Class 33 3.253.25 3.53.5 3.753.75 44
Complex 91.0%91.0\% 89.0%89.0\% 88.0%88.0\% 87.5%87.5\% 88.1%88.1\%
All 96.7%96.7\% 97.6%97.6\% 98.1%98.1\% 98.4%98.4\% 98.8%98.8\%
Table 2: Probability of r(f)=1r(f)=1 among the complex queries and among all queries. The queries were obtained from Paleo data. A column represents a family of itemsets used as the constraints. The parameter σ\sigma is the threshold given to the modified APriori.
σ×106\sigma\times 10^{-6}
Class rr\geq r<r< 0.80.8 0.90.9 11
Complex 0.00.0 0.20.2 4646 3838 4242
0.20.2 0.40.4 9696 8181 8080
0.40.4 0.60.6 302302 261261 260260
0.60.6 0.80.8 9696 8686 6969
0.80.8 11 168168 118118 109109
11 47384738 41464146 39933993
Trivial 11 45544554 52705270 54475447
Table 3: Counts of queries obtained from Mushroom data and classified according to the ratio r(f)r(f), giving the relative tightness of the bounds from restricted safe sets compared to the trivial projections. A column represents a family of itemsets used as the constraints. The parameter σ\sigma is the threshold given to the modified APriori. The class Trivial contains the queries in which the projections were equal; Complex contains the remaining queries.
σ×106\sigma\times 10^{-6}
Class 0.80.8 0.90.9 11
Complex 87.0%87.0\% 87.7%87.7\% 87.7%87.7\%
All 92.9%92.9\% 94.2%94.2\% 94.4%94.4\%
Table 4: Probability of r(f)=1r(f)=1 among the complex queries and among all queries. The queries were obtained from Mushroom data. A column represents a family of itemsets used as the constraints. The parameter σ\sigma is the threshold given to the modified APriori.

By examining Tables 1 and 2 we conclude the following: If we conduct a random query of form ff, then in 97%99%97\%-99\% of the cases the frequency intervals are equal i1(f)=i2(f)i_{1}(f)=i_{2}(f). However, if we limit ourselves to the cases where the projections differ (the class Complex), then the frequency interval is equal only in about 90%90\% of the cases. In addition, the probability of i1(f)i_{1}(f) being equal to i2(f)i_{2}(f) increases as the threshold σ\sigma grows.

The same observations apply to the results for Mushroom data (Tables 3 and 4): In 93%94%93\%-94\% of the cases the frequency intervals are equal i1(f)=i2(f)i_{1}(f)=i_{2}(f), but if we consider only the cases where projections differ, then the percentage drops to 88%88\%. The percentages are slightly smaller than those obtained from Paleo data and also there are relatively many queries whose ratios are very small.

The computational burden of a trivial query is equivalent for both trivial projection and restricted safe set. Hence, we examine complex queries in which there is an actual difference in the computational burden. The results suggest that in abt. 10%10\% of the complex queries the restricted safe sets produced tighter interval.

9 Conclusions

We started our study by considering the following problem: Given a family \mathcal{F} of itemsets, frequencies for \mathcal{F}, and a boolean formula find the bounds of the frequency of the formula. This can be solved by linear programming but the problem is that the program has an exponential number of variables. This can be remedied by neglecting the variables not occurring in the boolean formula and thus reducing the dimension. The downside is that the solution may change.

In the paper we defined a concept of safeness: Given an antimonotonic family \mathcal{F} of itemsets a set CC of attributes is safe if the projection to CC does not change the solution of a query regardless of the query function and the given frequencies for \mathcal{F}. We characterised this concept by using graph theory. We also provided an efficient algorithm for finding the minimal safe set containing some given set.

We should point out that while our examples and experiments were focused on conjunctive queries, our theorems work with a query function of any shape

If the family of itemsets satisfies certain requirements, that is, it is triangulated and clique-safe, then we can obtain safe sets from junction trees. We also show that the factorisation obtained from a junction tree can be used to reduce the computational burden of the optimisation problem.

In addition, we provided a heuristic algorithm for finding restricted safe sets. The algorithm tries to construct a set of items such that the optimisation problem does not change for some given itemset frequencies.

We ask ourselves: In practice, should we use the safe sets rather than the trivial projections? The advantage is that the (restricted) safe sets always produce outcome at least as good as the trivial approach. The downside is the additional computational burden. Our tests indicate that if a user makes a random query then in abt. 93%99%93\%-99\% of the cases the bounds are equal in both approaches. However, this comparison is unfair because there is a large number of queries where the projection sets are equal. To get the better picture we divide the queries into two classes Trivial and Complex, the first containing the queries such that the projections sets are equal, and the second containing the remaining queries. In the first class there is no improvement in the outcome but there is no additional computational burden (checking that the set is safe is cheap comparing to the linear programming). If a query was in Complex, then in 10%10\% of the cases projecting on restricted safe sets did produce more tight bounds.

Acknowledgements.
The author wishes to thank Heikki Mannila and Jouni Seppänen for their helpful comments.

References

  • [1] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 26–28  1993.
  • [2] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and Aino Inkeri Verkamo. Fast discovery of association rules. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press/The MIT Press, 1996.
  • [3] Artur Bykowski, Jouni K. Seppänen, and Jaakko Hollmén. Model-independent bounding of the supports of Boolean formulae in binary data. In Pier Luca Lanzi and Rosa Meo, editors, Database Support for Data Mining Applications: Discovering Knowledge with Inductive Queries, LNCS 2682, pages 234–249. Springer Verlag, 2004.
  • [4] Toon Calders. Computational complexity of itemset frequency satisfiability. In Proceedings of the 23nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database System, 2004.
  • [5] Toon Calders and Bart Goethals. Mining all non-derivable frequent itemsets. In Proceedings of the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2002.
  • [6] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467, May 1968.
  • [7] J. Darroch and D. Ratchli. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5):1470–1480, 1972.
  • [8] Mikael Forselius. Neogene of the old world database of fossil mammals (NOW). University of Helsinki, http://www.helsinki.fi/science/now/, 2005.
  • [9] George Georgakopoulos, Dimitris Kavvadias, and Christos H. Papadimitriou. Probabilistic satisfiability. Journal of Complexity, 4(1):1–11, March 1988.
  • [10] Bart Goethals and Mohammed Javeed Zaki, editors. FIMI ’03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003, Melbourne, Florida, USA, volume 90 of CEUR Workshop Proceedings, 2003.
  • [11] Theodore Hailperin. Best possible inequalities for the probability of a logical function of events. The American Mathematical Monthly, 72(4):343–359, Apr. 1965.
  • [12] Radim Jiroušek and Stanislav Přeušil. On the effective implementation of the iterative proportional fitting procedure. Computational Statistics and Data Analysis, 19:177–189, 1995.
  • [13] Michael I. Jordan, editor. Learning in graphical models. MIT Press, 1999.
  • [14] Thomas Lukasiewicz. Efficient global probabilistic deduction from taxonomic and probabilistic knowledge-bases over conjunctive events. In Proceedings of the sixth international conference on Information and knowledge management, pages 75–82, 1997.
  • [15] Thomas Lukasiewicz. Probabilistic logic programming with conditional constraints. ACM Transactions on Computational Logic (TOCL), 2(3):289–339, July 2001.
  • [16] Heikki Mannila and Hannu Toivonen. Multiple uses of frequent sets and condensed representations (extended abstract). In Knowledge Discovery and Data Mining, pages 189–194, 1996.
  • [17] Yossi Matias, Jeffrey Scott Vitter, and Min Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 448–459, 1998.
  • [18] M. Muralikrishna and David DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 28–36, 1988.
  • [19] Nils Nilsson. Probabilistic logic. Artificial Intelligence, 28(1):71–87, 1986.
  • [20] Christos Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization Algorithms and Complexity. Dover, 2nd edition, 1998.
  • [21] Dmitry Pavlov, Heikki Mannila, and Padhraic Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Transactions on Knowledge and Data Engineering, 15(6):1409–1421, 2003.
  • [22] Dmitry Pavlov and Padhraic Smyth. Probabilistic query models for transaction data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 164–173, 2001.

Appendix A Appendix

This section contains the proofs for the theorems presented in the paper.

A.1 Proof of Theorem 3.1

Let θ\theta be any consistent frequencies for \mathcal{F}. Let =C\mathcal{H}=\mathcal{F}_{C}. To prove the theorem we will show that any distribution defined on items CC and satisfying the frequencies θC\theta_{C} can be extended to a distribution defined on the set AA and satisfying the frequencies θ\theta.

Let W=ACW=A-C. Partition WW into connected blocks WiW_{i} such that x,yWix,y\in W_{i} if and only if there is a path PP from xx to yy such that PC=P\cap C=\emptyset. Note that the items coming from the same WiW_{i} have the same frontier. Therefore, front(Wi,C)\operatorname{front}\left(W_{i},C\right) is well-defined. We denote front(Wi,C)\operatorname{front}\left(W_{i},C\right) by ViV_{i}.

Let pMEp^{ME} be the maximum entropy distribution defined on the items AA and satisfying θ\theta. Note that there is no chord containing elements from WiW_{i} and from CViC-V_{i} at the same time. This implies that we can write pMEp^{ME} as

pME(XA)=pME(XC)ipME(XWi,XVi)pME(XVi).p^{ME}(X_{A})=p^{ME}(X_{C})\prod_{i}\frac{p^{ME}\left(X_{W_{i}},X_{V_{i}}\right)}{p^{ME}\left(X_{V_{i}}\right)}.

Let pp be any distribution defined on CC and satisfying the frequencies θC\theta_{C}. Note that pME(XVi)=p(XVi)p^{ME}\left(X_{V_{i}}\right)=p\left(X_{V_{i}}\right), and hence we can extend pp to the set AA by defining

p(XA)=p(XC)ipME(XWi,XVi)pME(XVi).p(X_{A})=p(X_{C})\prod_{i}\frac{p^{ME}\left(X_{W_{i}},X_{V_{i}}\right)}{p^{ME}\left(X_{V_{i}}\right)}.

To complete the proof we will need to prove that pp satisfies the frequencies θ\theta. Select any itemset UU\in\mathcal{F}. There are two possible cases: Either UCU\subseteq C, which implies that UU\in\mathcal{H} and since pp satisfies θC\theta_{C} it follows that pp also satisfies θU\theta_{U}.

The other case is that UU has elements outside CC. Note that UU can have elements in only one WiW_{i}, say, WjW_{j}. This in turn implies that UU cannot have elements in Cfront(Wj,C)C-\operatorname{front}\left(W_{j},C\right), that is, UWjViU\subseteq W_{j}\cup V_{i}. Note that pME(XWi,XVi)=p(XWi,XVi)p^{ME}\left(X_{W_{i}},X_{V_{i}}\right)=p\left(X_{W_{i}},X_{V_{i}}\right). Since pMEp^{ME} satisfies θ\theta, pp satisfies θU\theta_{U}. This completes the theorem.

A.2 Proof of Theorem 3.2

Assume that we are given a family \mathcal{F} of itemsets and a set CC such that there exists xCx\notin C such that front(x,C)\operatorname{front}\left(x,C\right)\notin\mathcal{F}. Select Yfront(x,C)Y\subseteq\operatorname{front}\left(x,C\right) to be some subset of the frontier such that YY\notin\mathcal{F} and each proper subset of YY is contained in \mathcal{F}. We can also assume that paths from xx to YY are of length 11. This is done by setting the intermediate attributes lying on the paths to be equivalent with xx. We can also set the rest of the attributes to be equivalent with 0. Therefore, we can redefine C=YC=Y, the underlying set of attributes to consist only of YY and xx, and \mathcal{F} to be

={ZZC,ZC}{yxyC}.\mathcal{F}=\left\{Z\mid Z\subset C,Z\neq C\right\}\cup\left\{yx\mid y\in C\right\}.

Let θ={θZZ}\theta=\left\{\theta_{Z}\mid Z\in\mathcal{F}\right\} be the frequencies for the itemset family \mathcal{F} such that

θZ=0.5|Z|if ZCθZ=0.5if Z=xθZ=cif Z=xy for yC,\begin{array}[]{rcll}\theta_{Z}&=&0.5^{-{\left|Z\right|}}&\text{if }Z\subset C\\ \theta_{Z}&=&0.5&\text{if }Z=x\\ \theta_{Z}&=&c&\text{if }Z=xy\text{ for }y\in C,\end{array} (4)

where cc is a constant (to be determined later).

Define nn to be the number of elements in CC. Let kk be the number of ones in the random bit vector XCX_{C}. Let us now consider the following three distributions defined on CC:

p1(XC)={2n+1,nk is even0,nk is oddp2(XC)=2np3(XC)={2n+1,nk is odd0,nk is even.\begin{array}[]{rcl}p_{1}(X_{C})&=&\left\{\begin{array}[]{lll}2^{-n+1}&,&n-k\text{ is even}\\ 0&,&n-k\text{ is odd}\end{array}\right.\\ p_{2}(X_{C})&=&2^{-n}\\ p_{3}(X_{C})&=&\left\{\begin{array}[]{lll}2^{-n+1}&,&n-k\text{ is odd}\\ 0&,&n-k\text{ is even}\end{array}\right..\\ \end{array}

Note that all three distributions satisfy the first condition in Eq. 4. Note also that pi(XC)p_{i}(X_{C}) depends only on the number of ones in XCX_{C}. We will slightly abuse the notation and denote pi(k)=pi(XC)p_{i}(k)=p_{i}(X_{C}), where XCX_{C} is a random vector having kk ones.

Assume that we have extended pi(XC)p_{i}(X_{C}) to pi(XC,Xx)p_{i}(X_{C},X_{x}) satisfying θ\theta. We can assume that pi(XC,Xx)p_{i}(X_{C},X_{x}) depends only on the number of ones in XCX_{C} and the value of XxX_{x}. Define ci(n,k)=pi(XC,Xx=1)c_{i}(n,k)=p_{i}(X_{C},X_{x}=1), where XCX_{C} is a random vector having kk ones. Note that

0.5=pi(Xx=1)=k=0n(nk)ci(n,k).0.5=p_{i}(X_{x}=1)=\sum_{k=0}^{n}{n\choose k}c_{i}(n,k).

If we select any attribute zCz\in C, then

c=pi(Xz=1,Xx=1)=k=1n(n1k1)ci(n,k).c=p_{i}(X_{z}=1,X_{x}=1)=\sum_{k=1}^{n}{n-1\choose k-1}c_{i}(n,k).

If we now consider the conditions given in Eq. 4 and require that pi(Xx=1)=θx=0.5p_{i}(X_{x}=1)=\theta_{x}=0.5 and also require that pi(Xz=1,Xx=1)=cp_{i}(X_{z}=1,X_{x}=1)=c is the largest possible, then we get the following three optimisation problems:

Problem Pi:Maximiseci(n)=k=1n(n1k1)ci(n,k)subject toci(n,k)0ci(n,k)pi(k)0.5=k=0n(nk)ci(n,k)\begin{array}[]{lrcl}\textsc{Problem P${}_{i}$}:\\ \text{Maximise}&c_{i}(n)&=&\sum_{k=1}^{n}{n-1\choose k-1}c_{i}(n,k)\\ \text{subject to}&c_{i}(n,k)&\geq&0\\ &c_{i}(n,k)&\leq&p_{i}(k)\\ &0.5&=&\sum_{k=0}^{n}{n\choose k}c_{i}(n,k)\end{array} (5)

If we can show that the statement

c1(n)=c2(n)=c3(n)c_{1}(n)=c_{2}(n)=c_{3}(n)

is false, then by setting c=max(c1(n),c2(n),c3(n))c=\max(c_{1}(n),c_{2}(n),c_{3}(n)) in Eq. 4 we obtain such frequencies that at least one of the distributions pip_{i} cannot be extended to xx. We will prove our claim by assuming otherwise and showing that the assumption leads to a contradiction.

Note that (n1k1)/(nk)=k/n{n-1\choose k-1}/{n\choose k}=k/n. This implies that the maximal solution c2(n)c_{2}(n) has the unique form

c2(n,k)={2n,k>n22n1,k=n2 and n is even0,otherwise.c_{2}(n,k)=\left\{\begin{array}[]{lll}2^{-n}&,&k>\frac{n}{2}\\ 2^{-n-1}&,&k=\frac{n}{2}\text{ and }n\text{ is even}\\ 0&,&\text{otherwise.}\end{array}\right. (6)

Define series b(n,k)=12(c1(n,k)+c3(n,k))b(n,k)=\frac{1}{2}\left(c_{1}(n,k)+c_{3}(n,k)\right). Note that b(n,k)b(n,k) is a feasible solution for Problem P2 in Eq. 5. Moreover, since we assume that c2(n)=c1(n)=c3(n)c_{2}(n)=c_{1}(n)=c_{3}(n), it follows that b(n,k)b(n,k) produces the optimal solution c2(n)c_{2}(n). Therefore, b(n,k)=c2(n,k)b(n,k)=c_{2}(n,k). This implies that c1(n,k)c_{1}(n,k) and c3(n,k)c_{3}(n,k) have the forms

c1(n,k)={2c2(n,k),nk is even0,nk is oddc_{1}(n,k)=\left\{\begin{array}[]{lll}2c_{2}(n,k)&,&n-k\text{ is even}\\ 0&,&n-k\text{ is odd}\\ \end{array}\right. (7)
c3(n,k)={2c2(n,k),nk is odd0,nk is even.c_{3}(n,k)=\left\{\begin{array}[]{lll}2c_{2}(n,k)&,&n-k\text{ is odd}\\ 0&,&n-k\text{ is even}\\ \end{array}\right.. (8)

Assume now that nn is odd. The conditions of Problems P1 and P3 imply that

k=0n(nk)c1(n,k)=0.5=k=0n(nk)c3(n,k).\sum_{k=0}^{n}{n\choose k}c_{1}(n,k)=0.5=\sum_{k=0}^{n}{n\choose k}c_{3}(n,k).

By applying Eqs. 6– 8 to this equation we obtain, depending on nn, either the identity

(nn)+(nn2)++(nn+12)=(nn1)+(nn3)++(nn+32){n\choose n}+{n\choose n-2}+\ldots+{n\choose\frac{n+1}{2}}={n\choose n-1}+{n\choose n-3}+\ldots+{n\choose\frac{n+3}{2}}

or

(nn)+(nn2)++(nn+32)=(nn1)+(nn3)++(nn+12).{n\choose n}+{n\choose n-2}+\ldots+{n\choose\frac{n+3}{2}}={n\choose n-1}+{n\choose n-3}+\ldots+{n\choose\frac{n+1}{2}}.

Both of these identities are false since the series having the term (nn+12){n\choose\frac{n+1}{2}} is always larger. This proves our claim for the cases where nn is odd.

Assume now that nn is even. The assumption c1(n)=c3(n)c_{1}(n)=c_{3}(n) together with Eqs. 6– 8 implies the identity

(n1n1)+(n1n3)++12(n1n21)=(n1n2)+(n1n4)++(n1n2).{n-1\choose n-1}+{n-1\choose n-3}+\ldots+\frac{1}{2}{n-1\choose\frac{n}{2}-1}={n-1\choose n-2}+{n-1\choose n-4}+\ldots+{n-1\choose\frac{n}{2}}.

We apply the identity

(nk)=(n1k)+(n1k1){n\choose k}={n-1\choose k}+{n-1\choose k-1} (9)

to this equation and cancel out the equal terms from both sides. This gives us the identity

12(n1n21)=(n2n21).\frac{1}{2}{n-1\choose\frac{n}{2}-1}={n-2\choose\frac{n}{2}-1}.

By applying again Eq. 9 we obtain

(n2n22)=(n2n21).{n-2\choose\frac{n}{2}-2}={n-2\choose\frac{n}{2}-1}.

This is true for no nn and thus we have proved our claim.

A.3 Proof of Theorem 3.5

Denote by (p)\mathcal{E}\left(p\right) the entropy of a distribution pp. We know that (qME)(pCME)\mathcal{E}\left(q^{ME}\right)\geq\mathcal{E}\left(p^{ME}_{C}\right). Assume now that qq is a distribution satisfying the frequencies θC\theta_{C}. Let us extend qq as we did in the proof of Theorem 3.1:

p(XA)=q(XC)ipME(XWi,XVi)pME(XVi).p(X_{A})=q(X_{C})\prod_{i}\frac{p^{ME}\left(X_{W_{i}},X_{V_{i}}\right)}{p^{ME}\left(X_{V_{i}}\right)}.

The entropy of this distribution is of the form (p)=(q)+c\mathcal{E}\left(p\right)=\mathcal{E}\left(q\right)+c, where

c=i(pWiViME)(pViME)c=\sum_{i}\mathcal{E}\left(p^{ME}_{W_{i}\cup V_{i}}\right)-\mathcal{E}\left(p^{ME}_{V_{i}}\right)

is a constant not depending on qq. This characterisation is valid because pViME=qVip^{ME}_{V_{i}}=q_{V_{i}}. If we let q=qMEq=q^{ME}, it follows that

(pME)(p)=(qME)+c(pCME)+c.\mathcal{E}\left(p^{ME}\right)\geq\mathcal{E}\left(p\right)=\mathcal{E}\left(q^{ME}\right)+c\geq\mathcal{E}\left(p^{ME}_{C}\right)+c.

If we now let q=pCMEq=p^{ME}_{C}, it follows that p=pMEp=p^{ME} and this implies that (pME)=(pCME)+c\mathcal{E}\left(p^{ME}\right)=\mathcal{E}\left(p^{ME}_{C}\right)+c. Thus (qME)=(pCME)\mathcal{E}\left(q^{ME}\right)=\mathcal{E}\left(p^{ME}_{C}\right). The distribution maximising entropy is unique, thus pCME=qMEp^{ME}_{C}=q^{ME}.

A.4 Proof of Theorem 4.1

Assume that there is xZx\in Z such that xYx\notin Y. Let Ux={u1,,uL}U_{x}=\left\{u_{1},\ldots,u_{L}\right\} be as it is defined in Algorithm 1. Let PiP_{i} be the shortest path from xx to uiu_{i} and define viv_{i} to be the first item on PiP_{i} belonging to YY. There are two possible cases: Either vi=uiv_{i}=u_{i} which implies that uifront(x,Y)u_{i}\in\operatorname{front}\left(x,Y\right), or uiu_{i} is blocked by some other element in YY. If Uxfront(x,Y)U_{x}\subseteq\operatorname{front}\left(x,Y\right), then the safeness condition is violated. Therefore, there exists uju_{j} such that vjujv_{j}\neq u_{j}.

We will prove that vjv_{j} outranks xx, that is, rank(vjC)>rank(xC)\operatorname{rank}\left(v_{j}\mid C\right)>\operatorname{rank}\left(x\mid C\right). It is easy to see that it is sufficient to prove that rank(vjUx)>rank(xUx)\operatorname{rank}\left(v_{j}\mid U_{x}\right)>\operatorname{rank}\left(x\mid U_{x}\right). In order to do this note that {v1,,vL}front(x,Y)\left\{v_{1},\ldots,v_{L}\right\}\subseteq\operatorname{front}\left(x,Y\right)\in\mathcal{F}. Therefore, because of the antimonotonic property of \mathcal{F}, there is an edge from vjv_{j} to each viv_{i}. This implies that there is a path RiR_{i} from vjv_{j} to uiu_{i} such that |Ri||Pi|{\left|R_{i}\right|}\leq{\left|P_{i}\right|}, that is, the length of RiR_{i} is smaller or equal than the length of PiP_{i}. Also note, that since vjv_{j} lies on PjP_{j}, there exists a path RjR_{j} from vjv_{j} to uju_{j} such that |Rj|<|Pj|{\left|R_{j}\right|}<{\left|P_{j}\right|}. This implies that rank(vjUx)>rank(xUx)\operatorname{rank}\left(v_{j}\mid U_{x}\right)>\operatorname{rank}\left(x\mid U_{x}\right).

Also, note that UxN(vjr)U_{x}\subset N\left(v_{j}\mid r\right), where rr is the search radius defined in Algorithm 1. This implies that vjv_{j} is discovered during the search phase, that is, vjv_{j} is one of the violating nodes.

To complete the proof we need to show that vjv_{j} is a neighbour of CC. Since xx is a neighbour of CC, there is uku_{k} such that there is an edge between xx and uku_{k}. This implies that vk=ukv_{k}=u_{k}. Since there is an edge between vjv_{j} and vkv_{k}, it follows that vjv_{j} is neighbour of CC.

A.5 Proof of Theorem 6.1

Let aa be some item belonging to some inner clique QQ but not belonging in any inner separator. The clique QQ is unique and the only reachable items of CC from aa are the inner separators incident to QQ. Since QQ is a clique, it follows from the clique-safeness assumption that the frontier of aa is included in \mathcal{F}.

Let now aa be any item that is not included in any inner clique. There exists a unique inner clique QQ such that all the paths from aa to CC go through this clique. This implies that the frontier of aa is again the inner separators incident to QQ.

A.6 Proof of Theorem 6.2

We will prove that if we have an item aa coming from some inner separator and not included in the minimal safe set, then we can alter the junction tree such that the item aa is no longer included in the inner separators. For the sake of clarity, we illustrate an example of the modification process in Figure 4.

Refer to caption
Refer to caption
Figure 4: Two equivalent junction trees. Our goal is to find the minimal safe set for B={a,d,g}B=\left\{a,d,g\right\}. The left junction tree is before the modification and the right is after the modification. We see that the attribute xx is not included in the inner separators in the right tree. The sets appearing in the proof are as follows: The minimal safe set CC is adgbehadgbeh. II consists of 33 separators bxbx, exex, and hxhx. The other separators belong to JJ. VV consists of 44 cliques bcxbcx, efxefx, hixhix, and behxbehx. The clique QQ is behxbehx.

Let GG be the dependency graph and TT the current junction tree. Let CC be the minimal safe set containing BB and let aCa\notin C be an item coming from some inner separator. Let us consider paths (in GG) from aa to its frontier. For the sake of clarity, we prove only the case where the paths from aa to CC are of length 11. The proof for the general case is similar.

Let II be the collection of inner separators containing aa. Let VV be the collection of (inner) cliques incident to the inner separators included in II. The pair (V,I)(V,I) defines a subtree of TT. Let JJ be the set of inner separators incident to some clique in VV but not included in II. Note that each item coming from the inner separators included in JJ must be included in CC because otherwise we have violated the assumption that the paths from aa to its frontier are of length 11.

The frontier of aa consists of the items of the inner separators in JJ and of possibly some items from the set BB. By the assumption the frontier is in \mathcal{F} and thus it is fully connected. It follows that there is a clique QQ containing the frontier. If QVQ\notin V, a clique from VV closest to QQ also contains the frontier. Hence we can assume QVQ\in V.

Select a separator EJE\in J. Let UVU\notin V be the clique incident to EE. We modify the tree by cutting the edge EE and reattaching UU to QQ. The procedure is performed to each separator in JJ. The obtained tree satisfies the running intersection property since QQ contains the items coming from each inner separators included in JJ. If the frontier contained any items included in BB, then QQ contains these items. It is easy to see that each clique in VV, except for the clique QQ, becomes outer. Therefore, aa is no longer included in any inner separator.

A.7 Proof of Theorem 6.3

Let p^\hat{p} be the optimal distribution. Then by marginalising we can obtain p^i\hat{p}_{i}, and q^j\hat{q}_{j} which produce the same solution for the reduced problem.

To prove the other direction let p^i\hat{p}_{i}, and q^j\hat{q}_{j} be the optimal distributions for the reduced problem. Since the running intersection property holds, we can define the joint distribution p^\hat{p} by p^=ip^i/jq^j\hat{p}=\prod_{i}\hat{p}_{i}/\prod_{j}\hat{q}_{j}. It is straightforward to see that p^\hat{p} satisfies the frequencies. This proves the statement.