This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

NewReferences

Multi-rubric Models for Ordinal Spatial Data with Application to Online Ratings Data

Antonio R. Linero  , Jonathan R. Bradley, and Apurva Desai Department of Statistics, Florida State University Email: arlinero@stat.fsu.edu
Abstract

Interest in online rating data has increased in recent years in which ordinal ratings of products or local businesses are provided by users of a website, such as Yelp! or Amazon. One source of heterogeneity in ratings is that users apply different standards when supplying their ratings; even if two users benefit from a product the same amount, they may translate their benefit into ratings in different ways. In this article we propose an ordinal data model, which we refer to as a multi-rubric model, which treats the criteria used to convert a latent utility into a rating as user-specific random effects, with the distribution of these random effects being modeled nonparametrically. We demonstrate that this approach is capable of accounting for this type of variability in addition to usual sources of heterogeneity due to item quality, user biases, interactions between items and users, and the spatial structure of the users and items. We apply the model developed here to publicly available data from the website Yelp! and demonstrate that it produces interpretable clusterings of users according to their rating behavior, in addition to providing better predictions of ratings and better summaries of overall item quality.


Key words and phrases: Bayesian hierarchical model; data augmentation; nonparametric Bayes; ordinal data; recommender systems; spatial prediction.

1 Introduction

In recent years, the complexity of data used to make decisions has increased dramatically. A prime example of this is the use of online reviews to decide whether to purchase a product or visit a local business; we refer to the objects being reviewed as items. Consider data provided by Yelp! (see, http://www.yelp.com/), which allows users to rate items, such as restaurants, convenience stores, and so forth, on a discrete scale from one to five “stars.” Additional features of the businesses are also known, such as the spatial location and type of business. Datasets of this type are typically very large and exhibit complex dependencies.

As an example of this complexity, users of Yelp! effectively determine their own standards when rating a local business. We refer to the particular standards a user applies as a rubric. We might imagine a latent variable YiuY_{iu} representing the utility, or benefit, user uu obtained from item ii. For a given level of utility, however, different users may still give different ratings due to having different standards for the ratings; for example, one user may rate a restaurant 5 stars as long as it provides a non-offensive experience, a second user might require an exceptional experience to rate the same restaurant 5 stars, and a third user may rate all items with 11 star in order to “troll” the website. Each of these users are applying different rubrics in translating their utility to a rating for the restaurant. In addition we also expect user-specific selection bias in the sense that some users may rate every restaurant they attend, while other users may only rate restaurants that they feel strongly about.

This article makes several contributions. First, we develop a semiparametric Bayesian model which accounts for the existence of multiple rubrics for ratings data that are observed over multiple locations. To do this, we use a spatial cumulative probit model (e.g., see Higgs and Hoeting,, 2010; Berret and Calder,, 2012; Schliep and Hoeting,, 2015) in which the break-points are modeled as user-specific random effects. This requires a flexible model for the distribution FF of the random effects, which we model as a discrete mixture. A by-product of our approach is that we obtain a clustering of users according to the rubrics they are using.

Second, we use the multi-rubric model to address novel inferential questions. For example, ratings provided to a user might be adjusted to match that user’s rubric, or to provide a distribution for the rating that a user would provide conditional on having a particular rubric. Utilizing this user-specific standardization of ratings may provide users with better intuition for the overall quality of an item.

This adjustment of restaurant quality for the rubrics is similar to, but distinct from, the task of predicting a user’s ratings. Good predictive performance is required for filtering, which refers to the task of processing the rating history of a user and producing a list of recommendations (for a review, see Bobadilla et al.,, 2013). As a third contribution, we show that allowing for multiple rubrics improves predictions.

The model proposed here also has interesting statistical features. A useful feature of our model is that it allows for more accurate comparisons across items. For example, if a user rates all items with 11 star, then the model discounts this user’s ratings. This behavior is desirable for two reasons. First, if a user genuinely rates all items with 11 star, then their rating is unhelpful. Second, it down-weights the ratings of users who are exhibiting selection bias and only rating items which they feel strongly about, which is desirable as comparisons across items will be more indicative of true quality if they are based on individuals who are not exhibiting large degrees of selection bias.

Additionally, the rubrics themselves may be of intrinsic interest. We demonstrate that the rubrics learned by our model are highly interpretable. For example, when analyzing the Yelp! dataset in Section 4, we obtain Figure 7 which displays the ratings observed for users assigned to a discrete collection of rubrics and reveals several distinct rating patterns displayed by users.

Other features of our model are also of potentially independent interest. The multi-rubric model can be interpreted as a novel semiparametric random-effects model for ordinal data, even for problems in which the intuition behind the multi-rubric model in terms of latent utility does not hold. Other study designs in which the multi-rubric analogy may be useful include longitudinal survey studies, or more general ordinal repeated-measures designs. Additionally, the cumulative probit model we use to model latent user preferences includes a spatial process to account for spatial dependencies across local businesses. Recovering an underlying spatial process allows for recommending entire regions to visit, rather than singular items. The development of low-rank spatial methodology for large-scale dependent ordinal data is of interest within the spatial literature, as the current spatial literature for ordinal data do not typically address large datasets on a similar order of the Yelp! dataset (e.g., see De Oliveira,, 2004, 2000; Chen and Dey,, 2000; Cargnoni et al.,, 1997; Knorr-Held,, 1995; Carlin and Polson,, 1992; Higgs and Hoeting,, 2010; Berret and Calder,, 2012; Velozo et al.,, 2014, among others). We model the underlying spatial process using a low-rank approximation (Cressie and Johannesson,, 2008) to a desired Gaussian process (Banerjee et al.,, 2008; Bradley et al.,, 2015).

Starting from Koren and Sill, (2011), several works in the recommender systems literature have considered ordinal matrix factorization (OMF) procedures which are similar in many respects to our model (see also Paquet et al., 2012 and Houlsby et al., 2014). Our work differs non-trivially from these works in that the multi-rubric model treats the break-points as user-specific random effects, with a nonparametric prior used for the random effects distribution FF. For the Yelp! dataset, this extra flexibility leads to improved predictive performance. Additionally, our focus in this work extends to inferential goals beyond prediction; for example, depending on the distribution of the rubrics of users who rate a given item, the estimate of overall quality for that item can be shrunk to a variety of different centers, producing novel multiple-shrinkage effects. Several works in the Bayesian nonparametric literature have also considered flexible models for random effects in multivariate ordinal models (Kottas et al.,, 2005; DeYoreo and Kottas,, 2014; Bao and Hanson,, 2015), but do not treat the break-points themselves as random effects.

The paper is organized as follows. In Section 2, we develop the multi-rubric model, with an eye towards the Yelp! dataset, and provide implementation details. In Section 3, we illustrate the methodology on synthetic data designed to mirror features of the Yelp! dataset, and demonstrate that we can accurately recover the number and structure of the rubrics when the model holds, as well as effectively estimate the underlying latent utility field. In Section 4, we illustrate the methodology on the Yelp! dataset. We conclude with a discussion in Section 5. In supplementary material, we present simulation experiments which demonstrate identifiability of key components of the model.

2 The Multi-rubric model

2.1 Preliminary notation

We consider ordinal response variables ZiuZ_{iu} taking values in {1,,K}\{1,\ldots,K\}. In the context of online ratings data, ZiuZ_{iu} represents the rating that user uu provides for item ii. In the context of survey data, on the other hand, ZiuZ_{iu} might represent the response subject uu gives to question ii. We do not assume that ZiuZ_{iu} is observed for all (i,u)(i,u) pairs, but instead we observe (i,u)𝒮{1,,I}×{1,,U}(i,u)\in\mathcal{S}\subseteq\{1,\ldots,I\}\times\{1,\ldots,U\}, where UU is the total number of subjects and II is the total number of items. For fixed ii we let 𝒰i={u:(i,u)𝒮}\mathcal{U}_{i}=\{u:(i,u)\in\mathcal{S}\} be the set of users that rate item ii, and similarly for fixed uu we let u={i:(i,u)𝒮}\mathcal{I}_{u}=\{i:(i,u)\in\mathcal{S}\} be the set of items that user uu rates.

2.2 Review of Cumulative Probit Models

Cumulative probit models (Albert and Chib,, 1993, 1997) provide a convenient framework for modeling ordinal rating data. Consider the univariate setting, with ordinal observations {Zi:1iN}\{Z_{i}:1\leq i\leq N\} taking values in {1,,K}\{1,\ldots,K\}. We assume that ZiZ_{i} is a rounded version of a latent variable YiY_{i} such that Zi=kZ_{i}=k if θk1Yi<θk\theta_{k-1}\leq Y_{i}<\theta_{k}. Here, =θ0θ1θK=-\infty=\theta_{0}\leq\theta_{1}\leq\cdots\leq\theta_{K}=\infty are unknown break-points. When YiY_{i} has the Gaussian distribution YiGau(xiγ,1)Y_{i}\sim\operatorname{Gau}(x_{i}^{\top}\gamma,1) this leads to the ordinal probit model, where Pr(Zi=kθ,γ)=Φ(θkxiγ)Φ(θk1xiγ)\Pr(Z_{i}=k\mid\theta,\gamma)=\Phi(\theta_{k}-x_{i}^{\top}\gamma)-\Phi(\theta_{k-1}-x_{i}^{\top}\gamma).

We assume Var(Yi)=1\operatorname{Var}(Y_{i})=1, as the variance of YiY_{i} is confounded with the break-points θ=(θ1,,θK1)\theta=(\theta_{1},\ldots,\theta_{K-1}). Any global intercept term is also confounded with the θ\theta’s; there are two resolutions to this issue. The first is to fix one of the θk\theta_{k}’s, e.g., θ10\theta_{1}\equiv 0. The second is to exclude an intercept term from xix_{i}. While the former approach is often taken (Albert and Chib,, 1997; Higgs and Hoeting,, 2010), it is more convenient in the multi-rubric setting to use the latter approach to avoid placing asymmetric restrictions on the break-points.

The ordinal probit model is convenient for Bayesian inference in part because it admits a simple data augmentation algorithm which iterates between sampling YiindepTruncGau(xiγ,1,θZi1,θZi)Y_{i}\stackrel{{\scriptstyle\textnormal{indep}}}{{\sim}}\operatorname{TruncGau}(x_{i}^{\top}\gamma,1,\theta_{Z_{i}-1},\theta_{Z_{i}}) for 1iN1\leq i\leq N and, assuming a flat prior for γ\gamma, sampling γGau{(XX)1X𝒀,(XX)1},\gamma\sim\operatorname{Gau}\{(X^{\top}X)^{-1}X^{\top}\bm{Y},(X^{\top}X)^{-1}\}, where XX has ithi^{\text{th}} row xix_{i}^{\top} and 𝒀=(Y1,,YN)\bm{Y}=(Y_{1},\ldots,Y_{N}). Here, TruncGau(μ,σ2,a,b)\operatorname{TruncGau}(\mu,\sigma^{2},a,b) denotes the Gaussian distribution truncated to the interval (a,b)(a,b). Additionally, an update for θ\theta is needed. Efficient updates for θ\theta can be implemented by using a Metropolis-within-Gibbs step to update θ\theta as a block (for details, see Albert and Chib,, 1997, as well as Cowles,, 1996 for alternative MCMC schemes).

2.3 Description of the proposed model

2.3.1 The multi-rubric model

We develop an extension of the cumulative probit model to generic repeated-measures ordinal data {Ziu:(i,u)𝒮}\{Z_{iu}:(i,u)\in\mathcal{S}\}. Following Albert and Chib, (1997) we introduce latent utilities YiuY_{iu}, but specify a generic ANOVA model

Yiu=fiu+νu+ξi+ϵiu,ϵiuiidGau(0,1),\displaystyle Y_{iu}=f_{iu}+\nu_{u}+\xi_{i}+\epsilon_{iu},\qquad\epsilon_{iu}\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}\operatorname{Gau}(0,1), (1)

where νu\nu_{u} and ξi\xi_{i} are main effects and fiuf_{iu} is an interaction effect. The multi-rubric model modifies the cumulative probit model by replacing the break-point parameter θ\theta with uu-specific random effects θu=(θu0,,θuK)\theta_{u}=(\theta_{u0},\ldots,\theta_{uK}) with [θuF]indepF[\theta_{u}\mid F]\stackrel{{\scriptstyle\textnormal{indep}}}{{\sim}}F for some unknown FF. As before, we let Ziu=kZ_{iu}=k if θu(k1)Yiuθuk\theta_{u(k-1)}\leq Y_{iu}\leq\theta_{uk}.

For concreteness, we take FF to be a finite mixture F=m=1Mωmδθ(m)F=\sum_{m=1}^{M}\omega_{m}\delta_{\theta^{(m)}} for some large MM, with θ(m)iidH\theta^{(m)}\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}H and ωDirichlet(a,,a)\omega\sim\operatorname{Dirichlet}(a,\ldots,a), where δθ(m)\delta_{\theta^{(m)}} is a point-mass distribution at θ(m)\theta^{(m)}. We note that it is also straight-forward to use a nonparametric prior for FF such as a Dirichlet process (Escobar and West,, 1995; Ferguson,, 1973). We refer to the random effects θ(1),,θ(M)\theta^{(1)},\ldots,\theta^{(M)} as rubrics. Note that for each subject uu there exists a latent class mm such that θu=θ(m)\theta_{u}=\theta^{(m)}.

Figure 1 displays the essential idea for the model. Viewing YY as a latent utility, the rubric that the user is associated to leads to different values of the observed rating ZZ. In this example, the second rubric is associated to users who rate many items with a 33, while the first rubric is associated to users who do not rate many items with a 33.

Refer to caption
Figure 1: Visualization of the multi-rubric model. The point on the density indicates the realized value of YY.

Treating the break-points as random effects has several benefits. First, it offers additional flexibility over approaches for ordinal data which incorporate a random intercept (Gill and Casella,, 2009). Due to the fact that the θu\theta_{u}’s are confounded with both the location and scale of YiuY_{iu}, treating the break-points as random effects is at least as flexible as treating the location and scale of the distribution of the YiuY_{iu}’s as random effects. We require this additional flexibility, as merely treating the location and scale of the YiuY_{iu}’s as random effects does not allow for the variety of rating behaviors exhibited by users. By treating the break-points as random effects, we are able to capture any distribution of ratings in a given rubric (see, e.g., Figure 7). In addition to flexibility, specifying FF as a discrete mixture induces a clustering of users into latent classes. To each user uu we associate a latent variable CuC_{u} such that Cu=mC_{u}=m if θu=θ(m)\theta_{u}=\theta^{(m)}. As will be demonstrated in Section 4, the latent classes of users discovered in this way are highly interpretable.

2.3.2 Model for the Yelp! data

Our model for the Yelp! data takes YiuGau(μiu,1)Y_{iu}\sim\operatorname{Gau}(\mu_{iu},1) where

μiu=xiγ+αuβi+Wi+bi,Wi=ψ(si)η.\displaystyle\mu_{iu}=x_{i}^{\top}\gamma+\alpha_{u}^{\top}\beta_{i}+W_{i}+b_{i},\qquad W_{i}=\psi(s_{i})^{\top}\eta.

This is model (1) with fiu=αuβif_{iu}=\alpha_{u}^{\top}\beta_{i}, ξi=xiγ+Wi+bi\xi_{i}=x_{i}^{\top}\gamma+W_{i}+b_{i}, and νu\nu_{u} removed. This model can be motivated as a combination of the fixed-rank kriging approach of Cressie and Johannesson, (2008) with the probabilistic matrix factorization approach of Salakhutdinov and Mnih, (2007). The terms xiγx_{i}^{\top}\gamma, WiW_{i}, and bib_{i} are used to account for heterogeneity in the items. The term xiγx_{i}^{\top}\gamma accounts for known covariates xipx_{i}\in\mathbb{R}^{p} associated to each item. The term WiW_{i} is used to capture spatial structure, and is modeled with a basis function expansion Wi=ψ(si)ηW_{i}=\psi(s_{i})^{\top}\eta where sis_{i} denotes the longitude-latitude coordinates associated to the item and ψ(s)=(ψ1(s),ψr(s))\psi(s)=(\psi_{1}(s)\ldots,\psi_{r}(s))^{\top} is a vector of basis functions. We note that it is straight-forward to replace our low-rank approach for WiW_{i} with more elaborate approaches such as the full-scale approach of Sang and Huang, (2012). The term bib_{i} is an item-specific random effect which is used to capture item heterogeneity which cannot be accounted for by the covariates or the low-rank spatial structure.

The vectors αu\alpha_{u} and βi\beta_{i} intuitively correspond to unmeasured user-specific and item-specific latent features. The term αuβi\alpha_{u}^{\top}\beta_{i} is large/positive when αu\alpha_{u} and βi\beta_{i} point in the same direction (i.e., the user’s preferences align with the item’s characteristics), and is large/negative when αu\alpha_{u} and βi\beta_{i} point in opposite directions. This allows the model to account not only for user-specific biases (θu\theta_{u}) and item-specific biases (xi,Wi,bi)(x_{i},W_{i},b_{i}), but also interaction effects.

The multi-rubric model can be summarized by the following hierarchical model. For each model, we implicitly assume the statements hold conditionally on all variables in the models below, and that conditional independence holds within each model unless otherwise stated.

Response model:

Ziu=kZ_{iu}=k with probability wiuk=Φ(θukμiu)Φ(θu(k1)μiu)w_{iuk}=\Phi(\theta_{uk}-\mu_{iu})-\Phi(\theta_{u(k-1)}-\mu_{iu}) and μiu=xiγ+αuβi+Wi+bi\mu_{iu}=x_{i}^{\top}\gamma+\alpha_{u}^{\top}\beta_{i}+W_{i}+b_{i}.

Random effect model:

θuiidF\theta_{u}\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}F, αuGau(0,σα2I)\alpha_{u}\sim\operatorname{Gau}(0,\sigma^{2}_{\alpha}\operatorname{I}), βiGau(0,σβ2I)\beta_{i}\sim\operatorname{Gau}(0,\sigma^{2}_{\beta}\operatorname{I}), and biGau(0,σb2)b_{i}\sim\operatorname{Gau}(0,\sigma^{2}_{b}).

Spatial process model:

Wi=ψ(si)ηW_{i}=\psi(s_{i})^{\top}\eta where ηGau(0,Ση)\eta\sim\operatorname{Gau}(0,\Sigma_{\eta}).

Parameter model:

γFlat\gamma\sim\operatorname{Flat} and F=m=1Mωmδθ(m)F=\sum_{m=1}^{M}\omega_{m}\delta_{\theta^{(m)}} where ωDirichlet(a,,a)\omega\sim\operatorname{Dirichlet}(a,\ldots,a) and θ(m)iidH\theta^{(m)}\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}H.

To complete the model we must specify values for the hyperparameters σα,σβ,σb,Ση,a\sigma_{\alpha},\sigma_{\beta},\sigma_{b},\Sigma_{\eta},a, and HH, as well as the number of rubrics MM and the number of latent factors LL. In our illustrations we place half-Gaussian priors for the scale parameters, with (σβ,σb)iidGau+(0,1)(\sigma_{\beta},\sigma_{b})\stackrel{{\scriptstyle\textnormal{iid}}}{{\sim}}\operatorname{Gau}_{+}(0,1), and σα1\sigma_{\alpha}\equiv 1. We let Ση=diag(ση2,,ση2)\Sigma_{\eta}=\operatorname{diag}(\sigma^{2}_{\eta},\ldots,\sigma^{2}_{\eta}) and set σηGau+(0,1)\sigma_{\eta}\sim\operatorname{Gau}_{+}(0,1). Here, Gau+(0,1)\operatorname{Gau}_{+}(0,1) denotes a standard Gaussian distribution truncated to the positive reals. For a discussion of prior specification for variance parameters, see Gelman et al., (2006) and Simpson et al., (2017).

In our illustrations we use M=20M=20. For the Yelp! dataset, the choice of M=20M=20 rubrics is conservative, and by setting a=κ/Ma=\kappa/M for some fixed κ>0\kappa>0, we encourage ω\omega to be nearly-sparse (Ishwaran and Zarepour,, 2002; Linero,, 2016). This strategy effectively lets the data determine how many rubrics are needed, as the prior encourages ωm0\omega_{m}\approx 0 if rubric mm is not needed. The prior HH for θ(1),,θ(M)\theta^{(1)},\ldots,\theta^{(M)} is chosen to have density h(θ)=k=1KGau(θk0,σθ2)I(θ1θK1)h(\theta)=\prod_{k=1}^{K}\operatorname{Gau}(\theta_{k}\mid 0,\sigma_{\theta}^{2})I(\theta_{1}\leq\cdots\leq\theta_{K-1}) so that θ(m)\theta^{(m)} has the distribution of the order statistics of K1K-1 independent Gau(0,σθ2)\operatorname{Gau}(0,\sigma^{2}_{\theta}) variables.

2.4 Evaluating item quality

A commonly used measure of item quality is the average rating of a user from the population λi=E(Ziuxi,ϕi,γ)\lambda_{i}=E(Z_{iu}\mid x_{i},\phi_{i},\gamma) where ϕi=(βi,bi,Wi)\phi_{i}=(\beta_{i},b_{i},W_{i}). This quantity is given by

λi\displaystyle\lambda_{i} =k=1KkPr(Ziu=kxi,ϕi,γ)\displaystyle=\sum_{k=1}^{K}k\cdot\Pr(Z_{iu}=k\mid x_{i},\phi_{i},\gamma)
=k=1Km=1MkωmPr(Ziu=kxi,ϕi,αu,γ,Cu=m)Gau(α0,σα2I)𝑑α.\displaystyle=\sum_{k=1}^{K}\sum_{m=1}^{M}k\cdot\omega_{m}\cdot\int\Pr(Z_{iu}=k\mid x_{i},\phi_{i},\alpha_{u},\gamma,C_{u}=m)\,\operatorname{Gau}(\alpha\mid 0,\sigma^{2}_{\alpha}\operatorname{I})\ d\alpha.

Using properties of the Gaussian distribution, and recalling that σα2=1\sigma^{2}_{\alpha}=1, it can be shown that

λi=k=1Km=1Mkωm{Φ(θk(m)ξi1+βi2)Φ(θk1(m)ξi1+βi2)},\displaystyle\lambda_{i}=\sum_{k=1}^{K}\sum_{m=1}^{M}k\cdot\omega_{m}\cdot\left\{\Phi\left(\frac{\theta^{(m)}_{k}-\xi_{i}}{\sqrt{1+\|\beta_{i}\|^{2}}}\right)-\Phi\left(\frac{\theta^{(m)}_{k-1}-\xi_{i}}{\sqrt{1+\|\beta_{i}\|^{2}}}\right)\right\}, (2)

where ξi=xiγ+bi+Wi\xi_{i}=x_{i}^{\top}\gamma+b_{i}+W_{i}. In Section 4, we demonstrate the particular users who rated item ii exert a strong influence on the λi\lambda_{i}’s, particularly for restaurants with few ratings.

Rather than focusing on an omnibus measure of overall quality, we can also adjust the overall quality of an item to be rubric-specific. This amounts to calculating λim=E(Ziuxi,ϕi,γ,Cu=m),\lambda_{im}=E(Z_{iu}\mid x_{i},\phi_{i},\gamma,C_{u}=m), which represents the average rating of item ii if all used rubric mm. Similar to (2), this quantity can be computed as

λim=k=1Kk{Φ(θk(m)ξi1+βi2)Φ(θk1(m)ξi1+βi2)}.\displaystyle\lambda_{im}=\sum_{k=1}^{K}k\cdot\left\{\Phi\left(\frac{\theta^{(m)}_{k}-\xi_{i}}{\sqrt{1+\|\beta_{i}\|^{2}}}\right)-\Phi\left(\frac{\theta^{(m)}_{k-1}-\xi_{i}}{\sqrt{1+\|\beta_{i}\|^{2}}}\right)\right\}. (3)

In Section 4, we use both (2) and (3) to understand the statistical features of the multi-rubric model.

2.5 Implementation Details

We use the reduced rank model W=Ψη+bW=\Psi\eta+b where ΨI×r\Psi\in\mathbb{R}^{I\times r} has ithi^{\text{th}} row given by ψ(si)\psi(s_{i})^{\top}. We choose Ψ\Psi so that Cov(Ψη)\operatorname{Cov}(\Psi\eta) is an optimal low-rank approximation to ση2Ξ\sigma^{2}_{\eta}\Xi where Ξ\Xi is associated to a target positive semi-definite covariogram. This is accomplished by taking Ψ\Psi composed of the first rr columns of ΓD1/2\Gamma D^{1/2} where Ξ=ΓDΓ\Xi=\Gamma D\Gamma^{\top} is the spectral decomposition of Ξ\Xi. The Eckart-Young-Mirsky theorem states that this approximation is optimal with respect to both the operator norm and Frobenius norm (see, e.g., Rasmussen and Williams,, 2005, Chapter 8). A similar strategy is used by Bradley et al., (2016, 2015), who use an optimal low-rank approximation of a target covariance structure ΞΨΣηΨ\Xi\approx\Psi\Sigma_{\eta}\Psi^{\top} where the basis Ψ\Psi is held fixed but Ση\Sigma_{\eta} is allowed to vary over all positive-definite r×rr\times r matrices. In our illustrations, we use the squared-exponential covariance, i.e., Ξij=exp(ρsisj2)\Xi_{ij}=\exp({-{\rho}\|s_{i}-s_{j}\|^{2}}) (Cressie,, 2015).

To complete the specification of the model, we must specify the bandwidth ρ\rho, the number of latent factors LL, and the number of basis functions rr. We regard LL as a tuning parameter, which can be selected by assessing prediction performance on a held-out subset of the data. In principle, a prior can be placed on ρ\rho, however this results in a large computational burden; we instead evaluate several fixed values of ρ\rho chosen according to some rules-of-thumb and select the value with the best performance. For the Yelp! dataset, we selected ρ=1000\rho=1000, which corresponds undersmoothing the spatial field relative to Scott’s rule (see, e.g., Härdle and Müller,, 2000) by roughly a factor of two, and remark that substantively similar results are obtained with other bandwidths. Finally, rr can be selected so that the proportion of the variance d=1rDii2/d=1nDii2\sum_{d=1}^{r}D_{ii}^{2}/\sum_{d=1}^{n}D_{ii}^{2} in Ξ\Xi accounted for by the low-rank approximation exceeds some preset threshold; for the Yelp! dataset, we chose r=500r=500 to account for 99% of the variance in Ξ\Xi.

When specifying the number of rubrics MM, we have found that the model is most reliable when MM is chosen large and a=κ/Ma=\kappa/M for some κ>0\kappa>0; under these conditions, the prior for FF is approximately a Dirichlet process with concentration κ\kappa and base measure HH (see, e.g., Teh et al.,, 2006). We recommend choosing MM to be conservatively large and allowing the model to remove unneeded rubrics through the sparsity-inducing prior on ω\omega. We have found that taking MM large is necessary for good performance even in simulations in which the true number of rubrics is small and known.

We use Markov chain Monte Carlo to approximately sample from the posterior distribution of the parameters. A description of the sampler is given in the appendix.

2.6 A note on selection bias

Let Δiu=1\Delta_{iu}=1 if (i,u)𝒮(i,u)\in\mathcal{S}, and Δiu=0\Delta_{iu}=0 otherwise. In not modeling the distribution of Δiu\Delta_{iu}, we are implicitly modeling the distribution of the ZiuZ_{iu}’s conditional on Δiu=1\Delta_{iu}=1. When selection bias is present, this may be quite different than the marginal distribution of ZiuZ_{iu}’s. Experiments due to Marlin and Zemel, (2009) provide evidence that selection bias may be present in practice.

A useful feature of the approach presented here is that it naturally down-weights users who are exhibiting selection bias. For example, if user uu only rates items they feel negatively about, they will be assigned to a rubric mm for which θ1(m)\theta^{(m)}_{1} is very large; this has the effect of ignoring their ratings, as there will be effectively no information in the data about their latent utility. As a result, when estimating overall item quality, the model naturally filters out users who are exhibiting extreme selection bias, which may be desirable.

In the context of prediction, the predictive distribution for ZiuZ_{iu} should be understood as being conditional on the event Δiu=1\Delta_{iu}=1; that is, the prediction is made with the additional information that user uu chose to rate item ii. This is the case for nearly all collaborative filtering methods, as correcting for the selection bias necessitates collecting ZiuZ_{iu}’s for which Δiu=0\Delta_{iu}=0 would occurred naturally; for example, as done by Marlin and Zemel, (2009), we might assess selection bias by conducting a pilot study which forces users to rate items they would not have normally rated. With the understanding that all methods are predicting ratings conditional on Δiu=1\Delta_{iu}=1, the results in Section 4 show that the multi-rubric model leads to increased predictive performance.

Selection bias should also be taken into account when interpreting the latent rubrics produced by our model. Our model naturally provides a clustering of users into latent classes, which we presented as representing differing standards in user ratings; however, we expect that the model is also detecting differences in selection bias across users. We emphasize that our goal is to identify and account for heterogeneity in rating patterns, and we avoid speculating on whether heterogeneity is caused by different rating standards or selection bias. For example, a user who rates items with only one-star or five-stars might be either (i) using a rubric which results in extreme behavior, with most of the break-points very close together; or (ii) actively choosing to rate items which they feel strongly about.

3 Simulation Study

The goal of this simulation is to illustrate that we can accurately learn the existence of multiple rubrics in settings where one would expect it would be difficult to detect them. We consider a situation where the data is generated according to two rubrics that are similar to each other. This allows us to assess the robustness of our model to various “degrees” of the multi-rubric assumption. The performance of our multi-rubric model is assessed relative to the single-rubric model, which is the standard assumption made in the ordinal data literature.

We calibrate components of the simulation model towards the Yelp! dataset to produce realistic simulated data. Specifically, we set η\eta and σb2\sigma^{2}_{b} equal to the posterior means obtained from fitting the model to the Yelp! dataset in Section 4. We set Ση=0.5I\Sigma_{\eta}=0.5\operatorname{I}, corresponding to a much stronger spatial effect than what was observed in the data, and for simplicity we removed the latent-factor aspect of the model by fixing σβ20\sigma^{2}_{\beta}\equiv 0. A two-rubric model is used with ω1=ω2=0.5\omega_{1}=\omega_{2}=0.5. We also use the same spatial basis functions and observed values of (i,u)(i,u) as in the Yelp! analysis in Section 4.

We now describe how the two rubrics θ1\theta_{1} and θ2\theta_{2} where chosen. First, θ1\theta_{1} was selected so that {Ziu:Cu=1,i=1,,I}\{Z_{iu}:C_{u}=1,i=1,\ldots,I\} was evenly distributed among the five responses. Associated to θ1\theta_{1} is a probability vector p1=(0.2,0.2,0.2,0.2,0.2)p_{1}=(0.2,0.2,0.2,0.2,0.2). To specify θ2\theta_{2}, we use the same approach with a difference choice of pp. Let p2=(0,0.25,0.5,0.25,0)p_{2}=(0,0.25,0.5,0.25,0). Then θ2(τ)\theta_{2}^{(\tau)} is associated to τp1+(1τ)p2\tau p_{1}+(1-\tau)p_{2} in the same manner as θ1\theta_{1} is associated to p1p_{1}. Here, τ\tau indexes the similarity of θ1\theta_{1} and θ2\theta_{2}, and it can be shown that the total variation distance between the empirical distribution of {Ziu:Cu=1}\{Z_{iu}:C_{u}=1\} and {Ziu:Cu=2}\{Z_{iu}:C_{u}=2\} is 0.8(1τ)0.8(1-\tau). Thus, values of τ\tau near 11 correspond imply that the rubrics are similar, while values of τ\tau near 0 imply that they are dissimilar. Figure 2 presents the distribution of the ZiuZ_{iu}’s with Cu=2C_{u}=2 when τ=0,0.8\tau=0,0.8, and 11.

Refer to caption
Figure 2: Empirical distribution of the ZiuZ_{iu}’s in the simulation model, for θ1\theta_{1}, θ2(0.8)\theta^{(0.8)}_{2}, and θ2(0)\theta_{2}^{(0)}.

We fit a 1010-rubric and single-rubric model for τ=0.0,0.1,,1.0\tau=0.0,0.1,\ldots,1.0. Figure 3 displays the proportion of individuals assigned to each rubric for a given value of τ\tau. If the model is accurately recovering the underlying rubric structure, we expect to see a half of the observations assigned to one rubric, and half to another; due to permutation invariance, which of the 10 rubrics is associated to θ1\theta_{1} and θ2(τ)\theta_{2}^{(\tau)} vary by simulation. Up to τ=0.9\tau=0.9, the model is capable of accurately recovering the existence of two rubrics. We also see that, even at τ=0.8\tau=0.8, the model accurately recovers the empirical distribution of the ZiuZ_{iu}’s associated to each rubric.

Refer to caption
Refer to caption
Figure 3: Top: proportion of individuals assigned to each rubric at the last iteration of the Markov chain. Bottom: The empirical distribution of ZiuZ_{iu} for the two rubrics associated to Cu=1C_{u}=1 and Cu=2C_{u}=2 when τ=0.8\tau=0.8; compare with the left and middle plots in Figure 2.

Next, we assess the benefit of using the multi-rubric model to predict missing values. For each value of τ\tau, we fit a single-rubric and multi-rubric model. Using the same train-test split as in the our real data illustration, we compute the log likelihood on the held-out data logliktest=(i,u)𝒮testlogPr(ZiuD),\text{loglik}_{\text{test}}=\sum_{(i,u)\in\mathcal{S}_{\text{test}}}\log\Pr(Z_{iu}\mid D), which is further discussed in detail in Section 4 . Figure 4 shows the difference in held-out log likelihood for the single-rubric and multi-rubric model as a function of τ\tau. Up-to τ=0.8\tau=0.8, there is a meaningful increase in the held-out log-likelihood obtained from using the multi-rubric model. The case where τ=1\tau=1 is also particularly interesting, as this implies that the data were generated from the single rubric model. Here the predictive performance of our model at missing values appears to be robust to the case when the multiple rubric assumption is incorrect.

Refer to caption
Figure 4: Difference in logliktest\text{loglik}_{\text{test}} for the single-rubric and multi-rubric model obtained in the simulation study, as a function of τ\tau. Above each point, we provide the proportion of users whose most likely rubric assignment matched their true rubric.

Displayed above each point in Figure 4 is the proportion of observations which are assigned to the correct rubric, where each observation is assigned to their most likely rubric. When the rubrics are far apart the model is capable of accurately assigning observations to rubrics. As the rubrics get closer together, the task of assigning observations to rubrics becomes much more difficult.

This simulation study suggests that the model specified here is able to disentangle the two-rubric structure, even when the rubrics are only subtly different. This leads to clear improvements in predictive performance for small and moderate values for τ\tau. Additionally, when the multi-rubric assumption is negligible, or even incorrect, our model performs as well as the single-rubric model.

4 Analysis of Yelp data

Refer to caption
Figure 5: Estimate of the underlying spatial field W(s)=ψ(s)ηW(s)=\psi(s)^{\top}\eta at each realized restaurant location using its posterior mean.

We now apply the multi-rubric model to the Yelp! dataset, which is publicly available at https://www.yelp.com/dataset_challenge. We begin by preprocessing the data to include reviews only between January 1st1^{\text{st}}, 2013 and December 31st31^{\text{st}} 2016, and restrict attention to restaurants in Phoenix and its surrounding areas. We further narrow the data to include only users who rated at least 10 restaurants; this filtering is done in an attempt to minimize selection bias, as we believe that “frequent raters” should be less influenced by selection bias.

We first evaluate the performance of the single-rubric and multi-rubric models for various values of the latent factor dimension LL. We set M=20M=20 and induce sparsity in ω\omega by setting ωDirichlet(1/20,,1/20)\omega\sim\operatorname{Dirichlet}(1/20,\ldots,1/20). We divide the indices (i,u)𝒮(i,u)\in\mathcal{S} into a training set 𝒮train\mathcal{S}_{\text{train}} and testing set 𝒮test\mathcal{S}_{\text{test}} of equal sizes by randomly allocating half of the indices to the training set. We evaluate predictions using a held-out log-likelihood criteria

logliktest=|𝒮test|1(i,u)𝒮testlogPr(Ziu𝒟),|𝒮test|1(i,u)𝒮testlogT1t=1TPr(ZiuCu(t),θ(t),μiu(t)),\displaystyle\begin{split}\text{loglik}_{\text{test}}&=|\mathcal{S}_{\text{test}}|^{-1}\sum_{(i,u)\in\mathcal{S}_{\text{test}}}\log\Pr(Z_{iu}\mid\mathcal{D}),\\ &\approx|\mathcal{S}_{\text{test}}|^{-1}\sum_{(i,u)\in\mathcal{S}_{\text{test}}}\log T^{-1}\sum_{t=1}^{T}\Pr(Z_{iu}\mid C_{u}^{(t)},\theta^{(t)},\mu_{iu}^{(t)}),\end{split} (4)

where 𝒟={Ziu:(i,u)𝒮train}\mathcal{D}=\{Z_{iu}:(i,u)\in\mathcal{S}_{\text{train}}\}, Pr(Ziu𝒟)\Pr(Z_{iu}\mid\mathcal{D}) denotes the posterior predictive distribution of ZiuZ_{iu}, and t=1,,Tt=1,\ldots,T indexes the approximate draws from the posterior obtained by MCMC. Results for the values L=1,3L=1,3, and 5, over 10 splits into training and test data, are given in Figure 6. We also compare our methodology to ordinal matrix factorization (Paquet et al.,, 2012) with learned breakpoints and spatial smoothing, and the mixture multinomial model (Marlin and Zemel,, 2009) with 1010 mixture components. The multi-rubric model leads to an increase in the held-out data log-likelihood (4) of roughly 5%5\% over ordinal matrix factorization and 8%8\% over the mixture multinomial model. Additionally, we note that the holdout log-likelihood was very stable over replications. The single-rubric model is essentially equivalent to ordinal matrix factorization.

The dimension of the latent factors αu\alpha_{u} and βi\beta_{i} has little effect on the quality of the model. We attribute this to the fact that |𝒰i||\mathcal{U}_{i}| and u|\mathcal{I}_{u}| are typically small, making it difficult for the model to recover the latent factors. On other datasets where this is not the case, such as the Netflix challenge dataset, latent-factor models represent the state of the art and are likely essential for the multi-rubric model. In the supplementary material we show in simulation experiments that the αu\alpha_{u}’s, βi\beta_{i}’s, and LL are identified.

Refer to caption
Figure 6: Boxplots of 2.0logliktest-2.0\cdot\text{loglik}_{\text{test}} for the mixture multinomial model (MMM, which does not have latent factors), ordinal matrix factorization (OMF), the single rubric model (Single) and the multi-rubric model (Multi), for 10 splits into training and testing data.

Figure 5 displays the learned spatial field W^(s)=ψ(s)η^\widehat{W}(s)=\psi(s)^{\top}\widehat{\eta} where η^\widehat{\eta} is the posterior mean of η\eta. The results suggest that the downtown Phoenix business district and the area surrounding the affluent Paradise Valley possesses a higher concentration of highly-rated restaurants than the rest of the Phoenix area. More sparsely populated areas such as such as Litchfield Park, or areas with lower income such as Guadalupe, seem to have fewer highly-rated restaurants.

Refer to caption
Refer to caption
Figure 7: Top: bar chart giving the number of users assigned to each rubric, where users are assigned to rubrics by minimizing Binder’s loss function. Bottom: bar charts giving the proportions of the observed ratings ZiuZ_{iu} for each item-user pair for the top 9 most common rubrics.

We now examine the individual rubrics. First, we obtain a clustering of users into their rubrics by minimizing Binder’s loss function (Binder,, 1978) L(𝒄)=u,u|δcu,cuΠu,uL(\bm{c})=\sum_{u,u^{\prime}}|\delta_{c_{u},c_{u^{\prime}}}-\Pi_{u,u^{\prime}}, where δij=I(i=j)\delta_{ij}=I(i=j) is the Kronecker delta, 𝒄=(c1,,cU)\bm{c}=(c_{1},\ldots,c_{U}) is an assignment of users to rubrics, and Πu,u\Pi_{u,u^{\prime}} is the posterior probability of Cu=CuC_{u}=C_{u^{\prime}}. See Fritsch and Ickstadt, (2009) for additional approaches to clustering objects using samples of the CuC_{u}’s.

The multi-rubric model produces interesting effects on the overall estimate of restaurant quality. Consider the rubric corresponding to m=7m=7 in Figure 7. Users assigned to this rubric give the majority of restaurants a rating of five stars. As a result, a rating of 5 stars for the m=7m=7 rubric is less valuable to a restaurant than a rating of 5 stars from a user with the m=6m=6 rubric. Similarly, a rating of 33 stars from the m=7m=7 rubric is more damaging to the estimate of a restaurant’s quality than a rating of 33 stars from the m=6m=6 rubric.

For restaurants with a large number of reviews, the effect mentioned above is negligible, as the restaurants typically have a good mix of users from different rubrics. The effect on restaurants with a small number of reviews, however, can be much more pronounced. To illustrate this effect, Figure 8 displays the posterior distribution of the quantity λi\lambda_{i} defined in (2) for the restaurants with i{3356,3809,9}i\in\{3356,3809,9\}. Each of these businesses has 44 reviews total, with empirically averaged ratings of 4.25, 3.75, and 3 stars. For i=3809i=3809 and i=9i=9, the users are predominantly from the rubric with m=7m=7; as a consequence, the fact that these restaurants do not have an average rating closer to five stars is quite damaging to the estimate of the restaurant quality. In the case of i=3809i=3809, the effect is strong enough that what was ostensibly an above-average restaurant is actually estimated to be below average by the multi-rubric model. Conversely, item i=3356i=3356 has ratings of 4,5,54,5,5, and 33 stars, but one of the 55-star ratings comes from a user assigned to the rubric m=2m=2 which gave a 55-star rating to only 8% of businesses. As a result, the 55-star ratings are weighted more heavily than they would otherwise be, causing the distribution of λi\lambda_{i} to be shifted slightly upwards.

Refer to caption
Figure 8: Posterior density of λi\lambda_{i} for i{9,3356,3809}i\in\{9,3356,3809\}. The dashed line is the empirical average rating of item ii; the dotted line is the overall average of all ratings. Error bars are centered at the posterior mean with a radius of one standard deviation.

Lastly, we consider rescaling the average ratings according to a specific rubric. This may of interest, for example, if one is interested in standardizing the ratings to match a rubric which evenly disperses ratings evenly across the possible stars. To do this, we examine the rubric-adjusted average ratings λim\lambda_{im} given by (3). Figure 9 displays the posterior density of λim\lambda_{im} for i=24i=24 and i=44i=44, for the 99 most common rubrics. These two restaurants have over 100 reviews, and so the overall quality can be estimated accurately. We see some expected features; for example, the quality of each restaurant has been adjusted downwards for users of the m=10m=10 rubric, and upwards for the m=7m=7 rubric. The multi-rubric model allows for more nuanced behavior of the adjusted ratings than simple upward/downward shifts. For example, for the mediocre item i=44i=44, we see that little adjustment is made for the m=13m=13 rubric, while for the high-quality item i=24i=24 a substantial downward adjustment is made. This occurs because the model interprets the users with m=13m=13 as requiring a relatively large amount of utility to rate an item 5 stars, so that a downward adjustment is made for the high-quality item; on the other hand, users with m=13m=13 tend to rate things near a 3.53.5, so little adjustment needs to be made for the mediocre item.

Refer to caption
Figure 9: Posterior density of λim\lambda_{im} for i=44,24i=44,24. Horizontal lines display the empirical average rating for each item. Rubrics are organized by the average rating assigned to i=44i=44 for visualization purposes.

5 Discussion

In this paper we introduced the multi-rubric model for the analysis of rating data and applied it to public data from the website Yelp!. We found that the multi-rubric model yields improved predictions and induces sophisticated shrinkage effects on the estimated quality of the items. We also showed how the model developed can be used to partition the users into interpretable latent classes.

There are several areas exciting areas for future work. First, while Markov chain Monte Carlo works well for the Yelp! dataset (it took 90 minutes to fit the model of Section 4), it would be desirable to develop a more scalable procedure, such as stochastic variational inference (Hoffman et al.,, 2013). Second, the model described here features limited modeling of the users. Information regarding which items the users have rated has been shown in other settings to improve predictive performance; temporal heterogeneity may also be present in users.

The latent class model described here can also be extended to allow for more flexible models for the αu\alpha_{u}’s and βi\beta_{i}’s. For example, a referee pointed out the possibility of inferring about how controversial an item is across latent classes, which could be accomplished naturally by using a mixture model for the αu\alpha_{u}’s.

A fruitful area for future research is the development of methodology for when MAR fails. One possibility for future work is to extend the model to also model the missing data indicators Δiu\Delta_{iu}. This is complicated by the fact that, while {Yiu:1iI,1uU}\{Y_{iu}:1\leq i\leq I,1\leq u\leq U\} is not completely observed, {Δiu:1iI,1uU}\{\Delta_{iu}:1\leq i\leq I,1\leq u\leq U\} is. As a result, the data becomes much larger when modeling the Δiu\Delta_{iu}’s.

Acknowledgements

The authors thank Eric Chicken for helpful discussions. This work was partially supported by the Office of the Secretary of Defense under research program #SOT-FSU-FATs-06 and NSF grants NSF-SES #1132031 and NSF-DMS #1712870.

Appendix A Markov chain Monte Carlo algorithm

Before describing the algorithm, we define several quantities. First, define

Riu(α)\displaystyle R_{iu}^{(\alpha)} =Yiuμiu+αiβu,\displaystyle=Y_{iu}-\mu_{iu}+\alpha_{i}^{\top}\beta_{u}, Riu(b)\displaystyle R_{iu}^{(b)} =Yiuμiu+bi,\displaystyle=Y_{iu}-\mu_{iu}+b_{i},
Riu(γ)\displaystyle R_{iu}^{(\gamma)} =Yiuμiu+xiγ,\displaystyle=Y_{iu}-\mu_{iu}+x_{i}^{\top}\gamma, Riu(η)\displaystyle R_{iu}^{(\eta)} =Yiuμiu+ψ(si)η.\displaystyle=Y_{iu}-\mu_{iu}+\psi(s_{i})^{\top}\eta.

Let 𝑹u(α)=vec(Riu(α):ii),𝑹(β)=vec(Riu(α):u𝒰i),𝑹i(b)=vec(Riu(b):u𝒰i),𝑹(γ)=vec(Riu(γ):(i,u)𝒮), and 𝑹i(η)=vec(Riu(η):(i,u)𝒮).\bm{R}^{(\alpha)}_{u}=\operatorname{vec}(R_{iu}^{(\alpha)}:i\in\mathcal{I}_{i}),\bm{R}^{(\beta)}=\operatorname{vec}(R_{iu}^{(\alpha)}:u\in\mathcal{U}_{i}),\bm{R}^{(b)}_{i}=\operatorname{vec}(R_{iu}^{(b)}:u\in\mathcal{U}_{i}),\bm{R}^{(\gamma)}=\operatorname{vec}(R_{iu}^{(\gamma)}:(i,u)\in\mathcal{S}),\text{ and }\bm{R}^{(\eta)}_{i}=\operatorname{vec}(R_{iu}^{(\eta)}:(i,u)\in\mathcal{S}). Then we can write

𝑹i(β)\displaystyle\bm{R}^{(\beta)}_{i} =𝑨iβi+ϵi,\displaystyle=\bm{A}_{i}\beta_{i}+\bm{\epsilon}_{i}, 𝑹i(b)\displaystyle\bm{R}^{(b)}_{i} =𝟙bi+ϵi,\displaystyle=\mathds{1}b_{i}+\bm{\epsilon}_{i},
𝑹u(α)\displaystyle\bm{R}^{(\alpha)}_{u} =𝑩uαi+ϵi,\displaystyle=\bm{B}_{u}\alpha_{i}+\bm{\epsilon}_{i}, 𝑹u(γ)\displaystyle\bm{R}^{(\gamma)}_{u} =𝑿γ+ϵ,\displaystyle=\bm{X}\gamma+\bm{\epsilon},
𝑹u(η)\displaystyle\bm{R}^{(\eta)}_{u} =𝚿η+ϵ,\displaystyle=\bm{\Psi}\eta+\bm{\epsilon},

where 𝑨i\bm{A}_{i} has rows composed of α\alpha’s associated to users who rated item ii, 𝑩u\bm{B}_{u} has rows composed of β\beta’s associated to items which were rated by user uu, and 𝑿\bm{X} and 𝚿\bm{\Psi} are design matrices associated to the covariates and basis functions respectively. Holding the other parameters fixed, each term above on the left-hand-side is sufficient for its associated parameter on the right-hand-side.

A data augmentation strategy similar to the one proposed by Albert and Chib, (1997) is employed. The updates for the parameters η,α,β\eta,\alpha,\beta, and γ\gamma use a back-fitting strategy based on the 𝑹\bm{R}’s above. The Markov chain operates on the state space (C,Y,θ,b,α,β,γ,η,ω,σα,σβ,ση)(C,Y,\theta,b,\alpha,\beta,\gamma,\eta,\omega,\sigma_{\alpha},\sigma_{\beta},\sigma_{\eta}). We perform the following updates for each iteration of the sampling algorithm, where each step is understood to be done for each relevant uu and ii.

  1. 1.

    Draw CuCategorical(ω^u1,,ω^uM)C_{u}\sim\operatorname{Categorical}(\widehat{\omega}_{u1},\ldots,\widehat{\omega}_{uM}) where ω^um\widehat{\omega}_{um} is proportional to ωmi𝒰u[Φ(θZiu(m)μiu)Φ(θZiu1(m)μiu)]\omega_{m}\prod_{i\in\mathcal{U}_{u}}[\Phi(\theta_{Z_{iu}}^{(m)}-\mu_{iu})-\Phi(\theta_{Z_{iu}-1}^{(m)}-\mu_{iu})]

  2. 2.

    Draw YiuTruncGau(μiu,1,θk1(Cu),θk(Cu))Y_{iu}\sim\operatorname{TruncGau}(\mu_{iu},1,\theta^{(C_{u})}_{k-1},\theta^{(C_{u})}_{k}), for (i,u)𝒮(i,u)\in\mathcal{S}.

  3. 3.

    Draw αuGau(α^u,Σ^αu)\alpha_{u}\sim\operatorname{Gau}(\widehat{\alpha}_{u},\widehat{\Sigma}_{\alpha_{u}}) where Σ^αu=(𝑩u𝑩u+σα2I)1\widehat{\Sigma}_{\alpha_{u}}=(\bm{B}_{u}^{\top}\bm{B}_{u}+\sigma_{\alpha}^{2}I)^{-1} and α^u=Σ^αu𝑩u𝑹u(α)\widehat{\alpha}_{u}=\widehat{\Sigma}_{\alpha_{u}}\bm{B}_{u}^{\top}\bm{R}^{(\alpha)}_{u}.

  4. 4.

    Draw βiGau(β^i,Σ^βi)\beta_{i}\sim\operatorname{Gau}(\widehat{\beta}_{i},\widehat{\Sigma}_{\beta_{i}}) where Σ^βi=(𝑨i𝑨i+σβ2I)1\widehat{\Sigma}_{\beta_{i}}=(\bm{A}_{i}^{\top}\bm{A}_{i}+\sigma^{2}_{\beta}I)^{-1} and β^i=Σ^βi𝑨i𝑹i(β)\widehat{\beta}_{i}=\widehat{\Sigma}_{\beta_{i}}\bm{A}_{i}^{\top}\bm{R}^{(\beta)}_{i}.

  5. 5.

    Draw γGau(γ^,Σ^γ)\gamma\sim\operatorname{Gau}(\widehat{\gamma},\widehat{\Sigma}_{\gamma}) where Σ^γ=(𝑿𝑿)1\widehat{\Sigma}_{\gamma}=(\bm{X}^{\top}\bm{X})^{-1} and γ^=Σ^γ𝑿𝑹(γ)\widehat{\gamma}=\widehat{\Sigma}_{\gamma}\bm{X}^{\top}\bm{R}^{(\gamma)}.

  6. 6.

    Draw biGau(b^i,σ^bi2)b_{i}\sim\operatorname{Gau}(\widehat{b}_{i},\widehat{\sigma}^{2}_{b_{i}}) where σ^bi2=(σb2+|𝒰i|)1,\widehat{\sigma}^{2}_{b_{i}}=(\sigma^{-2}_{b}+|\mathcal{U}_{i}|)^{-1}, and b^i=σ^bi2u𝒰iRiu(b)\widehat{b}_{i}=\widehat{\sigma}^{2}_{b_{i}}\sum_{u\in\mathcal{U}_{i}}R_{iu}^{(b)}.

  7. 7.

    Draw ηGau(η^,Σ^η)\eta\sim\operatorname{Gau}(\widehat{\eta},\widehat{\Sigma}_{\eta}) where Σ^η=(𝚿𝚿+Ση1)1\widehat{\Sigma}_{\eta}=(\bm{\Psi}^{\top}\bm{\Psi}+\Sigma^{-1}_{\eta})^{-1} and η^=Σ^η𝚿𝑹(η)\widehat{\eta}=\widehat{\Sigma}_{\eta}\bm{\Psi}^{\top}\bm{R}^{(\eta)}.

  8. 8.

    Draw ωDirichlet(a^1,,a^M)\omega\sim\operatorname{Dirichlet}(\widehat{a}_{1},\ldots,\widehat{a}_{M}) where a^m=a+u:Cu=m1\widehat{a}_{m}=a+\sum_{u:C_{u}=m}1.

  9. 9.

    Make an update to σb2\sigma^{2}_{b} which leaves Ga(σb20.5,0.5)i=1IGau(bi0,σb2)\operatorname{Ga}(\sigma^{2}_{b}\mid 0.5,0.5)\prod_{i=1}^{I}\operatorname{Gau}(b_{i}\mid 0,\sigma_{b}^{2}) invariant.

  10. 10.

    Make an update to σβ2\sigma^{2}_{\beta} which leaves Ga(σβ20.5,0.5)i=1I=1LGau(βi0,σβ2)\operatorname{Ga}(\sigma^{2}_{\beta}\mid 0.5,0.5)\prod_{i=1}^{I}\prod_{\ell=1}^{L}\operatorname{Gau}(\beta_{i\ell}\mid 0,\sigma_{\beta}^{2}) invariant.

  11. 11.

    Make an update to ση2\sigma^{2}_{\eta} which leaves Ga(ση20.5,0.5)j=1rGau(ηj0,ση2)\operatorname{Ga}(\sigma^{2}_{\eta}\mid 0.5,0.5)\prod_{j=1}^{r}\operatorname{Gau}(\eta_{j}\mid 0,\sigma_{\eta}^{2}) invariant.

  12. 12.

    Make an update to θ(m)\theta^{(m)} which leaves Gau(θ(m)𝟎,σθ2I)I(θ1(m)<,<θK1(m))u:Cu=miulog[Φ(θZiu(m)μiu)Φ(θZiu1(m)μiu)],\operatorname{Gau}(\theta^{(m)}\mid\bm{0},\sigma_{\theta}^{2}\operatorname{I})I(\theta_{1}^{(m)}<\ldots,<\theta_{K-1}^{(m)})\cdot\prod_{u:C_{u}=m}\prod_{i\in\mathcal{I}_{u}}\log[\Phi(\theta_{Z_{iu}}^{(m)}-\mu_{iu})-\Phi(\theta^{(m)}_{Z_{iu}-1}-\mu_{iu})], invariant.

In our illustrations, we use slice sampling (Neal,, 2003) to do updates 9–11. The chain is initialized by simulation from the prior with a=1a=1. The only non-trivial step is constructing an update for the θ(m)\theta^{(m)}’s. We use a modification of the approach outlined in Albert and Chib, (1997), which uses a Laplace approximation to construct a proposal for the full-conditional of the parameters δ1(m)=θ1(m)\delta^{(m)}_{1}=\theta^{(m)}_{1} and δk(m)=log(θk(m)θk1(m))\delta^{(m)}_{k}=\log(\theta^{(m)}_{k}-\theta^{(m)}_{k-1}) for k=2,,K1k=2,\ldots,K-1. To alleviate computational burden, the proposal is updated every 50th50^{\text{th}} iteration.

References

  • Albert and Chib, (1997) Albert, J. and Chib, S. (1997). Bayesian methods for cumulative, sequential and two-step ordinal data regression models. Technical report, Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH.
  • Albert and Chib, (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679.
  • Banerjee et al., (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:825–848.
  • Bao and Hanson, (2015) Bao, J. and Hanson, T. E. (2015). Bayesian nonparametric multivariate ordinal regression. Canadian Journal of Statistics, 43(3):337–357.
  • Berret and Calder, (2012) Berret, C. and Calder, C. A. (2012). Data augmentation strategies for the Bayesian spatial probit regression model. Computational Statistics & Data Analysis, 56:478–490.
  • Binder, (1978) Binder, D. A. (1978). Bayesian cluster analysis. Biometrika, 65(1):31–38.
  • Bobadilla et al., (2013) Bobadilla, J., Ortega, F., Hernando, A., and Gutiérrez, A. (2013). Recommender systems survey. Knowledge-Based Systems, 46:109–132.
  • Bradley et al., (2015) Bradley, J. R., Holan, S. H., and Wikle, C. K. (2015). Multivariate spatio-temporal models for high-dimensional areal data with application to longitudinal employer-household dynamics. Annals of Applied Statistics, 9:1761–1791.
  • Bradley et al., (2016) Bradley, J. R., Wikle, C. K., and Holan, S. H. (2016). Bayesian spatial change of support for count-valued survey data with application to the american community survey. Journal of the American Statistical Association, 111:472–487.
  • Cargnoni et al., (1997) Cargnoni, C., Müller, P., and West, M. (1997). Bayesian forecasting of multinomial time series through conditionally Gaussian dynamic models. Journal of the American Statistical Association, 92:640–647.
  • Carlin and Polson, (1992) Carlin, B. P. and Polson, N. G. (1992). Monte Carlo Bayesian methods for discrete regression models and categorical time series. In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A., editors, Bayesian Statistics 4. Oxford, UK: Oxford Univ. Press.
  • Chen and Dey, (2000) Chen, M. H. and Dey, D. K. (2000). A unified Bayesian approach for analysing correlated ordinal response data. Brazilian Journal of Probability and Statistics, 14:87–111.
  • Cowles, (1996) Cowles, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models. Statistics and Computing, 6(2):101–111.
  • Cressie, (2015) Cressie, N. (2015). Statistics for Spatial Data. John Wiley & Sons, New York, NY.
  • Cressie and Johannesson, (2008) Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:209–226.
  • De Oliveira, (2000) De Oliveira, V. (2000). Bayesian prediction of clipped Gaussian random fields. Computational Statistics & Data Analysis, 34(3):299–314.
  • De Oliveira, (2004) De Oliveira, V. (2004). A simple model for spatial rainfall fields. Stochastic Environmental Research and Risk Assessment, 18(2):131–140.
  • DeYoreo and Kottas, (2014) DeYoreo, M. and Kottas, A. (2014). Bayesian nonparametric modeling for multivariate ordinal regression. arXiv preprint arXiv:1408.1027.
  • Escobar and West, (1995) Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588.
  • Ferguson, (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209–230.
  • Fritsch and Ickstadt, (2009) Fritsch, A. and Ickstadt, K. (2009). Improved criteria for clustering based on the posterior similarity matrix. Bayesian Analysis, 4(2):367–391.
  • Gelman et al., (2006) Gelman, A. et al. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1(3):515–534.
  • Gill and Casella, (2009) Gill, J. and Casella, G. (2009). Nonparametric priors for ordinal Bayesian social science models: Specification and estimation. Journal of the American Statistical Association, 104(486):453–454.
  • Härdle and Müller, (2000) Härdle, W. and Müller, M. (2000). Multivariate and semiparametric kernel regression. In Schimek, M., editor, Smoothing and Regression: Approaches, Computation, and Application. New York: John Wiley & Sons, Inc.
  • Higgs and Hoeting, (2010) Higgs, M. D. and Hoeting, J. A. (2010). A clipped latent variable model for spatially correlated ordered categorical data. Computational Statistics & Data Analysis, 54:1999–2011.
  • Hoffman et al., (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347.
  • Houlsby et al., (2014) Houlsby, N., Hernández-Lobato, J. M., and Ghahramani, Z. (2014). Cold-start active learning with robust ordinal matrix factorization. In International Conference on Machine Learning, pages 766–774.
  • Ishwaran and Zarepour, (2002) Ishwaran, H. and Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12:941–963.
  • Knorr-Held, (1995) Knorr-Held, L. (1995). Dynamic cumulative probit models for ordinal panel-data; a Bayesian analysis by Gibbs sampling. Technical report, Ludwig-Maximilians-Universitat.
  • Koren and Sill, (2011) Koren, Y. and Sill, J. (2011). OrdRec: an ordinal model for predicting personalized item rating distributions. In Proceedings of the fifth ACM conference on Recommender systems, pages 117–124. ACM.
  • Kottas et al., (2005) Kottas, A., Müller, P., and Quintana, F. (2005). Nonparametric Bayesian modeling for multivariate ordinal data. Journal of Computational and Graphical Statistics, 14(3):610–625.
  • Linero, (2016) Linero, A. R. (2016). Bayesian regression trees for high dimensional prediction and variable selection. Journal of the American Statistical Association. To appear.
  • Marlin and Zemel, (2009) Marlin, B. M. and Zemel, R. S. (2009). Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, pages 5–12. ACM.
  • Neal, (2003) Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31:705–767.
  • Paquet et al., (2012) Paquet, U., Thomson, B., and Winther, O. (2012). A hierarchical model for ordinal matrix factorization. Statistics and Computing, 22(4):945–957.
  • Rasmussen and Williams, (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, Cambridge, MA.
  • Salakhutdinov and Mnih, (2007) Salakhutdinov, R. and Mnih, A. (2007). Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, pages 1257–1264.
  • Sang and Huang, (2012) Sang, H. and Huang, J. Z. (2012). A full scale approximation of covariance functions for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):111–132.
  • Schliep and Hoeting, (2015) Schliep, E. M. and Hoeting, J. A. (2015). Data augmentation and parameter expansion for independent or spatially correlated ordinal data. Computational Statistics & Data Analysis, 90:1–14.
  • Simpson et al., (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., Sørbye, S. H., et al. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1):1–28.
  • Teh et al., (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
  • Velozo et al., (2014) Velozo, P. L., Alves, M. B., and Schmidt, A. M. (2014). Modelling categorized levels of precipitation. Brazilian Journal of Probability and Statistics, 28:190–208.

Supplementary Material Antonio R. Linero, Jonathan R. Bradley, Apruva S. Desai

Appendix S.1 Identifiability

We conduct a brief simulation experiment to illustrate that model proposed in Section 2.3 is capable of (i) identifying the correct number of latent factors LL and (ii) capable of accruing evidence about the individual αu\alpha_{u}’s and βi\beta_{i}’s. We simulate from the data model, response model, random effect model, spatial process model, and parameter model described in Section 2.3 with the dimension of the latent factor set to L=4L=4. For simplicity, we omit the item-specific covariates given by xix_{i}. For the random effects model we set σα=2,σβ=5\sigma_{\alpha}=2,\sigma_{\beta}=5, and σb=3\sigma_{b}=3. We set ηGau(0,3I)\eta\sim\operatorname{Gau}(0,3\operatorname{I}) and r=20r=20, with the basis functions ψj(s)\psi_{j}(s) given by Gaussian radial basis functions with knots at placed uniformly at random throughout the spatial domain. For the parameter model, we used M=3M=3 rubrics with ω=(1/3,1/3,1/3)\omega=(1/3,1/3,1/3) and selected θ(m)\theta^{(m)} in the manner described in the simulation of Section 3. We set u=200u=200 and i=200i=200, and select user/item pairs for inclusion by sampling uniformly at random, with a total of 3981 ratings.

After simulating this data, we fit the multi-rubric model using the correct Ψ\Psi using the default prior described in Section 2.3 with the correct choice of basis functions ψj(s)\psi_{j}(s), with M=10M=10 and ωDirichlet(1/10,,1/10)\omega\sim\operatorname{Dirichlet}(1/10,\ldots,1/10).

We first assess whether the model is capable of recovering the true number of latent factors. We fit the model for L{1,,7}L\in\{1,\ldots,7\} and evaluated each value of LL by held-out log-likelihood criteria (4) after splitting the data randomly into training and testing sets. Results are given in Figure 10. We see that the model with the highest held-out log-likelihood corresponds to L=4L=4, the correct number of latent factors.

Refer to caption
Figure 10: Held-out log likelihood for different values of LL.

Next, we assess whether the model is capable of learning the individual αu\alpha_{u}’s and βi\beta_{i}’s. First, we note that for any orthonormal matrix OO with OO=IO^{\top}O=\operatorname{I} we have

αuβi=(Oαu)(Oβu).\displaystyle\alpha_{u}^{\top}\beta_{i}=(O\alpha_{u})^{\top}(O\beta_{u}).

Moreover, αu\alpha_{u} and OαuO\alpha_{u} are equal in distribution (as are βi\beta_{i} and OβiO\beta_{i}), so the prior does not provide any additional identification. Consequently, we can only expect to identify αu\alpha_{u} and βi\beta_{i} up-to an arbitrary rotation. While it is possible to impose constraints on the αu\alpha_{u} and βi\beta_{i} — say, by fixing α1,,αL\alpha_{1},\ldots,\alpha_{L} to know values — this is undesirable because it breaks the symmetry of the prior. In view of this, it is standard in the recommender systems literature to not invoke any constraints on the prior \citepNewsalakhutdinov2007probabilistic.

With these points in mind, Figure 11 displays the prior distribution of ζi=βi1/σβ\zeta_{i}=\beta_{i1}/\sigma_{\beta} along with the posterior distribution of ζi\zeta_{i}’s for 9 randomly selected items. We see that, while the overall distribution of the ζi\zeta_{i}’s is in agreement with the prior when considered as a group, for the individual ζi\zeta_{i}’s the prior and posterior differ considerably. This indicates that the model is capable of detecting differences in the βi\beta_{i}’s across items.

Refer to caption
Figure 11: Posterior density estimate of ζi=βi1/σβ\zeta_{i}=\beta_{i1}/\sigma_{\beta} for randomly selected items (solid) and the prior density ζiGau(0,1)\zeta_{i}\sim\operatorname{Gau}(0,1) (dashed).
\bibliographystyleNew

apalike \bibliographyNewmybib.bib