Assessing Copula Models for Mixed Continuous-Ordinal Variables
Abstract
Vine pair-copula constructions exist for a mix of continuous and ordinal variables. In some steps, this can involve estimating a bivariate copula for a pair of mixed continuous-ordinal variables. To assess the adequacy of copula fits for such a pair, diagnostic and visualization methods based on normal score plots and conditional Q-Q plots are proposed. The former utilizes a latent continuous variable for the ordinal variable. Using the Kullback-Leibler divergence, existing probability models for mixed continuous-ordinal variable pair are assessed for the adequacy of fit with simple parametric copula families. The effectiveness of the proposed visualization and diagnostic methods is illustrated on simulated and real datasets.
Keywords: parametric copula, empirical beta copula, Kullback-Leibler divergence, location-scale mixture models, normal scores, ordinal regression, polyserial correlation.
1 Introduction
Vine pair-copula constructions have been used for a mix of continuous and discrete/ordinal variables in Stöber et al., (2015), Section 3.9.5 of Joe, (2014), and Chang and Joe, (2019). The latter is concerned with the use of vine constructions for prediction models based on explanatory variables that are a mix of continuous and ordinal variables.
In some steps of the vine construction, the estimation involves a bivariate copula for a pair of mixed continuous-ordinal variables. The main objective of this paper is to assess the adequacy of the fit of parametric copula families for a pair of variables, one of which is continuous and the other is ordinal with a few categories. To decide on possible suitable bivariate parametric copula families, the two variables can be visualized using normal score plots by converting the ordinal variable into an appropriate latent continuous variable. After fitting candidate copula families, quantile-quantile (Q-Q) plots of the continuous variable conditioned on each category of the ordinal variable can be used to assess the adequacy of the fit.
To theoretically assess whether commonly used parametric copula families can fit mixed continuous-ordinal variables well, we consider some bivariate distributions or models proposed in the statistical literature for one ordinal and one continuous variable. The Kullback-Leibler (KL) divergence is used to assess the adequacy of copula-based approximations. For mixture models, we find that simple 1- or 2-parameter parametric copula families can lead to good approximations when there are homoscedastic and roughly equally spaced components and the number of mixture components is small. For conditional probit and logit models, Gaussian or t copulas can provide perfect or near perfect matches when the continuous variable follows a normal distribution. Otherwise, simple parametric bivariate copula families can be inadequate. In particular, in mixture models, as the locations become more dispersed or when there is heteroscedasticity, the best approximating simple parametric copula families based on the KL divergence can have reflection asymmetry, tail asymmetry, or tail dependence, but the fit can still be inadequate based on the conditional Q-Q plots.
When simple parametric copula families do not provide good fits, nonparametric copulas can be fitted with appropriate adaptations to an ordinal variable using the latent continuous variable. We indicate how to adapt the empirical beta copulas (Segers et al., (2017) for use with a pair of mixed continuous and ordinal variables as an example.
The remainder of this paper is organized as follows. Section 2 gives an overview of copulas when fitted to a pair of mixed continuous-ordinal variables. Section 3 proposes visualization methods via normal score plots and copula model diagnostic methods via conditional Q-Q plots for a pair of mixed continuous-ordinal variables. Section 4 discusses approaches to assessing the adequacy of approximations by computing the KL divergence of a copula-based model from a given density. Section 5 covers procedures for fitting parametric and nonparametric copula models to mixed continuous-ordinal variables. The proposed visualization and diagnostic methods are illustrated on simulated datasets in Section 6 and demonstrated on a real dataset in Section 7. Section 8 has final discussions.
2 Copulas for Mixed Continuous-Ordinal Variables
A copula is a multivariate distribution function with univariate Uniform margins. According to Sklar’s theorem (Sklar, (1959)), a -variate distribution is a composition of a copula and its univariate marginal distributions ; that is, , for . If is a continuous -variate distribution function with univariate margins , and quantile functions , then the copula , for , is the unique choice. If is a -variate distribution function of mixed continuous-ordinal variables, then the copula is only unique on the set . Hence, the copula associated with is non-unique when some variables are ordinal. Nevertheless, parametric copula families can still be used for data applications. Many non-uniqueness results are shown in Genest and Nešlehová, (2007) for the completely discrete/ordinal case. For the bivariate case of one ordinal and one continuous variable, one candidate for the non-unique copula is based on conditionally uniform given identities that the copula must satisfy, extending an idea on page 122 of Joe, (2014). This corresponds to defining an appropriate latent continuous variable.
Consider an ordinal variable that can take distinct ordered values and a continuous variable . Assume the labeling for without loss of generality. Let be the cumulative distribution function (CDF) of , with copula that is unique on . Let be a continuous latent variable associated with with distribution . If and
(1) |
then it can be shown that holds for and .
The proof is as follows. Note that in (1) has distribution. Hence, there is a unique copula for since is continuous. Let and . By Sklar’s theorem, any copula for satisfies
For a given , implies that can take a value in . The corresponding thus follows one of the uniform distributions . This implies that also holds. Similarly, implies that or . Hence, and are equivalent events, and
Thus matches on .
Note that the condition in (1) only requires that follows a piecewise uniform distribution given different values of . However, there are no restrictions on the joint distribution of and given . Given different joint distributions of , the resulting copulas are also different. This shows the non-uniqueness of copulas for the case of one ordinal and one continuous variable. Examples of generating different sets of values that satisfy the condition in (1) are illustrated in the next section.
3 Copula Model Diagnostics for Mixed Continuous-Ordinal
There are diagnostic plots and measures of asymmetry or dependence (Chapter 1 and Sections 2.12–2.15 of Joe, (2014)) to assess whether 1-parameter or 2-parameter bivariate copula families are useful for pairs of continuous variables with moderate to strong dependence. In this section, we discuss plots to help decide on possible parametric copula families for a continuous-ordinal pair of variables, and assess the adequacy of fitted copulas for a random sample with mixed continuous-ordinal variables. Visualization using normal score plots is presented in Section 3.1. A diagnostic procedure for copula estimates using conditional Q-Q plots is given in Section 3.2.
With an ordinal variable and a continuous variable , suppose there is a random sample from consisting of pairs for , with . For each , let be the cardinality of . Let be the empirical distribution function of . For the continuous variable , a parametric or nonparametric model can be applied to estimate with . For the th observation, let , , and .
3.1 Visualizations via Normal Score Plots
For a pair of continuous variables, normal score plots are often used to visualize the copula for the variables and check for deviations from Gaussian dependence. With the ordinal variable , is a set of discrete values. To visualize the relationship between and , we use an appropriate latent continuous variable associated with .
For the normal score plots, we assume that the ordinal variable is generated from a latent standard normal variable with estimated cutpoints for , where and are the CDF and quantile function of the standard normal distribution. Let and . We would like the latent variable to be generated in such a way that the correlation between and is close to the correlation between and in order for the visualization to preserve the strength of correlation from the original variables.
One measure of the association between an ordinal and a continuous variable is the polyserial correlation (see Olsson et al., (1982) and Section 2.12.7 of Joe, (2014)). The polyserial correlation between and is defined as
where is the probability density function (PDF) of the standard normal distribution.
Let , , be the set of indices of taking value . We propose to generate the normal scores of the latent variable for the ordinal variable in the following steps.
-
1.
Compute the polyserial correlation between and . For , perform steps 2 to 5 on :
-
2.
For each , independently generate . Let
assuming a bivariate Gaussian copula with correlation . Note that the generated are in the range of . Let be a vector of length .
-
3.
For each , independently generate for , or for . Let be a vector of length .
-
4.
For each with , find its rank in . Replace the value of by that has the same rank in , and denote this value by . Let .
-
5.
Let for each .
The generated in step 2 does not have a uniform distribution. The additional steps 3 and 4 match the ranks of with a random vector generated from the uniform distribution. This procedure ensures that the generated normal scores fall into the desired bins separated by the cutpoints and marginally follow the standard normal distribution. Within each bin, and also have similar correlation to . The two sets of normal scores, and , can then be plotted against each other to decide on suitable parametric copula families. If a bivariate copula model can fit well, then it can also potentially provide adequate fits to .
Let for . As , for . Therefore, each element of in Step 3 of the proposed algorithm above follows and satisfies the condition (1) as . Similarly, each element of also satisfies the condition (1) as , since is a permutation of . However, the correlation for is stronger than the correlation for . The bivariate empirical copulas of the two sets and evaluated at are the same for empirical CDF values and arbitrary .
3.2 Diagnostics of Estimated Copula
With for a copula, the conditional distribution of the continuous variable given , where is a possible value of the ordinal variable , is
(2) |
If is an estimate from a parametric family, let
be the estimated conditional distribution. Given a quantile level , the conditional quantile is obtained as where is the root to the equation
To assess the fit of , conditional Q-Q plots can be generated based on for . For the th conditional Q-Q plot, the quantiles are plotted against the quantiles of the empirical distribution of , that is, the model-based quantiles for are plotted against the sorted values of . If the points in each conditional Q-Q plot lie closely along the diagonal line, it indicates that the copula estimate fits the sample well.
4 Assessing Copula Approximations of Probability Models for Mixed Continuous-Ordinal Variables
For a mix of several continuous variables and one ordinal variable, there are several classes of probability models in the statistical literature. Since our primary goal is to examine whether simple parametric bivariate copula families within the vine pair-copula construction are adequate, we focus on probability models for one continuous and one ordinal variable and assess the adequacy of approximations by parametric bivariate copula families with a few parameters.
This section discusses methods to theoretically assess the adequacy of copula approximations by computing the Kullback-Leibler (KL) divergence of with a parametric copula family relative to a given bivariate distribution with margins and . This leads to summaries of conditions for when parametric copula families with a few parameters are adequate, and other conditions for when they are inadequate. In particular, Section 4.1 gives an overview of common probability models for a pair of mixed continuous-ordinal variables. Computational details of the KL divergence are covered in Section 4.2. Some representative concrete examples are considered in Section 4.3 to show a range of KL divergence values.
4.1 Probability Models for Mixed Continuous-Ordinal Variables
There are two classes of models for a mix of continuous and ordinal variables depending on the order of conditioning.
For the first class, a multinomial distribution is assumed for the ordinal variable and the conditional distribution of the continuous variable given the ordinal variable can be either Gaussian with different mean and variance parameters, or a more general location-scale family (Little and Schluchter, (1985) and Krzanowski, (1993)). For the second class, the continuous variable is transformed to have a parametric distribution such as and the ordinal variable conditioned on the continuous variable is an ordinal probit or logit model; references are Cox, (1972) and Krzanowski, (1993). We introduce the notation for these two classes of models with an ordinal variable and a continuous variable .
For the first class of models, a location-scale mixture model is considered. The conditional distribution is a general location-scale model with
where is a density function on the real line. The ordinal variable has a probability distribution representing the mixture components. If can only take positive values, it can be transformed via a logarithm to get a location-scale model.
For the second class of models, the conditional distribution is
where is a standard normal or logistic CDF, is a slope parameter, and is an offset depending on the ordinal category. The continuous variable is assumed to have a unimodal distribution. If , simpler calculation results can be obtained when is the standard normal CDF.
4.2 Assessing Copula Approximation via the KL Divergence
Let be two bivariate distributions with densities on the respective relevant measure spaces. For ordinal with values in and absolutely continuous , the product measure comes from a counting measure and the Lebesgue measure on the real line. The non-negative KL divergence of from is
(3) |
Let be a bivariate copula family with . If with conditional probability mass function , then the joint density is with . In the setting of a mixture model that specifies and , .
The copula with parameter estimate
is the member in the family that is the closest to . The value is considered as the KL divergence of the family from .
In the results below, we consider different parametric copula families denoted by with corresponding densities . Given , this leads to for model . A copula family that has a smaller value of KL divergence is considered to have a better approximation of .
Next are steps to find a sequence of that leads to smaller values of the KL divergence of a family from a given . Because many simple parametric copula families only have positive dependence, we assume that and have been oriented to have positive dependence. We also assume that has a distribution that is close to unimodal (otherwise one might use a non-copula-based model for ). Unless an additional reference is given, properties of the listed copula families can be found in Chapter 4 of Joe, (2014).
-
1.
Compute the KL divergence of the Gaussian copula family as a baseline.
-
2.
For copula family candidates with more dependence in one joint tail, compute the KL divergence of the Gumbel and survival Gumbel copula families with asymmetric dependence in the joint upper and lower tails. If one of these two families leads to smaller KL divergence, then try copula families with even more tail asymmetry in the same direction.
-
3.
For copula family candidates with less dependence in both joint tails (compared to Gaussian), compute the KL divergence of the Frank and Plackett copula families with reflection symmetric tail quadrant independence. If one of these two copula families leads to smaller KL divergence, then compute the KL divergence for the BB8 and BB10 111Note that Kadhem and Nikoloulopoulos, (2021) show that there are parameters that can lead to non-convex contours of copula densities with N(0,1) margins for the BB10 copula. families with reflection tail asymmetry and their survival counterparts.
-
4.
For copula family candidates with more dependence in both joint tails, compute the KL divergence for the t copula family.
- 5.
The concept of tail order (Hua and Joe, (2011)), covering copula families with different strengths of dependence in the joint lower and upper tails, is used in the above steps. The main distinctions are intermediate tail dependence (such as Gaussian copula with positive dependence), strong tail dependence (more dependence in the joint tail than Gaussian), and tail quadrant independence (less dependence in the joint tail than Gaussian). The concept of tail order may be less important when one variable is ordinal, because there are no observations in the joint upper or lower tail. However, copula families with tail asymmetries are still important to provide more flexibility when finding approximations to a given .
4.3 Examples of KL Divergence for Mixed Continuous-Ordinal Variables
In this section, we consider a variety of concrete examples of the probability models in Section 4.1 which lead to in (3). In all of these examples, the marginal density of is close to unimodal. Otherwise, could be large and “clusters" might be seen in bivariate scatterplots. After examining a large number of cases, we find that a KL divergence value less than 0.003 usually indicates a good approximation from a copula family, and a KL divergence value greater than 0.01 usually indicates a poor approximation when using the conditional Q-Q diagnostic plots in Section 3.2.
The tables in Section 4.3.1 summarize some representative examples to illustrate what happens for (a) mixture models with different separation of location parameters and common versus varying scale parameters, and (b) conditional probit or logit models. Section 4.3.2 contains conclusions drawn from these examples.
4.3.1 Representative Cases
Tables 1 and 2 have some illustrative examples for 2 and 3 ordinal categories, respectively. The top parts of these tables have mixture components that are Gaussian, t, or skew normal. The parametric bivariate copula families that lead to the smallest KL divergence, as well as the minimized KL divergence values, are shown in these tables.
For mixture models of different distributions, the mixing proportion vector of can vary across different cases. These examples have unequal mixing proportions and some examples have a dominant component. In Table 1 with two ordinal categories, models (A1), (B1), and (C1) have close locations and constant scales; models (A2), (B2), and (C2) have locations that are farther apart and constant scales; models (A3), (B3), and (C3) have close locations and non-constant scales; models (A4), (B4), and (C4) have locations that are farther apart and non-constant scales. In Table 2 with three ordinal categories, models (E1), (F1), and (G1) have equally spaced locations and constant scales; models (E2), (F2), and (G2) have unequally spaced locations and constant scales; models (E3), (F3), and (G3) have equally spaced locations and non-constant scales; models (E4), (F4), and (G4) have unequally spaced locations and non-constant scales. The tables show that the KL divergence values tend to be smaller for cases of close or equally spaced locations and constant scales, and be larger for cases of distant or unequally spaced locations and non-constant scales. With enough heterogeneity, the conditional Q-Q plots based on the parametric copula family with the smallest KL divergence typically show some deviation in the tails. When the ordinal variable has four or more categories, it is more difficult for simple parametric copula families to approximate mixture models well.
If there are asymmetries in the proportions in or the scale parameters are unequal, the best parametric copulas based on the KL divergence will have reflection or tail asymmetry, but need not be good fits based on the conditional Q-Q plots; see Section 6 for examples of cases (E1) to (E4). With asymmetries and unobserved extremes of the ordinal variable, the best parametric copula family based on the KL divergence could have tail dependence in the joint upper and/or lower tail, or in neither joint tail. However, the concept of tail dependence is less meaningful when one of the variables is ordinal.
The bottom parts of Tables 1 and 2 have ordinal regression models for the ordinal variable with a continuous covariate . For the conditional probit model (D1), , where and is independent from . As a result, . Let . Then and . The binary variable can thus be considered as being generated from a latent Gaussian variable with the cutpoint . Therefore, no matter what values of and are used to specify the Bernoulli distributions for the conditional distribution of given , a bivariate Gaussian copula with always provides a perfect match, leading to a KL divergence of 0.
For the conditional logit models (D2) and (D3) when has a normal distribution, Gaussian copulas can provide a good approximation since the logit function is very close to the probit function, while t copulas (degrees of freedom 28 and 11) approximate these models slightly better with the KL divergence being very close to 0. For the conditional probit or logit models (D4) to (D7) when has other distributions such as t or extreme value (EV) distributions, the approximation of copula-based models are slightly worse than the previous two scenarios, but still adequate.
Mixture of normal distributions | ||||
Model | Model parameters | Copula family | KL Divergence | |
(A1) | Gaussian | 0.0001 | ||
(A2) | BB8 | 0.0010 | ||
(A3) | Survival BB1 | 0.0040 | ||
(A4) | BB1 | 0.0087 | ||
Mixture of t distributions (: degrees of freedom) | ||||
Model | Model parameters | Copula family | KL Divergence | |
(B1) | Survival BB10 | 0.0022 | ||
(B2) | Survival BB10 | 0.0057 | ||
(B3) | Survival BB10 | 0.0040 | ||
(B4) | Survival BB10 | 0.0050 | ||
Mixture of skew normal distributions (: skew) | ||||
Model | Model parameters | Copula family | KL Divergence | |
(C1) | Survival Joe | 0.0050 | ||
(C2) | Survival Joe | 0.0073 | ||
(C3) | Clayton | 0.0065 | ||
(C4) | Survival Joe | 0.0056 | ||
Conditional probit or logit models | ||||
Model | Model specifications | Copula family | KL Divergence | |
(D1) | Gaussian | 0 | ||
(D2) | t(28) | |||
(D3) | t(11) | |||
(D4) | t(8) | 0.0021 | ||
(D5) | Gaussian | 0.0021 | ||
(D6) | Gaussian | 0.0019 | ||
(D7) | Gaussian | 0.0011 |
Results are similar with more than two ordinal categories. For the conditional ordinal probit model (H1) with three categories 1, 2, and 3, the categorical variable can be considered as being generated from a latent Gaussian variable with cutpoints and with , where and . No matter what constants , , and are used to specify the probabilities of the three categories, the bivariate Gaussian copula with always provides a perfect match, leading to KL divergences of 0. The same conclusion extends to conditional ordinal probit models with an arbitrary number of categories when follows a normal distribution. For the conditional logit model (H2) when has a normal distribution, the t copula (degrees of freedom 20) provides a good approximation with KL divergence being very close to 0. For the conditional probit models (H3) and (H4) when has t or EV distributions, the approximation of simple parametric copula families becomes slightly worse.
Mixture of normal distributions | ||||
Model | Model parameters | Copula family | KL Divergence | |
(E1) | Gaussian | 0.0016 | ||
(E2) | Survival BB1 | 0.0031 | ||
(E3) | Asymmetric Gumbel | 0.0267 | ||
(E4) | Survival BB1 | 0.0472 | ||
Mixture of t distributions (: degrees of freedom) | ||||
Model | Model parameters | Copula family | KL Divergence | |
(F1) | Plackett | 0.0028 | ||
(F2) | Survival BB10 | 0.0432 | ||
(F3) | Survival BB10 | 0.0075 | ||
(F4) | BB8 | 0.0039 | ||
Mixture of skew normal distributions (: skew) | ||||
Model | Model parameters | Copula family | KL Divergence | |
(G1) | Survival Gumbel | 0.0102 | ||
(G2) | Survival BB10 | 0.0424 | ||
(G3) | t(2) | 0.1421 | ||
(G4) | BB8 | 0.2540 | ||
Conditional probit or logit models | ||||
Model | Model specifications | Copula family | KL Divergence | |
(H1) | Gaussian | 0 | ||
\hdashline(H2) | t(20) | |||
\hdashline(H3) | Gaussian | 0.0038 | ||
\hdashline(H4) | Gaussian | 0.0095 |
4.3.2 Conclusions on Copula Approximations to Models in Section 4.1
Based on the representative examples in the previous section, the following general conclusions can be drawn. Simple parametric copula-based models provide good approximations mainly for (a) mixture models with roughly equally spaced components and similar scale parameters, and (b) ordinal regression models that are closed to probit models with a unimodal distribution for . For mixture models with components that have unequally spaced location parameters or quite different scale parameters, the effective number of parameters is greater than 2, so it is not surprising that the simple parametric copula families do not lead to good approximations. This motivates the use of nonparametric copulas in Section 5.2.
5 Fitting Copula Models to a Mixed Continuous-Ordinal Pair
This section explains the procedures for fitting copula models to a pair of mixed continuous-ordinal variables. For a random sample , are as defined in Section 3. Details for fitting parametric and nonparametric copula models are elaborated in Sections 5.1 and 5.2, respectively.
5.1 Parametric Bivariate Copula Families
Suppose there are parametric bivariate copula models to consider as candidate families. For a parametric bivariate copula model with parameter vector and , its log-likelihood function is
The maximum likelihood estimator is for a sample of size . As , converges in probability to the value that minimizes the KL divergence for family . Parametric copula models are then compared using model selection criteria such as AIC and BIC. Models with smaller values of AIC or BIC are usually considered to fit the data better.
5.2 Nonparametric Bivariate Copulas
Since (2) only involves the copula CDF but not the copula density, we consider the empirical beta copula (Segers et al., (2017)) as a nonparametric alternative fitted to pairs , . The empirical beta copula density performs less well than the KDE copula estimator in Nagler, (2018), even after some averaging over distinct subsamples. However, the empirical beta copula CDF has the advantage of being a proper copula, while the KDE approach only leads to a distribution that is approximately a copula.
Let and . For , the bivariate empirical beta copula CDF is given by
(4) |
where is the CDF of the distribution, is the rank of in , and is the rank of in . Note that (4) is a continuous differentiable function on . The th order statistic in a random sample of size generated from follows a distribution, which leads to the two beta distributions in (4).
Assuming a consistent estimator for , the consistency of the empirical beta copula with the latent vector is shown in Section 5.3 below.
5.3 Consistency of the Empirical Beta Copula Estimate with a Latent Variable
Let be the empirical distribution of the ordinal variable . Let be a consistent estimate of . Let be the empirical beta copula estimate in (4) and let be the empirical copula of the same sample, defined as
The empirical copula is a step function.
Proposition 2.8 in Segers et al., (2017) states that
This indicates that for arbitrary and as . It is shown in Section 3.1 that can be considered as being sampled from a uniform distribution that satisfies the condition (1) as . Based on results on the empirical copula processes (Segers, (2012)), converges weakly to , where is the latent Gaussian variable for in Section 3.1. Since matches at the CDF values of when satisfies the condition (1), for and arbitrary as . This shows the consistency of the empirical beta copula estimate in (4).
6 Illustrations on Simulated Datasets
In this section, we illustrate the visualization, estimation, and diagnostic techniques proposed in the previous sections on simulated datasets. Bivariate datasets are generated from four mixture models of normal distributions, denoted by (E1), (E2), (E3), and (E4) in Table 2. In each case, the sample size is 1000.
In Table 2, the minimized KL divergence values for cases (E1) and (E2) are much smaller than those for cases (E3) and (E4), indicating that parametric copula families fit (E1) and (E2) better than (E3) and (E4). Normal score plots of the continuous variable versus the latent Gaussian variable generated from the ordinal variable based on the steps in Section 3.1 are shown in Figure 1. It can be seen that the normal score plots of (E1) and (E2) have an approximate elliptical shape that can match some commonly used parametric copula families. In contrast, for (E3) and (E4), heteroscedasticity among different components of the mixture model leads to asymmetries and unusual shapes in the normal score plots. Simple parametric copula families are inadequate for the data generated in these two cases.

Diagnostic techniques in Section 3.2 are applied to assess the fits of the parametric copula families in Table 2 with the smallest KL divergence. The conditional Q-Q plots by category of the ordinal variable are shown in Figure 2. There are significant departures from the diagonal line for the bivariate copula families fitted to cases (E3) and (E4), indicating inadequate fits. This aligns with the conclusions drawn from the normal score visualizations in Figure 1.

With the empirical beta copula estimate in Section 5.2, the conditional Q-Q plots by category based on the ordinal variable are shown in Figure 3. It can be seen that the points in all Q-Q plots are closely aligned with the diagonal line, indicating that the empirical beta copula estimate provides a much more adequate fit to the data generated in all of these four cases.

7 Data Example
In this section, we illustrate the proposed visualization, estimation, and diagnostic techniques for a pair of mixed continuous-ordinal variables on the Auto MPG dataset (Quinlan, (1993), available at https://archive.ics.uci.edu/ml/datasets/Auto+MPG).
This dataset contains the fuel consumption data of 398 cars from 1970 to 1982. The goal is to predict the mpg (miles per gallon as an indicator of fuel efficiency) of each car based on a set of explanatory variables. The variable cylinders can take five unique ordinal values: 3, 4, 5, 6, and 8. We merge category 3 with 4 and category 5 with 6 since only 4 and 3 observations have 3 and 5 cylinders. Some summaries of the important variables along with the transformations to achieve positive correlations with the response variable mpg are given in Table 3. The nominal variable origin indicates where the car is from (1: USA, 2: Europe, 3: Japan). Since mpg tends to increase as origin changes from USA to Europe to Japan, we treat origin as an ordinal variable for the vine structure selection and conditional Q-Q plots mentioned below.
Variable | Type | Range | Transformation | |
---|---|---|---|---|
cylinders | ordinal | -0.781 | change sign, | |
horsepower | continuous | -0.771 | change sign, | |
weight | continuous | -0.832 | change sign, | |
acceleration | continuous | 0.420 | ||
model year | continuous | 0.579 | ||
origin | nominal | 0.563 | ||
mpg | continuous |
If the maximum spanning tree criterion based on absolute (correlation of normal scores, or polyserial/polychoric correlation) as in Chang and Joe, (2019) is used to select the vine structure for this dataset, the first tree of the resulting vine regression model would include the two ordinal variables cylinders and origin connected to weight on two edges for explanatory variables. The summary statistics of the distribution of weight (before sign change), the conditional distribution of weight given cylinders (before sign change), and the conditional distribution of weight given origin are shown in Table 4. Note that when fitting the vine copula model, the signs of weight and cylinders are changed so that the correlations between negative weight, negative cylinders, and origin are all positive. The conditional distributions of weight given cylinders have roughly equally spaced means and similar standard deviations; this is close to the setting of a mixture model with equally spaced locations and constant variance. Therefore, parametric copula families can fit this pair well. In contrast, the conditional distributions of weight given origin have unequally spaced means and very different standard deviations; this is similar to a mixture model with unequally spaced locations and non-constant variance. Therefore, it is more difficult for parametric copula families to provide good approximations to this second pair.
Subset | Min | Q1 | Median | Mean | Q3 | Max | SD | |
---|---|---|---|---|---|---|---|---|
None | 198 | 1613 | 2224 | 2804 | 2970 | 3608 | 5140 | 847 |
208 | 1613 | 2049 | 2240 | 2310 | 2567 | 3270 | 345 | |
87 | 2472 | 2938 | 3193 | 3195 | 3431 | 3907 | 332 | |
103 | 3086 | 3799 | 4140 | 4115 | 4404 | 5140 | 449 | |
249 | 1800 | 2720 | 3365 | 3362 | 4054 | 5140 | 795 | |
70 | 1825 | 2067 | 2240 | 2423 | 2770 | 3820 | 490 | |
79 | 1613 | 1985 | 2155 | 2221 | 2412 | 2930 | 320 |
The best-fitting parametric copula family for the pair negative weight and negative cylinders is Gaussian (). Conditional Q-Q plots for weight by category of cylinders are show in the first row of Figure 4. For the pair of negative weight and origin, the large proportion of the category causes the best-fitting parametric copula families to have tail asymmetry with more dependence in the joint lower tail. The 1-parameter Clayton, 2-parameter BB1 and 2-parameter BB7 copulas have the largest (and approximately equal) log-likelihoods and similar conditional Q-Q plots. The conditional Q-Q plots of weight by category of origin are shown in the second row of Figure 4 with the Clayton copula. The two parametric copula families in this figure generally fit the data well. Some deviations from the diagonal line can still be observed in the last plot for the weight and origin pair, but inference based on this copula should be acceptable. When the empirical beta copulas are fitted to these two pairs of variables, all conditional Q-Q plots show alignment with the diagonal line, including the last plot.

8 Conclusion
Through two types of diagnostic plots and theoretical assessments using the Kullback-Leibler divergence, we show that simple parametric bivariate copula families with a few parameters can sometimes be inadequate for a pair of mixed continuous-ordinal variables. For such a pair, visualizations are proposed based on normal score plots using an appropriate latent continuous variable and conditional Q-Q plots of the continuous variable given the ordinal variable. Existing probability models for mixed continuous-ordinal variables are considered to assess the adequacy of fits of simple parametric copula families using the Kullback-Leibler divergence. When a pair of mixed continuous-ordinal variables is generated from mixture models of distributions with roughly equally spaced locations and constant scales or from conditional probit/logit models, simple parametric copula families can provide good fits. Otherwise, nonparametric counterparts can be fitted to provide better approximations. Applications to simulated and real datasets demonstrate the effectiveness of the proposed methods in identifying the lack of fit of simple parametric copula families and in improving the adequacy of fits with nonparametric copulas.
The results in this paper can be used to understand when and how some standard regression methods with ordinal and continuous explanatory variables can be approximated by the vine copula regression methodology as considered in Chang and Joe, (2019). Details will be provided in future research.
Acknowledgments
This research has been supported by the Four-Year Doctoral Fellowship of the University of British Columbia, NSERC Discovery Grant GR010293, and a Mercator Fellowship associated with Deutsche Forschungsgemeinschaft.
References
- Chang and Joe, (2019) Chang, B. and Joe, H. (2019). Prediction based on conditional distributions of vine copulas. Computational Statistics & Data Analysis, 139:45–63.
- Cox, (1972) Cox, D. R. (1972). The analysis of multivariate binary data. Applied Statistics, pages 113–120.
- Genest and Nešlehová, (2007) Genest, C. and Nešlehová, J. (2007). A primer on copulas for count data. ASTIN Bulletin, 37(2):475–515.
- Hua and Joe, (2011) Hua, L. and Joe, H. (2011). Tail order and intermediate tail dependence of multivariate copulas. Journal of Multivariate Analysis, 102(10):1454–1471.
- Joe, (2014) Joe, H. (2014). Dependence Modeling with Copulas. Chapman & Hall/CRC, Boca Raton, FL.
- Kadhem and Nikoloulopoulos, (2021) Kadhem, S. H. and Nikoloulopoulos, A. K. (2021). Factor copula models for mixed data. British Journal of Mathematical and Statistical Psychology, 74:365–403.
- Krzanowski, (1993) Krzanowski, W. (1993). The location model for mixtures of categorical and continuous variables. Journal of Classification, 10:25–49.
- Little and Schluchter, (1985) Little, R. J. and Schluchter, M. D. (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika, 72(3):497–512.
- Nagler, (2018) Nagler, T. (2018). kdecopula: An R package for the kernel estimation of bivariate copula densities. Journal of Statistical Software, 84:1–22.
- Olsson et al., (1982) Olsson, U., Drasgow, F., and Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47(3):337–347.
- Quinlan, (1993) Quinlan, J. R. (1993). Combining instance-based and model-based learning. In Proceedings of the Tenth International Conference on Machine Learning, pages 236–243.
- Segers, (2012) Segers, J. (2012). Asymptotics of empirical copula processes under non-restrictive smoothness assumptions. Bernoulli, 18(3):764–782.
- Segers et al., (2017) Segers, J., Sibuya, M., and Tsukahara, H. (2017). The empirical beta copula. Journal of Multivariate Analysis, 155:35–51.
- Sklar, (1959) Sklar, A. (1959). Fonctions de répartition á dimensions et leurs marges. Publications de l’Institut de Statistique de l’Université de Paris, 8:229–231.
- Stöber et al., (2015) Stöber, J., Hong, H. G., Czado, C., and Ghosh, P. (2015). Comorbidity of chronic diseases in the elderly: Patterns identified by a copula design for mixed responses. Computational Statistics & Data Analysis, 88:28–39.
- Yoshiba, (2018) Yoshiba, T. (2018). Maximum likelihood estimation of skew-t copulas with its applications to stock returns. Journal of Statistical Computation and Simulation, 88:2489–2506.