Theory of self-learning -matrix
Abstract
Cognitive assessment is a growing area in psychological and educational measurement, where tests are given to assess mastery/deficiency of attributes or skills. A key issue is the correct identification of attributes associated with items in a test. In this paper, we set up a mathematical framework under which theoretical properties may be discussed. We establish sufficient conditions to ensure that the attributes required by each item are learnable from the data.
doi:
10.3150/12-BEJ430keywords:
, and
1 Introduction
Cognitive diagnosis has recently gained prominence in educational assessment, psychiatric evaluation, and many other disciplines. A key task is the correct specification of item-attribute relationships. A widely used mathematical formulation is the well known -matrix [27]. Under the setting of the -matrix, a typical modeling approach assumes a latent variable structure in which each subject possesses a vector of attributes and responds to items. The so-called -matrix is an binary matrix establishing the relationship between responses and attributes by indicating the required attributes for each item. The entry in the th row and th column indicates if item requires attribute (see Example 2.3 for a demonstration of a -matrix). A short list of further developments of cognitive diagnosis models (CDMs) based on the -matrix includes the rule space method [28, 29], the reparameterized unified/fusion model (RUM) [5, 7, 30], the conjunctive (noncompensatory) DINA and NIDA models [12, 26, 4, 31, 3], the compensatory DINO and NIDO models [32, 31], the attribute hierarchy method [13], and clustering methods [1]; see also [11, 33, 23] for more approaches to cognitive diagnosis.
Statistical analysis with CDMs typically assumes a known -matrix provided by experts such as those who developed the questions [20, 10, 19, 25]. Such a priori knowledge when correct is certainly very helpful for both model estimation and eventually identification of subjects’ latent attributes. On the other hand, model fitting is usually sensitive to the choice of -matrix and its misspecification could seriously affect the goodness of fit. This is one of the main sources for lack of fit. Various diagnostic tools and testing procedures have been developed [21, 2, 8, 14, 9]. A comprehensive review of diagnostic classification models can be found in [22].
Despite the importance of the -matrix in cognitive diagnosis, its estimation problem is largely an unexplored area. Unlike typical inference problems, the inference for the -matrix is particularly challenging for the following reasons. First, in many cases, the -matrix is simply nonidentifiable. One typical situation is that multiple -matrices lead to an identical response distribution. Therefore, we only expect to identify the -matrix up to some equivalence relation (Definition 2.2). In other words, two -matrices in the same equivalence class are not distinguishable based on data. Our first task is to define a meaningful and identifiable equivalence class. Second, the -matrix lives on a discrete space – the set of matrices with binary entries. This discrete nature makes analysis particularly difficult because calculus tools are not applicable. In fact, most analyses are combinatorics based. Third, the model makes explicit distributional assumptions on the (unobserved) attributes, dictating the law of observed responses. The dependence of responses on attributes via -matrix is a highly nonlinear discrete function. The nonlinearity also adds to the difficulty of the analysis.
The primary purpose of this paper is to provide theoretical analyses on the learnability of the underlying -matrix. In particular, we obtain definitive answers to the identifiability of -matrix for one of the most commonly used models – the DINA model – by specifying a set of sufficient conditions under which the -matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding consistent estimators. We believe that the results (especially the intermediate results) and analysis strategies can be extended to other conjunctive models [15, 12, 31, 32, 18].
The rest of this paper is organized as follows. In Section 2, we present the basic inference result for -matrices in a conjunctive model with no slipping or guessing. In addition, we introduce all the necessary terminologies and technical conditions. In Section 3, we extend the results in Section 2 to the DINA model with known slipping and guessing parameters. In Section 4, we further generalize the results to the DINA model with unknown slipping parameters. Further discussion is provided in Section 5. Proofs are given in Section 6. Lastly, the proofs of two key propositions are given in the Appendix.
2 Model specifications and basic results
We start the discussion with a simplified situation, under which the responses depend on the attribute profile deterministically (with no uncertainty). We describe our estimation procedure under this simple scenario. The results for the general cases are given in Sections 3 and 4.
2.1 Basic model specifications
The model specifications consist of the following concepts.
Attributes: subject’s (unobserved) mastery of certain skills. Consider that there are attributes. Let be the vector of attributes and be the indicator of the presence or absence of the th attribute.
Responses: subject’s binary responses to items. Consider that there are items. Let be the vector of responses and be the response to the th item.
Both and are subject specific. We assume that the integers and are known.
-matrix: the link between item responses and attributes. We define an matrix . For each and , when item requires attribute and otherwise.
Furthermore, we define
(1) |
which indicates whether a subject with attribute is capable of providing a positive response to item . This model is conjunctive, meaning that it is necessary and sufficient to master all the necessary skills to be capable of solving one problem. Possessing additional attributes does not compensate for the absence of necessary attributes. In this section, we consider the simplest situation that there is no uncertainty in the response, that is,
(2) |
for . Therefore, the responses are completely determined by the attributes. We assume that all items require at least one attribute. Equivalently, the -matrix does not have zero row vectors. Subjects who do not possess any attribute are not capable of responding positively to any item.
We use subscripts to indicate different subjects. For instance, is the response vector of subject . Similarly, is the attribute vector of subject . We observe , where we use to denote sample size. The attributes are not observed. Our objective is to make inference on the -matrix based on the observed responses.
2.2 Estimation of -matrix
We first introduce a few quantities for the presentation of an estimator.
-matrix
In order to provide an estimator of , we first introduce one central quantity, the -matrix, which connects the -matrix with the response and attribute distributions. Matrix has columns each of which corresponds to one nonzero attribute vector, . Instead of labeling the columns of by ordinal numbers, we label them by binary vectors of length . For instance, the th column of is the column that corresponds to attribute , for all .
Let be a generic notation for positive responses to item . Let “” stand for “and” combination. For instance, denotes positive responses to both items and . Each row of corresponds to one item or one “and” combination of items, for instance, , or , …. If contains all the single items and all “and” combinations, contains rows. We will later say that such a is saturated (Definition 2.1 in Section 2.4).
We now describe each row vector of . We define that is a dimensional row vector. Using the same labeling system as that of the columns of , the th element of is defined as , which indicates if a subject with attribute is able to solve item .
Using a similar notation, we define that
(3) |
where the operator “” is element-by-element multiplication from to . For instance,
means that , where and . Therefore, is the vector indicating the attributes that are capable of responding positively to items . The row in corresponding to is .
-vector
We let be a column vector the length of which equals to the number of rows of . Each element of corresponds to one row vector of . The element in corresponding to is defined as , where denotes the number of people who have positive responses to items , that is
For each , we let
(4) |
If (2) is strictly respected, then
(5) |
where is arranged in the same order as the columns of . This is because each row of indicates the attribute profiles corresponding to subjects capable of responding positively to that set of item(s). Vector contains the proportions of subjects with each attribute profile. For each set of items, matrix multiplication sums up the proportions corresponding to each attribute profile capable of responding positively to that set of items, giving us the total proportion of subjects who respond positively to the corresponding items.
The estimator of the -matrix
For each binary matrix , we define
(6) |
where . The above minimization is subject to the constraint that . is the Euclidean distance. An estimator of can be obtained by minimizing ,
(7) |
where “” is the minimizer of the minimization problem over all binary matrices. Note that the minimizers are not unique. We will later prove that the minimizers are in the same meaningful equivalence class. Because of (5), the true -matrix is always among the minimizers because .
2.3 Example
We illustrate the above construction by one simple example. We emphasize that this example is discussed to explain the estimation procedure for a concrete and simple example. The proposed estimator is certainly able to handle much larger -matrices. We consider the following -matrix,
(8) |
There are two attributes and three items. We consider the contingency table of attributes,
Multiplication | ||
---|---|---|
Addition | ||
In the above table, is the proportional of people who do not master either addition or multiplication. Similarly, we define , and . is not observed.
Just for illustration, we construct a simple nonsaturated -matrix. Suppose the relationship in (2) is strictly respected. Then, we should be able to establish the following identities:
(9) |
Therefore, if we let , the above display imposes three linear constraints on the vector . Together with the natural constraint that , solves linear equation,
(10) |
subject to the constraints that and , where
(11) |
Each column of corresponds to one attribute profile. The first column corresponds to , the second column to , and the third column to . The first row corresponds to item , the second row to and the last row to . For this particular situation, has full rank and there exists one unique solution to (10). In fact, we would not expect the constrained solution to the linear equation in (10) to always exist unless (2) is strictly followed. This is the topic of the next section.
The identities in (9) only consider the marginal rate of each question. There are additional constraints if one considers “combinations” among items. For instance,
People who are able to solve problem 3 must have both attributes and therefore are able to solve both problems 1 and 2. Again, if (2) is not strictly followed, this is not necessarily respected in the real data, though it is a logical conclusion. The DINA in the next section handles such a case. Upon considering , , and , the new -matrix is
(12) |
The last row is added corresponding to . With (2) in force, we have
(13) |
if is the true matrix.
2.4 Basic results
Before stating the main result, we provide a list of notations, which will be used in the discussions.
-
•
Linear space spanned by vectors :
-
•
For a matrix , denotes the submatrix containing the first rows and all columns of .
-
•
Vector denotes a column vector such that the th element is one and the rest are zero. When there is no ambiguity, we omit the length index of .
-
•
Matrix denotes the identity matrix.
-
•
For a matrix , is the linear space generated by the column vectors of . It is usually called the column space of .
-
•
denotes the set of column vectors of .
-
•
denotes the set of row vectors of .
-
•
Vector denotes the zero vector, . When there is no ambiguity, we omit the index of length.
-
•
Scalar denotes the probability that a subject has attribute profile . For instance, is the probability that a subject has attribute one but not attribute two.
-
•
Define a dimensional vector
-
•
Let and be two dimensional vectors. We write if for all .
-
•
We write if for all .
-
•
Matrix denotes the true matrix and denotes a generic binary matrix.
The following definitions will be used in subsequent discussions.
Definition 2.1.
We say that is saturated if all combinations of form , for , are included in .
Definition 2.2.
We write if and only if and have identical column vectors, which could be arranged in different orders; otherwise, we write .
Remark 2.1.
It is not hard to show that “” is an equivalence relation. if and only if they are identical after an appropriate permutation of the columns. Each column of is interpreted as an attribute. Permuting the columns of is equivalent to relabeling the attributes. For , we are not able to distinguish from based on data.
Definition 2.3.
A -matrix is said to be complete if ( is the set of row vectors of ); otherwise, we say that is incomplete.
A -matrix is complete if and only if for each attribute there exists an item only requiring that attribute. Completeness implies that . We will show that completeness is among the sufficient conditions to identify .
Remark 2.2.
One of the main objectives of cognitive assessment is to identify the subjects’ attributes; see [22] for other applications. It has been established in [1] that the completeness of the -matrix is a sufficient and necessary condition for a set of items to consistently identify attributes if (2) is strictly followed. Thus, it is usually recommended to use a complete -matrix. For a precise formulation, see [1].
Listed below are assumptions which will be used in subsequent development.
-
[C3]
-
C1
is complete.
-
C2
is saturated.
-
C3
are i.i.d. random vectors following distribution
We further let .
-
[C5]
-
C4
.
-
C5
Each attribute has been required by at least two items.
With these preparations, we are ready to introduce the first theorem, the proof of which is given in Section 6.
Theorem 2.4.
Remark 2.3.
If , the two matrices only differ by a column permutation and will be considered to be the “same”. Therefore, we expect to identify the equivalence class that belongs to. Also, note that if .
Remark 2.4.
In order to obtain the consistency of (subject to a column permutation), it is necessary to have not living on some sub-manifold. To see a counter example, suppose that . Then, for all , , that is, all subjects are able to solve all problems. Therefore, the distribution of is independent of . In other words, the -matrix is not identifiable. More generally, if there exit and such that , then the -matrix is not identifiable based on the data. This is because one cannot tell if an item requires attribute alone, attribute alone, or both; see [16, 17] for similar cases for the multidimensional IRT models.
Remark 2.5.
Note that the estimator of the attribute distribution, , in (15) depends on the order of columns of . In order to achieve consistency, we will need to arrange the columns of such that whenever .
Remark 2.6.
One practical issue associated with the proposed procedure is the computation. For a specific , the computation of only involves a constraint minimization of a quadratic function. However, if or is large, the computation overhead of searching the minimizer of over the space of matrices could be substantial. One practical solution is to break the -matrix into smaller sub-matrices. For instance, one may divide the items in to groups (possibly with nonempty overlap across different groups). Then apply the proposed estimator to each of the group of items. This is equivalent to breaking a big by -matrix into several smaller matrices and estimating each of them separately. Lastly, combine the estimated sub-matrices together to form a single estimate. The consistency results can be applied to each of the sub-matrices and therefore the combined matrix is also a consistent estimator. A similar technique has been discussed in Chapter 8.6 of [29].
Remark 2.7.
Conditions C1 and C2 are imposed to guarantee consistency. They may not be always necessary. Furthermore, constructing a saturated -matrix is sometimes computationally not feasible, especially when the number of items is large. In practice, one may include the combinations of one item, two items, …, items. The choice of depends on the sample size and the computation resources. The condition C5 is required for technical purposes. Nonetheless, one can in fact construct counterexamples, that is, the -matrix is not identifiable up to the relationship “”, if C5 is violated.
3 DINA model with known slipping and guessing parameters
3.1 Model specification
In this section, we extend the inference results in the previous section to the situation under which the responses do not deterministically depend on the attributes. In particular, we consider the DINA (Deterministic Input, Noisy Output “AND” gate) model [12]. We would like to introduce two parameters: the slipping parameter () and the guessing parameter . Here () is the probability of a subject’s responding positively to item given that s/he is capable of solving that problem. To simplify the notations, we denote by . An extension of (2) to include slipping and guessing specifies the response probabilities as
(16) |
where is the capability indicator defined in (1). In addition, conditional on , are jointly independent.
In this context, the -matrix needs to be modified accordingly. Throughout this section, we assume that both ’s and ’s are known. We discuss the case that ’s are unknown in the next section.
We first consider the case that for all . We introduce a diagonal matrix . If the th row of matrix corresponds to , then the th diagonal element of is . Then, we let
(17) |
where is the binary matrix defined previously. In other words, we multiply each row of by a common factor and obtain . Note that in absence of slipping ( for each ) we have that .
There is another equivalent way of constructing . We define
and
(18) |
where “” refers to element by element multiplication. Let the row vector in corresponding to be .
For instance, with , the corresponding to the -matrix in (12) would be
(19) |
Lastly, we consider the situation that both the probability of making a mistake and the probability of guessing correctly could be strictly positive. By this, we mean that the probability that a subject responds positively to item is if s/he is capable of doing so; otherwise the probability is . We create a corresponding by slightly modifying . We define row vector
When there is no ambiguity, we omit the length index of . Now, let
and
(20) |
Let the row vector in corresponding to be . For instance, the matrix corresponding to the in (19) is
(21) |
3.2 Estimation of the -matrix and consistency results
Having concluded our preparations, we are now ready to introduce our estimators for . Given and , we define
(22) |
where , and
(23) |
The labels to the right of the vector indicate the corresponding row vectors in . The minimization in (22) is subject to constraints that
The vector contains the probabilities of providing positive responses to items simply by guessing. We propose an estimator of the -matrix through a minimization problem, that is,
(24) |
We write and in the argument to emphasize that the estimator depends on and . The computation of the minimization in (22) consists of minimizing a quadratic function subject to finitely many linear constraints. Therefore, it can be done efficiently.
Theorem 3.1.
Suppose that and are known and that conditions C1–C5 are in force. For subject , the responses are generated independently such that
(25) |
where is defined as in Theorem 2.4. Let be defined as in (24). If for all and does not have zero elements, then
Furthermore, let
subject to constraint that . Then, with an appropriate rearrangement of the columns of , for any ,
Remark 3.1.
There are various metrics one can employ to measure the distance between the vectors and . In fact, any metric that generates the same topology as the Euclidian metric is sufficient to obtain the consistency results in the theorem. For instance, a principled choice of objective function would be the likelihood with profiled out. The reason we prefer the Euclidian metric (versus, for instance, the full likelihood) is that the evaluation of is easier than the evaluation based on other metrics. More specifically, the computation of current consists of quadratic programming types of well oiled optimization techniques.
4 Extension to the situation with unknown slipping probabilities
In this section, we further extend our results to the situation where the slipping probabilities are unknown and the guessing probabilities are known. In the context of standard exams, the guessing probabilities can typically be set to zero for open problems. For instance, the chance of guessing the correct answer to ?” is very small. On the other hand, for multiple choice problems, the guessing probabilities cannot be ignored. In that case, can be considered as when there are choices; see Remark 4.2 for more description.
4.1 Estimator of
We provide two estimators of given and . One is applicable to all -matrices, but computationally intensive. The other is computationally easy, but requires certain structures of . Then, we combine them into a single estimator.
A general estimator
We first provide an estimator of that is applicable to all -matrices. Considering that the estimator of minimizes the objective function , we propose the following estimator of :
(26) |
A moment estimator
The computation of is typically intensive. When the -matrix has a certain structure, we are able to estimate consistently based on estimating equations.
For a particular item , suppose that there exist items (different from ) such that
(27) |
that is, the attributes required by item are a subset of the attributes required by .
Let and
We borrow a result which will be given in the proof of Proposition 6.6 (Section 6.1) to say that there exists a matrix (only depending on ) such that
Let and be the row vectors in corresponding to and (in ).
Then,
where the vectors and only depend on .
Therefore, the corresponding estimator of would be
(29) |
Note that the computation of only consists of affine transformations and therefore is very fast.
Proof.
By the law of large numbers,
in probability as . By the construction of and , we have
Thanks to (27), we have
∎
Combined estimator
Lastly, we combine and . For each , we write . For each in the sub-vector , (27) holds. Let be defined in (29) (element by element). For , we let . Finally, let . Furthermore, each element of greater than one is set to be one and each element less than zero is set to be zero. Equivalently, we impose the constraint that .
4.2 Consistency result
Theorem 4.2.
Remark 4.1.
The consistency of does not rely on the consistency of , which is mainly because of the central intermediate result in Proposition 6.6. The consistency of is a necessary condition for the consistency of .
For most usual situations, is estimable based on the data given a correctly specified -matrix. Nonetheless, there are some rare occasions in which nonidentifiability does exist. We provide one example, explained at the intuitive level, to illustrate that it is not always possible to consistently estimate and . This example is simply to justify that the existence of the consistent estimator for in the above theorem is not an empty assumption. Consider a complete matrix . The degrees of freedom of a -way binary table is . On the other hand, the dimension of parameters is . Therefore, and cannot be consistently identified without additional information. This problem is typically tackled by introducing addition parametric assumptions such as satisfying certain functional form or in the Bayesian setting (weakly) informative prior distributions [6]. Given that the emphasis of this paper is the inference of -matrix, we do not further investigate the identifiability of . Nonetheless, estimation for is definitely an important issue.
Remark 4.2.
Assuming that the guessing probability being known is somewhat strong. For complicated situations, such as for multiple choice problems the incorrect choices do not look “equally incorrect”, the guessing probability is typically not . In Theorem 4.2, we make this assumption mostly for technical reasons.
One can certainly provide the same treatment to the unknown guessing probabilities just as to the slipping probabilities by plugging in a consistent estimator of or profiling it out (like ). However, the rigorous establishment of the consistency results is certainly much more difficult and additional technical conditions may be needed. We leave the analysis of the problem with unknown guessing probability to the future study.
5 Discussion
This paper provides basic theoretical results of the estimation of -matrix, a key element in modern cognitive diagnosis. Under the conjunctive model assumption, sufficient conditions are developed for the -matrix to be identifiable up to an equivalence relation and the corresponding consistent estimators are constructed. The equivalence relation defines a natural partition of the space of -matrices and may be viewed as the finest “resolution” that is possibly distinguishable based on the data, unless there is additional information about the specific meaning of each attribute. Our results provide the first steps for statistical inference about -matrices by explicitly specifying the conditions under which two -matrices lead to different response distributions. We believe that these results, especially the intermediate results in Section 6, can also be applied to general conjunctive models.
There are several directions along which further exploration may be pursued. First, some conditions may be modified to reflect practical circumstance. For instance, if the population is not fully diversified, meaning that certain attribute profiles may never exist, then condition C4 cannot be expected to hold. To ensure identifiability, we will need to impose certain structures on the -matrix. In the addition-multiplication example of Section 2.3, if individuals capable of multiplication are also capable of addition, then we may need to impose the natural constraint that every item that requires multiplication should also require addition, which also implies that the -matrix is never complete.
Second, when an a priori “expert’s” knowledge of the -matrix is available, we may wish to incorporate such information into the estimation. This could be in the form of an additive penalty function attached to the objective function . Such information, if correct, not only improves estimation accuracy but also reduces the computational complexity – one can just perform a minimization of in a neighborhood around the expert’s -matrix.
Third, throughout this paper we assume that the number of attributes (dimension) is known. In practice, it would be desirable to develop a data driven way to estimate the dimension, not only to deal with the situation of unknown dimension, but also to check if the assumed dimension is correct. One possible way to tackle the problem is to introduce a penalty function similar to that of BIC [24] which would give a consistent estimator of the -matrix even if the dimension is unknown.
Fourth, one issue of both theoretical and practical importance is the inference of the parameters additional to the -matrix, such as the slipping (), guessing () parameters and the attribute distribution . In the current paper, given that the main interesting parameter is the -matrix, the estimations of and are treated as by-product of the main results. On the other hand, given a known , the identifiability and estimation of these parameters are important topics. In the previous discussion, we provided a few examples for potential identifiability issues. Further careful investigation is definitely of great importance and challenges.
Fifth, the rate of convergence of the estimator is not only of theoretical importance. From a practical point of few, it is crucial to study the rate of convergence as the scale of the problem becomes large in terms of the number of attributes and number of items.
Lastly, the optimization of over the space of binary matrices is a nontrivial problem. It consists of evaluating the function times. This is a substantial computational load if and are reasonably large. As mentioned previously, this computation might be reduced by additional information about the -matrix or splitting the -matrix into small sub-matrices. Nevertheless, it would be highly desirable to explore the structures of the -matrix and the function so as to compute more efficiently.
6 Proofs of the theorems
6.1 Several propositions and lemmas
To make the discussion smooth, we postpone several long proofs to the Appendix.
Proposition 6.1.
Suppose that is complete and matrix is saturated. Then, we are able to arrange the columns and rows of and such that has full rank and has full column rank.
Proof.
Provided that is complete, without loss of generality we assume that the th row vector of is for , that is, item only requires attribute for each . Let the first rows of be associated with . In particular, we let the first rows correspond to and the first columns of correspond to ’s that only have one attribute. We further arrange the next rows of to correspond to combinations of two items, , . The next columns of correspond to ’s that only have two positive attributes. Similarly, we arrange for combinations of three, four, and up to items. Therefore, the first rows of admit a block upper triangle form. In addition, we are able to further arrange the columns within each block such that the diagonal matrices are identities, so that has form
(30) |
Note that has columns and obviously has full rank, therefore has full column rank. ∎
From now on, we assume that and the first rows of are arranged in the order as in (30).
Proposition 6.2.
Suppose that is complete, is saturated, and . Then, and have full column rank.
Proof.
The following two propositions, which compare the column spaces of and , are central to the proof of all the theorems. Their proofs are delayed to the Appendix.
The first proposition discusses the case where is complete. We can always rearrange the columns of so that . In addition, according to the proof of Proposition 6.1, the last column vector of corresponds to attribute . Therefore, this column vector is all of nonzero entries.
Proposition 6.3.
Assume that is a complete matrix and is saturated. Without loss of generality, let . Assume that the first rows of form a complete matrix. Further, assume that . If and , under the conditions in Theorem 4.2, is not in the column space for all .
The next proposition discusses the case where is incomplete.
Proposition 6.4.
Assume that is a complete matrix and is saturated. Without loss of generality, let . If and is incomplete, under the conditions in Theorem 4.2, is not in the column space for all .
The next result is a direct corollary of these two propositions. It follows by setting and for all .
Corollary 6.5.
If , under the conditions of Theorem 4.2, is not in the column space for all .
To obtain a similar proposition for the cases where the ’s are nonzero, we will need to expand the as follows. As previously defined, let
(31) |
The last row of consists entirely of ones. Vector is defined as in (23).
Proposition 6.6.
Suppose that is a complete matrix, , is saturated and . Let . Under the conditions of Theorem 4.2, is not in the column space for all . In addition, is of full column rank.
To prove Proposition 6.6, we will need the following lemma.
Lemma 6.7.
Consider two matrices and of the same dimension. If , then for any matrix of appropriate dimension for multiplication, we have
Conversely, if for some , does not belong to , then does not belong to .
Proof.
Note that is just a linear row transform of for . The conclusion is immediate by basic linear algebra. ∎
Proof of Proposition 6.6 Thanks to Lemma 6.7, we only need to find a matrix such that does not belong to the column space of for all .
We define
We claim that there exists a matrix such that
and
where the choice of does not depends on or . In the rest of the proof, we construct such a -matrix for . The verification for is completely analogous. Note that each row in is just a linear combination of rows of . Therefore, it suffices to show that every row vector of the form
can be written as a linear combination of the row vectors of . We prove this by induction. First note that for each ,
(32) |
Suppose that all rows of the form
for all can be written as linear combinations of the row vectors of with coefficients only depending on . Thanks to (32), the case of holds. Suppose the statement holds for some general . We consider the case of . By definition,
Let “” denote element-by-element multiplication. For every generic vector of appropriate length,
We expand the right-hand side of (6.1). The last term would be
From the induction assumption and definition (18), the other terms on both sides of (6.1) belong to the row space of . Therefore, is also in the row space of . In addition, all the corresponding coefficients only consist of . Therefore, one can construct a matrix such that
Because is free of and , we have
In addition, thanks to Propositions 6.3 and 6.4, is not in the column space for all . Therefore, by Lemma 6.7, is not in the column space for all .
In addition,
is of full column rank, where is a dimension row vector with last element being one and rest being zero. Therefore, is also of full column rank.
6.2 Proof of the theorems
Using the results of the previous propositions and lemmas, we now proceed to prove our theorems.
Proof of Theorem 2.4 Consider and saturated. Recall that is the vector containing ’s with , where
For any , since almost surely, according to Corollary 6.5, by (5), and , there exists such that,
and
Given that there are finitely many binary matrices, as . In fact, we can arrange the columns of such that as .
Note that satisfies the identity
In addition, since is of full rank (Proposition 6.1), the solution to the above linear equation is unique. Therefore, the solution to the optimization problem is unique and is . Notice that when , . Therefore,
Together with the consistency of , the conclusion of the theorem follows immediately.
Proof of Theorem 3.1 Note that for all
By the law of large numbers,
almost surely as . Therefore,
almost surely as .
For any , note that
According to Proposition 6.6 and the fact that , there exists such that is continuous in and
By elementary calculus,
and
Therefore,
as . For the same , we have
The above minimization on the left of the equation is subject to the constraint that
Together with the fact that there are only finitely many binary matrices, we have
We arrange the columns of so that as .
Now we proceed to the proof of consistency for . Note that
Since is a full column rank matrix and , in probability.
Proof of Theorem 4.2 Assuming is known, note that
is a continuous function of . According to the results of Proposition 4.1, the definition in (26), and the definition of in Section 4.1, we obtain that
in probability as . In addition, thanks to Proposition 6.6 and with a similar argument as in the proof of Theorem 3.1, is a consistent estimator.
Furthermore, if is a consistent estimator, then is also consistent. Then, the consistency of follows from the facts that is consistent and is of full column rank.
Appendix: Technical proofs
Proof of Proposition 6.3 Note that . Let be arranged as in (30). Then, . Given that , we have . We assume that , where is the entry in the th row and th column. Since , it is necessary that .
Suppose that the th row of the corresponds to an item that requires attributes . Then, we consider , such that the th row of is . Then, the th row vector and the th row vector of are identical.
Since , we have for . If and , the matrices and look like
and
C
-
ase 2]
-
Case 1.
Either the th or th row vector of is a zero vector. The conclusion is immediate because all the entries of are nonzero.
-
Case 2.
The th and th row vectors of are nonzero vectors. Suppose that the th row corresponds to an item. There are three different situations: according to the true -matrix (a) the item in row requires strictly more attributes than row , (b) the item in row requires strictly fewer attributes than row , (c) otherwise. We consider these three situations, respectively.
-
[(b)]
-
(a)
Under the true -matrix, there are two types of sub-populations in consideration: people who are able to answer item(s) in row () only and people who are able to answer items in both row and row (). Then, the sub-matrix of and are like
row row row row We now claim that and must be equal (otherwise the conclusion hold) for the following reason.
Consider the following two rows of : row A corresponding to the combination that contains all the items; row B corresponding to the row that contains all the items except for the one in row .
Rows A and B are in fact identical in . This is because all the attributes are used at least twice (condition C5). Then, the attributes in row are also required by some other item(s) and rows A and B require the same combination of items. Thus, the corresponding entries of all the column vectors of are different by a factor of .
For , rows A and B are also identical. This is because row and row have identical attribute requirements. Then, thus, the corresponding entries of all the column vectors of are different by a factor of . Thus, and must be identical otherwise is not in the column space of .
Similarly, we obtain that . Given that and , we now consider row and row . Notice that all the column vectors in have their entries in row and row different by a factor of . On the other hand, the and th entries of are NOT different by a factor of as long as the proportion of is positive. Thereby, we conclude this case.
-
(b)
Consider the following two types of sub-populations: people who are able to answer item(s) in row () only and people who are able to answer items in both row and row (). Similar to the analysis of (a), the sub-matrices look like:
row row row row With exactly the same argument as in (a), we conclude that , , and further is not in the column space of .
-
(c)
Consider the following three types of sub-populations: people who are able to answer item(s) in row only (), people who are able to answer item(s) in row only (), and people who are able to answer items in both row and row (). The sub-matrices look like:
row row row row row row With the same argument, we have that and . On considering row and row , we conclude that is not in the column space of . ∎
-
Proof of Proposition 6.4 is arranged as in (30). Consider such that is incomplete. We discuss the following situations.
-
1.
There are two row vectors, say the th and th row vectors (), in that are identical. Equivalently, two items require exactly the same attributes according to . With exactly the same argument as in the previous proof, under condition C5, we have that and . We now consider the rows corresponding to and . Note that the elements corresponding to row and row for all the vectors in the column space of are different by a factor of . However, the corresponding elements in the vector are NOT different by a factor of as long as the population is fully diversified.
-
2.
No two row vectors in are identical. Then, among the first rows of there is at least one row vector containing two or more nonzero entries. That is, there exists such that
This is because if each of the first items requires only one attribute and is not complete, there are at least two items that require the same attribute. Then, there are two identical row vectors in and it belongs to the first situation. We define
the number of attributes required by item according to .
Without loss of generality, assume for and for . Equivalently, among the first items, only the first items require more than one attribute while the th through the th items require only one attribute each, all of which are distinct. Without loss of generality, we assume for and for and .
-
[(b)]
-
(a)
. Since , there exists an such that . We now consider rows and . With the same argument as before (i.e., the attribute required by row is also required by item 1 in ), we have that (be careful that we cannot claim that ). We now consider the rows 1 and . Note that in these two rows are different by a factor of ; while the corresponding entries in are NOT different by a factor of . Thereby, we conclude the result in this situation.
-
(b)
and there exists and such that . The argument is identical to that in (2a).
-
(c)
and for each and , . Let the th row in correspond to . Let the th row in correspond to for .
We claim that there exists an such that the th row and the th row are identical in , that is
(1) If the above claim is true, then the attributes required by item have been required by some other items. Then, we conclude that and must be identical. In addition, rows in corresponding to and are different by a factor of . On the other hand, the corresponding entries in are NOT different by a factor of . Then, we are able to conclude the results for all the cases.
In what follows, we prove the claim in (1) by contradiction. Suppose that there does not exist such an . This is equivalent to saying that for each there exists an such that and for all and . Equivalently, for each , item requires at least one attribute that is not required by other first items. Consider
Let denote the cardinality of a set. Since for each and , , we have that . Note that and for all . Therefore, and . Therefore, . By a similar argument and induction, we have that . This contradicts the fact that . Therefore, there exists an such that (1) is true. As for , we have that
-
Summarizing the cases in 1, (2a), (2b) and (2c), we conclude the proof.
Acknowledgements
This research was supported in part by Grants NSF CMMI-1069064, SES-1123698, Institute of Education Sciences R305D100017 and NIH R37 GM047845.
References
- [1] {barticle}[mr] \bauthor\bsnmChiu, \bfnmChia-Yi\binitsC.Y., \bauthor\bsnmDouglas, \bfnmJeffrey A.\binitsJ.A. &\bauthor\bsnmLi, \bfnmXiaodong\binitsX. (\byear2009). \btitleCluster analysis for cognitive diagnosis: Theory and applications. \bjournalPsychometrika \bvolume74 \bpages633–665. \biddoi=10.1007/s11336-009-9125-0, issn=0033-3123, mr=2565331 \bptokimsref \endbibitem
- [2] {barticle}[author] \bauthor\bparticlede la \bsnmTorre, \bfnmJ.\binitsJ. (\byear2008). \btitleAn empirically-based method of -matrix balidation for the DINA model: Development and applications. \bjournalJournal of Educational Measurement \bvolume45 \bpages343–362. \bptokimsref \endbibitem
- [3] {bmisc}[author] \bauthor\bparticlede la \bsnmTorre, \bfnmJ.\binitsJ. (\byear2008). \bhowpublishedThe generalized DINA model. In International Meeting of the Psychometric Society, Durham, NH. \bptokimsref \endbibitem
- [4] {barticle}[mr] \bauthor\bparticlede la \bsnmTorre, \bfnmJimmy\binitsJ. &\bauthor\bsnmDouglas, \bfnmJeffrey A.\binitsJ.A. (\byear2004). \btitleHigher-order latent trait models for cognitive diagnosis. \bjournalPsychometrika \bvolume69 \bpages333–353. \biddoi=10.1007/BF02295640, issn=0033-3123, mr=2272454 \bptokimsref \endbibitem
- [5] {bincollection}[author] \bauthor\bsnmDiBello, \bfnmL. V.\binitsL.V., \bauthor\bsnmStout, \bfnmW. F.\binitsW.F. &\bauthor\bsnmRoussos, \bfnmL.\binitsL. (\byear1995). \btitleUnified cognitive psychometric assessment likelihood-based classification techniques. In \bbooktitleCognitively Diagnostic Assessment \bpages361–390. \baddressHillsdale, NJ: \bpublisherErlbaum. \bptokimsref \endbibitem
- [6] {barticle}[mr] \bauthor\bsnmGelman, \bfnmAndrew\binitsA., \bauthor\bsnmJakulin, \bfnmAleks\binitsA., \bauthor\bsnmPittau, \bfnmMaria Grazia\binitsM.G. &\bauthor\bsnmSu, \bfnmYu-Sung\binitsY.S. (\byear2008). \btitleA weakly informative default prior distribution for logistic and other regression models. \bjournalAnn. Appl. Stat. \bvolume2 \bpages1360–1383. \biddoi=10.1214/08-AOAS191, issn=1932-6157, mr=2655663 \bptokimsref \endbibitem
- [7] {bmisc}[author] \bauthor\bsnmHartz, \bfnmS. M.\binitsS.M. (\byear2002). \bhowpublishedA Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality. Ph.D. dissertation, Univ. Illinois, Urbana–Champaign. \bptokimsref \endbibitem
- [8] {barticle}[mr] \bauthor\bsnmHenson, \bfnmRobert\binitsR. &\bauthor\bsnmDouglas, \bfnmJeff\binitsJ. (\byear2005). \btitleTest construction for cognitive diagnosis. \bjournalAppl. Psychol. Meas. \bvolume29 \bpages262–277. \biddoi=10.1177/0146621604272623, issn=0146-6216, mr=2142701 \bptokimsref \endbibitem
- [9] {barticle}[mr] \bauthor\bsnmHenson, \bfnmRobert\binitsR., \bauthor\bsnmRoussos, \bfnmLouis\binitsL., \bauthor\bsnmDouglas, \bfnmJeff\binitsJ. &\bauthor\bsnmHe, \bfnmXuming\binitsX. (\byear2008). \btitleCognitive diagnostic attribute-level discrimination indices. \bjournalAppl. Psychol. Meas. \bvolume32 \bpages275–288. \biddoi=10.1177/0146621607302478, issn=0146-6216, mr=2432210 \bptokimsref \endbibitem
- [10] {bmisc}[author] \bauthor\bsnmHenson, \bfnmR. A.\binitsR.A. &\bauthor\bsnmTemplin, \bfnmJ. L.\binitsJ.L. (\byear2005). \bhowpublishedHierarchical log-linear modeling of the skill joint distribution. Technical report, External Diagnostic Research Group. \bptokimsref \endbibitem
- [11] {bmisc}[author] \bauthor\bsnmJunker, \bfnmB. W.\binitsB.W. (\byear1999). \bhowpublishedSome statistical models and computational methods that may be useful for cognitively-relevant assessment. Technical report. Available at http://www.stat.cmu.edu/~brian/nrc/cfa/documents/final.pdf. \bptokimsref \endbibitem
- [12] {barticle}[mr] \bauthor\bsnmJunker, \bfnmBrian W.\binitsB.W. &\bauthor\bsnmSijtsma, \bfnmKlaas\binitsK. (\byear2001). \btitleCognitive assessment models with few assumptions, and connections with nonparametric item response theory. \bjournalAppl. Psychol. Meas. \bvolume25 \bpages258–272. \biddoi=10.1177/01466210122032064, issn=0146-6216, mr=1842982 \bptokimsref \endbibitem
- [13] {barticle}[author] \bauthor\bsnmLeighton, \bfnmJ. P.\binitsJ.P., \bauthor\bsnmGierl, \bfnmM. J.\binitsM.J. &\bauthor\bsnmHunka, \bfnmS. M.\binitsS.M. (\byear2004). \btitleThe attribute hierarchy model for cognitive assessment: A variation on Tatsuoka’s rule-space approach. \bjournalJournal of Educational Measurement \bvolume41 \bpages205–237. \bptokimsref \endbibitem
- [14] {bmisc}[author] \bauthor\bsnmLiu, \bfnmY\binitsY., \bauthor\bsnmDouglas, \bfnmJ. A.\binitsJ.A. &\bauthor\bsnmHenson, \bfnmR.\binitsR. (\byear2007). \bhowpublishedTesting person fit in cognitive diagnosis. In The Annual Meeting of the National Council on Measurement in Education (NCME), Chicago, IL. \bptokimsref \endbibitem
- [15] {barticle}[mr] \bauthor\bsnmMaris, \bfnmE.\binitsE. (\byear1999). \btitleEstimating multiple classification latent class models. \bjournalPsychometrika \bvolume64 \bpages187–212. \biddoi=10.1007/BF02294535, issn=0033-3123, mr=1700708 \bptokimsref \endbibitem
- [16] {bmisc}[author] \bauthor\bsnmReckase, \bfnmMark D.\binitsM.D. (\byear1990). \bhowpublishedUnidimensional data from multidimensional tests and multidimensional data from unidimensional tests. In The Annual Meeting of the American Educational Research Association, Boston, MA, April 16–20. \bptokimsref \endbibitem
- [17] {bbook}[author] \bauthor\bsnmReckase, \bfnmMark D.\binitsM.D. (\byear2009). \btitleMultidimensional Item Response Theory. \baddressNew York: \bpublisherSpringer. \bptokimsref \endbibitem
- [18] {bmisc}[author] \bauthor\bsnmRoussos, \bfnmL.\binitsL., \bauthor\bsnmDiBello, \bfnmL. V.\binitsL.V., \bauthor\bsnmStout, \bfnmW.\binitsW., \bauthor\bsnmHartz, \bfnmS.\binitsS., \bauthor\bsnmHenson, \bfnmR.\binitsR. &\bauthor\bsnmTemplin, \bfnmJ.\binitsJ. (\byear2007). \bhowpublishedThe fusion model skills diagnosis system. In Cognitively Diagnostic Assessment for Education: Theory and Practice (J. P. Leighton and M. J. Gierl, eds.) 275–318. New York: Cambridge Univ. Press. \bptokimsref \endbibitem
- [19] {barticle}[author] \bauthor\bsnmRoussos, \bfnmLouis A.\binitsL.A., \bauthor\bsnmTemplin, \bfnmJonathan L.\binitsJ.L. &\bauthor\bsnmHenson, \bfnmRobert A.\binitsR.A. (\byear2007). \btitleSkills diagnosis using IRT-based latent class models. \bjournalJournal of Educational Measurement \bvolume44 \bpages293–311. \bptokimsref \endbibitem
- [20] {barticle}[author] \bauthor\bsnmRupp, \bfnmA. A.\binitsA.A. (\byear2002). \btitleFeature selection for choosing and assembling measurement models: A building-block-based organization. \bjournalPsychometrika \bvolume2 \bpages311–360. \bptokimsref \endbibitem
- [21] {barticle}[author] \bauthor\bsnmRupp, \bfnmA. A.\binitsA.A. &\bauthor\bsnmTemplin, \bfnmJ. L.\binitsJ.L. (\byear2008). \btitleEffects of -matrix misspecification on parameter estimates and misclassification rates in the DINA model. \bjournalEducational and Psychological Measurement \bvolume68 \bpages78–98. \bptokimsref \endbibitem
- [22] {barticle}[author] \bauthor\bsnmRupp, \bfnmA. A.\binitsA.A. &\bauthor\bsnmTemplin, \bfnmJ. L.\binitsJ.L. (\byear2008). \btitleUnique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. \bjournalMeasurement: Interdisciplinary Research and Perspective \bvolume6 \bpages219–262. \bptokimsref \endbibitem
- [23] {bbook}[author] \bauthor\bsnmRupp, \bfnmA. A.\binitsA.A., \bauthor\bsnmTemplin, \bfnmJ. L.\binitsJ.L. &\bauthor\bsnmHenson, \bfnmR. A.\binitsR.A. (\byear2010). \btitleDiagnostic Measurement: Theory, Methods, and Applications. \baddressNew York: \bpublisherGuilford Press. \bptokimsref \endbibitem
- [24] {barticle}[mr] \bauthor\bsnmSchwarz, \bfnmGideon\binitsG. (\byear1978). \btitleEstimating the dimension of a model. \bjournalAnn. Statist. \bvolume6 \bpages461–464. \bidissn=0090-5364, mr=0468014 \bptokimsref \endbibitem
- [25] {barticle}[author] \bauthor\bsnmStout, \bfnmW.\binitsW. (\byear2007). \btitleSkills diagnosis using IRT-based continuous latent trait models. \bjournalJournal of Educational Measurement \bvolume44 \bpages313–324. \bptokimsref \endbibitem
- [26] {barticle}[mr] \bauthor\bsnmTatsuoka, \bfnmCurtis\binitsC. (\byear2002). \btitleData analytic methods for latent partially ordered classification models. \bjournalJ. Roy. Statist. Soc. Ser. C \bvolume51 \bpages337–350. \biddoi=10.1111/1467-9876.00272, issn=0035-9254, mr=1920801 \bptokimsref \endbibitem
- [27] {barticle}[author] \bauthor\bsnmTatsuoka, \bfnmK. K.\binitsK.K. (\byear1983). \btitleRule space: An approch for dealing with misconceptions based on item response theory. \bjournalJournal of Educational Measurement \bvolume20 \bpages345–354. \bptokimsref \endbibitem
- [28] {barticle}[author] \bauthor\bsnmTatsuoka, \bfnmK. K.\binitsK.K. (\byear1985). \btitleA probabilistic model for diagnosing misconceptions in the pattern classification approach. \bjournalJournal of Educational Statistics \bvolume12 \bpages55–73. \bptokimsref \endbibitem
- [29] {bbook}[author] \bauthor\bsnmTatsuoka, \bfnmK. K.\binitsK.K. (\byear2009). \btitleCognitive Assessment: An Introduction to the Rule Space Method. \baddressFlorence, KY: \bpublisherRoutledge. \bptokimsref \endbibitem
- [30] {bmisc}[author] \bauthor\bsnmTemplin, \bfnmJ.\binitsJ., \bauthor\bsnmHe, \bfnmX.\binitsX., \bauthor\bsnmRoussos, \bfnmL.\binitsL. &\bauthor\bsnmStout, \bfnmW.\binitsW. (\byear2003). \bhowpublishedThe pseudo-item method: A simple technique for analysis of polytomous data with the fusion model. Technical report, External Diagnostic Research Group. \bptokimsref \endbibitem
- [31] {bmisc}[author] \bauthor\bsnmTemplin, \bfnmJ. L.\binitsJ.L. (\byear2006). \bhowpublishedCDM: Cognitive diagnosis modeling with Mplus. Computer software. Available at http://jtemplin.coe.uga.edu/research/. \bptokimsref \endbibitem
- [32] {barticle}[pbm] \bauthor\bsnmTemplin, \bfnmJonathan L.\binitsJ.L. &\bauthor\bsnmHenson, \bfnmRobert A.\binitsR.A. (\byear2006). \btitleMeasurement of psychological disorders using cognitive diagnosis models. \bjournalPsychol. Methods \bvolume11 \bpages287–305. \biddoi=10.1037/1082-989X.11.3.287, issn=1082-989X, pii=2006-11159-006, pmid=16953706 \bptokimsref \endbibitem
- [33] {bmisc}[author] \bauthor\bparticlevon \bsnmDavier, \bfnmM.\binitsM. (\byear2005). \bhowpublishedA general diagnosis model applied to language testing data. Research report, Educational Testing Service. \bptokimsref \endbibitem