The Outcome Questionnaire-45 (OQ-45)
The OQ-45 [
10,
11] contains 45 Likert items with response options with scores ranging from 0 (never) to 4 (almost always). Together the items comprise three subscales, which are the Symptom Distress (SD; 25 items; example items include “I feel fearful”, and “I feel worthless”) subscale, which taps symptoms of the most common types of psychological distress encountered in practice, such as depression and anxiety; the Interpersonal Relations (IR; 11 items; example items include “I am concerned about my family troubles” and “I have an unfulfilling sex life”) subscale, which measures problems encountered in interpersonal relations; and the Social Role (SR; 9 items; example items include “I feel stressed at school/work” and “I enjoy my spare time”) subscale, which taps distress on a broader social level including distress encountered at work, during education, and during leisure activities.
Two remarks with respect to the OQ-45 are in order. First, it has been shown that the hypothesized three-factor structure of the OQ-45 proposed by Lambert and colleagues [
10] is not always replicable (e.g., [
14‐
17]). In addition, De Jong et al. [
11] have identified an additional subscale containing 12 items from the SD subscale in the Dutch OQ-45. These 12 items measure symptoms of distress related exclusively to anxiety and its physical manifestations. The authors have named this subscale Anxiety and Somatic Distress (ASD), but the clinical relevance of ASD as a separate scale of patient functioning is not yet evident. Therefore, we used both De Jong’s [
11] hypothesized factorial structure and the empirical structure resulting from our sample to study the OQ-45 for beta and gamma change.
Second, previous studies [
11,
18] with respect to the psychometric properties of the Dutch OQ-45 revealed four items (i.e., items 11, 12, 26, and 32), which were problematic because of poor fit with the other items in the corresponding subscales. Response shifts cannot be validly detected for these items because they hardly share any variance with other items and their poor fit within the scale may also confound other results. Therefore, these four items were excluded from the analyses. After the exclusion of the problematic items, 24 items remained in the SD, 10 in the IR and 7 in the SR subscales.
Data analysis strategy
Beta and gamma change have to be assessed sequentially; that is, first, one has to ascertain that the same latent attribute is being measured at both measurement occasions (i.e., no gamma change, but maybe beta change) before proceeding to investigating possible beta change [
19]. Therefore, we first concentrate on gamma change and then on beta change.
Gamma change To assess gamma change one has to investigate whether the number of factors has changed and if not, whether for a fixed number of factors the pattern of fixed and free factor loadings has changed from pretest to posttest [
2,
20,
21]. To accomplish this goal, we first fitted a series of factor models, starting with the one-factor model, then proceeding with the two-factor model, the three-factor model, and so on. No restrictions were imposed on the loadings. The model with the smallest number of factors that adequately fitted the data was retained for further analysis. Next, gamma change was assessed by comparing the patterns of loadings and cross loadings between pretest and posttest in the best-fitting-factor model; that is, we tested for so-called configural invariance [
22]. Gamma change was inferred when either (1) a particular item had the highest loading on different factors at pretest and posttest, or (2) the number of factors on which the items had substantial loadings changed across pretest and posttest. All factor models were fitted on the polychoric correlation matrix, using MPlus5.0 [
23] and weighted least squares means-adjusted (WLSM) estimation. Factor analysis of polychoric correlation matrices avoids finding spurious difficulty factors [
24].
Beta change Beta change was assessed for each of the four OQ-45 subscales (i.e., SD, IR, SR, and ASD) separately within the framework of unidimensional IRT [
25]. Unidimensional IRT models can be conceived as non-linear factor models for categorical indicators. In particular, we used the graded response model (GRM; [
26]), which is suitable for modeling data obtained by means of Likert items, as in the OQ-45. Let
\(\theta\) denote the latent variable. The GRM assumes unidimensionality, local independence, and a logistic (i.e., S-shaped) relationship between
\(\theta\) and the cumulative response probabilities. In particular, for each item this logistic function is parameterized by one slope parameter (
\(a\)) and
\(M\) threshold (
\(b\)) parameters, where
\(M\) equals the number of response categories minus 1; that is, for a 5-category Likert item,
\(M=4\) (the reason is that the probability of having a score of at least 0, that is, any score, equals 1, which is a trivial result). The slope parameter expresses how well an item distinguishes low and high
\(\theta\) values, and thus how strongly observed scores are associated with the latent variable. The threshold parameter
\({{b}_{m}}\) (
\(m=1,\ldots ,4\) for OQ-45 Likert items) denotes the location on the
\(\theta\)-scale where the probability of obtaining score
m or higher equals 0.50. Different items usually have different
\(a\) and
\(b\) parameters. Beta change amounts to change in the item parameters, either
\(a\),
\(b\), or both, between pretest and posttest, provided that items are calibrated on the same scale at pretest and posttest. The GRM assumptions of unidimensionality and local independence were evaluated using the residual correlations under the 1-factor model. The assumptions are considered valid if the residual correlations do not exceed 0.15 [
27].
For testing beta change, we used likelihood-ratio tests (LRT; e.g., [
28]) that are available in FlexMIRT [
29]. The LRT compares the likelihood of two nested models, one model that assumes that both the
\(a\) and
\(b\) parameters are equal at pretest and posttest (i.e., restricted model of no beta change) and one in which the
\(a\) and
\(b\) parameters for one or more items are freely estimated at pretest and posttest (i.e., the general model suggesting beta change). A significant LRT means that the fit of the restricted model is significantly worse than the fit of the general model, thus suggesting that either the slopes or the thresholds changed from pretest to posttest.
Comparison of factor and IRT approaches Theoretically, assessing gamma change is also possible within an IRT framework. In fact, assuming multivariate normally distributed latent variables, the factor model of polychoric correlations and the multidimensional GRM are equivalent [
30], but the models are estimated differently [
31]. Parameters of the factor model are estimated from the bivariate associations, which is the limited information approach. Parameter estimation in multidimensional IRT is based on the likelihood of the response patterns, thus including all high-order associations, and is a full-information approach. Research [
31] showed that both approaches yield accurate estimates, but full information approaches may run into computational problems. Therefore, we chose to factorize the polychoric correlations using the limited-information approach for examining gamma change.
Beta change can also be assessed by means of factor analysis. It is tested whether factor intercepts and/or factor loadings changed between pretest and posttest (e.g., [
2,
32]). Factor loadings are conceptually equivalent to slope (
\(a\)) parameters in IRT. However, the interpretation of the item intercept in linear factor models is somewhat different from the interpretation of the
\(b~\) parameters in IRT models. The intercept in a factor analysis can be conceived as the overall item difficulty, whereas the
\(b\) parameters in the GRM define the probability to score in a particular category or higher and, thus, describe the item-difficulty at the level of the response categories. In practice, item intercepts in factor analysis are rarely utilized for assessing beta change [
13]. More importantly, because the GRM has
M location parameters per item, IRT is better able to exhibit subtle forms of beta change when violations of measurement invariance pertain only to some categories but not to all. Such beta changes may not be visible as change in the intercepts in factor models, because the intercept summarizes information that IRT divides across the
M threshold parameters, thus allowing to reveal nuances the intercept hides.