To estimate test-retest reliability, we can use the same methods as are commonly encountered in the context of inter-rater reliability: Cohen’s kappa for dichotomous or polytomous (i. e. three or more categories) response variables [
16], weighted kappa statistics for polytomous ordinal response variables [
17], Pearson’s correlation coefficient
r for non-categorical (i. e. scale) response variables [
23,
24], and the intra-class correlation coefficient (ICC) for both categorical and non-categorical variables [
43‐
46]. Where kappa and
r can be used when dealing with two measurements (two time points or two raters), the ICC can be used with more than two measurements as well. Moreover, while
r indicates to what extent scores of a quantitative variable measured at two measurements correlate, the ICC can combine an estimate of correlation with a test in the difference in mean scores of the measurements [
43,
47]. That is, differences in means do not affect
r but do lower the ICC to some extent. If a researcher’s interest is solely in the stability of scores (i. e. same or similar position of respondents’ scores towards each other at two measurements, regardless of the mean scores of the measurements),
r can provide an indication of that stability. However, if one wishes to incorporate mean differences in the reliability estimate as well (i. e. a penalty for large differences in mean scores across measurements) one needs to consider specific models that provide an ICC [
43‐
47]. Finally, in the context of factor analysis and related methods for latent variable analysis, one can consider including a time component in the model when dealing with repeated measurements [
22], which allows one to simultaneously examine whether a factor structure (i. e. sets of items grouped together) is stable across measurements and obtains information with regard to the correlation between factor scores of the different measurements (i. e. test-retest reliability).