Introduction 2002a 1990 2001) 1998 1965 1961 For some traits, it is convenient to have multiple indicators (items). For example one might have for a particular disease 10 symptoms that each can be scored as absent (0) or present (1). For each individual one can then compute a sum score that indicates to what extent the individual is affected by the disease. Such sum scores usually show a normal distribution or do so after an appropriate transformation. It is typically assumed that the normally distributed scores or transformations thereof reflect a continuous interval scale and the variance of the sum scores is subsequently decomposed. This approach follows classical test theory (CTT) where it is assumed that the observed score (the sum score) is the aggregate of a true score and a random component, usually referred to as measurement error. When decomposing the variance of sum scores, the measurement error variance (the unreliability) ends up as part of the non-shared environmental variance. As a result, when the reliability of a scale is low (i.e., the measurement error is large) and the analysis is based on sum scores, the heritability of the actual trait is significantly underestimated. Modelling sum scores is appropriate if the sum scores are highly reliable (for instance because they are based on a large number of correlated items) and well validated. Furthermore, there should be enough variation and the distribution should be more or less normal. Finally, there should be no data missing. If these requirements do not hold, item response theory (IRT) provides a well-established alternative to classical test theory. This paper introduces the basics of the IRT framework, after which its advantages over a sum score approach are discussed. Next, it is argued that IRT models should be estimated simultaneously with the variance decomposition model, which can be done using a Bayesian approach with Markov-chain Monte Carlo estimation. Lastly, a simulation study shows the potential bias when estimating variance components on the basis of sum scores and the Bayesian method is illustrated with an empirical data set on attention problems. Item response theory models θ β θ j θ j N μ σ 2 i j p Y ij θ j β i β i 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p(Y_{{ij}} = 1) = \Phi (\theta _{j} - \beta _{i} ), $$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \Phi (.) $$\end{document} i j θ j β i 1943 1952 1953) μ σ 2 An alternative parameterisation replaces the normal ogive by a logistic curve, that is, 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p(Y_{{ij}} = 1) = \Psi (\theta _{j} - \beta _{i} ), $$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \Psi (x) = \frac{{\exp (x)}} {{1 + \exp (x)}}. $$\end{document} 1960 θ j β 1980 α i i 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p(Y_{{ij}} = 1|\theta _{j} ,\alpha _{i} ,\beta _{i} ) = \Psi (\alpha _{i} \theta _{j} - \beta _{i} ). $$\end{document} αθ β α θ β β α α α μ N θ j IRT models for polytomous data Often, measurement is based on items or symptoms with more than two categories. For example, answers can be coded as 0 (not at all), 1 (somewhat, sometimes) and 2 (a lot, often). Typically in CTT approaches in behaviour genetics the sum of these item scores is regarded to represent a person’s score on the trait of interest and is used for the statistical inference. 1969 1982 1990 1997 M m = 0 M −  M  m = 2003 Advantages of using an IRT framework compared to analysing sum scores We will discuss four advantages of using IRT: (1) it supports construct validity and the scoring rule (e.g., a scoring rule might consist of taking the unweighted sum of symptoms as an estimate of a person’s liability), (2) it supports the use of incomplete item administration designs and handling of missing data, (3) it supports accounting for measurement error, and (4) it can handle floor and ceiling effects. An IRT framework allows one to explicitly model the relationship between item scores and the phenotype of interest. Any combination of items can of course be summed (weighted or unweighted), but this does not guarantee that the sum score reflects a meaningful construct. The meaningfulness of the measurement can be directly assessed in an IRT framework. Fit to an IRT model is empirical evidence that the observed responses can be explained by an underlying structure. The latent variable of the IRT model should, of course, be an appropriate representation of the construct to be measured. θ j 1968 β α 1987 2000 2000 2005 latent 2001 2003 2003 2004 2004 Variance decomposition: the one-step and the two-step approach θ j θ j θ j There are several disadvantages to this two-step approach. First of all, in the IRT model fitting phase, the usual IRT estimation software cannot handle the dependency in the data inherent in twin and family designs. In some cases, with simple designs such as with sibling pairs only, weighting of the data would come a long way in solving this problem, but with more complex family designs, weighting is not a satisfactory solution. θ j β θ α Actually, using the two-step approach the heritability coefficient estimate will be about the same as when the analysis is carried out on sum scores. This is because IRT estimates and sum scores correlate highly, well over 0.90 in the case of two-parameter models. When applying a one-parameter model, the correlation will be practically one, because a basic assumption of the Rasch model is that a sum score is a sufficient statistic for the score on the latent trait. Therefore, all persons with the same sum score will get the same estimate on the latent trait. Thus, a third problem of the two-step approach is that it neither solves the attenuation problem, nor the non-normality, nor the ceiling effects. In order to take full advantage of the IRT approach, it is critical to estimate both the measurement model and the variance decomposition model simultaneously, using a one-step approach. However, computationally this is rather challenging. Below, it is shown how this can be done using software for Bayesian estimation procedures. In an application in a later section, we demonstrate the one-step approach for the estimation of heritability with both simulated and empirical data. Bayesian estimation using a Markov chain Monte Carlo algorithm 2005 2005 P η Y η Y P η Y P Y η η P η 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ P\;(\eta {\text{ $|$ }}Y)\; \propto \;P(Y{\text{ $|$ }}\eta )P(\eta ). $$\end{document} η prior P Y η posterior η η η 1973 2004 η η 1984 1990 2004 In each iteration of an MCMC algorithm, a sample is taken from each conditional posterior distribution for each subset of the parameter space, given the current values of the other parameters. After a number of so-called ‘‘burn-in’’ iterations, necessary for a chain to achieve stationarity (i.e., approaching the target distribution: the joint posterior distribution) sufficiently closely, the subsequent draws can be regarded as sampled from the joint posterior distribution. 1999a b) 1999 2000 1997 2001 2003) 2000 2001 2005 2004 1992 1999a http://www.mrc-bsu.cam.ac.uk/bugs/ θ 2006a 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \theta _{{jk}} = a{\text{1}}_{k} + a{\text{2}}_{{jk}} + c_{k} + e_{{jk}} , $$\end{document} c k k e jk j k a a 1970 a a a a a a a a a a N σ 2 a a N σ 2 a c N σ 2 c e ∼ N σ 2 e 2006a 2006a 2004 1986 Simulation β Next, sum scores were computed and these were analysed with an AE model. Since the distribution of the sum scores is positively skewed, the AE analysis was also performed after a logarithmic transformation of the sum scores. The simulations were carried out using the software package . For each replicated data set, we computed the twin correlations for the latent scores, the twin correlations of the sum scores and the twin correlations for the log-transformed sum scores. The three types of analyses were carried out in WinBUGS. After a burn-in phase of 1000 iterations, the characterisation of the posterior distribution for the model parameters was based on 1000 iterations from 2 independent Markov chains. From each of the 3 (analyses) * 101 (replicated data sets) marginal posterior distributions for the heritability we took the mean and the median as point estimates. 2004 2003 Simulation results Taking the median parameter values from the 101 data sets, the simulated latent data correlated 0.72 in MZ twins and 0.36 in DZ twins, just as would be expected. The sum scores correlated 0.45 in MZ twins and 0.21 in DZ twins (medians of the 101 data sets) and the log-transformed sum scores correlated 0.41 and 0.20, respectively. Thus, twin correlations are severely attenuated when analysing sum scores, even with 14 items. 1 Table 1 Simulation results. Reported heritability values are the medians of the 101 posterior means and medians, standard deviations between parentheses Method of analysis Heritability coefficient point estimates Posterior mean Posterior median 1PL IRT model 0.7232 (0.0585) 0.7245 (0.0589) Sum scores continuous model 0.4364 (0.0393) 0.4369 (0.0395) Log-transformed sum scores 0.4046 (0.0403) 0.4047 (0.0406) 1 Fig 1 β 1 β 1 1998 2004 2004 1 1 1 β Thus, also under the IRT model, analysing sum scores leads to an underestimation of shared environmental effects and an overestimation of dominance genetic effects when the sum score distributions are skewed. An application 2002b 1997 2006c 2 2004 2006c Table 2 1997 Item Description 1 I act too young for my age 2 I have trouble concentrating or paying attention 3 I daydream a lot 4 My school work or job performance is poor 5 I am too dependent on others 6 I fail to finish things I should do 7 My behaviour is irresponsible δ β β β δ 2006a Results 3 β Table 3 Descriptives of marginal posterior distributions for the AE variance decomposition model using the 1PL IRT model for polytomous items with a main effect for sex Parameter Mean SD 2½th percentile Median 97½th percentile σ 2 a 0.84 0.07 0.71 0.84 0.99 σ 2 e 0.32 0.06 0.20 0.32 0.44 δ −0.13 0.05 −0.24 −0.13 −0.02 β 11 0.25 0.05 0.15 0.25 0.34 β 12 2.76 0.10 2.56 2.76 2.96 β 21 −0.76 0.05 −0.86 −0.76 −0.66 β 22 2.44 0.08 2.30 2.45 2.60 β 31 −0.43 0.05 −0.53 −0.43 −0.33 β 32 1.84 0.08 1.71 1.84 1.98 β 41 1.90 0.06 1.78 1.90 2.02 β 42 3.96 0.20 3.58 3.96 4.36 β 51 0.22 0.05 0.13 0.22 0.32 β 52 3.03 0.10 2.83 3.02 3.23 β 61 0.62 0.05 0.53 0.62 0.73 β 62 3.90 0.15 3.63 3.90 4.19 β 71 2.57 0.07 2.44 2.57 2.71 β 72 4.61 0.31 4.05 4.60 5.27 h 2 0.73 0.05 0.63 0.72 0.82 Note 1 2006c α γ γ N Discussion 1998 2004 2004 2006b 2004 2003 Moreover, when the measurement model and the variance decomposition model are estimated simultaneously, the variance decomposition deals appropriately with the dependency in the data when estimating IRT model parameters and testing the model’s assumptions, and the IRT measurement model deals appropriately with the estimation of the heritability coefficient (correcting for attenuation to obtain an unbiased point estimate) and the reporting of the confidence intervals (correcting for location-dependent uncertainty of person scores on the latent trait). 2006c The crucial element of the one-step approach that leads to unbiased point estimates is the inclusion of the appropriate probabilistic measurement model so that the estimation takes into account the unreliability of the measurement. The probabilistic modelling allows for the fact that twins with identical response patterns may have different scores on the latent trait, and also, that twins with non-identical response patterns may have exactly the same score on the latent trait. Discriminatory power of the items and the number of items are both crucial to the heritability estimated based on sum scores: the fewer the items and the worse the discrimination of the items (i.e., the smaller the variance of the latent trait in the one-parameter model; the smaller the factor loadings in the two-parameter model), the more biased the estimation will be when the analysis is performed on sum scores. High quality scales with a large number of items (say, more than 50) with high discriminatory power that are scattered across the entire scale can indeed be analysed with sum scores, but any other scale should be analysed using the IRT framework if one is interested in an unbiased heritability estimate with trustworthy confidence intervals. Future work should focus on the assessment of model fit in the context of genetic models. It is only sensible to apply a one-step IRT approach when the data actually conform to an IRT measurement model. If data do not fit an IRT model, for instance when there is differential item functioning across subpopulations, the approach will still lead to biased estimates. A crucial first step therefore is assessing model fit and checking measurement invariance.