library(ggdist)
library(brms)
library(cmdstanr)
library(psych)
library(tidyverse)
rm(list=ls())
set.seed(1)
= 5000 # number of subjects (mice)
N = 5 # number of measurements per subject
J = 3
true_score_variance = 2
error_variance
= expand.grid(j = 1:J, mouse = 1:N)
df
= rnorm(N, mean = 10, sd = sqrt(true_score_variance))
true_scores = rnorm(N*J, mean = 0, sd = sqrt(error_variance))
measurement_error
$measurement = true_scores[df$mouse] + measurement_error
df
= df %>%
df_average_lengths group_by(mouse) %>%
summarise(average_measurement = mean(measurement))
Estimating Mean Score Reliability with RMU
Introduction
This tutorial demonstrates how to use Relative Measurement Uncertainty (RMU) to estimate mean score reliability. See our pre-print for more information on RMU.
This brief tutorial assumes your relatively familiar with multilevel Bayesian modelling and logistic regression.
Set Up
Imagine we measure the length of 5000 mice 5 times each. Every time we measure a mouse, we will get a slightly different measurement (have you tried holding a mouse with one hand and holding a ruler in another?).
In this example, imagine that each measurement from each subject follows a normal distribution with a fixed mean (called the “true score” in psychometrics; \(\theta_i\)) and fixed variance (\(\sigma^2_{\epsilon}\)). Assume that this measurement error variance (\(\sigma^2_{\epsilon}\)) is the same across all participants.
The final important parameter to consider is the variance in mouse true scores across the mouse population (\(\sigma^2_{\theta}\)).
To illustrate, let’s simulate some data:
In the above example, we can work out the reliability of the mean score using J, \(\sigma^2_{\epsilon}\) and \(\sigma^2_{\theta}\):
\[ \begin{aligned} R(\hat{\mathbf{y}}) &= \frac{\sigma^2_{\theta}}{\sigma^2_{\theta}+\sigma^2_{\epsilon}/\text{J}} \\ &= \frac{3}{3+2/5}= 0.882 \end{aligned} \]
In case you don’t trust me, the derivation for this formula is in the appendix. However, we can also check its correct by computing the squared correlation between the observed mean scores and true scores:
cor(df_average_lengths$average_measurement, true_scores)^2
[1] 0.8902067
Estimating Reliability
It’s worth noting that the Bayesian measurement model we employ in this example effectively makes the same assumptions as coefficient alpha - which if violated can lead to bias (this paper covers the assumptions of alpha nicely).
However, with RMU, we can make our model more complex (e.g., use an alternative outcome distribution, add parameters, etc.) to account for any quirks with our measurements. For example, for modelling reaction times, you may want to use a more appropriate distribution.
We will keep our model simple in this example - but if you want to apply this code to real data - its worth considering whether the model assumptions are met!
(1) Coefficient alpha
We can estimate reliability using (unstandardised) coefficient alpha. First, we need to pivot_wider our dataset so that each row represents a mouse and our five measurements are on five columns:
= df %>%
df_wide pivot_wider(
id_cols = mouse,
names_from = j,
values_from = measurement,
names_prefix = "measurement"
)
%>%
df_wide head()
# A tibble: 6 × 6
mouse measurement1 measurement2 measurement3 measurement4 measurement5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6.77 9.80 6.54 10.6 10.5
2 2 8.57 8.58 11.2 10.7 10.2
3 3 7.41 8.71 8.30 8.63 10.4
4 4 12.1 11.6 12.0 12.4 11.9
5 5 11.1 12.6 12.5 11.2 10.2
6 6 6.54 7.19 6.89 8.49 11.3
Coefficient alpha calc also gives us .88!
= psych::alpha(df_wide[-1])$total$raw_alpha
alpha
= psych::alpha.ci(alpha = alpha, n.obs = N, n.var = J, digits = 4)
alpha_ci
alpha_ci
95% confidence boundaries (Feldt)
lower alpha upper
0.88 0.89 0.89
(2) RMU
To calculate RMU, we need to first fit a Bayesian measurement model, and then extract a matrix of posterior draws for each mouse’s length parameter.
We can use a simple random intercept multilevel model, as shown below.
To fit the model, we use a “long” dataset which has a column for measurement, and a column to indicate mouse. Note that multiple measurements from each mouse are stored in the same column (measurement).
head(df)
j mouse measurement
1 1 1 6.770474
2 2 1 9.804690
3 3 1 6.541625
4 4 1 10.583413
5 5 1 10.495552
6 1 2 8.567657
Fit brms Multilevel Model
Let’s fit our model:
set.seed(1)
= brm(
brms_model ~ 1 + (1 | mouse),
measurement data = df,
chains = 4,
cores = 4,
backend = "cmdstanr", # This is to speed up model fitting using parallelization
threads = threading(2) # This is to speed up model fitting using parallelization
)
And look at our results:
summary(brms_model)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: measurement ~ 1 + (1 | mouse)
Data: df (Number of observations: 25000)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~mouse (Number of levels: 5000)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.77 0.02 1.73 1.80 1.00 1103 2085
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 9.99 0.03 9.94 10.04 1.00 870 1436
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.42 0.01 1.40 1.43 1.00 4648 2956
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Our fitted model has the following structure (rounded to whole numbers):
\[ \begin{aligned} measurement_{ij} &= 10 + \theta_{j} + \epsilon_{ij} \\ \theta_{j} &\sim N(0, 3) \quad \text{(mouse random-intercepts)}\\ \epsilon_{ij} &\sim N(0, 2) \quad \text{(measurement error)} \end{aligned} \]
Calculate RMU
To calculate RMU, extract the posterior draws for each mouse’s random intercept parameter, and plug the matrix of draws into my reliability function.
You can either install my R package gbtoolbox
from github, or you can just load the reliability function directly from github. Both methods are outlined below:
Install R package from github:
::install_github("giac01/gbtoolbox") devtools
Or, you can just load the R function using the following code (but make sure you also have the R package ggdist installed):
source("https://raw.githubusercontent.com/giac01/gbtoolbox/refs/heads/main/R/reliability.R")
= brms_model %>%
posterior_draws as_draws_df() %>%
select(starts_with("r_mouse")) %>%
t()
nrow(posterior_draws) # should equal the number of subjects
[1] 5000
ncol(posterior_draws) # should equal the number of post-warmup posterior draws
[1] 4000
::reliability(posterior_draws)$hdci gbtoolbox
[1] "Number of subjects: 5000"
[1] "Number of posterior draws: 4000"
rmu_estimate hdci_lowerbound hdci_upperbound .width .point .interval
1 0.8854683 0.8808548 0.8903191 0.95 mean hdci
Remember to check that the matrix of draws has subjects on the rows and draws on the columns. We can check this because we know we have 5000 subjects, and the function has confirmed this is the case.
Recap of simulation code
Just to check this wasn’t a fluke, we can play around with the parameters J and true_score_variance and error_variance below to show that we get the correct result across different simulation conditions:
rm(list=ls())
set.seed(1)
= 5000 # number of subjects (mice)
N = 3 # number of measurements per subject
J = 1
true_score_variance = 10
error_variance
= expand.grid(j = 1:J, mouse = 1:N)
df
= rnorm(N, mean = 10, sd = sqrt(true_score_variance))
true_scores = rnorm(N*J, mean = 0, sd = sqrt(error_variance))
measurement_error
$measurement = true_scores[df$mouse] + measurement_error
df
= df %>%
df_average_lengths group_by(mouse) %>%
summarise(average_measurement = mean(measurement))
cat("Simulation reliability = ")
/(true_score_variance+error_variance/J)
true_score_variance
cat("Squared Correlation between mean and true scores = ")
# This may slightly differ to above due to sampling variance
cor(df_average_lengths$average_measurement, true_scores)^2
# Fit model and calculate RMU
# I've reduced the number of MCMC chains here to two for speed - you may want more in practice!
= brm(
brms_model ~ 1 + (1 | mouse),
measurement data = df,
chains = 2,
cores = 2,
refresh = 0, # This hides the progress bar
backend = "cmdstanr", # This is to speed up model fitting using parallelization
threads = threading(4) # This is to speed up model fitting using parallelization
)
= brms_model %>%
posterior_draws as_draws_df() %>%
select(starts_with("r_mouse")) %>%
t()
::reliability(posterior_draws)$hdci gbtoolbox
For the above example, RMU is pretty similar to the expected value:
cat("Simulation reliability = ")
Simulation reliability =
/(true_score_variance+error_variance/J) true_score_variance
[1] 0.2307692
cat("Squared Correlation between mean and true scores = ")
Squared Correlation between mean and true scores =
# This may slightly differ to above due to sampling variance
cor(df_average_lengths$average_measurement, true_scores)^2
[1] 0.2414736
cat("RMU:")
RMU:
::reliability(posterior_draws)$hdci gbtoolbox
[1] "Number of subjects: 5000"
[1] "Number of posterior draws: 2000"
rmu_estimate hdci_lowerbound hdci_upperbound .width .point .interval
1 0.2493919 0.2155272 0.2852303 0.95 mean hdci
Beyond coefficient alpha
We can leverage the flexibility of brms + RMU to estimate reliability in more complex scenarios.
In this next example, we’ll simulate binary outcome data (1 = success, 0 = fail) where:
- Each subject completes between 3-9 trials (varying number of measurements per subject)
- Subjects improve with practice (probability of success increases on each subsequent trial)
- We’re interested in individual differences in first-trial performance
Because subjects completed different numbers of trials, simply averaging their scores would be misleading - those who completed more trials had more opportunities to practice and improve.
Instead, we’ll use a multilevel logistic regression model to estimate each subject’s probability of success on the first trial. We’ll include trial number as fixed covariate in the model. We’ll code trial number (j) starting at 0 for the first trial, so the random intercept directly represents each subject’s first-trial performance (on the logit scale).
rm(list=ls())
set.seed(1)
= 5000 # number of subjects (mice)
N = sample(3:9, size = N, replace = TRUE) # number of measurements per subject
J = 1
true_score_variance = 2
error_variance
= expand.grid(j = 1:10, mouse = 1:N)
df
for(i in 1:N){
= df %>%
df filter(!(mouse == i & j > J[i]))
}
= rbeta(N,3,3) # probability of success on first trial
prob_success_first_trial
$prob_success_first_trial = prob_success_first_trial[df$mouse] %>% qlogis()
df
$y = rbinom(nrow(df), size = 1, prob = plogis(df$prob_success_first_trial + .1*df$j))
df
$j = df$j - 1
df
= brm(
brms_model_logistic ~ 1 + j + (1 | mouse),
y data = df,
family = bernoulli(link = "logit"),
chains = 4,
cores = 4,
backend = "cmdstanr",
threads = threading(2)
)
Our data structure looks like this:
%>%
df select(-prob_success_first_trial) %>%
slice(1:10)
j mouse y
1 0 1 1
2 1 1 0
3 2 1 1
4 0 2 1
5 1 2 1
6 2 2 1
7 3 2 1
8 4 2 0
9 5 2 1
10 0 3 1
Fitted brms model summary:
summary(brms_model_logistic)
Family: bernoulli
Links: mu = logit
Formula: y ~ 1 + j + (1 | mouse)
Data: df (Number of observations: 30080)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~mouse (Number of levels: 5000)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.88 0.02 0.84 0.92 1.00 1419 2482
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.10 0.02 0.05 0.14 1.00 5038 3681
j 0.10 0.01 0.09 0.11 1.00 5333 2918
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We visualise the posterior distribution for each subject’s random intercept below.
By transforming the random intercept with the inverse logit function we can look at our posterior on the probability scale too - these values tell us the estimated probability of success on the first trial for each subject.
= brms_model_logistic %>%
posterior_draws as_draws_df() %>%
select(starts_with("r_mouse")) %>%
t()
%>%
posterior_draws t() %>% # make sure participant is on the COLUMN for pivot_wider!
data.frame() %>%
`colnames<-`(1:ncol(.)) %>%
pivot_longer(cols = everything()) %>%
filter(as.numeric(name) < 201) %>%
ggplot(aes(x = value, group = name)) +
geom_density(fill = NA, color = ggplot2::alpha("black", .5)) +
theme_bw() +
guides(color = "none") +
labs(
subtitle = "Random Intercept Posterior Distributions for each subject \nLogit scale, first 200 subjects only"
)
%>%
posterior_draws t() %>% # make sure participant is on the COLUMN for pivot_wider!
data.frame() %>%
`colnames<-`(1:ncol(.)) %>%
pivot_longer(cols = everything()) %>%
filter(as.numeric(name) < 201) %>%
mutate(value = plogis(value)) %>%
ggplot(aes(x = value, group = name)) +
geom_density(fill = NA, color = ggplot2::alpha("black", .5)) +
theme_bw() +
labs(
subtitle = "Random Intercept Posterior Distributions for each subject \nProbability scale, first 200 subjects only"
)
To find the “true” reliability, we can calculate the squared correlation between the estimated probabilities of success for each subject (using posterior means for the random intercepts) and the known probabilities of success.
Rather than working directly with the probabilities, we will calculate the correlation on the logit scale.
= rowMeans(posterior_draws)
posterior_means
cor(posterior_means,prob_success_first_trial)^2
[1] 0.4726488
::reliability(posterior_draws)$hdci gbtoolbox
[1] "Number of subjects: 5000"
[1] "Number of posterior draws: 4000"
rmu_estimate hdci_lowerbound hdci_upperbound .width .point .interval
1 0.4739327 0.4533702 0.4949552 0.95 mean hdci
Our RMU estimate of reliability closely matches the squared correlation between estimated and true scores - which is reassuring!
A more robust approach to find the “true” reliability would be to simulate data from two time points for each participant - and show that the correlation between estimates from each time point is equivalent to RMU. In the case that each subject’s expected score corresponds to prob_success_first_trial
, then either approach will yield the same result, however this is not always the case.
Appendix: Deriving The Mean Score Reliability Formula
Consider a measurement model where we measure N participants, and we measure each participant J times (or across J items). Each participant has a true score \(\theta_i\) which represents the expected measurement for participant i.
We assume that measurements from a given subject follow a normal distribution with mean (\(\theta_i\)) and variance (\(\sigma^2_{\epsilon}\)).
We assume that the distribution of true scores across the population is normally distributed with mean \(\mu\) and variance \(\sigma^2_{\theta}\).
We can describe the distribution of each measurement as follows:
\[ \begin{aligned} \overbrace{y_{ij}}^{\text{measurement } j \text{ for subject } i} &= \overbrace{\theta_i}^{\text{subject } i \text{ true score}} + \overbrace{\epsilon_{ij}}^{\text{measurement error}} \\ \epsilon_{ij} &\sim \text{Normal}(0, \overbrace{\sigma^2_{\epsilon}}^{\text{error variance}}) \\ \overbrace{\theta_1,\theta_2,\dots,\theta_N}^{\text{population true scores}} &\sim \text{Normal}(\overbrace{\mu}^{\text{population mean true score}}, \overbrace{\sigma^2_{\theta}}^{\text{true score variance}}) \end{aligned} \]
Calculating Reliability of Mean Scores
Given that we know the distribution of each measurement, the distribution of a mean of multiple independent measurements is also Normal but with a smaller variance (note that as J increases, the variance decreases):
\[ \begin{aligned} \hat{y}_i &= \frac{1}{\text{J}}\sum_{j=1}^{\text{J}} y_{ij} \\ &= \frac{1}{\text{J}}\sum_{j=1}^{\text{J}} (\theta_i + \epsilon_{ij}) \\ &= \theta_i + \sum_{j=1}^{\text{J}} \frac{\epsilon_{ij}}{\text{J}} \\ &= \theta_i + \mathcal{E}_i \\ \mathcal{E}_i &=\frac{\epsilon_{i1} + \epsilon_{i2} + \dots + \epsilon_{i\text{J}}}{\text{J}} \sim \text{Normal}(0,\frac{ \sigma^2_{\epsilon} }{\text{J}}) \end{aligned} \]
To work out the variance of the mean score for subject i, note that the variance of the sum of J independent normally distributed variables (\(\epsilon_{i1} + \epsilon_{i2} + \dots + \epsilon_{i\text{J}}\)) is equal to \(\text{J}\sigma^2_{\epsilon}\), and dividing by a constant J divides the variance by \(\text{J}^2\).
Reliability Formula
Reliability is the ratio of true score variance divided by the observed score variance:
\[ \begin{aligned} \text{R}(\hat{\mathbf{y}}) &= \text{R}(\mathbf{\theta}+\mathcal{E})\\ &=\frac{\text{Var}(\mathbf{\theta})}{\text{Var}(\mathbf{\theta})+\text{Var}(\mathcal{E})} \\ &= \frac{\sigma^2_{\theta}}{\sigma^2_{\theta}+\sigma^2_{\epsilon}/\text{J}} \end{aligned} \]
This formula shows that reliability increases as we add more items (J increases), because the error variance gets divided by J.