Tutorial: Calculating RMU reliability for a go/no-go task
reliability
cognition
Author
Giacomo Bignardi
Published
September 11, 2025
Modified
October 2, 2025
Introduction
This is a brief walk-through of how to calculate reliability using the approach (relative measurement uncertainty; RMU) described in our pre-print.
In this example, we will calculate RMU reliability for the d' parameter estimated from a go/no-go task. We will also compare our results to other methods, including test-retest, split-half and empirical reliability.
In this tutorial, I’m going to assume that you’re familiar with the basics of Bayesian modelling and brms, and focus on the reliability methods.
Load Data
We’ll use test-retest data from Hedge, Powell & Sumner’s reliability paradox paper, pooling data from studies 1 and 2, which contain different participants. Retest data was collected three weeks after initial testing. Data from the paper is openly available, downloaded from here.
In this go/no-go task, one of four letters is shown on a screen. Participants are told to press the space bar as quickly as possible if three of the four letters appear on screen (“go” condition trials), and to not press the space bar if one of the letters appears on screen (“no-go” condition trials).
Condition
Response
Go
No-Go
Go
Hits
Misses
No-Go
False alarms
Correct rejections
We will analyse the go/no-go task using a signal detection theory model to derive a measure of performance called sensitivity or d'.
The number of “go” responses, that is, trials when the participant presses the space bar, will be the outcome variable. Condition (“go” or “no-go”) will be our predictor. Participants who are better on the task should have a high probability of responding “go” in the go condition, and a low probability of responding “go” in the no-go condition.
In this example, one could calculate d' directly using each participant’s hit rate (\(HR_i\); proportion of go responses in the go condition) and false alarm rate (\(FAR_i\); proportion of go responses in the no-go condition), shown in the equation below. In fact, this gives us estimates of d' that are largely identical to the Bayesian method (see Appendix A), and one could calculate reliability utilizing the split-half method (see Split-Half Reliability). However, this approach won’t work for more complex models, which is where RMU is particularly helpful.
Note that \(\Phi^{-1}\) represents the inverse cumulative distribution function of the standard normal distribution (in R, this function is qnorm()). It converts values between 0-1 to z-scores.
To estimate RMU, we will model d' using a Bayesian random intercept and slope probit regression model. This approach will allow us to estimate uncertainty in each participant’s fitted d' value, and calculate RMU reliability.
We will contrast code condition so that condition_contrast = -1/2 for “no-go” trials, and condition_contrast = +1/2 for “go” trials.
The random intercept captures individuals’ propensity to generally respond “go” regardless of condition.
The random slope captures how well individuals shift their probability of responding “go” between conditions.
The beta coefficient for condition_contrast approximately corresponds to the change in the probability of responding “go” between conditions.
Prepare Data for Modelling
data_long_summarised = data_long %>%group_by(pps, session, condition) %>%summarise(n =n(),n_correct =sum(correct),.groups ='drop' ) %>%ungroup() %>%mutate(n_go =ifelse(condition ==0, n_correct, n - n_correct),condition_contrast = dplyr::case_when( condition ==0~ .5, condition ==2~-.5 ) ) %>%ungroup() %>%mutate(n_go =ifelse(n_go == n, n_go -1, n_go),n_go =ifelse(n_go ==0, 1, n_go) ) %>%arrange(pps)data_long_summarised %>%select(pps, session, condition_contrast, n, n_go) %>%slice(1:10) %>% knitr::kable(caption ="First 10 lines of our cleaned data for analysis",col.names =c("Unique Participant ID", "Testing Session","Condition (contrast coded)", "Number of Trials","Number of Go Responses in Condition") )
Instead of modelling each individual trial, we will sum the number of “go” responses in each condition for each participant.
Because we are assuming that the probability of responding ‘go’ only varies by condition and participant, we do not need the individual trial data.
For example, if we wanted to estimate the probability of a flipped bent coin landing on heads, we just need to know the number of heads and number of throws, not the value of each individual coin toss.
Fit Bayesian Signal Detection Models
In the next code chunk, we fit our models to each session (time point) and we extract the posterior draws of each subject’s d' value at each time point into two matrices called subj_draws_time_1 and subj_draws_time_2.
I will stick with brms’s default, weakly informative priors on the model parameters.
When initially running the models, we got a warning from Stan that “ESS is too low”. We can address this by increasing the number of iterations (draws) per chain to 10,000 (setting iter = 10000). Note that half of these draws are discarded as warmup iterations. Setting chains = 4 and cores = 4 tells Stan to fit 4 chains and run them in parallel across four CPU cores, which will give 20,000 post-warmup draws in total.
For calculating test-retest reliability, it makes sense to fit models at each time point (session) independently.
model_time_1 = brms::brm( n_go |trials(n) ~1+ condition_contrast + (1+ condition_contrast | pps),family =binomial(link ="probit"),data =filter(data_long_summarised, session ==1 ),cores =4,chains =4,iter =10000)model_time_2 = brms::brm( n_go |trials(n) ~1+ condition_contrast + (1+ condition_contrast | pps),family =binomial(link ="probit"),data =filter(data_long_summarised, session ==2 ),cores =4,chains =4,iter =10000)# Time Session 1 # Participant random effects of conditionsubj_draws_time_1 =as_draws_df(model_time_1) %>%select(ends_with("condition_contrast]")) %>%t() # Main effect of conditioncond_main_effect_1 =as_draws_df(model_time_1) %>%pull(b_condition_contrast) %>%as.numeric()# Add main effect of condition to random effects subj_draws_time_1 = subj_draws_time_1 + cond_main_effect_1# Time / Session 2# Participant random effects of conditionsubj_draws_time_2 =as_draws_df(model_time_2) %>%select(ends_with("condition_contrast]")) %>%t() # Main effect of conditioncond_main_effect_2 =as_draws_df(model_time_2) %>%pull(b_condition_contrast) %>%as.numeric()# Add main effect of condition to random effects subj_draws_time_2 = subj_draws_time_2 + cond_main_effect_2# Calculate point-estimate of `d'` for each participant (posterior means)# Time / Session 1subj_est_time_1 =apply(subj_draws_time_1, 1, mean)# Time / Session 2subj_est_time_2 =apply(subj_draws_time_2, 1, mean)
Model results
model_time_1
Family: binomial
Links: mu = probit
Formula: n_go | trials(n) ~ 1 + condition_contrast + (1 + condition_contrast | pps)
Data: filter(data_long_summarised, session == 1) (Number of observations: 214)
Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup draws = 20000
Multilevel Hyperparameters:
~pps (Number of levels: 107)
Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept) 0.25 0.02 0.22 0.30 1.00
sd(condition_contrast) 0.79 0.06 0.68 0.92 1.00
cor(Intercept,condition_contrast) 0.14 0.11 -0.07 0.35 1.00
Bulk_ESS Tail_ESS
sd(Intercept) 7112 12566
sd(condition_contrast) 5352 9267
cor(Intercept,condition_contrast) 2728 5447
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.78 0.03 0.72 0.84 1.00 7196 10775
condition_contrast 3.37 0.08 3.21 3.53 1.00 3708 7440
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
model_time_2
Family: binomial
Links: mu = probit
Formula: n_go | trials(n) ~ 1 + condition_contrast + (1 + condition_contrast | pps)
Data: filter(data_long_summarised, session == 2) (Number of observations: 214)
Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup draws = 20000
Multilevel Hyperparameters:
~pps (Number of levels: 107)
Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept) 0.26 0.02 0.22 0.31 1.00
sd(condition_contrast) 1.00 0.08 0.87 1.16 1.00
cor(Intercept,condition_contrast) 0.33 0.10 0.13 0.51 1.00
Bulk_ESS Tail_ESS
sd(Intercept) 4188 8338
sd(condition_contrast) 2076 4731
cor(Intercept,condition_contrast) 979 2717
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.78 0.03 0.73 0.84 1.00 3159 6115
condition_contrast 3.13 0.10 2.93 3.33 1.01 1205 2961
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Since random effects are centered (i.e., have a mean of zero), in the above code I have added the posterior distribution of the main effect of condition_contrast (which represents the average d' value across participants) to the random slope posteriors. This gives us posterior distributions that have the correct scaling for d', but doesn’t affect our reliability calculations.
At this point, the procedures for estimating RMU reliability are the same regardless of the model we have used to estimate participant effects.
For each of our models fitted to time 1 and time 2 data, we have a matrix with our posterior draws (subj_draws_time_1 and subj_draws_time_2).
Each matrix has 107 rows (one row for each subject) and 20,000 columns. Each row has the posterior draws for a given subject’s d'.
Below, we show data from 4 posterior draws from the first 10 participants.
Point estimates of d' for each participant could be estimated using the matrix row means (e.g., rowMeans(subj_draws_time_1)).
To visualize our uncertainty about each subject’s true d’ value, we can plot the posterior distribution for each subject’s d' value.
subj_draws_time_1 %>%data.frame() %>%rowid_to_column(var ="pps") %>%pivot_longer(cols =!matches("pps")) %>%ggplot(aes(x = value, group = pps)) +geom_density(alpha = .2, fill ="light blue") +labs(title ="Session 1: Posterior Distribution for each subject's Go/No-Go d' (sensitivity)")
subj_draws_time_2 %>%data.frame() %>%rowid_to_column(var ="pps") %>%pivot_longer(cols =!matches("pps")) %>%ggplot(aes(x = value, group = pps)) +geom_density(alpha = .2, fill ="light blue") +labs(title ="Session 2: Posterior Distribution for each subject's Go/No-Go d' (sensitivity)")
From just eye-balling the posteriors, it is evident that there is a lot of variability between participants, and our estimates look quite precise. This suggests good reliability!
As expected, most of the mass of each density lies above 0. If d' is less than zero it would indicate that a participant is more likely to press the space bar (i.e., “go”) for no-go conditions than go conditions, which would be odd.
Reliability Estimation Methods
In the following sections, we will outline a few different methods of calculating reliability and compare their results. We’ll estimate RMU, test-retest , split-half, and empirical reliability.
RMU
Theory
RMU estimates reliability using the posterior distributions for each subject’s true d' value.
To estimate RMU, we take pairs of draws from each subject’s posterior and calculate the correlation between these draws to get a single “draw” of RMU reliability. We repeat this process many times to build up a posterior distribution for reliability, from which we can calculate point estimates and credible intervals.
RMU reliability estimation process
One can think of RMU as a way of simulating reliability by drawing test-retest estimates from the posterior distributions. It is closely related to the empirical reliability estimator, which estimates reliability by comparing the variance of the posterior means to the average posterior variance across all subjects.
RMU reliability will increase when subjects’ posterior means have greater variance (more between-subject differences), or when subjects’ posterior distributions become narrower (more precise individual estimates).
Practice
The code to calculate RMU using the posterior draw matrix is relatively simple.
You can either install my R package of miscellaneous R functions gbtoolbox available on github, or you can just load the reliability function directly from github. Both methods are outlined below:
The analyses indicate that reliability at the first time point was 89% (95% Credible Interval = 86%-92%), and at the second time point reliability was 93% (95% Credible Interval = 91%-95%).
Test-Retest Reliability
The test-retest reliability is a little lower than these estimates (r = 0.73):
cor.test(subj_est_time_1, subj_est_time_2)
Pearson's product-moment correlation
data: subj_est_time_1 and subj_est_time_2
t = 10.873, df = 105, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6241361 0.8062195
sample estimates:
cor
0.7277571
If there were no changes in individuals’ latent (true) d' values across time points, the population test-retest reliability and RMU would be identical. Because some change over time is inevitable, test-retest reliability will usually be a bit lower than reliability estimates derived from a single time point.
Our estimate for split-half reliability (see below) is similar to RMU, because its not effected by change in participant’s performance score over time.
Calculating a posterior for test-retest reliability
While cor.test gives us a confidence interval, if we want a posterior distribution for this correlation we can use the following code thanks to Solomon Kurz for code.
We can then visually compare test-retest reliability with our RMU estimates:
bind_rows(data.frame(par ="Test-Retest Posterior", val = test_retest_cor_posterior$rescor__subjesttime2__subjesttime1),data.frame(par ="RMU Session 1", val = rmu_time1$reliability_posterior),data.frame(par ="RMU Session 2", val = rmu_time2$reliability_posterior)) %>%ggplot(aes(x = val, group = par, col = par, fill = par)) +geom_density(alpha = .3) +theme_bw() +theme(legend.position =c(.1, .9),legend.justification =c(0, 1) ) +labs(title ="Posterior distribution for reliability estimates")
Split-Half Reliability
If we want to avoid model-fitting, another way of calculating reliability is to use split-half reliability.
To calculate split-half reliability, I created two datasets with odd and even trials respectively. I then calculated d' using the transformed hit and false alarm rates, and calculated the correlation between split-half estimates at each time point.
Split-Half Correlations and Spearman-Brown Corrected Split-Half Reliabilities
par
val
Split Half Correlation (T1)
0.8944
Split Half Correlation (T2)
0.9170
Split-Half Reliability
0.9442
Split-Half Reliability
0.9567
Empirical Reliability
Finally, the empirical reliability method involves calculating the variance of the posterior means (\(\hat{\boldsymbol{\theta}}\)) and average posterior variance across subjects (\(\hat{\boldsymbol{V}}_i\)).
Empirical reliability can also be used with maximum likelihood estimates of d' and error variances - which can make it appealing to those less familiar with Bayesian methods. Instead of brms/stan, you could use lme4 or other packages to fit the signal detection theory model using maximum likelihood. However, I’m not aware of any methods to get uncertainty intervals with empirical reliability, beyond perhaps bootstrapping.
Comparison of RMU and Empirical Reliability Point Estimates
Method
Reliability
RMU Session 1
0.8933
RMU Session 2
0.9308
Empirical Session 1
0.8840
Empirical Session 2
0.9202
Appendix: Bayesian vs. Simple d’ Estimates
If we just calculate d' using the transformed hit rate and false alarm rate, using the formula shown in Equation 1, does this give us estimates that are very different, or less reliable?
The answer in this case seems to be no, the estimates are very similar regardless of which approach we take:
data_plot %>%ggplot(aes(x = Simple.d..T2, Bayes.d..T2)) +geom_abline(slope =1, intercept =0) +geom_point() +theme_bw() +annotate("text", x =1, y =4.5, label =paste("Correlation: ", round(cor(two_stage_estimate$d_prime_t2,subj_est_time_2), 4)), hjust =0, vjust =1, size =8) +labs(x ="Simple d' T2",y ="Bayes d' T2",title ="Correlation between Bayesian d' and simple d' estimates",subtitle ="Simple estimates = qnorm(hit rate) - qnorm(false alarm rate)" )
This example highlights that there isn’t anything magical about Bayesian or more fancy statistical methods. They don’t provide better estimates by default.
This doesn’t mean that the regression approach to calculating d' is pointless. As we have shown, we can easily calculate reliability from our Bayesian regression model - even without retest data. We could also calculate a credible interval for any participant’s true d' score. We can also handle more complex signal detection theory models within a regression approach, or we could control for time of day or practice effects, and more.