Back to Stroop Test
Technical Manual

Stroop Test

Scientific rationale, metric formulas, and clinical interpretation guide for health professionals and researchers.

Section 1

Overview

The Stroop test is one of the most widely used instruments in cognitive neuropsychology. It was first described by John Ridley Stroop in 1935 and has since generated over 700 empirical studies, making it one of the most replicated paradigms in all of experimental psychology (MacLeod, 1991).

The test exploits a fundamental characteristic of skilled readers: word reading is automatic and difficult to suppress, while colour naming requires controlled attentional processing. When a colour word is printed in an incongruent ink colour (e.g., the word RED printed in blue ink), the automatic word-reading response competes with the slower colour-naming response, generating measurable interference in both reaction time and accuracy.

This interference — the Stroop Effect — is considered a robust index of selective attention, response inhibition, and cognitive control. It reflects the activity of frontal executive systems, particularly the anterior cingulate cortex and dorsolateral prefrontal cortex, which mediate conflict monitoring and resolution.

Key insight: The magnitude of Stroop interference is not fixed — it is sensitive to cognitive load, fatigue, age, neurological condition, and practice. This variability is precisely what makes the test clinically informative.
Section 2

Conditions and Stimuli

This implementation uses up to three conditions depending on the selected mode. All stimuli are colour words displayed in a specific ink colour; the participant must identify the ink colour, ignoring the word meaning.

Congruent

The word meaning and the ink colour match. Example: the word RED printed in red. No conflict exists between automatic reading and colour naming; this condition serves as the cognitive baseline.

Incongruent

The word meaning and the ink colour differ. Example: the word RED printed in blue. This is the conflict condition — the automatic reading response must be suppressed in favour of the correct colour response.

Neutral

A word with no colour connotation (e.g., TABLE, HOUSE) is printed in a colour. Example: TABLE. The word carries no semantic interference for colour naming. This condition allows estimation of baseline processing time without the congruency benefit — the gap between Neutral and Congruent RT estimates the facilitation effect.

Note on accuracy in the Neutral condition: Because the word has no colour meaning, there is no "correct" response driven by word content — any colour name that matches the ink is correct. Accuracy is therefore not computed as a conflict metric for Neutral stimuli.

Trial Structure

Each trial follows a fixed sequence: (1) a fixation cross (+) appears at the centre of the screen for the duration of the inter-trial interval (400 ms by default), directing the participant's gaze to the stimulus location; (2) the fixation cross is replaced by the colour word stimulus, which remains visible until a response is made or the time limit is reached; (3) after each response or timeout, the screen clears briefly before the next fixation cross appears. The fixation cross is displayed in a neutral grey and carries no colour connotation — its sole function is to direct spatial attention.

Response Input Modes

Three input methods are available, selected before starting the test:

  • Keyboard shortcuts (recommended for clinical assessment): assign any key to each colour in the Keyboard Shortcuts panel. Enable Keyboard-only mode to hide the colour buttons entirely — this removes the visual search component from the response action and brings the paradigm closer to validated keyboard-based Stroop protocols.
  • Labelled colour buttons (desktop, default): four buttons labelled with the colour names appear below the stimulus. The participant clicks the button matching the ink colour. This adds a visual search component not present in standard Stroop paradigms and may slightly inflate RT.
  • Colour squares (touch devices): on touchscreen devices, buttons are displayed as unlabelled coloured squares, removing the word-reading confound from button labels while preserving direct touch interaction.
Comparability note: RT values obtained with keyboard input, labelled buttons, and touch squares are not directly comparable. For serial monitoring within the same patient, maintain a consistent input method across sessions.
Section 3

Standard Report — Formulas

The standard report is generated immediately after each session. All calculations use only trials from the current session, with no outlier removal applied.

Mean Reaction Time (RT)

For each condition, the mean RT is computed across all trials with a valid (non-timeout) response. Timed-out trials are excluded from RT but counted as errors in the accuracy calculation.

Formula
RT̄cond = Σ RTi / nvalid
Where n_valid = trials with a recorded response (excluding timeouts). RT is measured in milliseconds from stimulus onset to button press.

Accuracy

Accuracy is the proportion of trials in which the participant selected the correct ink colour, expressed as a percentage. Timeout trials count as errors.

Formula
Acccond = (ncorrect / ntotal) × 100
n_total includes all trials in the condition, including timeouts. A timeout is treated as an incorrect response.

The Overall Accuracy shown in the report is computed from Congruent and Incongruent trials only — the Neutral condition is excluded. Because the Neutral condition serves exclusively as an RT baseline, its accuracy is not clinically meaningful for the interference assessment and is therefore not displayed or included in the overall figure.

Formula
Accoverall = (ncorrect,Con + ncorrect,Inc) / (ntotal,Con + ntotal,Inc) × 100

Stroop Interference

Interference is the RT difference between the Incongruent and Congruent conditions. It quantifies the cost of resolving word-colour conflict, expressed in milliseconds. A positive value indicates the expected direction (incongruent slower than congruent).

Formula
Interference = RT̄Incongruent − RT̄Congruent
Unit: milliseconds (ms). Negative values (congruent slower) indicate response set effects or atypical performance patterns.
Why RT difference and not ratio? Both approaches exist in the literature. The raw RT difference is more intuitive and widely reported in clinical settings. The ratio method (Incongruent / Congruent) offers some correction for baseline speed but can distort comparisons when congruent RT varies markedly between participants.

Coefficient of Variation (CV)

The Coefficient of Variation (CV) quantifies response consistency relative to the mean reaction time. It expresses the standard deviation as a percentage of the mean, allowing comparison of variability across conditions and across sessions regardless of differences in baseline speed.

Formula
CV = (SD / RT̄) × 100%
SD is the sample standard deviation (Bessel-corrected, n − 1 denominator) computed from all valid response times in the condition. CV is reported for the Congruent and Incongruent conditions only and is stored in the Summary CSV for longitudinal tracking.

A lower CV indicates more consistent, stable responses. A higher CV indicates greater trial-to-trial variability, which may reflect attentional instability, fatigue, or inconsistent strategic control — independent of mean RT. In the Historical Comparison table, CV changes are colour-coded in green (improvement) when the value decreases relative to the historical reference, because lower variability is the desired direction.

Clinical Report — How It Works

The Clinical Report is a printable PDF document that consolidates the current session's results and, when historical data is available, a longitudinal summary of all previous sessions. It can be generated in two ways:

  • After a live session: complete the test, then load a Summary CSV in the results screen and click Generate Report. The report will include both the current session and the historical series from the CSV.
  • From the initial screen: load a Summary CSV under Generate report from existing data and click Clinical Report. The last row of the CSV is treated as the current session; all preceding rows form the historical baseline.

Historical Comparison Table

The comparison table shows four metrics side by side: the current session value, the historical reference, and the change between them. The historical reference is computed from all sessions in the CSV excluding the current one. This prevents a circular comparison in which today's result would influence its own reference baseline — a standard approach in serial performance monitoring.

Formula
Historical Ref = Σ(previous sessions) / nprev
n_prev = number of sessions before the current one. The current session is always excluded from this calculation.

The column label adapts to the amount of data available:

  • 1 previous session: labelled Historical Value — the mean of a single value is that value itself; calling it an average would be misleading.
  • 2 or more previous sessions: labelled Historical Avg — the arithmetic mean of all preceding sessions.
Section 4

Exclusion Criteria — Detailed Report

The Detailed Report applies a three-pass exclusion pipeline before computing RT-based statistics. This pipeline follows standard practices in RT research to remove trials that are not representative of genuine cognitive processing. Exclusion thresholds are configurable by the user.

Important: Exclusion criteria apply only to the Detailed Report. The Standard Report uses all valid trials without filtering.
Pass Criterion Default Excludes from Rationale
1 Anticipatory response RT < 200 ms RT and Accuracy A response under 200 ms cannot reflect genuine colour identification — it precedes the time required for visual processing and decision-making. It is most likely a pre-motor reflex or key-press accident, not a cognitive response.
2 Lapse / disengagement RT > 2000 ms RT only Very long RTs typically reflect attentional lapses, momentary disengagement, or distraction rather than genuine processing difficulty. Including them inflates mean RT and increases variance. The response is still counted in accuracy because a decision was eventually made.
3 Statistical outlier (per condition) |RT − M| > 2.5 × SD RT only Trials whose RT deviates more than 2.5 standard deviations from the condition mean are flagged as statistical outliers. This pass operates within each condition separately to preserve genuine between-condition differences. It is applied after Pass 2 to avoid outlier inflation from extreme values.

SD Outlier Formula

Formula
SD = √[ Σ(RTi − RT̄)2 / (n − 1) ]
Sample standard deviation (Bessel-corrected, n − 1 denominator). A trial is excluded when |RT_i − RT̄| > k × SD, where k is the configurable multiplier (default: 2.5).
Outlier condition
Exclude if |RTi − RT̄cond| > k × SDcond
Computed independently for Congruent, Incongruent, and Neutral conditions. Using a per-condition mean and SD avoids cross-condition confounding — a long incongruent RT should not be flagged simply because it is longer than the congruent mean.

The 2.5 SD threshold is widely used in RT research as a balance between preserving statistical power and removing artifactual observations (Ratcliff, 1993). More conservative thresholds (3.0 SD) retain more trials at the cost of greater outlier influence; more liberal thresholds (2.0 SD) remove more data but may exclude legitimate slow responses.

Section 5

Detailed Report — Analyses and Formulas

The Detailed Report is generated from a Trial Data CSV file and provides six complementary analyses. All analyses operate on the trial-level data after applying the three-pass exclusion pipeline described in Section 4.

5.1 — RT Distribution (Boxplot)

A box-and-whisker plot is displayed for each condition, allowing visual comparison of RT spread and central tendency. The implementation uses the Tukey method (Tukey, 1977).

Formula
Q1 = 25th percentile of RTcond
Q2 = 50th percentile (median)
Q3 = 75th percentile
IQR = Q3 − Q1
Whiskerlow = Q1 − 1.5 × IQR
Whiskerhigh = Q3 + 1.5 × IQR
Values outside the whisker range are plotted individually as outlier points. The box spans Q1 to Q3; the median line divides the box. Reported statistics (mean, SD, min, max) are based on the post-exclusion dataset.

5.2 — Learning Curve

The learning curve plots individual trial RTs over the sequence of trials, overlaid with a moving average to reveal practice or fatigue effects across the session.

Formula
MA(i) = [ Σ RTj for j = i−2 to i+2 ] / w
Centred moving average with window w = 5. For trials near the edges (i < 3 or i > N−3), available neighbours are used and w is adjusted accordingly. Excluded trials are omitted from the sequence but do not break the trial index.

A downward trend in the moving average over time indicates a practice effect (RT improvement). An upward trend at the end of the session may indicate cognitive fatigue. A flat curve suggests stable performance throughout.

5.3 — Post-Error Slowing (PES)

Post-error slowing is a well-established cognitive phenomenon in which participants are slower on the trial immediately following an error compared to trials following a correct response. It is interpreted as reflecting error-monitoring processes and adaptive response adjustment (Rabbitt & Rodgers, 1977).

Formula
RT̄post-error = mean RT of all trials immediately after an error
RT̄post-correct = mean RT of all trials immediately after a correct response
PES = RT̄post-error − RT̄post-correct
A positive PES value (post-error RT > post-correct RT) is the expected adaptive pattern, indicating active error monitoring. A near-zero or negative PES may suggest reduced error awareness or impulsive responding.
Reliability note: PES estimates require a minimum number of error trials to be stable. With very few errors (fewer than ~5), the PES value should be interpreted with caution as it is based on a very small sample.

5.4 — Block Analysis

The session is divided into four equally-sized temporal blocks, and mean RT and accuracy are computed per block. This allows detection of practice, fatigue, or attentional fluctuation effects across the session.

Formula
Block size = ⌈ Ntotal / 4 ⌉
RT̄block k = mean RT of trials in block k
Accblock k = (ncorrect in block k / nblock k) × 100
⌈ ⌉ denotes ceiling division. The last block may contain fewer trials if N_total is not divisible by 4. Dual axes are used in the chart: RT on the left axis (bars), accuracy on the right axis (line).

5.5 — Error Analysis

Errors are grouped by word-ink combination. For each unique pair, the error rate is computed and the combinations with the highest rates are reported. This reveals which specific stimuli are most cognitively demanding, which can reflect lexical frequency effects, colour-name similarity, or individual semantic associations.

Formula
Error rate(word, ink) = nerrors / npresentations × 100
Up to 12 word-ink pairs with the highest error rates are displayed. Pairs with zero errors are not shown.

5.6 — Excluded Trials

All trials removed by the three-pass exclusion pipeline are listed in full, showing the trial number, condition, RT, and the exclusion reason. This transparency allows the clinician or researcher to review every data-quality decision made before analysis.

Section 6

Clinical Interpretation

Interference Score

There is no universally standardised normative range for Stroop interference, as values vary substantially across age, education, and task design. The following descriptive thresholds are used in this tool as a general orientation for clinical screening, not as diagnostic cutoffs:

Minimal (<30 ms) Moderate (30–100 ms) Marked (>100 ms)
  • <30 ms: Minimal interference — very fast conflict resolution, possibly due to practice, very high attention, or reduced automaticity of reading.
  • 30–100 ms: Moderate interference — the typical range for healthy adults in standard conditions.
  • >100 ms: Marked interference — warrants attention; may reflect reduced cognitive control, fatigue, anxiety, or clinical factors.
Clinical caveat: These categories are descriptive guides, not diagnostic thresholds. Interference magnitude must be interpreted alongside accuracy (a low-interference, low-accuracy profile suggests speed-accuracy trade-off), in the context of the individual's baseline, and alongside other clinical information.

Negative Interference Values

When interference is negative (e.g., −42 ms), the congruent condition was slower than the incongruent condition — a reversal of the expected Stroop direction. This is atypical in healthy adults and does not indicate superior cognitive control. Possible explanations include response set effects (a consistent speed-accuracy bias toward the incongruent format), practice or habituation effects, or strategic suppression of word reading. Negative interference should be noted as an unusual finding and interpreted cautiously, particularly if it persists across multiple sessions.

Best Session — Historical Report

In the historical report, Best Session identifies the session with the smallest absolute interference value — the session whose score is closest to zero. This criterion reflects the session in which word-colour conflict had the least measurable effect on response time, regardless of direction. A session with +20 ms is therefore considered better than one with −60 ms, because its absolute interference (20 ms) is smaller. This approach avoids misclassifying atypical negative values as markers of exceptional performance.

Clinical Applications

The Stroop paradigm has been validated as a sensitive measure across numerous clinical and research contexts (MacLeod, 1991):

  • ADHD: Increased interference and reduced accuracy reflect impaired response inhibition.
  • Traumatic Brain Injury (TBI): Slowing across all conditions with disproportionate increase in incongruent RT; sensitive to frontal lobe involvement.
  • Healthy ageing: Gradual increase in overall RT and interference with age, associated with white matter changes and reduced processing speed.
  • Schizophrenia: Greater interference and higher error rates, linked to executive dysfunction and reduced top-down cognitive control.
  • Depression and anxiety: Slower responses to emotionally congruent stimuli in emotional Stroop variants; standard Stroop shows slowing under high cognitive load.
  • Neurological monitoring: Serial administration tracks changes in cognitive control over time, useful in rehabilitation and neuromodulation follow-up.
Section 7

Limitations

  • Literacy requirement: The Stroop effect depends on automatic word reading. Results are not interpretable in participants who cannot read the language of the stimuli fluently.
  • Screen and input variability: RT precision depends on display refresh rate and input device latency. Results obtained on different hardware (e.g., touchscreen vs. keyboard) are not directly comparable.
  • Practice effects: Repeated administration within a short period produces RT improvement unrelated to the clinical construct of interest. A minimum interval of 24–48 hours between sessions is recommended for serial comparisons.
  • Colour vision: The task requires reliable discrimination of the ink colours used. Participants with colour vision deficiency (colour blindness) may show artifactual interference patterns and should be screened before administration.
  • Absence of local norms: This tool does not include a normative database. Interference values must be interpreted relative to the individual's own baseline across sessions, or against published norms appropriate for the population being evaluated.
  • Computerised vs. paper format: The computerised format (used here) and classic paper-and-pencil Stroop formats yield different absolute values. Published norms from paper versions should not be directly applied to these data.
Section 8

References

MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative review. Psychological Bulletin, 109(2), 163–203. https://doi.org/10.1037/0033-2909.109.2.163

Rabbitt, P. M. A., & Rodgers, B. (1977). What does a man do after he makes an error? An analysis of response programming. Quarterly Journal of Experimental Psychology, 29(4), 727–743. https://doi.org/10.1080/14640747708400645

Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological Bulletin, 114(3), 510–532. https://doi.org/10.1037/0033-2909.114.3.510

Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643–662. https://doi.org/10.1037/h0054651

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.