Case studies have several advantages and limitations.
In biofeedback research, a case study
is a record of patient experiences and behaviors compiled by a therapist and is nonexperimental. A therapist
documents change
in patient symptoms across the phases of treatment but cannot prove that
the change was due to biofeedback training.
A descriptive case study might use an
AB design,
where A is the baseline phase and B is the treatment phase.
Small N Designs
N represents the number of participants required for an experiment. Classic experiments
like Peniston and Kulkosky's (1990) alcoholism study utilized
large N designs
that compared the performance of groups of participants (those who received alpha-theta neurofeedback and those who
received conventional medical treatment).
A
small N design examines one or two participants. The individual featured below
learned to control a synthesizer by manipulating his EEG.
A clinical psychologist could use a small
N design to test a treatment when there are insufficient
participants to conduct a large
N study and when they want to avoid the ethical problem of an
untreated control group. Animal researchers prefer small
N designs to minimize their animal subjects' acquisition and maintenance costs, training time, and possible sacrifice.
Small
N designs have been most extensively used in operant conditioning research. B. F. Skinner
examined the continuous behavior of individual participants in preference to analyzing discrete measurements from
separate groups of participants.
Small
N researchers often use variations of the
ABA reversal design to participate in all treatment conditions. A subject is observed in a control condition (A), treatment condition (B), and then returns to the control condition. The requirement for this design
is that the treatment is reversible.
In both large and small
N designs,
baselines are control conditions that
allow us to measure behavior without the influence of the independent variable.
Removing the independent variable and returning to baseline are crucial in ruling out confounding.
The return to baseline, which may
be repeated several times, is needed to rule out the effects of extraneous variables like history and maturation
threats.
An ABABA design is illustrated below.
Baseline1 measured the number
of clothing items the husband leaves in the living room.
Doing dishes
contingent1 penalized the husband with dishwashing when he left
more clothes in the
living room than his wife.
Baseline2
measured the number of clothes the husband left in the living
room without penalty.
Doing dishes
contingent2 penalized the husband with
dishwashing when he left more clothes in the
living room than his wife.
Post-checks measured the dependent
variable after completion of training to assess the maintenance of behavior change, which was picking up
clothing. John Balven adapted the illustration below from Myers and Hansen (2012).
Many clinical reversal studies do not return to a final baseline because it would be unethical to risk patient
relapse after treatment appeared to improve behavior.
When a reversal study does not end with a baseline condition, we can't rule out the possibility that an extraneous variable caused the patient's clinical improvement.
A
multiple baseline design overcomes the ethical problem of withdrawing an effective treatment by never
withdrawing a treatment. Instead, this approach uses baselines of different lengths to measure the same behavior at different times or locations, or other behaviors.
Descriptive and Inferential Statistics
Descriptive statistics report the attributes of a sample of scores. Measures of central tendency and variability are two important families of descriptive
statistics (Myers & Hansen, 2012).
Measures of Central Tendency
Measures of central tendency are descriptive statistics that describe the typical score within a sample. Three measures of central tendency are the mean, median, and mode.
Their relative position depends on a distribution's symmetry. The mean is lower than the median and mode when a distribution is skewed to the left. The mean is higher than its counterparts when a distribution is skewed to the right. Graphic © Iamnee/iStockphoto.com.
The
mean is the arithmetic average and the most commonly reported measure of central tendency.
The mean's principal limitation is its vulnerability to outliers (extreme scores).
The
median divides the sample distribution in half. The median is the middle score or the average of two middle scores. Since the median is unaffected by outliers, statisticians prefer it to the mean when extreme scores are present.
The
mode is the most frequent score calculated when there are at least two different sample values. There may be no mode or multiple modes.
Measures of Variability
Statistical measures of variability like the range, standard deviation, and the variance index the dispersion in our data. There is no variability [all of these measures are 0] when all sample scores are identical.
The
range is the
difference between the lowest and highest values (or this difference minus 1). For example, for 5, 10, 20, and 25, the range is 20. Since the range is calculated using only two scores in a series, it can be influenced by outliers and cannot show whether data are evenly distributed or clumped together.
The
standard deviation is the square root of the average squared deviations from the mean. For example, for 2, 4, 4, 4, 5, 5, 7, and 9, the standard deviation is the square root of 4, which is 2. When the standard deviation is low, the data points cluster more closely about the mean. When it is high, scores are distributed farther from the mean. The standard deviation better characterizes the dispersion of scores than the range since it uses all of the data points and is not distorted by outliers.
When values are normally distributed, the percentage of scores that lie within 1, 2, and 3 standard deviations from the mean are 68%, 95%, and 99.7%, respectively.
Graphic © Peter Hermes Furian/iStockphoto.com.

Clinicians use the standard deviation to determine whether physiological measurements or psychological test scores differ significantly from the standardization sample mean. In neurofeedback,
z-score training protocols attempt to reduce the discrepancy in real-time between instantaneous EEG values and values within a normative database. Graphic retrieved from breakthroughpsychologyprogram.com.
Check out the YouTube video
What Is The Standard Deviation?
The
variance is the
average squared deviation of scores from their mean and is the square of the standard deviation. For example, for 2, 4, 4, 4, 5, 5, 7, and 9, the variance is 4. A disadvantage of the variance compared with the standard deviation is that it is reported in squared units.
Inferential Statistics Allow Researchers to Test Hypotheses
Hypotheses are predictions of the relationship between the IV and DV.
Inferential statistics like the
F-test and
t-test allow researchers to test the
null hypothesis, the prediction that IV did not affect the DV.
When the difference between the experimental and control conditions scores is greater than what we should expect from normal variability in the population, we reject the null hypothesis because our findings are
statistically significant (probably not due to chance).
Statistical significance is the lowest bar in inferential statistics because it only means a rare event has occurred. Christopher Zerr (2021) likens a significant outcome to a sighting of a Black Swan since these are uncommon. Graphic © Andreas Prott/Shutterstock.com.
Statistical significance does not reveal the magnitude of the treatment effect, called its effect size. Following Christopher's analogy, statistical significance doesn't convey whether the Black Swan is lilliputian or mammoth. Effect size is a higher bar because large effect sizes are more likely to be replicated and produce important clinical or performance outcomes.
Meta-Analysis
A
meta-analysis is a statistical analysis of
related studies. A meta-analysis utilizes statistical procedures to combine and quantify data from many experiments using the same operational definitions for independent and dependent variables to calculate a typical effect size (Bordens & Abbott, 2022).
Check out the YouTube video
Introduction to Meta-analysis.
Effect size indexes the magnitude of an IV's effect on the DV. Researchers measure effect size using percentage and standard deviation approaches. The percentage method calculates the percentage of variability in the DV that the IV can predict. Statistics like r
2 and eta
2 estimate the strength of association between the IV and DV. Possible values range from 0-1.0. A small r
2 = 0.01, medium = 0.09, and large ≥ 0.5. An r
2 of 0.5 means that the IV accounted for 50% of the variability in the DV.
Graphic © Jacek Fulawka/Shutterstock.com.
The standard deviation approach estimates the degree of change in the DV in standard deviation units. For example, Cohen's
d calculates the effect size for both
t- and
F-tests. A small effect = 0.2, medium = 0.5, and large = 0.8.
A meta-analysis may overstate or understate treatment efficacy depending on the representativeness of their selected studies. Treatments may appear more effective than in real-world practice when a meta-analysis
excludes unpublished studies that did not achieve statistically significant results. This is called the
file drawer effect. Conversely, treatments may
appear less effective than when a meta-analysis includes studies by inexperienced clinicians and those
who utilized inadequate training protocols (e.g., did not train until a criterion was reached).
The Clinical Efficacy of Established Medical Practices
Prasad et al. (2013) examined 363 studies of an accepted drug or medical procedure published in
The New England Journal of Medicine from 2001 to 2010. More than 40% were ineffective or harmful, 38% were beneficial, and 22% had uncertain value.
Examples of ineffective or harmful practices included hormone replacement therapy in postmenopausal women and aggressive blood sugar reduction in Type 2 diabetics treated in intensive care, which increased mortality rates.
The authors observed: “Nevertheless, the reversals we have identified at the very least call these practices into question. Some practices ought to be abandoned, whereas others warrant retesting in more powerful investigations. One of the greatest virtues of medical research is our continual quest to reassess it.” (p. 796)
Criteria for Clinical Efficacy
The following guidelines for evaluating the clinical efficacy of biofeedback and neurofeedback interventions were
recommended by a joint Task Force and adopted by the Boards of Directors of the Association for Applied
Psychophysiology (AAPB) and the International Society for Neuronal Regulation (ISNR) (LaVaque et al., 2002). We will discuss the
Evidence-Based Practice in Biofeedback and Neurofeedback (4th ed.) ratings in our Applications units.
Level 1: Not empirically supported
Supported only by anecdotal reports and/or case studies in non-peer reviewed venues.
Level 2: Possibly efficacious
At least one study of sufficient statistical power with well-identified outcome measures but lacking
randomized assignment to a control condition internal to the study.
Level 3: Probably efficacious
Multiple observational studies, clinical studies, wait-list controlled studies, and within-subject and
intrasubject replication studies that demonstrate efficacy.
Level 4: Efficacious
a. In comparison with a no-treatment control group, alternative treatment
group, or sham (placebo) control utilizing randomized assignment, the
investigational treatment is shown to be statistically significantly superior
to the control condition or the investigational treatment is equivalent to a
treatment of established efficacy in a study with sufficient power to
detect moderate differences, and
b. The studies have been conducted with a population treated for a specific
problem, for whom inclusion criteria are delineated in a reliable,
operationally defined manner, and
c. The study used valid and clearly specified outcome measures related to
the problem being treated, and
d. The data are subjected to appropriate data analysis, and
e. The diagnostic and treatment variables and procedures are clearly
defined in a manner that permits replication of the study by independent
researchers, and
f. The superiority or equivalence of the investigational treatment has been
shown in at least two independent research settings.
Level 5: Efficacious and specific
The investigational treatment is statistically superior to credible sham therapy, pill, or
alternative bona fide treatment in at least two independent research settings.
Glossary
ABA reversal design: small
N design where a baseline is followed by treatment and then a return to baseline.
assignment threat: a classic threat to internal validity in which individual
differences are not balanced across treatment conditions by the assignment procedure.
assignment-interaction threat: the combination of a selection threat with at least
one other threat (history, maturation, testing, instrumentation, statistical regression, or mortality).
balancing: a method of controlling a physical variable by distributing its
effects across all treatment conditions (running half of each condition's participants in the morning and half in the
evening).
baseline: a control condition in which participants receive a zero level of the
independent variable (sitting quietly without receiving feedback about physiological performance).
baseline-only control condition: participants sit quietly without receiving feedback about their physiological performance.
bidirectional causation: reason that correlation does not imply causation.
Each of two variables could influence the other.
case study: a nonexperimental descriptive study of a participant’s
experiences, observable behaviors, and archival records kept by an outside observer.
confounding: loss of internal validity when an extraneous variable
systematically changes across the experimental conditions.
constancy of conditions: controlling a physical variable by keeping
it constant across all treatment conditions (running all participants in the evening).
contingent feedback: feedback of a participant's actual physiological
performance.
control group: in an experiment, a group that receives a receives a zero-level
of the independent variable (placebo or wait-list).
correlational study: a nonexperimental procedure in which the researcher does
not manipulate an independent variable and only records data concerning traits or behaviors (investigating the
relationship between body mass index and severity of low back pain).
demand characteristics: situational cues (like placing EEG sensors on a client's forehead) signal expected behavior (increased attention).
dependent variable (DV): the outcome measure the experimenter uses to assess
the change in behavior produced by the independent variable (airway resistance could be used to measure the
effectiveness of HRV training for asthma).
detectable noncontingent feedback: noncontingent feedback that a participant
can detect as false when the display fails to mirror voluntary behavior like muscle bracing.
double-blind crossover experiment: an experimental approach using a double-blind
design where participants start with one treatment and conclude with an alternative treatment (controls demand
characteristics, experimenter bias, and individual differences).
double-blind experiment: the experimenter and participant do not know the
condition to which the participant has been assigned to control demand characteristics and experimenter bias.
effect size: magnitude of an IV's effect on the DV.
elimination: controlling a physical variable by removing it
(soundproofing a room).
ex post facto study: quasi-experimental design in which a researcher compares participants on pre-existing characteristics.
experimenter bias: confounding occurs when the researcher knows the
participants' treatment condition and acts in a manner that confirms the experimental hypothesis.
extraneous variable (EV): variable not controlled by the experimenter (room
temperature).
file drawer effect: excluding unpublished studies that did not obtain significant treatment effects from a meta-analysis or systematic review.
history threat: a classic threat to internal validity that occurs when an event
outside the experiment threatens internal validity by changing the DV.
independent variable (IV): the variable (antecedent condition) an experimenter
intentionally manipulates (HRV or SEMG biofeedback).
instrumentation threat: a classic threat to internal validity in which changes
in the measurement instrument or procedure threaten internal validity.
internal validity: the degree to which the experiment can demonstrate that
changes in the dependent variable across treatment conditions are due to the independent variable.
large N designs: studies that examine the performance of groups of
participants.
level 1: not empirically supported.
level 2: possibly efficacious.
level 3: probably efficacious.
level 4: efficacious.
level 5: efficacious and specific.
maturation threat: a classic threat to internal validity that occurs when
physical or psychological changes in participants threaten internal validity by changing the dependent variable.
mean: arithmetic average and the most commonly reported measure of central tendency.
measures of central tendency: descriptive statistics (mean, median, and mode) that describe the typical sample score.
measures of variability: descriptive statistics (range, standard deviation, and variance) that describe the dispersion of scores within a sample and allow us to compare different samples.
median: the score that divides a sample distribution in half. It is the middle score or the average of two middle scores.
meta-analysis: a statistical analysis that combines and quantifies data from
many experiments that use the same operational definitions for their independent and dependent variables to
calculate an average effect size.
mode: the most frequent score calculated when there are at least two different sample values.
mortality threat: a classic threat to internal validity that occurs when
participants drop out of experimental conditions at different rates.
N: number of participants.
No-biofeedback control group: control condition in which participants do not receive physiological feedback.
nonspecific treatment effect: a measurable symptom change not correlated with a specific psychophysiological change.
observational studies: nonexperimental procedures like naturalistic
observation and correlational studies.
Pearson r: a statistical procedure that calculates the strength of the
relationship (from -1.0 to +1.0) between pairs of variables measured using interval or ratio scales.
personality variables: personal aspects of participants or experimenters like
anxiety or warmth.
physical variables: properties of the physical environment like time of day,
room size, or noise.
placebo response: associatively conditioned homeostatic response.
pre-test/post-test design: researchers measure participants on the DV at least twice, before and after administering training.
pre-registration: submission of a study on the Open Science Framework before data collection.
random sample: a subset of a target population selected using an unbiased method so that every member of the population has an equal chance to be chosen.
randomized controlled trial (RCT): researchers manipulate an independent variable
and randomly assign participants to conditions, with or without prior matching on participant variables.
range: the
difference between the lowest and highest values.
registration: submission of study design to a journal before data collection for Stage 1 and Stage 2 peer review.
relaxation control condition: participants receive a non-biofeedback relaxation procedure.
reverse contingent feedback: feedback that trains participants to produce
changes that are the reverse of those shaped by a clinical protocol (beta decrease and theta increase in children diagnosed with ADHD).
sample: selected subset of a target population.
selection interactions: the combination of a selection threat with at least
one other threat (history, maturation, testing, instrumentation, statistical regression, or subject
mortality).
single-blind experiment: participants are not told their treatment condition.
small N designs: studies involving one or two participants.
social variables: aspects of the relationships between researchers and
participants like demand characteristics and experimental bias.
specific treatment effect: a measurable symptom change associated with a
measurable psychophysiological change produced by biofeedback.
standard deviation: the square root of the average squared deviations from the mean.
statistical regression threat:a classic threat to internal validity that occurs when participants are assigned to conditions using extreme scores. The measurement procedure is not wholly reliable. Participants are retested using the same method to show change on the DV. The scores of both extreme groups tend to regress to the mean on the second measurement so that high scorers are lower and low scorers are higher on the second testing.
testing threat: a classic threat to internal validity that occurs when prior
exposure to a measurement procedure affects performance during the experiment.
third variable problem: a reason that correlation does not mean causation. A
hidden variable may affect both correlated variables. For example, alcohol abuse could both disrupt sleep and
increase depression.
variance: the
average squared deviation of scores from their mean.
wait-list control group: participants are measured like the experimental group(s) but are placed on a waiting list for an experimental treatment.
z-score training: a neurofeedback protocol that reinforces in real-time closer approximations of client EEG values to those in a normative database.
Test Yourself
Click on the ClassMarker logo to take 10-question tests over this unit without an exam password.

REVIEW FLASHCARDS ON QUIZLET
Click on the Quizlet logo to review our chapter flashcards.
Visit the BioSource Software Website
BioSource Software offers
Human Physiology, which satisfies BCIA's Human Anatomy and Physiology requirement, and
Biofeedback100, which provides extensive multiple-choice testing over the Biofeedback Blueprint.
Assignment
Now that you have completed this unit, consider how you could use the case study approach in your clinical
practice to assess treatment efficacy?
References
Bordens, K. S., & Abbott, B. B. (2022).
Research design and methods: A process approach (11th ed.). McGraw Hill.
Brady, J. V. (1958).
Ulcers in "executive" monkeys.
Scientific American, 199(4), 95-100. https://psycnet.apa.org/doi/10.1038/scientificamerican1058-95
Campbell, D. T. (1957).
Factors relevant to the validity of experiments in social settings.
Psychological
Bulletin, 54, 297-312. https://psycnet.apa.org/doi/10.1037/h0040950
Campbell, D. T., & Stanley, J. T. (1966).
Experimental and quasi-experimental designs for research.
Rand McNally.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling.
Psychological Science, 23, 524-432. https://doi.org/10.1177/0956797611430953
Khazan, I., Shaffer, F., Moss, D., Lyle, R., & Rosenthal, S. (2023).
Evidence-based practice in biofeedback and neurofeedback (4th ed.). Association for Applied Psychophysiology and Biofeedback.
Larson,
M. J. (2019, March 13-16).
How can we improve the rigor and replicability of applied psychophysiology? [Keynote address]. Association for Applied Psychophysiology and Biofeedback Annual Meeting, Denver, CO, United States.
LaVaque, T. J., Hammond, D. C., Trudeau, D., Monastra, V., Perry, J., Lehrer, P., Matheson, D., & Sherman, R.
(2002).
Template for developing guidelines for the evaluation of the clinical efficacy of psychophysiological evaluations.
Applied Psychophysiology and Biofeedback, 27(4), 273-281. https://dx.doi.org/10.1023/A:1021061318355
Myers, A., & Hansen, C. (2012).
Experimental psychology (7th ed.). Wadsworth.
Peniston, E. G., & Kulkosky, P. J. (1990). Alcoholic personality and alpha-theta brainwave training.
Medical
Psychotherapy, 3, 37-55.
Prasad, V., Vandross, A., Toomey, C., Cheung, M., Rho, J., Quinn, S., . . . Vifu, A. (2013).
A decade of reversal: An analysis of 146 contradicted medical practices.
Mayo Clinic Proceedings, 88(8), 790-798. https://doi.org/10.1016/j.mayocp.2013.05.012
Weiss, J. M. (1972).
Psychological factors in stress and disease.
Scientific American, 226(6),
104-113. https://doi.org/10.1038/scientificamerican0672-104
Wickramasekera, I. A. (1988).
Clinical behavioral medicine: Some concepts
and procedures. Plenum Press.
Wickramasekera, I. A. (1999).
How does biofeedback reduce clinical symptoms and do memories and beliefs have biological consequences? Toward
a model of mind-body healing.
Applied Psychophysiology and Biofeedback,
24(2), 91-105. https://doi.org/10.1023/a:1022201710323
Yaremko, R. M., Harari, H., Harrison, R. C., & Lynn, E. (1982).
Reference handbook of research and statistical methods in psychology: For
students and professionals. Harper & Row, Publishers, Inc.
Yucha, C. B., & Montgomery, D. (2008).
Evidence-based practice in biofeedback and neurofeedback (2nd
ed.). Association for Applied Psychophysiology and Biofeedback.
Zerr, C. L. (2021). Personal communication regarding probability testing.