Research

Research allows us to assess the efficacy of clinical and performance interventions. Peer-reviewed randomized controlled trials (RCTs) and meta-analyses that summarize their results are critical to the credibility of biofeedback and neurofeedback. Practitioners who claim success but refuse to submit their findings to peer review have continually endangered the field's reputation. Our stakeholders demand that we "show our receipts" or not be taken seriously. The field's reputation is critical since it impacts respect for providers, professional referrals, third-party reimbursement, and public demand for our services.

Large-scale randomized controlled trials represent the "gold standard" in research because experiments [alone] allow us to draw causal conclusions. Due to their expense and time demands, experimental studies exceed the resources of most practitioners. However, clinicians can still integrate the scientific method into their practice. They can utilize nonexperimental techniques like case studies and publish their results in journals like Applied Psychophysiology and Biofeedback. Case studies and case series can generate valuable hypotheses and provide the rationale for RCTs. Graphic © Zyabich/ Shutterstock.com.

Evidence-Based Practice in Biofeedback and Neurofeedback, an Association for Applied Psychophysiology and Biofeedback publication, summarizes the peer-reviewed literature and conservatively assigns efficacy ratings for specific applications. Drs. Inna Khazan, Fred Shaffer, Donald Moss, Randy Lyle, and Saul Rosenthal are the editors of the fourth edition of this industry reference.

BCIA Blueprint Coverage

This unit addresses Research methodology (IV).

This unit covers Nonexperimental and Experimental Designs, Randomized Controlled Trials, Questionable Research Practices, Case Study Designs, Descriptive and Inferential Statistics, Meta-Analysis, The Clinical Efficacy of Established Medical Practices, and Criteria for Clinical Efficacy.

Please click on the podcast icon below to hear a full-length lecture.

Nonexperimental and Experimental Designs

Nonexperimental designs do not manipulate independent variables or randomly assign participants to treatment conditions. These methods, including correlational and quasi-experimental designs, lack internal validity (Myers & Hansen, 2012).

Independent, Dependent, and Extraneous Variables

An independent variable (IV) is the variable (antecedent condition) an experimenter intentionally manipulates.

An experiment requires at least two levels (values) of an independent variable like slow cortical potential (SCP) neurofeedback and listening to relaxing music. The levels can consist of active and inactive treatments (experimental-control) or two active treatments (experimental-experimental).

A dependent variable (DV) is the outcome measure the experimenter uses to assess the change in behavior produced by the independent variable. The value of a dependent variable depends on the level of the IV. In a study of the efficacy of SCP neurofeedback in treating migraine, the number of emergency room visits per month could serve as one of the DVs. Graphic © Monkey Business Images/ Shutterstock.com.

An extraneous variable (EV) is a variable not controlled by the experimenter that could affect the DV. Failure to control EVs can result in confounding.

Confounding occurs when an EV changes systematically across conditions, making it a potential cause of changes in the DV.

For example, in a study comparing the effects of breathing rate on lowering blood pressure, if the experimental group also exercised more than the control group, this would confound the experiment.

When confounding occurs, a researcher has competing explanations for the experimental findings, which cannot be ruled out. In this case, an investigator cannot draw cause-effect conclusions. When confounding occurs, a researcher has competing explanations for the experimental findings, which cannot be ruled out. In this case, an investigator cannot draw cause-effect conclusions.

Internal and External Validity

Experiments can achieve high internal validity, which is the degree to which we can attribute changes in the dependent variable to the independent variable.

Experiments attain internal validity by creating at least two independent variable levels and then randomly assigning participants to one or more conditions. Internal validity is important because it establishes a causal relationship between the independent and dependent variables.

Nonexperimental designs cannot achieve internal validity because they don't control EVs. They may not create IV levels, randomly assign participants to conditions, or both.

A sample is a subset of the target population. Graphic © Bakhtiar Zein/Shutterstock.com.

Researchers achieve external validity by using probability sampling to select participants from a population. External validity is the degree to which we can generalize sample findings to the population from which it was drawn.

Both experimental and nonexperimental designs can achieve external validity since this depends on the sampling method and not the control of EVs. Random sampling, where each member of a population has an equal chance to be selected, is a prerequisite for external validity. Graphic © Bakhtiar Zein/Shutterstock.com.

Threats to Internal Validity

Classic threats to internal validity include history, maturation, testing, instrumentation, statistical regression, assignment, mortality, and assignment-interaction (Campbell, 1957; Campbell & Stanley, 1966).

History threat occurs when an event outside the experiment threatens internal validity by changing the DV. For example, in a neurofeedback (NF) study to increase continuous attention, participants in group A were run in the morning while those in group B were run in the afternoon.

Maturation threat is produced when physical or psychological changes in the subject threaten internal validity by changing the DV. For example, boredom may increase subject errors on a test of continuous attention.

Testing threat occurs when prior exposure to a measurement procedure affects performance on this measure during the experiment. For example, suppose experimental participants regularly use a self-inflating blood pressure cuff during HRV biofeedback training, and control participants only use one during pre- and post-testing. In that case, the experimental participants could have lower pressures due to their greater experience.

Instrumentation threat is when changes in the measurement instrument or measurement procedure threaten internal validity. For example, EEG sensor placement could be less accurate in one treatment group than in another.

Statistical regression threat occurs when participants are assigned to conditions based on extreme scores, the measurement procedure is not wholly reliable, and participants are retested using the same method to show change on the DV. The scores of both extreme groups tend to regress to the mean on the second measurement: high scorers are lower and low scorers are higher on the second testing. This effect is caused by random variability of measurement error.

For example, all participants are pre-tested on anxiety, and only those who scored high or low are used as participants. If assignment to two different relaxation training conditions did not produce groups that started with equal anxiety levels, apparent group differences at post-test might be due to regression threat.

Assignment threat occurs when the assignment procedure does not balance participant variables across treatment conditions.For example, participants in the experimental group may have higher systolic blood pressures than those in the control group despite random assignment.

Mortality threat occurs when participants drop out of experimental conditions at different rates. For example, even if random assignment distributed subject characteristics equally across the conditions at the start of the experiment, dropout could render the conditions unequal on an extraneous variable like hypnotic susceptibility.

Assignment-interaction threat occurs when an assignment threat combines with at least one other threat (history, maturation, testing, instrumentation, statistical regression, or mortality).

When one of these classic threats confounds an experiment, the study lacks internal validity, and we cannot establish a cause-effect relationship between the independent and dependent variables.

Control of Physical Variables

Physical variables are aspects of the physical environment like time of day, room size, or noise. Researchers can control these EVs, in order of preference, through elimination, constancy of conditions, and balancing. Elimination removes the variable (soundproofing a room). Constancy of conditions keeps the variable about the same for all treatment conditions (running all participants in the morning). Balancing distributes the variable's effects across all treatment conditions (running half of each condition's participants in the morning and half in the afternoon).

Control of Social Variables

Experiments are social situations, and their results can be confounded if the demand characteristics or experimenter behaviors differ systematically across treatment conditions. Social variables are aspects of the relationships between researchers and participants.

Two significant social variables are demand characteristics and experimenter bias. Demand characteristics are situational cues (like students packing up their belongings at the end of class) that signal expected behavior (dismissal of class).

An experimenter can control demand characteristics by performing a single-blind experiment where participants are not told their treatment condition. Graphic adapted from publichealthnotes.com. Artist: Dani S @ unclebelang on Fiverr.

For example, in a single-blind biofeedback study of heart failure patients, the experimental and control groups receive the same training using identical screens and practice assignments. The only difference between the groups is their target breathing rate (6 bpm for the experimental and 12 bpm for the control group).

Experimenter bias occurs when a researcher knows the participants' treatment condition and acts to confirm the experimental hypothesis. Graphic adapted from @BogusandBias. Artist: Dani S @ unclebelang on Fiverr.

In the previous heart failure example, an experimenter might provide more encouragement to 6-bpm than 12-bpm participants.

A double-blind experiment, which conceals the participants' treatment condition from both participants and the experimenter, can control both demand characteristics and experimenter bias. For example, in a double-blind drug study, the experimental and control groups receive capsules that look and taste identical. The experimenter does not know whether a subject has received a drug or placebo. Graphic retrieved from medindia.net.

There have been few double-blind biofeedback studies because participants can often detect and experience frustration when receiving false feedback (e.g., the display shows heart rate increasing when it is decreasing). Also, unless biofeedback training is automated, training will be ineffective if the experimenter doesn't know the training goal (e.g., heart rate down).

An alternative to a double-blind experiment is to blind participants and research technicians to the experimental hypothesis. Where a technician does not know which group is expected to perform better, this can reduce–but not eliminate–experimenter bias.

Control of Participant Characteristics

Participant characteristics are attributes like HRV or resting breathing rates. These variables can confound an experiment when associated with the DV, and failure to control these factors results in groups that are not equivalent. Researchers can control participant personality variables through random assignment to treatment conditions, matching participants on this variable, and randomly assigning them to different conditions. They can also randomly assign participants to all treatment conditions using a within-participants design.

Correlational Studies

A correlational study is a nonexperimental design because it does not manipulate an IV. These designs only obtain data regarding participant characteristics and performance (no IVs, only DVs). Scatterplots graphically display the magnitude (-1 to +1) and direction (negative or positive) of a correlation. Graphic retrieved from courses.lumenlearning.com.

For example, the test-retest reliability of resonance frequency (RF) measurements is an unresolved question in HRV biofeedback. The RF is stimulation rate using techniques like paced breathing or rhythmic skeletal muscle contraction that produces the greatest HRV. Studies that address this question measure participants' RF at two different points in time, often two weeks apart, and calculate a Pearson r statistic.

Investigators use correlational studies when they cannot or should not manipulate an independent variable. Since the RF is characteristic brings into an experiment, the test-retest reliability question falls into the "cannot manipulate" category.

Although valuable, correlation does not imply causation. Correlational studies lack internal validity for three reasons. Graphic © VectorMine/Shutterstock.com.

First, correlations are not directional (depression can cause insomnia, or impaired sleep can cause depression). Second, the problem of bidirectional causation means that each variable could influence the other (depression could trigger insomnia, and insomnia could worsen depression). Finally, the third variable problem occurs when a hidden variable affects both correlated variables (financial worries could disrupt sleep and worsen depression). Hot weather was the third variable in the Infographic above.

Check out the YouTube video Correlation vs. Causality: Freakonomics Movie.

Common Quasi-Experimental Designs

Although quasi-experimental designs have the superficial appearance of experiments, they do not create two or more IV levels and randomly assign participants to conditions. Two examples are pre-test/post-test designs and ex post facto studies.

Pretest/Posttest Designs

In a pre-test/post-test design, researchers measure participants on the DV at least before and after administering training. For example, a practitioner could measure children's continuous attention (the DV) before and after neurofeedback training (the IV).

Since pre-test/post-test designs lack a control group, we cannot rule out the possibility that history or maturation threats or a placebo effect accounted for changes. Graphic retrieved from Stand Up to Cancer.

Confounding is a pervasive problem when the DV is a symptom (e.g., depression) that waxes and wanes. A client may enter NF training when depression is most severe and conclude training when it would naturally remit.

For these reasons, practitioners should not make claims regarding training effectiveness until their protocols have been evaluated by experimental studies that can rule out possible sources of confounding. The requirement of impartial experimental validation is particularly acute when a practitioner has a financial or reputational stake in a protocol and its related hardware and software.

Ex Post Facto Designs

"Ex post facto" means after the fact. Examples of ex post facto studies are comparisons of participants based on existing characteristics (e.g., the HRV of 30-year-olds versus 60-year-olds) or self-selected treatments (e.g., the HRV of individuals who chose to study biofeedback versus mindfulness meditation). In the first example, participants may vary on characteristics other than age (e.g., the 30-year-olds may be healthier). In the second example, participants may differ in more ways than their self-regulation technique (e.g., individuals who chose biofeedback training may have higher socioeconomic status). Since ex post facto studies do not create IV levels and randomly assign participants to conditions, they do not permit us to draw cause-and-effect conclusions.

Randomized Controlled Trials

In a randomized controlled trial (RCT), researchers manipulate an independent variable and randomly assign participants to conditions with or without prior matching on participant variables (Bordens & Abbott, 2022). A RCT can achieve strong internal validity because matching and random assignment increase the likelihood that the groups will be identical on relevant participant characteristics at the start of an experiment. Check out the YouTube video Randomized Controlled Trials vs. Observational Studies. The Brady (1958) "Executive Monkey" study is a textbook example of how to confound an experiment.

The Brady Study

Brady pre-tested primates on their speed when performing a shock avoidance task and then assigned the faster learners to the executive and slower learners to the yoked control conditions. The executives could prevent a painful shock by pressing the control button at least once every 20 seconds. Failure to do this caused both the executive and yoked control participants to receive a shock. The controls had no control over the delivery of the painful shock.

Brady reported that executives had many ulcers while the controls did not. Was this due to the executive's greater responsibility for avoiding painful shocks? We can't know from Brady's data because he confounded this study through a biased assignment. We have no reason to believe that the two groups had an equal risk of developing ulcers at the start of this study. Alternative explanations are that the executives were more sensitive to shock than the controls or were more ulcer prone. An internally valid redesign of the Brady study would pre-test the primates, match them on their speed of shock avoidance, and randomly assign them to the executive or yoked control conditions.

Weiss (1972) replicated the Brady study using rats and randomly assigned them to executive and control conditions. In contrast to Brady, he found that lack of control over shock was more stressful than the responsibility for pushing a control button to prevent shock. These findings are supported by organizational research that shows that executives experience lower rates of health problems like heart attacks and ulcers than their subordinates. The subordinates have less perceived control than management, contributing to less resilience.

There are three essential takeaways from Weiss' replication of the Brady study. First, lack of control–and not responsibility–can contribute to illness. Second, educators and textbook authors often do not evaluate the internal validity of studies they quote. Third, poorly designed research has a long half-life. While the Brady study was reported in 1958, textbooks in disciplines as diverse as Business and Health Psychology still recycle Brady's flawed conclusions.

Specific and Nonspecific Treatment Effects

Treatment effects can be specific and nonspecific. A specific treatment effect is a measurable symptom change associated with a measurable psychophysiological change produced by biofeedback. For example, there would be a specific treatment effect on airway resistance in a patient diagnosed with asthma if decreased resistance were correlated with increased heart rate variability (HRV).

A nonspecific treatment effect is a measurable symptom change not correlated with a specific psychophysiological change. For example, researchers might report that HRV biofeedback training produced larger reductions in state anxiety (situational anxiety) than Progressive Relaxation. The caveat is that the HRV and state anxiety changes were not correlated. In this case, demand characteristics may have contributed to state anxiety reduction. Since effective biofeedback and accepted medical treatments produce specific and nonspecific treatment effects, it is vital to control demand characteristics in clinical outcome studies to ensure internal validity and understand the mechanisms of action.

The Search for an Appropriate Control Condition

A control group receives a zero level of the independent variable. While placebos are easy to administer in pharmaceutical research, researchers find that selecting an appropriate control group is challenging in biofeedback research. Weak control conditions can stack the deck in favor of the experimental condition, defeating the purpose of an experiment. Graphic © Sashkin/ Shutterstock.com.

No-biofeedback control groups include baseline only, waiting list, and relaxation conditions. Participants in a baseline-only condition are only measured on the DVs and receive no treatment. Individuals assigned to a waiting list are measured on the DVs with the expectation of later biofeedback training. participants in a relaxation condition receive instructions like Autogenic phrases. These three control strategies are not equivalent to the experimental condition because their demand characteristics are often weaker (e.g., none of these controls provide physiological feedback). Their activities are less engaging than biofeedback training.

Biofeedback control groups include noncontingent feedback, reverse contingent feedback, and contingent feedback of an unrelated response.

In noncontingent feedback, participants receive false pre-programmed feedback that may be easy (e.g., heart rate) or hard (e.g., slow cortical potentials) to detect. Noncontingent feedback can be frustrating when participants perceive that feedback is inconsistent with internal cues and they cannot learn to control the signal. Since noncontingent feedback is inherently deceptive, the researchers must explain this deception during debriefing.

Reverse contingent feedback reinforces physiological changes in the opposite direction as the experimental group (e.g., HRV down instead of HRV up).

Finally, contingent feedback of an unrelated response reinforces physiological changes in a system that should not powerfully affect the DVs (e.g., skin conductance down versus HRV up).

Reverse contingent feedback and contingent feedback of an unrelated response can serve as effective control conditions as long as participants experience comparable engagement and training success. Reverse contingent feedback should not be used when the physiological changes could harm participants (e.g., HRV down).

Questionable Research Practices

The frequent failure to replicate (reproduce) results from a flawed incentive system that reinforces questionable research practices (QRP; Larson, 2019).

Caption: Michael J. Larson, PhD

An anonymous survey (John et al., 2012) of 6000 American Psychological Association (APA) members is summarized below.

Larson (2019) proposed six steps to increase the integrity of psychophysiological research.

Case Study Designs

In case studies, a researcher compiles a descriptive analysis of a participant's experiences, observable behaviors, and archival records kept by an outside observer (Myers & Hansen, 2012).

Case studies have several advantages and limitations.

In biofeedback research, a case study is a record of patient experiences and behaviors compiled by a therapist and is nonexperimental. A therapist documents change in patient symptoms across the phases of treatment but cannot prove that the change was due to biofeedback training. A descriptive case study might use an AB design, where A is the baseline phase and B is the treatment phase.

Small N Designs

N represents the number of participants required for an experiment. Classic experiments like Peniston and Kulkosky's (1990) alcoholism study utilized large N designs that compared the performance of groups of participants (those who received alpha-theta neurofeedback and those who received conventional medical treatment).

A small N design examines one or two participants. The individual featured below learned to control a synthesizer by manipulating his EEG.

A clinical psychologist could use a small N design to test a treatment when there are insufficient participants to conduct a large N study and when they want to avoid the ethical problem of an untreated control group. Animal researchers prefer small N designs to minimize their animal subjects' acquisition and maintenance costs, training time, and possible sacrifice.

Small N designs have been most extensively used in operant conditioning research. B. F. Skinner examined the continuous behavior of individual participants in preference to analyzing discrete measurements from separate groups of participants.

Small N researchers often use variations of the ABA reversal design to participate in all treatment conditions. A subject is observed in a control condition (A), treatment condition (B), and then returns to the control condition. The requirement for this design is that the treatment is reversible.

In both large and small N designs, baselines are control conditions that allow us to measure behavior without the influence of the independent variable.

Removing the independent variable and returning to baseline are crucial in ruling out confounding.

The return to baseline, which may be repeated several times, is needed to rule out the effects of extraneous variables like history and maturation threats.

An ABABA design is illustrated below. Baseline₁ measured the number of clothing items the husband leaves in the living room. Doing dishes contingent₁ penalized the husband with dishwashing when he left more clothes in the living room than his wife. Baseline₂ measured the number of clothes the husband left in the living room without penalty. Doing dishes contingent₂ penalized the husband with dishwashing when he left more clothes in the living room than his wife. Post-checks measured the dependent variable after completion of training to assess the maintenance of behavior change, which was picking up clothing. John Balven adapted the illustration below from Myers and Hansen (2012).

Many clinical reversal studies do not return to a final baseline because it would be unethical to risk patient relapse after treatment appeared to improve behavior. When a reversal study does not end with a baseline condition, we can't rule out the possibility that an extraneous variable caused the patient's clinical improvement.

A multiple baseline design overcomes the ethical problem of withdrawing an effective treatment by never withdrawing a treatment. Instead, this approach uses baselines of different lengths to measure the same behavior at different times or locations, or other behaviors.

Descriptive and Inferential Statistics

Descriptive statistics report the attributes of a sample of scores. Measures of central tendency and variability are two important families of descriptive statistics (Myers & Hansen, 2012).

Measures of Central Tendency

Measures of central tendency are descriptive statistics that describe the typical score within a sample. Three measures of central tendency are the mean, median, and mode.

Their relative position depends on a distribution's symmetry. The mean is lower than the median and mode when a distribution is skewed to the left. The mean is higher than its counterparts when a distribution is skewed to the right. Graphic © Iamnee/iStockphoto.com.

The mean is the arithmetic average and the most commonly reported measure of central tendency. The mean's principal limitation is its vulnerability to outliers (extreme scores).

The median divides the sample distribution in half. The median is the middle score or the average of two middle scores. Since the median is unaffected by outliers, statisticians prefer it to the mean when extreme scores are present.

The mode is the most frequent score calculated when there are at least two different sample values. There may be no mode or multiple modes.

Measures of Variability

Statistical measures of variability like the range, standard deviation, and the variance index the dispersion in our data. There is no variability [all of these measures are 0] when all sample scores are identical.

The range is the difference between the lowest and highest values (or this difference minus 1). For example, for 5, 10, 20, and 25, the range is 20. Since the range is calculated using only two scores in a series, it can be influenced by outliers and cannot show whether data are evenly distributed or clumped together.

The standard deviation is the square root of the average squared deviations from the mean. For example, for 2, 4, 4, 4, 5, 5, 7, and 9, the standard deviation is the square root of 4, which is 2. When the standard deviation is low, the data points cluster more closely about the mean. When it is high, scores are distributed farther from the mean. The standard deviation better characterizes the dispersion of scores than the range since it uses all of the data points and is not distorted by outliers.

When values are normally distributed, the percentage of scores that lie within 1, 2, and 3 standard deviations from the mean are 68%, 95%, and 99.7%, respectively. Graphic © Peter Hermes Furian/iStockphoto.com.

Clinicians use the standard deviation to determine whether physiological measurements or psychological test scores differ significantly from the standardization sample mean. In neurofeedback, z-score training protocols attempt to reduce the discrepancy in real-time between instantaneous EEG values and values within a normative database. Graphic retrieved from breakthroughpsychologyprogram.com.

Check out the YouTube video What Is The Standard Deviation?

The variance is the average squared deviation of scores from their mean and is the square of the standard deviation. For example, for 2, 4, 4, 4, 5, 5, 7, and 9, the variance is 4. A disadvantage of the variance compared with the standard deviation is that it is reported in squared units.

Inferential Statistics Allow Researchers to Test Hypotheses

Hypotheses are predictions of the relationship between the IV and DV. Inferential statistics like the F-test and t-test allow researchers to test the null hypothesis, the prediction that IV did not affect the DV.

When the difference between the experimental and control conditions scores is greater than what we should expect from normal variability in the population, we reject the null hypothesis because our findings are statistically significant (probably not due to chance).

Statistical significance is the lowest bar in inferential statistics because it only means a rare event has occurred. Christopher Zerr (2021) likens a significant outcome to a sighting of a Black Swan since these are uncommon. Graphic © Andreas Prott/Shutterstock.com.

Statistical significance does not reveal the magnitude of the treatment effect, called its effect size. Following Christopher's analogy, statistical significance doesn't convey whether the Black Swan is lilliputian or mammoth. Effect size is a higher bar because large effect sizes are more likely to be replicated and produce important clinical or performance outcomes.

Meta-Analysis

A meta-analysis is a statistical analysis of related studies. A meta-analysis utilizes statistical procedures to combine and quantify data from many experiments using the same operational definitions for independent and dependent variables to calculate a typical effect size (Bordens & Abbott, 2022).

Check out the YouTube video Introduction to Meta-analysis.

Effect size indexes the magnitude of an IV's effect on the DV. Researchers measure effect size using percentage and standard deviation approaches. The percentage method calculates the percentage of variability in the DV that the IV can predict. Statistics like r² and eta² estimate the strength of association between the IV and DV. Possible values range from 0-1.0. A small r² = 0.01, medium = 0.09, and large ≥ 0.5. An r² of 0.5 means that the IV accounted for 50% of the variability in the DV. Graphic © Jacek Fulawka/Shutterstock.com.

The standard deviation approach estimates the degree of change in the DV in standard deviation units. For example, Cohen's d calculates the effect size for both t- and F-tests. A small effect = 0.2, medium = 0.5, and large = 0.8.

A meta-analysis may overstate or understate treatment efficacy depending on the representativeness of their selected studies. Treatments may appear more effective than in real-world practice when a meta-analysis excludes unpublished studies that did not achieve statistically significant results. This is called the file drawer effect. Conversely, treatments may appear less effective than when a meta-analysis includes studies by inexperienced clinicians and those who utilized inadequate training protocols (e.g., did not train until a criterion was reached).

The Clinical Efficacy of Established Medical Practices

Prasad et al. (2013) examined 363 studies of an accepted drug or medical procedure published in The New England Journal of Medicine from 2001 to 2010. More than 40% were ineffective or harmful, 38% were beneficial, and 22% had uncertain value.

Examples of ineffective or harmful practices included hormone replacement therapy in postmenopausal women and aggressive blood sugar reduction in Type 2 diabetics treated in intensive care, which increased mortality rates.

The authors observed: “Nevertheless, the reversals we have identified at the very least call these practices into question. Some practices ought to be abandoned, whereas others warrant retesting in more powerful investigations. One of the greatest virtues of medical research is our continual quest to reassess it.” (p. 796)

Criteria for Clinical Efficacy

The following guidelines for evaluating the clinical efficacy of biofeedback and neurofeedback interventions were recommended by a joint Task Force and adopted by the Boards of Directors of the Association for Applied Psychophysiology (AAPB) and the International Society for Neuronal Regulation (ISNR) (LaVaque et al., 2002). We will discuss the Evidence-Based Practice in Biofeedback and Neurofeedback (4th ed.) ratings in our Applications units.

Level 1: Not empirically supported
Supported only by anecdotal reports and/or case studies in non-peer reviewed venues.

Level 2: Possibly efficacious
At least one study of sufficient statistical power with well-identified outcome measures but lacking randomized assignment to a control condition internal to the study.

Level 3: Probably efficacious
Multiple observational studies, clinical studies, wait-list controlled studies, and within-subject and intrasubject replication studies that demonstrate efficacy.

Level 4: Efficacious
a. In comparison with a no-treatment control group, alternative treatment
    group, or sham (placebo) control utilizing randomized assignment, the
      investigational treatment is shown to be statistically significantly superior
      to the control condition or the investigational treatment is equivalent to a
    treatment of established efficacy in a study with sufficient power to
    detect moderate differences, and
b. The studies have been conducted with a population treated for a specific
    problem, for whom inclusion criteria are delineated in a reliable,
    operationally defined manner, and
c. The study used valid and clearly specified outcome measures related to
    the problem being treated, and
d. The data are subjected to appropriate data analysis, and
e. The diagnostic and treatment variables and procedures are clearly
     defined in a manner that permits replication of the study by independent
    researchers, and
f. The superiority or equivalence of the investigational treatment has been
    shown in at least two independent research settings.

Level 5: Efficacious and specific
The investigational treatment is statistically superior to credible sham therapy, pill, or alternative bona fide treatment in at least two independent research settings.

Glossary

ABA reversal design: small N design where a baseline is followed by treatment and then a return to baseline.

assignment threat: a classic threat to internal validity in which individual differences are not balanced across treatment conditions by the assignment procedure.

assignment-interaction threat: the combination of a selection threat with at least one other threat (history, maturation, testing, instrumentation, statistical regression, or mortality).

balancing: a method of controlling a physical variable by distributing its effects across all treatment conditions (running half of each condition's participants in the morning and half in the evening).

baseline: a control condition in which participants receive a zero level of the independent variable (sitting quietly without receiving feedback about physiological performance).

baseline-only control condition: participants sit quietly without receiving feedback about their physiological performance.

bidirectional causation: reason that correlation does not imply causation. Each of two variables could influence the other.

case study: a nonexperimental descriptive study of a participant’s experiences, observable behaviors, and archival records kept by an outside observer.

confounding: loss of internal validity when an extraneous variable systematically changes across the experimental conditions.

constancy of conditions: controlling a physical variable by keeping it constant across all treatment conditions (running all participants in the evening).

contingent feedback: feedback of a participant's actual physiological performance.

control group: in an experiment, a group that receives a receives a zero-level of the independent variable (placebo or wait-list).

correlational study: a nonexperimental procedure in which the researcher does not manipulate an independent variable and only records data concerning traits or behaviors (investigating the relationship between body mass index and severity of low back pain).

demand characteristics: situational cues (like placing EEG sensors on a client's forehead) signal expected behavior (increased attention).

dependent variable (DV): the outcome measure the experimenter uses to assess the change in behavior produced by the independent variable (airway resistance could be used to measure the effectiveness of HRV training for asthma).

detectable noncontingent feedback: noncontingent feedback that a participant can detect as false when the display fails to mirror voluntary behavior like muscle bracing.

double-blind crossover experiment: an experimental approach using a double-blind design where participants start with one treatment and conclude with an alternative treatment (controls demand characteristics, experimenter bias, and individual differences).

double-blind experiment: the experimenter and participant do not know the condition to which the participant has been assigned to control demand characteristics and experimenter bias.

effect size: magnitude of an IV's effect on the DV.

elimination: controlling a physical variable by removing it (soundproofing a room).

ex post facto study: quasi-experimental design in which a researcher compares participants on pre-existing characteristics.

experimenter bias: confounding occurs when the researcher knows the participants' treatment condition and acts in a manner that confirms the experimental hypothesis.

extraneous variable (EV): variable not controlled by the experimenter (room temperature).

file drawer effect: excluding unpublished studies that did not obtain significant treatment effects from a meta-analysis or systematic review.

history threat: a classic threat to internal validity that occurs when an event outside the experiment threatens internal validity by changing the DV.

independent variable (IV): the variable (antecedent condition) an experimenter intentionally manipulates (HRV or SEMG biofeedback).

instrumentation threat: a classic threat to internal validity in which changes in the measurement instrument or procedure threaten internal validity.

internal validity: the degree to which the experiment can demonstrate that changes in the dependent variable across treatment conditions are due to the independent variable.

large N designs: studies that examine the performance of groups of participants.

level 1: not empirically supported.

level 2: possibly efficacious.

level 3: probably efficacious.

level 4: efficacious.

level 5: efficacious and specific.

maturation threat: a classic threat to internal validity that occurs when physical or psychological changes in participants threaten internal validity by changing the dependent variable.

mean: arithmetic average and the most commonly reported measure of central tendency.

measures of central tendency: descriptive statistics (mean, median, and mode) that describe the typical sample score.

measures of variability: descriptive statistics (range, standard deviation, and variance) that describe the dispersion of scores within a sample and allow us to compare different samples.

median: the score that divides a sample distribution in half. It is the middle score or the average of two middle scores.

meta-analysis: a statistical analysis that combines and quantifies data from many experiments that use the same operational definitions for their independent and dependent variables to calculate an average effect size.

mode: the most frequent score calculated when there are at least two different sample values.

mortality threat: a classic threat to internal validity that occurs when participants drop out of experimental conditions at different rates.

N: number of participants.

No-biofeedback control group: control condition in which participants do not receive physiological feedback.

nonspecific treatment effect: a measurable symptom change not correlated with a specific psychophysiological change.

observational studies: nonexperimental procedures like naturalistic observation and correlational studies.

Pearson r: a statistical procedure that calculates the strength of the relationship (from -1.0 to +1.0) between pairs of variables measured using interval or ratio scales.

personality variables: personal aspects of participants or experimenters like anxiety or warmth.

physical variables: properties of the physical environment like time of day, room size, or noise.

placebo response: associatively conditioned homeostatic response.

pre-test/post-test design: researchers measure participants on the DV at least twice, before and after administering training.

pre-registration: submission of a study on the Open Science Framework before data collection.

random sample: a subset of a target population selected using an unbiased method so that every member of the population has an equal chance to be chosen.

randomized controlled trial (RCT): researchers manipulate an independent variable and randomly assign participants to conditions, with or without prior matching on participant variables.

range: the difference between the lowest and highest values.

registration: submission of study design to a journal before data collection for Stage 1 and Stage 2 peer review.

relaxation control condition: participants receive a non-biofeedback relaxation procedure.

reverse contingent feedback: feedback that trains participants to produce changes that are the reverse of those shaped by a clinical protocol (beta decrease and theta increase in children diagnosed with ADHD).

sample: selected subset of a target population.

selection interactions: the combination of a selection threat with at least one other threat (history, maturation, testing, instrumentation, statistical regression, or subject mortality).

single-blind experiment: participants are not told their treatment condition.

small N designs: studies involving one or two participants.

social variables: aspects of the relationships between researchers and participants like demand characteristics and experimental bias.

specific treatment effect: a measurable symptom change associated with a measurable psychophysiological change produced by biofeedback.

standard deviation: the square root of the average squared deviations from the mean.

statistical regression threat:a classic threat to internal validity that occurs when participants are assigned to conditions using extreme scores. The measurement procedure is not wholly reliable. Participants are retested using the same method to show change on the DV. The scores of both extreme groups tend to regress to the mean on the second measurement so that high scorers are lower and low scorers are higher on the second testing.

testing threat: a classic threat to internal validity that occurs when prior exposure to a measurement procedure affects performance during the experiment.

third variable problem: a reason that correlation does not mean causation. A hidden variable may affect both correlated variables. For example, alcohol abuse could both disrupt sleep and increase depression.

variance: the average squared deviation of scores from their mean.

wait-list control group: participants are measured like the experimental group(s) but are placed on a waiting list for an experimental treatment.

z-score training: a neurofeedback protocol that reinforces in real-time closer approximations of client EEG values to those in a normative database.

Test Yourself

Click on the ClassMarker logo to take 10-question tests over this unit without an exam password.

REVIEW FLASHCARDS ON QUIZLET

Click on the Quizlet logo to review our chapter flashcards.

Visit the BioSource Software Website

BioSource Software offers Human Physiology, which satisfies BCIA's Human Anatomy and Physiology requirement, and Biofeedback100, which provides extensive multiple-choice testing over the Biofeedback Blueprint.

Assignment

Now that you have completed this unit, consider how you could use the case study approach in your clinical practice to assess treatment efficacy?

References

Bordens, K. S., & Abbott, B. B. (2022). Research design and methods: A process approach (11th ed.). McGraw Hill.

Brady, J. V. (1958). Ulcers in "executive" monkeys. Scientific American, 199(4), 95-100. https://psycnet.apa.org/doi/10.1038/scientificamerican1058-95

Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297-312. https://psycnet.apa.org/doi/10.1037/h0040950

Campbell, D. T., & Stanley, J. T. (1966). Experimental and quasi-experimental designs for research. Rand McNally.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524-432. https://doi.org/10.1177/0956797611430953

Khazan, I., Shaffer, F., Moss, D., Lyle, R., & Rosenthal, S. (2023). Evidence-based practice in biofeedback and neurofeedback (4th ed.). Association for Applied Psychophysiology and Biofeedback.

Larson, M. J. (2019, March 13-16). How can we improve the rigor and replicability of applied psychophysiology? [Keynote address]. Association for Applied Psychophysiology and Biofeedback Annual Meeting, Denver, CO, United States.

LaVaque, T. J., Hammond, D. C., Trudeau, D., Monastra, V., Perry, J., Lehrer, P., Matheson, D., & Sherman, R. (2002). Template for developing guidelines for the evaluation of the clinical efficacy of psychophysiological evaluations. Applied Psychophysiology and Biofeedback, 27(4), 273-281. https://dx.doi.org/10.1023/A:1021061318355

Myers, A., & Hansen, C. (2012). Experimental psychology (7th ed.). Wadsworth.

Peniston, E. G., & Kulkosky, P. J. (1990). Alcoholic personality and alpha-theta brainwave training. Medical Psychotherapy, 3, 37-55.

Prasad, V., Vandross, A., Toomey, C., Cheung, M., Rho, J., Quinn, S., . . . Vifu, A. (2013). A decade of reversal: An analysis of 146 contradicted medical practices. Mayo Clinic Proceedings, 88(8), 790-798. https://doi.org/10.1016/j.mayocp.2013.05.012

Weiss, J. M. (1972). Psychological factors in stress and disease. Scientific American, 226(6), 104-113. https://doi.org/10.1038/scientificamerican0672-104

Wickramasekera, I. A. (1988). Clinical behavioral medicine: Some concepts and procedures. Plenum Press.

Wickramasekera, I. A. (1999). How does biofeedback reduce clinical symptoms and do memories and beliefs have biological consequences? Toward a model of mind-body healing. Applied Psychophysiology and Biofeedback, 24(2), 91-105. https://doi.org/10.1023/a:1022201710323

Yaremko, R. M., Harari, H., Harrison, R. C., & Lynn, E. (1982). Reference handbook of research and statistical methods in psychology: For students and professionals. Harper & Row, Publishers, Inc.

Yucha, C. B., & Montgomery, D. (2008). Evidence-based practice in biofeedback and neurofeedback (2nd ed.). Association for Applied Psychophysiology and Biofeedback.

Zerr, C. L. (2021). Personal communication regarding probability testing.