Reading Research: Details Matter
The Carlat Child Psychiatry Report, Volume 12, Number 5&6, July 2021
Topics: adverse events | control group | double blind | effect size | NNH | NNT | null hypothesis | number needed to harm | number needed to treat | Open Label | p-value | primary outcome | Randomized controlled trial | randomized discontinuation design | RCT | replication | secondary outcome | statistical significance | treatment efficacy | unblinded
Glen Spielmans, PhD.
Professor of Psychology, Metropolitan State University, St. Paul, MN.
Dr. Spielmans has disclosed no relevant financial or other interests in any commercial companies pertaining to this educational activity.
When headlines report on positive treatment results from a study, families will come in asking about those results. This article will help you determine what makes a study a good one, and conversely what makes a not-so-great one. With this information, you can be better prepared to talk with families about research studies and results that can affect treatment decisions.
Open-label studies vs randomized controlled trials
In open-label studies, both researchers and patients know what the treatment is, and there is no control group. Positive results in an open-label study can be misleading because the patients may improve due to positive expectations or the natural course of their illness—rather than the treatment. Thus, open-label studies can offer preliminary hope but not solid evidence of efficacy.
Randomized controlled trials (RCTs) are a better way to evaluate a treatment’s effects—but they must be examined closely. RCTs are supposed to be double-blind, with participants and clinical raters unaware of who receives which intervention until the study is over. But drug side effects can “unblind” such studies and bias their results. Drugs with more obvious physical or psychological effects can lead to greater unblinding; for instance, a high dose of olanzapine is more likely to cause unblinding than a low dose of fluoxetine. Also, the same clinical raters usually evaluate both efficacy and adverse events. A rater who notes drug-specific adverse events (or a lack of adverse events) may guess which patients are receiving the active treatment, defeating the purpose of the RCT.
Defining success: Effect size is key
Researchers often declare that a treatment “works” if it has a statistically significant benefit over placebo. What does “statistical significance” mean? In an RCT, suppose an antidepressant outperforms placebo by 2 points on a depression rating scale. Based on scores obtained through the rating scale, statistical calculations generate a p-value. The p-value is the probability that the obtained result (the medication outperforming placebo by 2 points) could be explained by the null hypothesis, which claims that there is no treatment effect. In other words, if the drug really had no effect (ie, the null hypothesis is true), what are the odds that the study would find at least a 2-point benefit for the drug? If the p-value is less than .05 (5%), the result is deemed statistically significant. However, statistical significance does not necessarily mean that the result is important! Among other things, the size of the sample is influential—a very small treatment benefit may be statistically significant in a large study. Statistical significance gives some confidence that there is a treatment effect, so it’s an important first step, but it’s not the final word on treatment efficacy.
What we really want to know is the effect size—the magnitude of the treatment effect. Common convention on effect sizes is that 0.20 = small, 0.50 = medium, and 0.80 = large. In psychiatry, effective treatments nearly always generate small to medium effect sizes (compared to placebo). It is now standard practice to report effect size in treatment studies. A study that fails to report effect size may be trying to downplay a minimal treatment benefit.
Categorical outcomes sound great—but aren’t always impressive
Many studies report improvement in symptom scores along a continuous measure. Clinicians often prefer categorical outcomes, like response and remission. However, categories have arbitrary cutoff points. For instance, in autism we can look at a total ADOS score or a change from the severe to moderate range, from the moderate to mild range, etc. But this is tricky—if a patient’s score goes from the low end of severe to the very high end of moderate, that shift is not necessarily clinically meaningful. Pay attention to total scores along with any categorical outcomes.
Always check the NNT or NNH
For categorical outcomes, one should examine the number needed to treat (NNT) and number needed to harm (NNH). These values refer to the number of people who would need to receive treatment in order to gain an additional positive (NNT) or negative (NNH) outcome over what would have occurred if all participants received placebo. For instance, an NNT of 8 for “response” means 8 patients would need to be treated to gain a response that would not have occurred if all 8 patients had taken placebo. An NNT of 5 is often considered impressive, while an NNT of more than 10 is often considered unimpressive, but there is no firm consensus on this. For NNH, the acceptable range might vary from 10 to 100 depending on the side effect or the likelihood of discontinuation due to a side effect. For severe side effects such as Stevens-Johnson syndrome, you’ll want to see a much higher NNH.
Distinguish primary from secondary outcomes
RCTs typically use several outcomes. Researchers declare a single outcome as the primary outcome before the study starts. This prevents researchers from cherry-picking a positive outcome after the data have been examined, then declaring it as the primary outcome. When reading a study, you should examine all secondary outcomes as well. Patient self-reports, quality of life, levels of functioning in school/work/family life—these measures provide valuable information and should be closely considered along with clinician-rated symptom scales. There may be different results across these outcomes. For instance, antidepressants have been shown to provide small benefits on depression rating scales for youth, but yield no benefits on depression self-reports and quality of life measures compared to placebo (Spielmans GI and Gerwig K, Psychother Psychosom 2014;83(3):158–164).
Beware medication discontinuation designs—they may not mean the drug works
Most RCTs last only a few weeks, while child development is measured in years. You cannot assume long-lasting benefit based on a positive short-term RCT. Most studies of long-term drug efficacy inappropriately use a randomized discontinuation design (El-Mallakh RS and Briscoe B, CNS Drugs 2012;26(2):97–109). These studies start with only participants who have responded to the drug in the short term. By random assignment, some participants are (usually abruptly) switched to placebo while others continue to take the drug. This conflates drug withdrawal effects among those switched to placebo with treatment efficacy in those who keep taking the drug. The worse the drug discontinuation effects, the worse the placebo group performs after being taken off the medication—and the better those who stay on the medication seem in comparison. A better test of long-term effects is to simply lengthen a short-term placebo-controlled RCT (Khan A et al, J Psychiatr Res 2008;42(10):791–796). Longer studies are far more expensive and might reduce the hoped-for positive findings of the researchers, though, so they are rarely done.
Adverse events may be underreported
Ideally, RCTs should accurately detect adverse events. While weight and some lab measures are usually reliably assessed in RCTs, most adverse events are assessed vaguely. For example, until recently, there was little attempt to ask specific questions regarding suicidality in most treatment studies, leading to an underreporting of such events. Adverse events must be systematically assessed, otherwise studies may be unable to detect them. Also, researchers often don’t report all recorded adverse events in journal articles (Hughes S et al, BMJ Open 2014;4(7):e005535).
Replication of results is crucial
It’s easy to get excited about published positive treatment results, but further research may or may not support the initial findings. For industry studies, replication from a non-industry team is important. For therapy studies, you want to see replication of positive results by a separate research group. Two good RCTs of the same treatment for the same indication showing good effect size represent much more powerful evidence than a single trial. How often does this happen in child and adolescent psychopharmacology? Not often. Some treatments have demonstrated consistently poor results; for example, there are multiple studies showing no significant effect of desvenlafaxine or paroxetine on depression in youth.
“Doctor, what about this study?”
How can we help families understand that much popular press coverage of research is misleading, without sounding cynical? Listen respectfully, then calmly and neutrally describe the hope that an open-label study provides, and stress that controlled trials are needed to know whether a treatment is truly helpful. For example, in CCPR’s Jan/Feb/March 2021 issue, we talked with Dr. Aaron Besterman about how results from pharmacogenomic testing, while interesting, are unlikely in most cases to lead to changes in good treatment.
CCPR Verdict: You will hear about and read research your entire career. Pay attention to the quality of the study and effect size. Educate families about how you use your professional judgment to give them truly evidence-based recommendations. We’ve added a Clinical Research Checklist box below as a guide to help you interpret study results and limitations.
Clinical Research Checklist
|Randomized controlled trial (RCT) or open label?|
|For each efficacy outcome, ask:|
|Are there parent reports, self-reports, daily functioning, or quality of life measurements?|
|Remember that categorical outcomes (eg, remission, response) are usually based on arbitrary cutoff scores. Always consider categorical outcomes in context of rating scale scores and other outcomes.
|If an adverse event is not systematically measured, it is likely underreported.|