Psychology Reporting and AI (1)

While I’m on the topic of Psychology Reporting…

Everything is up in the air in the science world right now. I just wrote about challenges and assumptions in psychological science reporting. Here’s a topic that was already stirring controversy before the current chaos: “AI.” A recent NY Times article, This Therapist Helped Clients Feel Better. It was A.I., suggested that, “an A.I. chatbot eased mental health symptoms among participants” (NY Times, April 15, 2025). How is psychology reporting addressing the potential for AI?

What “AI” is not

First things first – it is important to recognize that AI is not… AI. It is not intelligent, not in the way we understand human intelligence. For now, it’s a really sophisticated parrot (not to diss parrots – they have way more human intelligence than AI. I’m only using the metaphorical meaning, not referring to the birds themselves).

These systems collect information from across internet databases and output requested responses based on large language models. LLMs use these huge swaths of textual data to create networks of meanings. Randomness in the output can be attributed to the variation in the vast amounts of human data input, as well as quirks of the specific networking model. “Most AI experts agree that modern generative AI models do not (and maybe never can) have a subjective consciousness despite isolated claims to the contrary.” (Scientific American, Jan 17, 2025). AI can say it is in pain, sure. But does it act independently and spontaneously via internal inputs, or only from external cues and being programmed to respond a certain way?

What “AI” is

That said, these systems are increasingly powerful and increasingly sophisticated. They may not be alive, but they are increasingly, to us, responding and feeling that way. They are in fact so sophisticated that we no longer have a grasp on how they reach their conclusions. We saw the advent of this many years ago, when a computer first won a match against the leading chess grandmaster. But now it’s a daily occurrence that an AI has noted a pattern or reached a conclusion that we simply couldn’t – or at least, didn’t – on our own. This isn’t just faster. This is more sophisticated. We are hitting into the cliche: AI is playing chess while we are playing checkers.

Here’s what AI won’t do (unless we ask it to, which requires us to know what and how to ask): AI will not, on its own, tell us if we have the rules for chess right. It won’t say if the rules could be improved, or if perhaps we should be playing a different game altogether.

We’ll come back to this point.

Psychology reporting and AI

“An AI chatbot eased mental health symptoms”

Let’s look at the study in question. The authors randomly assigned people with “clinically significant symptoms” of major depressive disorder, generalized anxiety disorder, and at high risk for an eating disorder either to receive 4 weeks of access to Therabot; or to receive nothing, with the promise of Therabot at the end of the study (8 weeks). They took ratings before the intervention, after 4 weeks, and then after an additional 4 weeks.

The NY Times writer reports, “During the trial, participants with depression saw a 51 percent reduction in symptoms after messaging Therabot for several weeks. Many participants who met the criteria for moderate anxiety at the start of the trial saw their anxiety downgraded to “mild,” and some with mild anxiety fell below the clinical threshold for diagnosis.” The study authors report comparisons with the controls, and show significantly larger decreases for each Therabot group, at 4 weeks and 8 weeks, than for the people waiting.

Sounds nice, right?

Design issues

A couple of quick design problems before we get to the psychosocial aspects. First, which the NY Times article points out, is the study design itself. The authors used a no-intervention control. This means that you cannot say what the effects are due to. It could be the screen time. It could be the talking, even if no one (or no-thing) is listening. Really, it could be anything. Even more problematic, this design means the intervention is not blinded. People know they’re getting the Therabot, or they’re not. So if people expect it to help, especially with a flashy new treatment they’ve chosen to try, it likely will. Moreover, people who have to wait may expect that waiting is not helpful, especially compared to a flashy new treatment. This actually could decrease self-help behaviors and inflate the comparative results.

This type of control is designed to at least compare the treatment with the natural effects of time. This is important, because you’d expect all symptoms to naturally show a decrease over time (this is called “regression to the mean.”). Which is exactly what the data here shows for the control groups. The study’s control design deals a little bit with the effects of time, but I’m not sure how well given the lack of blinding.

Caution with the results

Next is a note about the results themselves. Five of the six standard deviations (SDs) for the treatment groups are as large or larger than the effect size. What does that mean? SDs measure spread. Across the different people, it shows how much variation was there in how people fared. The larger the SD, the less consistent across people. If larger than the effect size, this suggests that many people showed no change or went a little up on symptoms.

Who went which way, and why? This is a major flaw with group design in general: unless you use blocked group design, where you have a block of people getting the treatment and the control for each potential factor, you can’t tell any of that. We only know that results highly varied. This leaves any particular person in a statistical guessing game. Where will I fall on the bell curve?

Psychosocial Issues

The above issues are standard in this type of research. They are kinks to work out. As the researchers investigate further, they might find these exploratory results hold up, or they may not. And as the article notes, Therabot is generation 3. There will likely be a 4th, and a 5th, each of which will refine and expand. But that’s not the end of our story for right now.

Remember when I wrote above that AI won’t tell us whether we’re using the right rules, or the right game, unless we ask? That’s where we’ll pick up next time…