Psychologist Paul Meehl reviewed the results of 20 studies and analyzed whether clinical predictions made by trained professionals were more accurate than statistical predictions made by combining scores according to a rule. For example, he found that his statistical formula was more accurate at predicting the grades of college freshman than 11 of 14 professional counselors. Meehl reported similar outcomes in predicting violations of parole, success in pilot training, and criminal recidivism.
Meehl’s discoveries pick away at the overconfidence that people experience about humans at large but particularly their own abilities—especially when they have expertise in a given field, as the counselors do here, for example.
Meehl’s book prompted shock and disbelief, and a lot of subsequent research was devoted to proving him wrong. Still, 60% of the studies (which cover a variety of medical variables, economic measures, questions of interest to government agencies, and other outcomes like winners of football games and future wine prices) have shown significantly better accuracy for algorithms, while other studies simply scored a draw in accuracy.
Even the draw in accuracy between algorithms and humans is, in effect, a blow to the confidence of humans because algorithms often cost a lot less than professionals. It is also remarkable that they can be applied to a wide range of fields, demonstrating the ubiquitous nature of overconfidence.
Meehl suggests that this discrepancy is due to the fact that experts sometimes consider complex combinations of features to make predictions, while the algorithms focus on simple combinations of features. People often feel that they can overrule the formula because they have additional information. This is only true in an odd case—for example, a formula that predicts whether a person will go to the movies tonight should be disregarded if a person receives information that the person broke a leg today. But broken legs are both rare and decisive.
It is interesting that Meehl believes the issue to be human tendency toward complexity, as all of Kahneman’s arguments have focused on human desire to make things simple. But in these cases of prediction where System 2 is already activated, overconfidence actually cause mistakes because some factors are emphasized that are not actually as relevant as others.
Humans are also very inconsistent, unlike formulas. When asked to evaluate the same information twice, they frequently give different answers. This can be a matter of real concern, as with radiologists who contradict themselves 20% of the time. These inconsistencies are likely due to the context dependency of System 1. Formulas do not suffer from the same problems.
System 1 plays a big factor in human inconsistency because, as Kahneman introduced in the earlier chapters, System 1 is unconsciously affected by factors like priming and different ways of framing questions.
Kahneman also introduces the idea that some formulas don’t even require any statistical research. Psychologist Robin Dawes provides an example of this kind of formula: marital stability is predicted by frequency of lovemaking minus frequency of quarrels. Positive numbers signify good results. This formula can compete with an optimally weighted formula and is often better than expert judgment.
Human fallibility is made even more apparent when Kahneman proves how formulas based essentially on common sense (and no research) are better able to predict outcomes than human experts are.
An application of this approach was developed to save infants’ lives. Obstetricians had always known that a newborn that is not breathing normally a few minutes after birth is at a risk for brain damage or death. Virginia Apgar came up with a scoring system to develop consistent standards for determining which babies were in trouble, and it has been credited with reducing infant mortality. It is still in use every day in every delivery room.
The invention of the Apgar test demonstrates the importance of recognizing human fallibility. Instead of trusting the experts, who may have biased judgment, the creation of a standardized test helped to lower infant mortality.
The hostility to Meehl’s ideas from clinical psychologists resulted from their own experience of their good intuitions and judgments. But the tasks at which they fail typically require long-term predictions about the patient’s future, and it is hard to know the boundaries of their skill. Additionally, the debate centers on the idea that our sympathies inherently lie with our fellow humans over machines.
Yet even in the face of these examples, people are still hesitant to trust algorithms. This makes some sense: in the same way that we prefer stories over statistics, we often prefer human judgment over mechanical predictions.
The prejudice against algorithms is magnified when the decisions are consequential, and because the cause of a mistake often matters to people. The story of a child dying because an algorithm made a mistake draws more outrage than the same tragedy occurring because of human error. But overall, the role of algorithms has been expanding—like the calculations of credit limits or the amount a professional team should pay for a player.
Even though, overall. we may feel better if mistakes are made by humans, this is a somewhat illogical position, as fewer mistakes would be made overall if we trusted the predictive results of algorithms.
In 1955, at 21 years old, Kahneman was assigned to set up an interview system for the army. At the time, every soldier completed psychometric tests and then had a personality assessment conducted by other young draftees. This interview procedure was found to be almost useless for predicting success.
Kahneman description of his own discovery of some of Meehl’s principles makes even clearer his stand on trusting numerical scales over human judgment.
Kahneman decided that instead of learning about the interviewee’s mental life, the army should obtain specific information and let go of global evaluations to determine the final decision. The interviewers would evaluate several personality traits and score each separately. In this way, he hoped to combat the halo effect.
Kahneman understood that the issue is not only that humans are overconfident in their judgments, but also that their judgments contain many of the flaws that he has already introduced, like the bias to like everything about a person.
The interviewers were displeased to be ordered to switch off their intuition and ask factual questions. Kahneman then compromised and instructed them to give them a global score on the scale of 1 to 5. The new method was a substantial improvement over the old one. Even the intuitive judgment did better, because it was based on more objective information and scales. Kahneman makes clear that this anecdote could be useful for anyone interested in better interview procedures.
Kahneman leaves one glimmer of hope for human judgment: when people are forced to be held to a more objective scale, their intuition becomes more predictive. This is similar to the Apgar test that Kahneman described earlier, which became effective when people were forced to report answers based on a standard scale.