Judgment Upgrade: Improving Forecasting Techniques

A recent report on best practices in forecasting has the potential for long-range positive impact in many fields, including medical research and the development and operation of healthcare provider organizations. In an article published in Harvard Business Review(1), authors Paul J.H. Schoemaker and Philip E. Tetlock posit that it’s possible for almost any organization to improve its ability to forecast—if leaders have a thick skin. The process could prove embarrassing, since it will most likely expose the ineffectiveness of old methods and demonstrate that even the improved forecasting methods are not as accurate as one would like them to be.

The article takes as its starting point the statement by the U.S. National Intelligence Council, in October 2002, that Iraq possessed chemical and biological weapons of mass destruction and was producing more. The resultant costly and unproductive military action could have been avoided with better intelligence.

From this mistake, considerable research evolved, including a multi-year (2011-2015) prediction tournament funded by the Intelligence Advanced Research Projects Activity, in which five academic research teams competed in addressing various economic and geopolitical questions. The winning team, called the Good Judgment Project (GJP), was co-led by Tetlock and Barbara Mellers, PhD, of the Wharton School at the University of Pennsylvania. This project consisted of a series of forecasting contests, from which the organizers drew three main conclusions:

  • people with broad general knowledge often make better forecasts than specialists,
  • teams often out-predict individuals, and
  • proper training often enhances predictive talent.

These findings, the report says, could influence the way various organizations forecast outcomes such as business trends or employee performances, as well as financial results. The article stresses that the improved methods it recommends are not certain to produce accurate predictions—but if Company A is consistently a little better at judgment calls than is Company B, its competitive advantage will grow exponentially.

Methods of improving predictability will be more effective in some circumstances than in others, of course. Many issues already are highly predictable, given the proper tools, without any improvement in subjective judgment skills. Other issues are so little understood—and so unpredictable by their very nature—that new methods of forecasting them are almost impossible to develop.

Predicting the predictable

However, for a great many questions, there exist some data that can be

analyzed scientifically, and from which logical conclusions—or at any rate, predictions—can be drawn. These might include the efficacy of a product (or service) in development, the performance of a prospective investment or the suitability of a prospective new hire. Schoemaker and Tetlock warn that their methods and approaches have to be tailored to the specific needs of an organization, but they have identified a set of practices that in general might be useful to any individual or team that is analyzing a problem that requires improved subjective judgment.

“Most predictions made in companies, whether they concern project budgets, sales forecasts or the performance of potential hires or acquisitions, are not the result of cold calculus,” the authors note. “They are colored by the forecaster’s understanding of basic statistical arguments, susceptibility to cognitive biases, desire to influence others’ thinking and concerns about reputation. Indeed, predictions are often intentionally vague to maximize wiggle room should they prove wrong.”

The GJP offers training in probability concepts that measurably boost the accuracy of the trainee’s predictions. These include:

  • regression to the mean,
  • updating of probability estimates to allow for new data,
  • more precise definition of what is to be predicted and the time frame, and
  • basing the prediction on a numeric probability.

The article also notes the presence of confirmation biases in many groups and individuals. Researchers usually have pre-conceived hypotheses of what their conclusions will be (or should be), and will look for data that’s likely to lead to those conclusions. GJP’s methods teach trainees to beware of these biases, and look for evidence that might contradict them. GJP also teaches trainees to beware of “streaks,” or other deviations that might look like developing patterns but are usually explicable by the small size of the sample being examined. (For example, a flipped penny that comes up heads eight times in a row, still has a 50/50 chance of coming up heads on the ninth flip.)

GJP’s methods include tests of trainees’ confidence in their own knowledge, by asking them questions about both their general knowledge and knowledge of information specific to their organization, and then asking them how confident they are of their responses.  “The aim,” the authors explain, “is to measure not participants’ domain-specific knowledge, but, rather, how well they know what they don’t know.”

The authors advise organizations to customize their training, to focus on prediction areas that are especially useful to them or on areas where their predictive performance has been historically bad. They also recommend assembling teams that will work together on forming predictions. The teams should be intellectually diverse: that is, at least one should be an expert in the subject being researched, and at least one should be a generalist. They should be aware of tendencies to bias and have sound reasoning skills.

This team should be able to manage the “diverging phase,” in which team members are likely to form different assumptions, approaches and opinions and give more or less weight to different factors; the evaluating phase, in which these various points of view are discussed and weighed; and the converging phase, where the team forms its conclusions and/or predictions. The first phase, in which the team members work independently to some extent, is especially important because it reduces the likelihood of the team forming “anchors”: pre-conceived opinions to which the group might hew for too long.

Trust is essential to a successful team, in almost any walk of life, and it’s especially important when predictive research is involved—because the predictions could turn out not to be to the benefit of one team member or another. A certain forecast could, for example, render one team member’s job redundant, or could indicate that the member’s work performance has been sub-par. In such cases, it’s sometimes difficult to maintain the integrity of the predictions while minimizing any negative effects.

Quantitative feedback

The article also recommends giving predictive teams plentiful, frequent and timely feedback. It notes that many professions that are predictive by their nature rely on this feedback to keep them working hard to improve. A TV meteorologist, for example, is subject to immediate chastisement and ridicule from viewers if a prediction is inaccurate. A professional bridge player relies almost entirely on predictions, based on probabilities and the ability to understand the code languages that are used in bidding and the play of the cards. If his predictions are wrong, he’ll lose points and let his partner down.

“The purest measure for the accuracy of predictions and tracking them over time is the Brier score,” the article states. “It allows companies to make direct, statistically reliable comparisons among forecasters across a series of predictions.”

To arrive at a Brier score, assign a numerical value of 1 to a prediction that came true, and 0 to one that didn’t. Subtract that number from the confidence level of the predictor, and square the result. That’s your Brier score, and the lower the score, the better.

For example, if Mr. X were to say, “I’m 75% sure that the Packers will win on Sunday,” and the Packers do win, you represent that probability as a fraction of 1 (0.75), and subtract 1. Your result is .25. That number, squared, is .0625. If the Packers lose, you subtract 0 from .75, and square that result: .5625. When team members have accumulated enough Brier scores to form a representative sample, it becomes possible to measure the efficacy of the team and individual members. In many cases, it’s possible to drill deeper and determine that certain team members are strong in certain areas, and some in others.

It’s also important to determine how certain predictions were reached, and why the actual result was what it was, to ensure that an outcome didn’t ensue through luck or accident. The researching and decision-making processes should be documented in real time, from start to finish, to determine—in the case of a mistaken prediction—whether it was the information, the analysis or the team’s organization that was at fault. It often happens that a team, based on its research and analysis, concludes, “We should do A,” but the people at the top of the organization will say, “It would be a huge inconvenience to do A. We’ve done B many times before and nothing has gone wrong.” And the team, as often as not, dares not say “we told you so” after the fact.

Thorough audits of the decision-making process can determine the original problem was properly stated; underlying premises were correct; critical information ignored or overlooked; all opinions aired and considered. What was done right, and what was done wrong, can both be noted to improve best practices for future studies. 

Reference: Schoemaker PJH, and Tetlock PE. Superforecasting: how to upgrade your company’s judgment. Harv Bus Rev. 2016;94(5):72-78