- Biases versus noise: how they differ
- How to conceptualize a judgment and judgment error
- What the three different categories of noise look like, and how they can skew decision making
- Why companies that depend on the judgments of many people should conduct a noise audit
- Decision hygiene and preventative strategies to improve decision making
- Practical steps that can be integrated into organizations to optimize judgments and reduce noise
Beyond Bias with Olivier SibonyPodcastMay 25th, 2021
Listen to this episode
What is a Judgment?
“What we call a judgment is something where you expect reasonable degrees of agreement between people, but not perfect agreement. If you are expecting perfect agreement, it’s actually a question of fact. If you ask, “How much is 24 times 17?” Well, two reasonable people should agree exactly on that. Now, if you ask, “How likely is it to rain this afternoon?” Actually, that’s a topic on which if I’m looking at the sky right now where I am, and I say, “The probability of rain is 0%,” you’re going to say, “Olivier, you’re crazy. It doesn’t look like a 0% probability of rain.” But if I say it’s 50%, and you say it’s 40%, we can agree to disagree. That’s a reasonable disagreement. So, we’re talking about judgment here, or judgment of probability in this case, we’re talking about a judgment which is characterized by the fact that we expect some disagreement, but not a lot of disagreement. That’s how we’ve defined judgment.”
How do we define noise?
“A bias is simply the average error. The fact that on any given day, all our forecasters have a different probability estimate, that’s noise. That’s the variability of judgments that should basically be identical. And that’s how we define noise. It’s what is left of the error after you’ve removed the shared error, the average error, which is the bias, and it’s therefore the variability in judgements that should be identical.”
How does noise differ conceptually from a bias?
“It’s interesting in itself to ask why we are so insensitive to noise, why we are so deaf to noise essentially. We speak of bias all the time; we never talk about noise. And for many of the people who are listening to us, this may be the first time, of course, the word noise is familiar, but this may be the first time that they hear about this form of error that we call noise. What’s important to bear in mind, and completely non-intuitive, is that it actually contributes as much to error as does bias.”
Noise in the business world
“When we talk about improving decisions, and when we talk about biases specifically, there is a question that comes up quite often, at least it came up quite often in my work when I was a consultant trying to advise companies that were struggling with the problem of bias. And the problem is this: when you look back at some decision you’ve made, especially at some mistake you’ve made, or it’s something that you consider in hindsight to be a mistake, it’s always very easy to identify what bias tripped you. Now, when you try to predict in what direction you are going to be wrong, because some kind of bias is going to affect you, that is much harder.”
Decision hygiene as a preventative measure
“The problem with noise is that we don’t know the direction of the error. We don’t know in what direction we should push or pull. So, the analogy of hygiene here is meant to suggest that preventing noise is a bit like washing your hands. You don’t know if by washing your hands, the virus or the bacteria you’ve avoided are the ones that would have given you COVID, or that would have given you the flu, or that would have given you some other disease. You just knew it’s good hygiene. It’s a good thing to do.”
Making measurement scales relative can reduce the level of noise
“We’re actually a lot better at making relative judgments than absolute judgments, and there is a lot less noise between people in relative judgment than there is in absolute judgment. If you take again those two evaluators, one who is very lenient and tends to give the highest ratings to everybody, and one who is very tough and tends to reserve the highest rating for only the very best people, and you ask them to rank their employees, they are much more likely to agree on the ranking than they are to agree on the rating. That is going to help a lot.”
Brooke: Hello, everyone, and welcome to the podcast of The Decision Lab, a socially conscious applied research firm that uses behavioral science to improve outcomes for all of society. My name is Brooke Struck, Research Director at TDL, and I’ll be your host for the discussion. My guest today is Olivier Sibony, former partner at McKinsey, currently teaching at HEC in Paris, and the coauthor of Noise: a Flaw in Human Judgment, which he recently published with Cass Sunstein and Daniel Kahneman. In today’s episode, we’ll be talking about noise, the problems that it creates, and cleaning up signals using decision hygiene. Olivier, great to have you back on the show.
Olivier: Great to be here.
Brooke: Please tell us a bit about how the idea for this book emerged, and how you ended up joining forces with Cass Sunstein and Daniel Kahneman to pull it together.
Olivier: Well, as always, it’s a long story, so, I’ll try to keep it short. Basically, when we talk about improving decisions, and when we talk about biases specifically, there is a question that comes up quite often, at least it came up quite often in my work when I was a consultant trying to advise companies that were struggling with the problem of bias. And the problem is this, when you look back at some decision you’ve made, especially at some mistake you’ve made, or it’s something that you consider in hindsight to be a mistake, it’s always very easy to identify what bias tripped you.
Olivier: Now, when you try to predict in what direction you are going to be wrong, because some kind of bias is going to affect you, that is much harder. If you are going to make an acquisition, are you being overly optimistic? Yeah, quite possibly. Or, are you being too cautious and too loss-averse and too hung up on the status quo bias? Well, that’s a possibility too. And every time, and this is a simple example, that every time we were looking at bias, because we had the same problem, which is that the multiplicity of the psychological biases that we have actually has all sorts of bizarre combinations. And the result of those combinations is something that looks quite random, and that’s what we ended up calling noise.
Brooke: So, when you talk about errors, in the book, you talk about these as errors in judgment, and I’d like to explore what judgment means so that we know what the problem space looks like that we’re dealing with here. So, what do you mean by judgment?
Olivier: Absolutely. And we tried to have a fairly narrow definition of judgment so that we actually know what we’re talking about. What we call a judgment is something where you expect reasonable degrees of agreement between people, but not perfect agreement. If you are expecting perfect agreement, it’s actually a question of fact. If you ask, “How much is 24 times 17?” Well, two reasonable people should agree exactly on that. Now, if you ask, “How likely is it to rain this afternoon?” Actually, that’s a topic on which if I’m looking at the sky right now where I am, and I say, “The probability of rain is 0%,” you’re going to say, “Olivier, you’re crazy. It doesn’t look like a 0% probability of rain.”
Olivier: But if I say it’s 50%, and you say it’s 40%, we can agree to disagree. That’s a reasonable disagreement. So, we’re talking about judgment here, or judgment of probability in this case, we’re talking about a judgment which is characterized by the fact that we expect some disagreement, but not a lot of disagreement. That’s how we’ve defined judgment.
Brooke: Okay, great I think that’s really helpful. And when we look at sets of judgments, we can notice that there are different kinds of patterns that might emerge in those sets. And this is where we have two different kinds of error as you discussed before. So, biases and noise.
Olivier: The way we thought about this is, if you take again my simplistic example of forecasting rain, now, suppose that I am a very pessimistic forecaster, and I always forecast that it’s going to rain, I’ve got a bias. Right? Or, maybe you’re a very optimistic forecaster, and you have an optimistic bias and you always underestimate. Now, let’s take the average of all the forecasters, and if it turns out that all forecasters in our weather forecasting department, on average, forecast a 10% higher probability of rain than would be correct, that’s an average error, that is a bias.
Olivier: Now, we can look for an explanation for this bias; maybe there is a psychological explanation for this, maybe there is an incentive, we can come up with all sorts of speculations, but basically, that’s the bias. The bias is simply the average error. The fact that on any given day, all our forecasters have a different probability estimate, that’s noise. That’s the variability of judgments that should basically be identical. And that’s how we define noise. It’s what is left of the error after you’ve removed the shared error, the average error, which is the bias, and it’s therefore the variability in judgements that should be identical.
Brooke: That’s very interesting. I think that that really adds a new layer to our understanding of errors in judgment. So much of behavioral science has focused on these systematic patterns, but now shifting the focus to consider the distributions of those errors as well, I think really enriches conversations about how it is that we can go wrong in certain ways. I liked that in the book, you distinguish certain kinds of noise, that we don’t just have kind of a simple dichotomy between systematic error, which is bias, and then distribution error, which is noise. Noise also gets broken down into some categories that I found very useful. The three that come to mind are level error, pattern error, and occasion error. Could you help us to understand those a little bit?
Olivier: Before I do that, let me just echo what you were saying. It is actually a very different type of error from bias, and it’s one that never gets discussed. It’s interesting in itself to ask why we are so insensitive to noise, why we are so deaf to noise essentially. We speak of bias all the time, we never talk about noise. And for many of the people who are listening to us, this may be the first time, of course, the word noise is familiar, but this may be the first time that they hear about this form of error that we call noise. What’s important to bear in mind, and completely non-intuitive, is that it actually contributes as much to error as does bias.
Olivier: Now, conceptually, the variability of your errors and the average amount of your errors contributes to your error, when you measure it as mean squared error, which is the typical way to do it, it contributes to your errors in exactly the same way. And therefore, when we neglect noise, when we are unaware of this important component of how we go wrong, we actually neglect a big reason why we make mistakes. So, I just wanted to echo that because you were making an important point. Now, the different types of noise. We’ve tried to dig into what creates noise, and it’s actually quite simple to explain the three types of noise.
Olivier: The first type is the one that we immediately think of. Let’s consider a very practical example, suppose that you’re in a courthouse and you’re facing a sentencing, and the luck of the draw essentially is going to decide which judge you get assigned to. You might get Judge Brooke, who has a reputation for being an extremely severe and tough hanging judge, or you might get Judged Olivier, who has a reputation for being a bleeding heart, lenient guy. That’s the first type of noise. In any given system, you’ve got people who have different average levels in their judgment. Here, it’s the level of severity of the judges, it’s quite intuitive, and when we think of noise, that’s the first one we think of.
Olivier: There’s a second type of noise, which is that Judge Brooke, who is very tough and who is really, really your bad guy to be assigned to, might luckily turnout to be in a very good mood today. Maybe he just won the lottery, maybe his favorite sports team won, maybe he got married yesterday, maybe he just had a beautiful baby. I mean, even the toughest SOB might be in a great mood. And by the way, the guy who is unusually lenient might be, today, especially badly disposed. So, that’s what we call the second type of noise, it’s occasion noise. The same person on two occasions is going to behave differently because of elements of the contacts that in theory should be irrelevant, that should not influence decisions, but that in fact do influence decisions.
Olivier: So, we’ve got level noise, and we’ve got occasion noise. But here’s the thing, there is a third type of noise, which is larger than the other two in general, and which actually we never think of, and which is a bit harder to get our minds around. That’s what we call pattern noise, and here’s how to think about it. If Judge Brooke and Judge Olivier are seeing the same 10 defendants, their ranking of those 10 cases is not going to be identical. You, Judge Brooke, might be tougher on average, and I might be more lenient on average, but that doesn’t mean that every time we look at two cases, you’re always going to be tougher than me. There are going to be cases where you’re going to be more lenient than usual, and cases where you’re going to be tougher than usual. And the same is true for me.
Olivier: Essentially, our individual response to each case, to each defendant, to each crime, to each set of circumstances, is going to be partly idiosyncratic, and a lot of that idiosyncrasy is constant. So, you might, for instance, be a tough judge, but unusually lenient with white-collar criminals. And I might be a lenient judge, but extremely tough with people who break the speed limit. And we might have our own reasons for that which come from our experience and our beliefs and our politics or whatever, but this is part of our personality. So, this third type of noise means that each of us has a different pattern of judgments, and that’s why we call it pattern noise.
Olivier: You can think of it as our judgment personality. It’s what makes us individual in the way we make judgements, and that’s how we all differ from one another. So, level noise, occasion noise and pattern noise, the three components add up to noise in any given system that is trying to make homogeneous regiments.
Brooke: And just to clarify, you noted earlier that the relative magnitudes here are relatively even between bias and noise, and within noise across those three categories, it’s mostly pattern noise, you were mentioning, as the big driver. Is that correct?
Olivier: Yeah. So, it’s very hard to generalize obviously, especially between bias and noise. You might have a system in which there is a lot of bias, and you might have a system in which there is no bias. But by and large, there is good reason to believe that in many systems, there is more noise than bias, simply because if there was bias, it would have been addressed, it would have been fixed, whereas noise is less obvious. And within noise, there is fairly good reason in many of the studies that we looked at, to suspect that pattern noise is a big component of it. And this is actually philosophically interesting, because pattern noise, as I said, is what our individuality creates, what our personality creates.
Olivier: Pattern noise arises from the fact that we are all different, and all interesting, and all wonderfully unique. And that is great for many things, but in a system that expects us all to arrive at identical or at least not very different answers, it becomes a problem that we tend to underestimate because we are so used to considering individuality as a good thing
Brooke: That, I think, gives us a bit of a clue to unravel some of the mystery perhaps of why noise gets so much less focus than bias. So, if judgements are things about which we expect a reasonable degree of coherence among the answers that we get across individuals, there are so many instances, judging being a perfect example, where in actual practice, we seldom have a set to work with, we’re seldom confronted with a set. We have an N of one. We only get one judgment to look at, and so, it’s hard to identify how noise or bias in that single instance might be contributing to the judgment.
Olivier: That’s so true. And actually, the same would be true, let me extrapolate from the judges, the same would be true of any professional making any judgment in an organization. If you are a doctor making a diagnosis, you don’t pose to ask yourself, “How would another doctor think of this patient? And what could possibly make that doctor disagree with me?” If you are an underwriter in an insurance company, and you say, “The premium we’re going to charge for this particular insurance policy is $100,000,” you don’t ask yourself, “Could someone else in the underwriting department think that it should be 170?” Which, by the way, is the median difference that we observe in underwriting departments.
Olivier: So, you look at each decision on its own, and that is a big reason why we tend to remain unaware of the fact that others, if given the same case, would actually look at it differently. There is another reason which we shouldn’t underestimate, which is that organizations do a very good job of sweeping that problem under the rug. Organizations are designed to give themselves the illusion of consensus, which we’ve called the illusion of agreements, they’re designed to explain away the occasional disagreements, and they are designed to produce harmony and to maintain the credibility of the experts who work within their company.
Brooke: So, in that kind of situation where it’s not always easy to put your hands on a set of judgments in order to measure, and organizationally, in a situation where the immune system of the organization will react very strongly to this outside force, how is it that we can, first of all, get some traction in terms of trying to assess the extent of the problem? And from there, depending on what we find, what is it that we can do to tangibly increase the quality of the judgments?
Olivier: The first step, to answer your first question, is what we call a noise audit. Basically, it’s to create an artificial exercise in which what normally never happens does happen. And what normally never happens is that different people are going to look at the same case. So, as you said earlier about judicial judges, judges would typically see one defendant, and one defendant would see one judge, and we never find out what another judge would have sentenced the same defendant to. Many years ago, scholars who were concerned about the variability of judicial sentencing raised this issue, and studies were done, where actually, the same vignettes describing cases were given to 200 judges, 208 federal judges, and the variability of the sentences was measured by basically asking all judges to look at all the cases and seeing how much level noise there is.
Olivier: Are judges much more severe than others, how much pattern noise there is, and if it had been designed differently, we could also have seen how much occasion noise there was. So, when you do something like that, which is not very difficult to do, you can actually measure how much noise there is in a system. There could be a lot, and in just about every study that we’ve seen, there was actually quite a lot. It’s possible, it is at least conceptually possible that there will not be a lot, and that you would say, “It’s not worth fixing. Our experts are very disciplined and are very consistent, and whatever differences there are, are well within the bounds of what we consider to be tolerable. That’s fine then.” You’ll be reassured that everything is fine.
Olivier: But we strongly suggest that any organization that relies on the quality of the judgments of many supposedly interchangeable people, should do a noise audit to measure how much variation there is.
Brooke: And that noise audit, how does that help us to identify the three categories of noise that we talked about earlier? What is it that we’re looking for there?
Olivier: It’s tricky in the sense that you will rarely be able to see occasion noise, but you will be able to see level noise, and you will be able to see the rest, which by definition will be pattern noise, including occasion noise. Quite simply, again, take the example of the judges, if I take as industry, I was talking about 16 different cases, and I measure the average sentence, which for all judges is seven years in prison, but the average sentence of Judge Brooke is nine years, and the average sentence of Judge Olivier is five years, that’s level noise. That basically means that the minute a defendant gets assigned to you rather than to me, he’s already scored four more years in prison before you actually start looking at his case.
Olivier: This is the real order of magnitude. It’s actually a bit worse. The mean difference between two judges in that study was 3.8 years. So, it’s actually not quite as bad as the four years I was taking in this example, but nearly. It’s 3.8 years. This should be regarded as outrageous, right? I mean, this is completely unacceptable. If we told you that, all things being equal, black defendants get sentenced to three point more years than white defendants, or people of a particular background or a particular gender are treated differently, we would say, “This bias is intolerable,” and we will be right. We would, of course, not want to tolerate that.
Olivier: Now, when that injustice is triggered simply by the luck of the draw, we somehow don’t pay as much attention to it. Maybe we should. It’s hard to see morally what makes it more justifiable for the personality of the judge to drive as much of the sentencing as these particular studies seem to suggest.
Interlude – 20:00
Hi there, and welcome back to the Decision Corner, the podcast of the Decision Lab. Today I’m speaking with Olivier Sibony about his recent collaborative project with Daniel Kahneman and Cass Sunstein. Up until this point, we’ve been deconstructing the idea of noise: the variability in human judgments that has been overlooked in behavioral science – until now. We’ve explored the different elements of noise and looked at how they confound our judgments in various applied contexts.
In the next half hour we are going to dive into decision hygiene — what it is, and how it works — along with helpful strategies that, when properly put into practice, can greatly improve our decision making. We’ll also suggest some practical steps for organizations to incorporate these strategies into their own daily functioning.
Cleaning up our choices
Brooke: Luckily, not all is lost, you articulate quite a bit in the book about practical strategies that we can take to take some kind of remediating action against the noise that we find. So, would you like to help us walk through these decision hygiene strategies and how to apply them and where they apply to us?
Olivier: Yeah. Let me first try to explain the bizarre phrase, decision hygiene, that you just mentioned. Why do we call this decision hygiene? The idea is actually quite simple. When you are looking at a bias, and you’re thinking about debiasing, which I suspect a lot of the people who are listening to us will be familiar with this term, you assume that you know what bias you’re fighting. You basically assume, for instance, that you are dealing with people who are doing project planning, you know everything about the planning fallacy, you suspect that they are going to be overly optimistic and that they are going to underestimate the cost of their projects and the time it takes to complete them.
Olivier: So, you put in practices that are going to counteract that bias, that are designed to correct a predictable mistake. That makes a lot of sense when you know the direction of the error, because you know what bias you’re trying to fix. The problem with noise is that we don’t know the direction of the error. We don’t know in what direction we should push or pull. So, the analogy of hygiene here is meant to suggest that preventing noise is a bit like washing your hands. You don’t know if by washing your hands, the virus or the bacteria you’ve avoided are the ones that would have given you COVID, or that would have given you the flu, or that would have given you some other disease. You just knew it’s good hygiene. It’s a good thing to do.
Olivier: It’s not all the same thing as saying, “Oh, I got this disease, let’s find the right treatment,” which would be the analog of debiasing. So, this decision hygiene is to debiasing what hand-washing is to medical treatment basically. It’s a preventative approach where you don’t know what enemy you’re fighting, you don’t need to know what enemy you’re fighting to actually improve the quality of your decision. It’s prevention. What does it consist of? We’ve got a series of techniques, I don’t know which ones you think we should talk about, but basically, that’s the idea is to make your decision process more robust, to make the whole decision process less likely to be a victim of noise.
Brooke: Yeah. I think one of the, I mean, the strategies that you articulated in the book, all strike me as very approachable and very sensible, but some of them really resonated with my own experience. So, I’ll draw on my pattern noise here and talk about, especially scales. So, when we ask people to assess something on a scale, one of the sources of noise is that we don’t necessarily have an agreement about what constitutes a three versus a five, versus a seven, on the scale.
Olivier: Absolutely. And first of all, before we go there, any judgment implies a scale. So, you’re taking the example of a numerical scale, where we say, “Let’s rate the performance of this employee on a scale from one to four,” or, “Let’s give one out of five stars to this restaurant,” on the review we leave on Google. I’m assuming we can actually go to a restaurant, which, of course, sounds like science fiction these days. So, those are quantitative scales. But when I say, “This tumor is very likely to be benign,” very likely is a scale. I could have said likely, I could have said unlikely. This is a scale. So, any judgment, in one way or another, assumes a scale. The scale might be categories, it might be numbers, it might be lots of things, but it’s always a scale.
Olivier: Now, whenever we use a scale, there is a danger that different people will use the scale differently. Take performance ratings, which is one of the domains that we study in the book in great detail and where there is a staggering amount of noise. The big reason for that noise is that there is, in general, a tendency of raters to inflate ratings and to use the high ratings more than the low ratings because it saves them the trouble of having difficult conversations. But between raters, some people think that to get the highest rating, you need to be an absolute once in a lifetime superstar, and others think that, by and large, unless you’ve killed someone, you should get the highest rating.
Olivier: And that difference is, of course, a big source of level noise, because they simply disagree on how to use the scale. They might both think, if you’re asked to describe it in the same terms, that the performance of the person they are looking at is the same. They’ve actually seen the same person and they think the same things about that performance, but the ways they use the scale to qualify that performance is quite different. So, whenever we can make the scale more precise, whenever we can make the scale less ambiguous, we are going to reduce the level noise that we’re talking about here.
Olivier: And the best way to do that, it turns out, is to make the scale relative. We’re actually a lot better at making relative judgments than absolute judgments, and there is a lot less noise between people in relative judgment than there is in absolute judgment. If you take again those two evaluators, one who is very lenient and tends to give the highest ratings to everybody, and one who is very tough and tends to reserve the highest rating for only the very best people, and you ask them to rank their employees, they are much more likely to agree on the ranking than they are to agree on the rating. That is going to help a lot.
Olivier: And of course, you can’t always do that. So, when you cannot do that, one trick to replicate the relative rating is to create a scale that is itself relative, where you use cases as anchors on the scale and you create what we call a case scale, meaning that instead of saying, “Bob is four on,” I don’t know, “client relationship skills,” we’re going to define a three on client relationship skills as being Anna’s level of skill in client relationship, and we’re going to define a two as being Barbara’s levels of skills on client relationships.
Olivier: And mentally, all we now have to do is to ask, “Is Bob better than Anna?” “Oh, no, he’s not as good as Anna.” “Okay, is he better than Barbara?” “Yeah, he’s better than Barbara.” “Great. So, he’s between those two cases on the scale.” And that kind of relative ranking by anchoring the scale on specific cases actually makes the ratings a lot less noisy.
Brooke: The thing that I found so interesting about that is that in going through that exercise myself, I started to realize the power that that kind of thing has if we’re talking about performance assessment for individuals, or if we’re talking about looking at the success of projects that we’ve conducted, it struck me that actually going through that exercise and identifying those cases and then socializing those cases within the organization has a profound impact on the culture of the organization. So, when we sat down and said, “Okay, well, we’d like to start assessing our track record of research projects, how successful were they in terms of smoothness of execution? Then in terms of staying within budget? Being delivered on time? Having impact?” These kinds of things.
Brooke: When we went through the exercise of articulating, “What was our most impactful project?” And asking ourselves why, and having those conversations, and building up the storyline around that so that when we want to convey it to the rest of the organization, “When we are assessing projects, these are the cases that we’re using,” the cultural impacts of going through that process can be profound.
Olivier: Because it makes you look more fact-based in your discussion of those cases. And implicitly, when you’re talking about projects here, you’re using another decision hygiene technique, which we call structuring. Because when you talk about projects, you don’t just say, I assume, “What was our best project? What was our average project? And what was our worst project?” You probably say, “Well, on the quality of the impact that we’ve eventually had, what characterizes a five, a three, and a one? On how smoothly it went in our relationships with the clients, what characterizes a five, a three, and a one? On the happiness of the team throughout the project, what characterizes a five, a three, and a one?”
Olivier: And if you go through that exercise on each dimension separately, and you create anchor cases, which is the name we give to those examples you will use, you will actually change the culture of the organization, as you say, because you will create a shared frame of reference, as we call it, you will have a shared language and a shared frame of reference to describe what it means to have impact, what it means to have a happy team, what it means to have a good client relationship experience, and that actually raises the aspirations of people on different dimensions.
Olivier: The beauty of structuring this, of course, and doing it differently on the different dimensions, is that you force yourself to articulate those dimensions separately, and you force yourself to see that there is probably no project that is perfect on every dimension, and no project that is terrible on every dimension, although they might be, of course, correlated, and that helps a lot too to be more balanced and more nuanced in your evaluation of each project.
Brooke: Yeah. And even that process of articulating what the dimensions are, “Which are the ones that we care enough about assessing?” That conversation is one that has profound impacts. If it comes up in an organization that assessing employee happiness throughout the course of a project is not one of the things that they measure, that says a lot about what kind of culture the organization is going to have.
Olivier: This is very important. And actually, we’re now talking about another decision hygiene technique that we call the Mediating Assessments Protocol. Now, that’s a very bizarre name and probably not the most user-friendly we could’ve come up with, but what you’re describing here is what we call mediating assessments. Basically, what we suggest is that when you have a complex judgment to make, like, “How well did this project go?” Or, “How well does this person perform?” Or, “Should we acquire this company?” Or, “What policy should we adopt to deal with this pandemic?” It could be anything complex. That thing that you’re looking at always has multiple dimensions. And making sure that you spell out what those dimensions are, and that you have what we call mediating assessments on each of those dimensions separately, is what the Mediating Assessments Protocol is all about.
Olivier: So, what we would say is, first of all, specify what actually counts as being a successful project. For instance, if the happiness of the team during the project is a factor, let’s have that. If the development of the team skills between the beginning and the end of the project is another factor, let’s have that. Let’s list those factors. There probably shouldn’t be 20, but there probably should be more than one or two. It’s going to be a list of maybe five, maybe seven, maybe eight, something along those lines. Once you’ve actually identified those mediating assessments, make sure, and this is the hard part, make sure that you evaluate each of them as separately as you can, so that you don’t have too much of a halo effect in your evaluation of those dimensions.
Olivier: If, for instance, you ask your people, “How much have you learned in this project?” At the same time as you ask them, “Were you happy throughout this project?” You are going to have a big bias, a big halo effect in your answer to those two things. So, you should think hard about how you get that information in a way that is as independent as possible. And then make your final evaluation of the projects based on holistic assessment of those multiple mediating assessments, which may be anchored on the average of those assessments, but it does not need to be, it could simply be a different discussion informed by each of the individual assessment. For every type of complicated multi-dimensional decision, we think that’s actually a pretty good discipline to adopt. It’s difficult, but it’s not that difficult.
Brooke: We’ve laid out a number of strategies there. If we want to put them on a bit of a temporal sequence, the way that this would unfold within an organization, the first thing you might say is, “Well, we need to identify what the dimensions are that we care about.”
Brooke: From there, you would say, “Okay, well, for each dimension, if we want to have some kind of intensity scale, be it numeric or not, we need to identify cases that we think represent various kind of milestones along the way of that intensity scale.”
Olivier: Correct. The first thing would be to identify the assessments that I mentioned, the second one would be to have appropriate scales, and if possible, relative scales with anchor cases that are anchored in a frame of reference that everybody shares, which gives a common language to the organization and moves this culture forward, as you said. So, that, yes, that would be the second step. The third step would be to make sure that when you evaluate those dimensions, you evaluate them as independently of one another as possible. So, orchestrate your data gathering process in a way that makes it possible for, for instance, the quality of your delivery to the client and the happiness of the team to be evaluated independently of each other.
Olivier: It could be as simple as asking the question at a different time, or, in this example, asking different people. The client might tell you one thing, your team might tell you another thing. So, make sure you have independent assessments. Four make sure that for each of those assessments, you’ve got multiple sources of information. So, if you want to know if your team is happy, don’t just ask the project manager, that’s a fairly obvious example, but take the average of the judgments of all the members of your team. If you want to know if your client is satisfied, don’t just ask one individual like the client, or don’t just ask them once, ask them now, ask them again in a month, and again in three months, and take the average of those data points. That’s the fourth thing.
Olivier: And finally, and this may be the hardest thing, and we haven’t talked about it yet, resist the temptation to have a holistic point of view in the beginning. Now, if you’re doing the evaluation of the project, it’s easy because the project has already happened. But if what you’re doing is making a decision on something that depends on a number of assessments. Suppose, for instance, and this is a real example where we’re using this technique, suppose you are a venture capital firm, and your assessment of every opportunity that you look at is based on six or seven different dimensions. One might be the quality of the founder team, and another one might be how big the potential market is, and the third one might be how distinctive the product or the technology being developed is, and you can imagine the rest.
Olivier: The challenge here is that the minute we start looking at a company like this, we have an idea forming in our minds, we have an intuition, and we start to filter everything that we look at through that filter, through that intuition. We start to let confirmation bias make us less attuned to what might contradict our initial opinion, and more sensitive to what might actually reinforce it. What we want to try to avoid is precisely that. What we want to try to avoid is forming a holistic impression until we have actually discussed each of those separate assessments. The whole point of the Mediating Assessments Protocol is that they should be mediating assessments. They should be intermediary steps before you come to your final judgment.
Olivier: And at a minimum, if you can’t help forming your own opinion, make sure you don’t discuss it. So, when you are in the final decision meeting, make sure that you do not discuss the final decision before you have discussed each of the assessments. That way, your final intuition, when it comes, will be informed by each of the separate dimensions.
Brooke: We undertook a project last year where we did something like this. And one of the things that we found helpful to address this challenge is around the relative weights of the dimensions, and specifically, using some pre-commitment tools. So, we went through an exercise with the client where we were helping them to identify the relative weights that they would apply to each of these, and then essentially conducting something like a noise audit to identify where the discrepancies were, but also where the averages lie, and using that to fuel a conversation to settle on what the weights would be before opening up the envelope and seeing what the results would give across the whole set of potential investments that they were looking at.
Brooke: Now, one of the questions that comes up in my mind is that that kind of approach is very valuable, but at the beginning, it can be challenging. It can be challenging because until you look at the results that you get from different weighting decisions, it can all feel very ethereal. It doesn’t feel very tangible. And so, there can be something iterative there that’s valuable. In your experience applying this kind of thing, how do you find the kind of pre-commitment strategy versus a bit of reflective equilibrium, where we check out different weightings and see how those affect different outcomes?
Olivier: By pre-commitment you mean pre-committing to the weights that you’re going to give to each of the dimensions?
Brooke: That’s right. So, if we say, “Across these dimensions, we want them all to total up to 100 points. This one’s going to carry 30.”
Olivier: Here’s something that might surprise you, we actually don’t suggest doing that. We know, of course, that decision theory suggests that there are optimum weights, and that you should agree on those weights, and you should pre-commit to those weights, and you should be willing to suspend judgment in that ethereal status where you’re going to be until you see the answer, the problem though is that it’s not only difficult at the beginning, it’s difficult throughout, and it’s difficult at the end. People need to believe in the judgment that they’re making in the end, and if the judgment is the result of a formula, they won’t believe in it.
Olivier: Furthermore, what they will do after a while, if they are not stupid, is that they will reverse-engineer the formula and they will make sure that the ratings they give to each of the dimensions are such that they get to the answer they want. In order to do that, they will in fact decide what answer they want at the beginning, which is exactly what we’re trying to avoid. Right? So, in theory, in decision theory, this approach to decision making is perfect, in practice, in psychological practice and in organizational practice, it tends to be completely counterproductive because people hate decision-making power to be taken away from them, and often, for a good reason, and therefore, they will work around the system to recoup, to regain the authority that’s been taken away from them.
Olivier: And that will, in fact, not just make the whole thing a waste of time, it will make it worse, because it will encourage them to have their opinion from the beginning, and then to give ratings on each of the mediating assessments in such a way that they get what they want. So, we don’t actually suggest doing that. What we suggest doing is tell people from the beginning that it will be a holistic discussion in the end, it will not be the average of the mediating assessments. You will have to make the decision in exactly the same way you would have made it, the only difference is that before you actually come to the discussion in your final decision meeting, you will have discussed each of the dimensions separately, and you will be informed, you will have given yourself a chance to be swayed by the evidence.
Olivier: By the way, the fact that we ask you not to share your ingoing hypothesis, you’re ingoing view with others, has that nice added benefit that it gives you a little bit more leeway to change your mind. No one will know what you see if you change your mind. You walk into the meeting thinking, “We should make that acquisition. It’s a great business.” And then we look at the quality of the team, and we look at the quality of the markets, and we look at the quality of the products, and your opinion goes a little bit down and down, and in the end, you are not that convinced. Well, maybe ordinarily, you would have stuck to your guns and said, “Yeah, we should still do it,” because you had said it. But just preventing you from stating that opinion gives you the option to actually change your mind on the basis of the facts that you’re going to discover.
Brooke: I feel like part of what I need to do now is leap to my own defense about this work that we were doing. And so, that point about not wanting to relinquish decision authority, I think is a very important one. And in fact, that’s something that we have integrated into our process as well. The idea was to arrive, and part of the context that might be important here is that there were a large number of investment decisions that needed to be made. And so, what we were trying to do is to arrive with a ranked list as the basis for discussion, not to say, “We now have a ranked list, we’re going to draw a hard line between those which have a score above 56, and those which have a score below 56. And everything above the line gets money, and everything below gets cut.”
Brooke: That’s a very important situation to avoid. And I think that that creates a lot of incentive, as you say, to reverse-engineer the system. The challenge is that when we have a large set of decisions to make, if we can’t go through this kind of full Mediating Assessments Protocol for each individual case, we might say, “Well, there’s a lot of discussion to be had about those things that are going on at the front here, but the top 20 are probably pretty safe locks. And we know that anything that’s getting a score below 15 out of 100, we probably don’t need to talk about too, too much.”
Brooke: “So, we need to prioritize our resources, our discussion resources, our time, around these boundary cases.” And the other thing about having that ranking and the individual dimensions is that it helps to give the conversation a bit more structure. When we say we want to prioritize this over another thing, despite the fact that that’s not what the kind of the hard ranking suggests, the question of why we’re moving one thing from its current location to somewhere else has some structure to it that gets away from the kind of horse-trading that can sometimes go on within organizations when you’ve got a bunch of investment decisions to make, and rather than going through this structured discussion, you have a bunch of different executives who hive off in a boardroom or in a quiet corner and say, “Well, I’ll support yours if you’ll support mine.”
Brooke: And ultimately, it ends up being a political discussion and one governed by power dynamics rather than governed by the actual information that’s supposed to inform our choice.
Olivier: First of all, if you’ve got a number of things that you rank comparatively, that you rank relative to another, and if you’ve got a group of people who are disciplined enough, what you’re doing works fine. Moreover, if you are in fact using it to create big buckets of obvious yes, obvious no, and things we need to talk about, well, then we’re not that far apart. Because if you are doing your mediating assessments and you look at all the things that get a bad score on every dimension, and a very high score in every dimension, these are no-brainers basically. Now, there’s a little bit of a difference between the simple mechanistic approach and the holistic discussion, which is that you could have things that have very contrasted profiles.
Olivier: You could have things that are average across the board, and things that are very high on some dimensions and very low and others, and those are things you probably want to discuss. Unless you’re doing something very mechanistic, it’s hard to think that a formula is going to do justice to that kind of complexity. So, that’s where you will want to discuss the ones that are in the middle, and that’s where you will want that discussion to give people a chance to flex the criteria a little bit, because it’s not that precise.
Yeah, the illusion of precision is also something that I think we need to avoid here. For instance, one of my favorite examples of this is, in a previous life, putting together data sets to share with clients. I had some colleagues who would leave like five numbers after the decimal, and clients would constantly come back and say, “Well, there’s this difference in the fourth place after the decimal between this and that, so, which one really is better?” And the answer is, “Well, we’re just looking at the wrong orders of magnitude here.” If you need to look that far past the decimal to see a difference, then the difference is probably more a reflection of kind of noise within the system, than it is to do with some kind of precise underlying reality.
And so, that something, that kind of humility about what we can rely on the numbers to do, and where we need to depart from them, is something that I think is really important in this system. And credibility and trust within the results that you come to is an incredibly important one.
Olivier: Plus, there is another thing that we haven’t talked about, which is that, however good your mediating assessments are, there might be things that you haven’t thought of, and that are going to be real deal-breakers or deal-clinchers on the decision that you are trying to make. Now, this shouldn’t be an excuse to jettison the whole architecture through decision making, but occasionally, there’s going to be something where you say, “Well, yes, this candidate checks all the boxes, but he was just arrested for dealing drugs. So, no, we hadn’t anticipated that this should be one of our criteria, but we don’t hire convicted felons.” That’s an extreme example obviously. But sometimes there is going to be deal-breakers like that, and you don’t want to blind yourself to that possibility.
Brooke: Yeah. One of the things that we found in conducting this kind of work is that having the dimensions of the mediating assessments available is very helpful for structuring that kind of conversation. So, someone will say, “Ah, but there’s this kind of extenuating circumstance that makes this either something we must do or something we must not do.” The question then is, “Well, that extenuating circumstance, are we really sure that it doesn’t fall within what we’re already capturing?” Or, if it really is external to what we’ve measured, then we can ask productive questions such as, “Is this extenuating circumstance so extreme that it warrants overruling this quite strong signal that we’ve found within the exercise?”
Olivier: Exactly. And this is the kind of discussion that we find is difficult to reduce to a formula. And this is why the predefined weights that you give to all the dimensions are difficult to apply to something like this. Now, let’s not overdo this. I’m not against formulas at all. I’m a big fan of predefined weights, and in fact, we devote a couple of chapters in the book to how much better a formula is than a human being for many types of decisions. And part of the reason, or in fact, the main reason for that is that a formula always applies the same weights to the various inputs, whereas humans tend to change their weights for all kinds of sometimes good, but usually not very good reasons, and therefore, create a lot of noise in trying to be subtle, in trying to be well-informed.
Olivier: So, for many things, for many things that are repetitive, that are fairly simple, and that can be automated, automating the decision with a formula of the type you’re talking about works fine. If you think the judgment is a subtle one that warrants a discussion, then we would suggest that you anchor that discussion in the results of the ratings of the mediating assessments, but you don’t constrain it to the weighted average of the mediating assessments for the reasons we’ve just discussed.
Brooke: Yeah. I like that a lot, using that as an anchor. I think that that’s a good illustration of one of the challenges that I’ve seen in some people’s expectations around behavioral science and around taking quantitative approaches, that they expect that, or they hope perhaps, that taking this more quantitatively driven approach will bring an end to the conversation. Whereas, in fact, in so many of these instances, what we’re trying to do is just build a much, much, much more robust foundation for a good conversation to happen.
Olivier: That’s a very good way to put it. It’s actually the data and the analysis and the formula to start the conversation, not to eliminate it.
Brooke: So, let’s think now about how it is that someone out there listening is perhaps seeing a bit of themselves and their organizations reflected in some of the good or bad practices that we’ve been talking about. What’s the practical step that someone listening to this and finding this kind of approach and this discussion a compelling suggestion for how to undertake conversations, what’s the first step that they can take?
Olivier: First, identify where you are making judgements that are important and that your organization depends upon. If you think there is no such thing, you haven’t looked hard enough, because there are always judgements in every organization. The obvious one that every organization deals with is HR judgements, when you’re making hiring decisions, when you’re making performance evaluation decisions, these are, with very few exceptions, always judgments. When you’re making operational decisions on a day-to-day basis, “How do we price?” Or, “How much rebates do we give to a customer?” Or things like that. So, you’re making judgments, you’re making trade-offs between the cost of giving that rebate and the benefit that you expect to get. So, identify as precisely as you can the areas in which there is judgment that you would want to optimize.
Olivier: Second, measure how much noise there is in those judgments, that’s the idea of the noise audit, and ask yourself if that amount of noise is okay, if it’s tolerable. Or, to put it differently, how much it costs you. The example we haven’t talked about yet that we opened the book with is in an insurance company where we’re looking at the noise in the underwriting department. And underwriting is not an exact science, you need to say, “This policy is worth X.” If you say it’s more, you’re going to lose customers because they are going to get a better quote somewhere else. If you say it’s less, you’re going to get the customer, but you’re probably going to lose money in the long run because you’re undercharging for the risks that you’re insuring, and therefore, you’re going to end up paying more in claims than you are getting in policies. That’s the problem.
Olivier: So, both errors are costly, but it’s non-trivial to compute how much they cost exactly and to decide whether the cost of those mistakes is worth the trouble of reducing the noise. And if the answer is yes, which it will often be, then ask yourself, which of the decision hygiene techniques could be applicable to your situation, bearing in mind that each of those techniques has costs, has downsides, has side effects. There is no free lunch in this, as in many other things. If you use guidelines or rules or the sort of formulas that we’ve just talked about, there is a risk, and that’s what we were talking about, that you will demotivate people or that you will get them to reverse-engineer the formula and to find a work-around essentially to avoid being constrained in that way.
Olivier: If you put in place automation, which is another solution, you might actually get algorithmic biases. If you don’t do it, try it. You don’t have to, but you might. If you ask people to deliberate collegially and to discuss with more people, instead of just having one judgment, then it’s going to be more costly because more people need to be doing the same job that only one person was doing, and there is a risk of political horse-trading, as you were talking about earlier, where you don’t criticize my judgment and I won’t criticize yours. So, none of these techniques is entirely innocuous and entirely without challenges, and it takes some thinking and then some trial and error, of course, to find the best set of decision hygiene techniques.
Brooke: Well, Olivier, I think that that’s a very good place to end off our conversation on a very practical note. Thank you very much for taking the time to speak with us again today, and thank you very much for writing this very compelling book. And I hope that many of our listeners find this discussion informative, and that they go and consult the book as well.
Olivier: Thank you very much, Brooke. It was a pleasure.
We want to hear from you! If you are enjoying these podcasts, please let us know. Email our editor with your comments, suggestions, recommendations, and thoughts about the discussion.