Show Notes
In this episode, Andy and Mon Chiao explore differential diagnosis. They clarify what differential diagnosis truly entails and how it can be applied in technical settings, particularly within software teams. The hosts illustrate its importance through real-world scenarios, emphasizing the balance between gathering information and taking corrective action, the consequences of different types of misdiagnoses, and the significance of judgment in the problem-solving process. Listeners will learn the surprising ways in which multiple causes can coexist and the implications for diagnosis and treatment within teams.
References
- MIT OpenCourseware Lecture – https://ocw.mit.edu/courses/6-s897-machine-learning-for-healthcare-spring-2019/resources/lecture-11-differential-diagnosis/
- Outlining the Design Space of Explainable Intelligent Systems for Medical Diagnosis – https://arxiv.org/pdf/1902.06019
Transcript
Andy: Welcome back to another episode of the TTL podcast. Today on this episode, Mon Chiao and I are going to revisit one topic that we mentioned in the last episode. In the last episode, we talked about different ways that a company can fail or at least not succeed.
And in there, we mentioned that. A way of approaching that is to come up with a diagnosis about what’s going on, about how you might understand the situation and what you could do about it. But we didn’t say how you do that. And this is even what caused us to come up with this idea, I think I mentioned in there something about differential diagnosis.
Mon-Chaio: Yeah, you may have, or may have been something where we talked about it afterward, but in either case, this is what we’re talking about today.
Andy: that’s what we’re talking about today. Differential diagnosis. And going into this, I have to admit, I had a House or ER understanding of differential diagnosis, which is basically some, some person in a lab coat saying, well, let’s perform a differential diagnosis. And suddenly they know what’s happening.
But I don’t
Mon-Chaio: listeners that don’t know, which I think probably are few, unless you’re cultural wastelands like me, House and ER are American medical TV dramas.
Andy: Yes. So, but that was my starting point. Mon Chaio, I think you had a better starting point because your wife, Kay, is a nurse, and in fact, she was listening in the background when we were discussing, should we do this next? And I got the impression was incredibly happy that we might be talking about differential diagnosis.
Mon-Chaio: Oh, absolutely. So, my wife is a, psychiatric nurse practitioner. So what she does is she diagnoses pediatrics kids, for mental disorders. And differential diagnosis was a big part of her training. And I think a big part of medical training. It’s probably required for anybody in the medical field would be my guess, but I am not an expert on this.
Andy: I’m not an expert either, but. From the reading that I’ve done, I would think, yes, that it absolutely would be.
Mon-Chaio: So she was, you were right, very excited about this. And then of course we’ve proceeded to not talk about it at all until today. So, no,
Andy: you’re going to be sitting there with all these medical textbooks telling you all these techniques of differential diagnosis. Okay. All right.
Mon-Chaio: When you do read the medical textbooks, it is very enlightening as to the depth of this procedure. Because for me, when I first heard about it, like a lot of the things we talk about, it seems like a no brainer. Okay, so you need to diagnose. Well, and let’s step back a little bit, right? So last episode, we talked about the possible ways.
A startup can fail and the likelihoods, based on research about, the most likely ways startups fail.
And given that we hope the theme for this season is turning things around, you have to be able to diagnose what’s wrong. Before you can turn things around and so differential diagnosis makes a lot of sense. But when I first heard about it, I said, okay, but it’s diagnosis. Everybody does diagnosis all day long You come home your house is hotter than normal.
You got to do some diagnosis. So what’s the big deal? But it turns out it is a really big deal and it is a really interesting way of thinking that Some people don’t do. Many people don’t do.
Andy: and, and I think that’s one of the things I’m most interested in is because. A thing that I encounter so often is people who are so highly trained in software engineering or these other things, and I do not see the level of diagnostic rigor that I would expect. And so what I’m thinking is that like, we’re somehow trapped by our own reasoning.
And in my mind, that just means I remember the feedback episode of the empty vessel idea. But well, let’s go with the empty vessel. I think that people maybe just don’t have, like, some of these techniques. And so they don’t use them. And they, or if they do use them, it’s more ad hoc than purposeful.
Mon-Chaio: And I agree. I think one thing that if you read about differential diagnosis, they talk about a is As you become more experienced, your experience can play into the diagnosis and can often become a shortcut for having to, in the formalized methods, do all of these probability calculations and putting them in tables and then comparing them and all of that sort of work that in the formal definition you have to do for differential diagnosis.
And so if experience plays a part where you can shortcut, And if maybe, perhaps, people don’t have the formalized definition, all of a sudden you get into the space where it is a lot of, what do they call it, fire, fire aim shoot or something? What is that thing
Andy: Fire, ready, aim.
Mon-Chaio: Yeah, fire ready or something like that, right?
Andy: Yeah. So let’s actually take a step back because we just said it. And I said, I didn’t know much about it other than it’s house and ER. Let’s just actually define some terms.
Mon-Chaio: I love it. We love defining terms here. I love it.
Andy: I was watching an MIT OpenCourseWare lecture, and they started with a definition, and I think it works really well for us.
It’s very simple, it’s very straightforward, and it’s also cribbed from Wikipedia, so it’s easy for us to reference. So they say, diagnosis is simply the identification of the nature and cause of a certain phenomenon. Okay, so we, as you said, we diagnose all the time. You walk into the house, it’s warmer than you expect.
You start figuring out, what’s the thermostat set at? Are all of the windows open and it’s sunny outside? How do you start coming up with an explanation and identification of what’s going on? Differential diagnosis. It says it’s the distinguishing of a particular disease or condition from others that present similar clinical features.
Already we got some jargon words in there, disease or condition and clinical features. So we should probably start by just saying like, what are the kinds of things we’re talking about here?
Mon-Chaio: should we talk about what are the things we’re talking about in software or what are the things we’re talking about in medicine?
Andy: I think we can very quickly just go to software. I think we can skip over it and just step into software.
Mon-Chaio: I, I’m glad you said that because I was going to make up all sorts of medical stuff, which then we don’t have a lot of listeners in general, but we hopefully don’t have any doctors listening who would be like, that’s not right.
But if
Andy: not what that is.
Mon-Chaio: skip the software,
Andy: let’s just take this and apply it to the domain
Mon-Chaio: okay. Okay.
Andy: So clinical features in this case. So things that you see, observations you can make, uh, they actually made a really interesting distinction that I think I’d never heard before, but I think is really useful.
They made a distinction between a sign and a symptom.
Mon-Chaio: Ooh. I don’t know the difference. Tell me more.
Andy: Yeah. So. They said that a sign is objective evidence. So a sign that we could use in software. Let’s keep this very technical, mechanical, but then as we talk through it, we can then apply it to social as well.
Mon-Chaio: hmm.
Andy: So a sign is objective evidence. So I can have a sign that my web server is serving a 500 response on every third request. Or I have a sign that it served a 500 response. Let’s keep it even simpler than that. I have a sign. It served a 500 request or 500 response on this request.
Mon-Chaio: hmm.
Andy: That’s a sign. A symptom is subjective. A symptom would be that the customer got upset when they clicked the button and nothing happened.
Mon-Chaio: Mm, okay.
Andy: Now, I think that’s kind of useful because it starts helping us differentiate between what is sitting inside someone’s head.
Mon-Chaio: Mm hm.
Andy: all symptoms are going to be in someone’s head. That’s what makes it subjective. That doesn’t mean it doesn’t exist. Just have to be clear on that. Versus a sign, which is that we objectively, we can all see that it’s there and we can agree that it exists and work from that. Now, The reason I think that that’s important, is because when we’re talking about those clinical features, those are both signs and symptoms. So when we’re talking about the diagnosis, we’re taking all these things into account. You’ve got signs and symptoms. So you’ve got a user complained. That’s a sign, but the symptom is that they were upset by this and we can ask them why they were upset or how it affected them and that kind of thing.
And there was another sign that we had a 500 response. And maybe there’s something in the log file indicating a stack trace.
Mon-Chaio: hm. Mm hm. Mm hm. Mm hm. Mm hm.
Andy: But maybe, maybe the stack trace didn’t ever appear in the log file. But I got something that I would think would show me a stack trace that never showed up. Those are all the features that we use for our differential diagnosis. What does all of that mean? How does that come together to describe what’s going on? And here’s the thing, and this was the key when I was watching this lecture, more than one thing can be happening at the same time. If you just go off of Well, all of these features are required to say that it’s this kind of a diagnosis, to say that it’s out of memory.
And all of these features are required to say that it’s a deadlock in the database. Maybe I ran out of memory and had a deadlock in the database at the same time. So if I, if I go too strongly to just, eliminate everything based on this happened, and this happened, and this happened, which means it’s this.
If I take everything that happened, I will eliminate all of my possibilities if I think that only one of them can happen at a time.
Mon-Chaio: Sure, right. And that’s, that starts to get difficult because especially in the soft, well, in medical world as well, but also in the software world, a lot of things are these combinations. And so instead of saying out of memory or deadlock in database, now all of a sudden you also have to say out of memory and deadlock in database as you go through your diagnosis.
And what we know from math and combinatorics is the more possibilities there are, the more combinations there are. And so you can’t possibly at some point say that I’m going to evaluate every combinatoric of all of the possibilities, right? And I think that starts to get into the actual practical application of differential diagnosis.
Andy: Yeah. And going through this, you have, like, where it started, which was They had flowcharts, essentially, what in software we might, we might put together flowcharts of, I’ve seen runbooks of this actually, uh, where people say, oh, here’s the runbook for what to do in the case that you get this alert. And you start running through and it says, well, you need to do this and check this and run this test and then do this.
And what the medical profession apparently found was those were too expensive to create. Too cumbersome to use. Too, fragile to different situations. And in the end, maybe interesting, but not what you could expect someone would just use.
Mon-Chaio: Hehehehehehehehehehe
Andy: And I’ve seen the same thing for runbooks in software. If you try to train people, you get this alert and you just run this runbook and it will diagnose exactly what to do. What I see is people then applying very poor judgment because it misses out in order to stay succinct and stay small, of a size that we could even create.
It leaves out that multiple of these things could be happening at the same time, or some new condition could be coming up that we haven’t taken into account. And so we’ll misdiagnose it and restart the web server when that will end up causing load to go to another place and cause a cascading failure.
Mon-Chaio: Right, and I’m very much not a fan of runbooks. I usually encounter them when people say, well, this is the way that we’re going to scale, or this is the way we’re going to do knowledge transfer. I often think that the thing that the runbook says, because exactly what you were saying, Andy, it removes judgment, because it gives you a recipe to follow. That’s the whole point. But if it gives you a recipe to follow, that should just be automated. And removed from the runbook.
Andy: Yeah. Give me a script and tell me to run the script.
Mon-Chaio: Right, run these three commands, look at the output, if the output says x, do this. If the output says y, that’s a flowchart, right? That’s just a script. So why don’t you just run that script? Or write that script. That’s my short little segue into why I don’t like runbooks generally.
Andy: Which gets us to the next evolution of these diagnostic techniques, which is that people started writing programs to do this. Because they’re like, you know what, these things are so big and complicated. Maybe there’s these new things called computers. They process information. They started creating programs and they’re like, okay, we’ll do this. But those programs very quickly stopped being simple deterministic things. And they started turning into very complicated, probabilistic systems where they have Bayesian inference and all sorts of other things to come up with what’s going on.
But To me, the interesting thing that comes out of that is that they give you a probability of what could be going on. Because all of these symptoms, as we said, could be happening at the same time, and could be all caused by different diseases, different things happening.
It’s kind of like, well, there’s a 20 percent probability it’s this, and a 50 percent probability it’s that, and a 95 percent probability it’s these two things in combination.
Mon-Chaio: Mm hmm.
Andy: But I think that’s the thing, is that the lesson I take away from what differential diagnosis in medicine seems to be now, which is a very wide field of lots of probability and lots of epistemology, how do you even know what you know, is that a lot of their training is how do you make sense of an ambiguous signal to come up with a conclusion to do something.
Mon-Chaio: Mm hmm.
Andy: And so, there’s a lot we can learn from them.
Mon-Chaio: And,
Andy: it’s a very wide thing, so we’re not going to go into all of it.
Mon-Chaio: right, and I think the impetus in medicine to do this, I think it kind of aligns well with software and the technical field as well. You can think that there’s two sides of misdiagnosis. One side is the patient dies. Right, which I don’t know what the analog in software is for the patient dying, maybe catastrophic system failure, company going out of business, something like that.
Andy: Horrible downtime where you could ever bring it back up, like an actual disaster. But we can also talk about at the social level of a software team, you could destroy the fabric of your team
Mon-Chaio: Mm
Andy: and like, all have to start again.
Mon-Chaio: hmm. Yep, that which is death and rebirth in some ways. Yeah The other side of it is I think in the medical side they call it over diagnosis Which ends up leading to a ton of unnecessary treatments So you can think about as a patient, you’re given drugs that aren’t necessary. They’re going to the hospital every week to perform a test that’s unnecessary.
And you can definitely see that on the software side. And in fact, I would say that’s probably more prevalent in the way that our profession does diagnosis. Where there’s just a lot of wasted work that happens.
Andy: Yeah. Look, looking for absolute certainty. And in fact, There was an interesting thing that was brought up. I’m going to keep referencing this lecture. It was the most recent thing I watched and also, I think, in many ways, the most easily digested for me. The lecturer started out, he said, look, here is the entire map of about, as of about 20 years ago, of how we understand the human circulatory system works.
And it’s huge. Like, if you looked at it as a software engineer, you’d be like, Oh, that’s a large company’s microservices setup. All these connections between things and all these other things. And he said, if you were to diagnose through tests, so for them, diagnose through tests means, like, do a blood sample, do a biopsy, do, whatever else.
If you were to do all of those tests to the point where you could actually fully diagnose purely based on the actual measured deterministic almost system, you would kill your patient.
Mon-Chaio: It takes too long, yeah?
Andy: It takes too long and it’s too invasive. Now, I think in our software systems, what we build instead is there’s this whole observability thing.
You say, well, you need logging, you need metrics, and you need all this other stuff. Because we can change that system to provide us more diagnostic information.
Mon-Chaio: right.
Andy: But if you don’t have that, like, like you don’t for your team, I can’t do all of the invasive probing techniques on my team to understand exactly what went wrong, necessarily because maybe I have to put some of them in a room to see, well, is it in this situation that these two blow up? Oh, okay.
Okay. Well let me try another test. So if I, if I give them, a cheesecake beforehand and put them in there, maybe they’ll be fine. Maybe it’s a blood glucose thing.
Okay. Let me try that.
Mon-Chaio: or maybe it’s a knowledge silo thing. It’s whenever these two people go on vacation. So to prove it, I’m going to send them on vacation for two weeks and measure. That could be a hypothesis and a test, yes?
Andy: Yeah. And, and if you did all of these things, you would destroy your team.
And in medicine, they also have, this advantage that they can work from population level probabilities. We know that when we give antibiotics, in these situations, they’ll work, 70 percent of the time.
And we know that in 25 percent of the time, it won’t work and things will get worse. And we know that in 5 percent of the time, it might just kill them. Please people take your antibiotics and won’t kill you 5 percent of the time this is a particular situation I’m looking at here. But they have like these large population level data that they can work from, which lets them then come up with these decision trees about what to do, and that feeds into these big, machine learning algorithms and Bayesian systems and all of that to come up with, these are the things that you should probably do, because they have that information, they have that ability.
We don’t have that ability, do we Mon Chaio in a team? We don’t know what is the probability that when I ask John to do this thing that he’ll find it useful and get better out of it or he’ll get worse out of it.
Mon-Chaio: Right, we don’t. And even on the more macro level, we, I don’t think we have enough, objective. Statistics or numbers to help us drive that. So, as an example, a team is not shipping well. They’re missing every other release. So, we don’t have diagnostic tools that say, in the field, in the population, Most teams that are missing every other release, 60 percent of them don’t have unit tests.
40 percent of them have, conflicts between senior engineers and tech leads. We just don’t have that type of information.
Andy: I think the closest we have is the DevOps research, led by Nicole Forsgren. Where they found the correlations that they, I believe they reported as they’re, they’re causations of higher performance. So that might give us some guidance. I see teams use it as a recipe for what to do, rather than as a way of diagnosing. So they use it rather than saying, well, these signs and symptoms of working well is these things. And so they mandate just do those things.
Mon-Chaio: Exactly
Andy: and it seems like they’ve actually just skipped the diagnostic step.
Mon-Chaio: I think, I don’t know how many, what percentage of our listeners do this because we don’t have the tools to figure it out. But I’ve certainly heard anecdotally some people aren’t really interested in the first 30 minutes of our podcast.
Andy: not?
Mon-Chaio: Well, that’s because we talk through stuff like this. Possibilities, probabilities, research papers. They just want to know what to do. Right. We’re called tactics for tech leadership. They want to get to the tactics.
Andy: Oh, and we get to the tactics in the end.
Mon-Chaio: Exactly. So you just fast forward to the end until we say, well, what are the tactics? And then you just listen to those tactics, right?
Diagnostics, learning how to do differential diagnosis is around learning how to do this higher level thinking. Yes, there’s frameworks about how to go about it and how to calculate probabilities and what probability to calculate.
But at the end of the day, it’s a judgment call
Andy: Yeah,
Mon-Chaio: in reading those probabilities and then determining what is the next correct course of action, given the probabilities that you see and the interactions between those.
Andy: yeah.
Mon-Chaio: And so you have to learn that higher level thinking. And if you’re not interested in learning that higher level thinking,
this is where the machine tells you the next thing you should do is blood test. You’re like, yeah, well, the machine told me so blood test. That doesn’t really help as much, does it? It doesn’t help you steer into ways that where you can use your own judgment. And that’s part of being a good leader is being able to inject your great judgment into the situation.
Andy: exactly. And, I want to try to use an example of judgment. Because, as we were talking about this, the diagnostic itself, doing the diagnosis, you could end up having to do tests that could be harmful. And so what I see sometimes is, individuals or teams, because those things could be harmful and they just want to resolve the problem, they will skip all of the testing
and just do the thing that they believe will resolve the issue.
Mon-Chaio: hmm.
Andy: But the thing is, is that they’re resolving a sign. They’re not resolving the cause. For instance, I was working with a team and we were having an issue where every once in a while, after a release, we would get an error message saying it was unable to load some JavaScript. And then it happened to one of our developers. And we’d never seen it before. We’d seen the error message, but we’d never seen it.
And so we couldn’t reproduce it. And it happened to one of the developers, and he could reproduce it. He could get it going again and again. We didn’t have access, me and the developer, didn’t have access to enough of the system. So we pulled in a platform person. Someone who had access to more of the underlying platforms.
The platform person’s very first response was, Oh, we should just clear the caches.
Mon-Chaio: Mm hmm. Mm hmm. Mm
Andy: And I said, no, no, no, no, no, we don’t want to clear the caches. Because if that resolves the issue, it also won’t give us any new information to diagnose what’s going on. And this is that judgment call of how harmful is it to the patient to let something continue versus to take an action that you think would resolve something.
But it would resolve this one case. We’d never understand the cause of it. We’d never understand where this is coming from. And so we started diagnosing. The approach that I use is one of building up a theory of the world, an explanation of what’s going on. And that explanation, that theory means that I can propose hypotheses because that theory makes predictions about something that will be going on. And so we can then propose that hypothesis and go and test it.
These are the, this is how diagnosis works in the end. You’re like, you’ve got all of these things that could be going on. I need to come up with an explanation of exactly what’s happening in this case. And what was going on here is we went down multiple different paths of, is it the CDN that’s caching the wrong data?
Because when I requested this URL, I got the JavaScript. When the other developer requested this URL, he got an HTML page. Yeah. But I’m in Lancaster and he’s in London. So our system goes through a CDN. For a CDN, we might be going through different nodes of that CDN. So, okay, maybe it’s something about the CDN has cached things improperly.
Okay, how could we explain how it ever would have gotten to these different responses? Now, as we worked our way through this, each one of these, it’s kind of like, well, is that likely? Is that probable? Is that something we can even measure? Is that something we could ever test? The entire process of this was going through different explanations and looking for evidence that such a thing might be happening.
And that evidence may be in, looking for a sign or looking for a symptom. So looking for a sign in terms of, a direct measurement, I can actually see this, or looking for a symptom, in this case I’m going to say it more along the lines of, a thing that we can very reasonably say is connected to this kind of a thing occurring.
Mon-Chaio: Or, couldn’t it have been the result? You could design a test that will give you a result. So, for example, you write a test that hits two different random nodes of a CDN a thousand times. And then that will be a test that gives you some confidence interval around whether CDN caching was actually the issue or not.
Andy: Yeah.
Mon-Chaio: So that would be like a blood test or a diagnostic test of some sort.
Andy: yeah. And we had a lot of kind of serendipitous information come in. Things that we didn’t even think of coming up with, but came in. And that’s kind of, for a doctor, that’s kind of like listening to the patient, or watching them closely as they’re telling you something.
Watching them walk into the room and noticing that they’re limping or favoring a leg, or something like that, where it’s like, they didn’t tell me, but just the thing happening gave me extra information.
Mon-Chaio: Mm hmm.
Andy: And one thing we had of that was, uh, that one of the developers just in the background, kept curling the same URL again and again and again.
And finally he says, Hey, it’s not always giving me the same response.
Mon-Chaio: Mm
Andy: I get the JavaScript. But most of the time I get the HTML.
Mon-Chaio: Mm hmm.
Andy: Like, huh. Interesting. Start piecing the whole thing together. For those who don’t want to be in suspense. What we eventually figured out is that the web server that serves the JavaScript had a fallback that if it, if you hit a URL that didn’t exist, it would serve you the index page.
So if you hit a JavaScript file that didn’t exist, it would serve you its index HTML.
Mon-Chaio: Okay.
Andy: Okay, that gives us an explanation of how you could get, an index HTML when you’re trying to hit a JavaScript file. But what we didn’t have was a further explanation of how could you get the old index. html because it was the old one from the during the release, which was remember this was happening during the release.
It was the old index. html when asking for the new JavaScript file. And there it had had to do with keepalives on HTTP and how Kubernetes services work. Which is that the CDN keeps a connection open to the web server. And because of that, some of the CDN nodes, this is, this is our hypothesis, or this is our theory.
Parts of this we can never prove. Because the world moved on and the cost of doing this would just be too high for us to actually prove all of it. But it gave us enough of a theory that we could work from. And it seemed to explain everything. Hence, differential diagnosis, all these different causes.
What we came up with was the CDN does a keep alive. During a deploy, the CDN has a connection open to the old web server. The new web server comes online. The HTTP load balancer sends all new traffic to the new web server. A request comes in through the CDN, goes to the new web server, gets the new index.
html, which goes to the browser, which tells the browser to go get the new javascript. That request for the new javascript goes to the old web server, and the old web server says, I don’t have that file, and responds with its index. html. And then the CDN caches that, caches that for two hours.
Because after two hours, it started working again.
Mon-Chaio: Mm hmm.
Andy: But we went through this entire, diagnostic process, differential diagnosis, where we had multiple explanations of what could be going on here. And the thing was, is after we did this, something that the platform person said was he was fascinated by the process we used to figure this out. He’d never experienced that before.
Mon-Chaio: Hmm. That’s a little troubling, actually.
Andy: And that, that, that is why I said, let’s talk about differential diagnosis.
Mon-Chaio: Because what I was about to say before you said that is, that’s interesting, Andy, but there’s a number of different ways this could have changed and that might have changed how we did differential diagnosis. And the reason I was going to say that is your explanation of this makes complete sense to me.
You have a problem, You list out the probable aspects of what could be happening, you look at those probabilities, you perform some tests to eliminate or strengthen certain probabilities, then use judgment to determine whether you should do more tests until you arrive at a state. You can never in this case prove it a hundred percent, but you had enough of a working model where you determined that you could make changes that would alleviate these signs and symptoms and probably wouldn’t make things worse because the model sorts of fits at least in most of the probabilities, most of the cases.
Andy: Yeah.
Mon-Chaio: And so I would have said that makes a lot of sense and that’s what everybody does. So let’s talk about the things that people might find more interesting than not everybody does. But apparently what you’re saying is not everybody even does this. So what do they do instead?
Andy: They take their immediate gut reaction to what is happening and they make a change and they see, is the problem still there? And then if the problem’s still there, they’ll make another change. And then they’ll see, is the problem still there? And it’s not that the changes are uninformed, it’s that they’re not geared to understanding the cause.
Mon-Chaio: Right.
Andy: And I think that was the thing is I coached them on when we have a situation like this, I am quite often willing to pay more downtime. Like this was a testing environment. But even in production I would have actually been willing to pay more downtime to get a better understanding of what’s happening, so that we can actually resolve it and get a much better chance that this isn’t going to recur.
Mon-Chaio: Right. And the challenge with making a change and seeing if it solves it is it can get back into this either failure mode of death of the patient or the over treatment, over diagnosis. So, in your example, let’s say that clearing the caches did work. And so the team says, Well, you know what we should do is we should just have a cron job, that runs every ten minutes that clears the caches.
Andy: That was very early on. It was like, Oh, we just need to clear the cache. In fact, the deploy is supposed to clear the cache.
Mon-Chaio: Mm
Andy: And we looked and the deploy does clear the cache, but it only clears one cache, not the other caches. And then they were like, Oh, well, we should just have it clear all the caches. And I said, Whoa, wait, why would it ever need to do that?
Mon-Chaio: Mm hmm.
Andy: Those other caches shouldn’t matter for the deploy. And they’re like, Well, yeah, but we’re getting this problem.
Mon-Chaio: Exactly. And you might say, well, what’s the problem with that? So you change a deploy script to clear all these other caches. But it’s about a complex system. Remember code is like the main thing for code obviously is work and serve value to the customer. But 1A is to be able to communicate the state of the system to the people that have to continue to maintain the system and develop the system.
And anybody who’s ever been a software engineer of any size of product has gone into a system and said, what is this thing doing?
Andy: Yeah.
Why is it doing all of this
Mon-Chaio: that seems to never be successful? Like it’s always false. Why is there like three things in here that are always false? Like what’s going on here?
Why am I clearing cache, waiting two minutes and then clearing the cache again? Like, I don’t understand that.
Andy: I found a paper and they talk about these treatment purposes. So why you might take a particular action, why you might perform a particular treatment. So in this case, like in this diagnosis, we had treatments of, reading the code to do another test. We had, running a curl command to do another test.
We had, looking up log files to see what was going on. So we had all these different tests, which are all treatments of a form, of one form or another. And each one was kind of chosen for different reasoning. And I liked this paper because what they did was they kind of asked a bunch of doctors. About how they go about treating, explaining how they treat.
And they came up with this list of different reasons why they did things. And I think it’s really useful for us to go through because, I think it gives tactics for when you’re trying to diagnose reasons you might do different things. So, one was theoretical validity.
Where they were looking for a robustness between signs, symptoms, and diseases, as proved by theories. So they’re, they’re, they’re doing something because the theories tell them that this is a correct course of action. This isn’t, like, my own theory. This is, like, for me, I’ve had multiple times where I’m just like, no, the theory of how TCP works tells me that it has to work this
Mon-Chaio: Mm hmm.
Andy: So I’m going to take an action based on
Mon-Chaio: Mm hmm.
Andy: into big arguments with people about how TCP works and it’s like, no, that’s not what it does. It does this. That’s why the system did that. And they’re like, that’s stupid. Doesn’t matter. That’s what TCP does.
Mon-Chaio: Right. Mm hmm. Mm hmm. Mm hmm. Mm
Andy: Another one is severity of consequence. So you might choose something because not doing it has such a high consequence or because doing it has such a low consequence. But it has a big upside, maybe. It’s like, this is where I said I can trade downtime for information. But I only do that if the consequence of that downtime is low enough that the information I could gain out of this, offsets it. The next one is time constraint. Like, do they have time? Is this person just about to die? This is the whole triage thing that happens in hospitals. And we also do in software. We triage bugs and we say, well, we have to deal with this right now. But you might take an action earlier than you’d want to, because it’s the best thing you can do right now.
You’d prefer to take, do more diagnostic tests to figure out, is this the right action? But time constraints say you don’t have time to do those. And I think sometimes we use this earlier than we should.
Mon-Chaio: Right.
Andy: And that was what this operations person does is he uses the time constraint thing long before he needs to.
Mon-Chaio: Mm hmm.
Andy: Domain of expertise, which is how much do you actually know about this? So you might take a completely different course of action in a thing that you are an expert in versus a thing that you’re not an expert in. And so I think that, that one takes a bit of self reflection on, am I actually an expert in this?
For instance, like I am for the most part to the level of an application’s need for TCP, I would consider myself a bit of an expert in how TCP will interact with things. If I was working on a core network switch about its routing and VLANs and all of that, I would not at all consider myself an expert in TCP enough to be able to handle that.
Mon-Chaio: You wouldn’t take that experience because your reflection shows you that you’re not an expert to be able to shortcut some of these diagnostic techniques. Whereas an actual expert who’d been working in it for 30 years might say, look, my experience tells me that it’s, 90 percent this, and we’re just going to go down some tests for this route.
Andy: Yeah. It’s like the more experienced programmer telling the less experienced one, the compiler is not the thing that’s wrong here.
Mon-Chaio: Mm hmm. Right.
Andy: This is not a bug in the compiler.
Mon-Chaio: Mm hmm. Mm
Andy: Another one is risk avoidance. And this, this is in, in, this was the medical field. So this was, they said, the responsibility assigned to a specific doctor and power dynamics between junior and senior doctors. Where’s the risk involved here? Is it like a liability risk? But in software, we also have these risks, talking about more of the social system, you have those risks in terms of actions you may take, for employment law.
Mon-Chaio: hmm.
Andy: Like, we don’t talk about this all that much, but a lot of the times when you’re doing something as a manager you have to think about what is the actual employment law here.
Mon-Chaio: Mm hmm. Mm hmm. Yep.
Andy: And then the last one is technical feasibility. You might come up with this amazing idea of, oh, we could test this. And then it turns out that your tools cannot do that.
Mon-Chaio: Right.
Andy: And so you have to go with what your tools can do. And that’s like in that story I had. We could probably have come up with something to measure exactly what was going on.
But it’s not technically feasible, really, to do a bunch of this. Or time feasible. Like, I cannot come up with a complete explanation of the internals of this was Cloudflare as the CDN, of how Cloudflare caches things. And like, why this one developer is getting a response six times out of seven of one type and the other response in the other case.
I can go from the very high level, kind of theoretical validity of distributed caches are complicated.
Mon-Chaio: Right.
Andy: And they’ll have weird behaviors sometimes. Because I’m limited by the technical feasibility of actually testing some of these things.
Mon-Chaio: And I like that, I like how this lays it out because we talked a lot right at the beginning through most of this episode about what we might call the probabilistic approach. of determining probabilities, working down the probability tree, figuring out what the next steps are, right? But there are other ways to do differential diagnosis.
You mentioned one that, some people call the prognostic approach, which is if you don’t do it, catastrophic things are going to fail,
Andy: Oh, kind of like you’re predicting the future, like, this is gonna go so horribly wrong, I have to
Mon-Chaio: Well, yeah, not exactly. So like if you did this release and then all of a sudden your website was unavailable to all of the world and you were losing a million dollars a minute, you might just clear that cache,
Andy: yeah,
Mon-Chaio: right? Like, okay, it’s going to destroy evidence or we’ll worry about that afterwards, but, we can’t be losing a million dollars a minute.
So, let’s clear that cache.
Andy: Yeah,
Mon-Chaio: Right. And then there’s also what they call a pragmatic approach. This is one that maybe, kind of flows through a bunch of different things. Something about like doing something because it’s more responsive to treatment and for me that comes down to both like the experience side of things.
We talked about that already, which is hey I know that it’s likely this and so let’s go down that route because it’s going to be more responsive and we’re going to gain more information but it’s also stuff like well, if I do this, it will definitely show me if it’s X, but that’s only 5 percent of the explanation, right?
Versus if I do this, it gives me a 40 percent chance that it could be 30 different things. But at least I can kind of narrow that down because it’s like, it’s more responsive to what I can do instead of doing like 5 or 1 percent chunks 60 times or whatever the case may be. And it just really gets back down to, again, it’s that judgment call.
Because if you have the wrong judgment, to your point Andy, some people go to the prognostic approach way too early. We got to just fix this issue.
Andy: All right, yeah, I think that gets us to a really good ending point for this discussion of differential diagnosis, I have a feeling we’re going to come across this again. And I think what we can leave the listeners with is, even though we don’t know the probabilities of these events, this is where our experience really comes in.
All of these techniques about making decisions about what course of action to take to more information about the situation. Costs and benefits of taking those actions, are all things that we need to consider, and why we’re doing it.
Mon-Chaio: Mm hmm.
Andy: What we end up with is people’s ability to make judgments based on signs rather than guesswork.
Mon-Chaio: hmm.
Andy: So Mon Chiao, I think we should bid our listeners adieu. Give them our standard plea for feedback. We are always interested in hearing from people. We’re always interested in getting what your questions are. Whether or not we’ve actually covered the topic to your satisfaction. Whether we could do more or do less to cover it better for you.
Mon-Chaio: I mean, this is what you do on LinkedIn. We should make a TikTok channel, right? Where it’s just the last five minutes.
Andy: But as always, if you’re interested in hearing about these kinds of diagnostic approaches or teams that could get better at diagnostic approaches. These are things that, as you just heard, Mon Chiao and I, we both coach people on. We help them get better at having those kinds of judgment.
Mon Chiao, anything you want to say before we go?
Mon-Chaio: So if you’re interested, reach out to us. Hosts at the TTL podcast. com. That’s hosts with an S.
Andy: Excellent. And so until next time, Mon Chiao, be kind and stay curious.
Leave a Reply