S2E39 – Living on the Edge of Chaos – Tactics for Tech Leadership Podcast

Show Notes

Andy and Mon-Chaio challenge the traditional notions of root cause analysis and incident reviews within complex systems. They examine how the framing of ensuring errors ‘never happen again’ can be counterproductive, suggesting a shift towards faster recovery and continuous learning instead. Drawing parallels with After Action Reviews in the military and Netflix’s Chaos Monkey, they advocate for embracing controlled chaos and fostering a culture of practice and micro-decisions. Listeners will gain insights into how technical errors and normative errors are perceived, and why focusing on organizational culture can be more effective than strict process adherence. By the end, listeners will understand the importance of balancing process with flexibility and why living at the edge of chaos is crucial for organizational resilience.

References

Books by Sidney Dekker – https://sidneydekker.com/books/
Chaos Monkey – https://netflix.github.io/chaosmonkey/
Andy’s talk “From Outage to Understanding” – https://www.youtube.com/watch?v=5grWMM5nIC4

Transcript

Andy: And we’re back for another episode of the TTL podcast. There’s these things that we do when something goes wrong. We call them postmortems. We call them root cause analysis. We call them incident analysis.

And in the end, what we’re trying to do is come up with an explanation of what happened and how do we make that never happen again, at least in most framings of it that I ever see. And I don’t think that that framing is actually all that useful.

Mon-Chaio: Hmm. Well, now you’re stepping on years and years of tradition, Andy. And it’s tradition that goes beyond software teams. So software teams obviously have these, this concept of root cause analysis, usually for bugs or outages that happen. Those, that’s, that’s the most popular, uh,

Andy: Mm hmm. Mm hmm.

Mon-Chaio: The military does these things that they call AARs, After Action Reviews. So after every mission, they come back and they retrospect and they figure out what went right, what went wrong. But even beyond these individual portions, humans just do this naturally. And going back to software teams, we see it even without outages. So, as an example, if somebody’s deciding, do we need to hire more people to complete this project that we want to do in the next quarter? Some of it can come down to, in the end, what you might call root cause analysis. What’s causing us to be slow in the first place? Are we slow? How do we measure we’re slow? How do we know? Whether we’re going to be able to achieve this

Andy: Mm hmm.

Mon-Chaio: And so, what you’re saying is, no, let’s not do that. Let’s go on our merry little way without ever knowing what could have been.

Isn’t that it, Andy? That’s what you’re saying?

Andy: Not, not quite. I’m actually a big fan of the after action review, the, the looking at what happened when you have a bug out in production, of looking at what happened if you didn’t even have a bug in production, but you caught it before production, like trying to understand what happened. What I take exception to is multiple parts of what I often see, one of them being that there is a root cause, that in these vastly complicated systems we can say there’s this one thing, and then from that we can start determining everything that went wrong.

It all came down to that one thing. The other part I have an issue with is the framing of we need to make sure that we never have this happen again. This very binary black and white thinking about it. Because what I see happen in those cases is quite often and like, airline, you should probably try a bit more to not have it happen again, but there’s social reasons for why we approach it that way.

But in software, if an outage occurs again, it’s probably not the end of things. Maybe it is. But if we approach every single thing as though it can never happen again. We actually are ignoring most of the learning and we’re going from more of an emotional reaction than an analytical reaction.

Mon-Chaio: That’s interesting. I imagine the tactics would likely be different too, if you were more analytical than emotional.

Andy: Yeah.

Mon-Chaio: I have certainly seen spoiler alert that they say we can never happen again. So we put 18 layers of alerts on this one thing and this gets back to the, well, will it actually solve the root cause or will just this specific incident in this specific system never happen again?

Because is that where we spent our

Andy: but, but through the system we’ve now set up, we’ve made a whole bunch of other stuff much more likely through alert fatigue. Or through a more complicated system that now fewer people understand.

Mon-Chaio: That’s right. That’s right. A couple of things come to mind for me. One, you use the term in a complicated system. I think if we were using our favorite Cynefin framework, we would say that it’s actually a complex system,

Andy: Yeah, yeah, that’s probably a more precise term to use for it is that it’s a complex system where in a lot of cases we can’t really predict. All we can do is, is act. sense and respond.

Mon-Chaio: And in a system like that, I don’t think many people would take umbrage to your fact that there isn’t a single solution, but they might say, Andy, there is a set, a discrete set of solutions to prevent this from ever happening again.

Andy: Ooh, I was gonna, I was going along you. There’s a discrete set of solutions that we could do. Absolutely. That would prevent it from ever happening again. Maybe in isolation, maybe it being this one particular thing. Yeah, maybe in isolation.

Mon-Chaio: and I think that’s the difference between the complicated and the complex in Cynefin, right? If it was a complicated system, we might be, I think I would be comfortable to say, look, it might not be one solution, but it is a set of four solutions that will solve this because it’s closed form. I know what’s going to happen if I implement these four solutions or these six solutions, but in a complex system, like you were saying there, you, you can’t have zero solutions.

You have to have something because, uh, the whole sense react loop, the react is the something, right? Um, but to say that Hey, I have these six things, and that somehow makes a closed form system where I can put a heavy stamp on it. And I’m like, yep, leadership approved. This retrospective has been closed, and we’ll never have this happen again.

I think is, I agree with you, is a fallacy in thinking. And in culture, actually.

Andy: And, and here’s one of the reasons that I believe that to be true. So we have something occur. We have some, we hire someone that we don’t think actually fit with the, with the company, or we release some code that caused a major problem for some customers. And then we analyze that and we come up with, well, this happened and this happened.

And maybe some of that is technical. And maybe some of that is like, well, this person did this thing.

Mon-Chaio: Mm hmm.

Andy: And I think. One of the ways that us with our engineering brains come into this is we want to believe that we’re dealing with an objective reality that has known rules about cause and effect, and we’re, we’re purely just uncovering the cause and effect that occurred here.

And all we’re doing is just naming what’s there. There’s, there’s no judgment involved.

The uncomfortable reality is that when we go through, and, and, and to an extent I will say that that is true. To an extent you can say, well this occurred, then that occurred, and we can say here’s the mechanism that connected those together to say it’s cause and effect. Here’s all these steps that occurred.

But when you take that, that next step a little bit further, and you start saying, ah, this was the error, that was the error, and those are the things that we need to change. You’ve now gone from a description of occurrence, a description of what occurred, into a judgment of things. And that judgment is not an objective reality.

That judgment is a social construction. You have decided that that is an error after the fact, based on criteria that as a group you’ve possibly agreed to, although not everyone, almost always, there’s not everyone agreeing that that is the social construct that we’re working with to say that this is the error.

But then we’re going to operate with it as though this is some sort of empirical scientific fact about the world. that the error was that I typed kubectl delete pod, and I pasted in a different pod than I had intended. And that was the error. And that’s, that’s the cause of our outage.

Except that’s not a scientific conclusion. That’s a social judgment about where the error was.

Mon-Chaio: Wow. I, uh, yeah. I think it’s challenging maybe for a lot of listeners when we start diving in and saying look you’re in this matrix social construct of which you cannot know inputs and outputs and there’s these errors in which you’re projecting social judgment.

Now, I’ll tell you why I even the analytical mind believe it One is if you remember what we talked about in one of last year’s episodes, I believe the feedback fallacy episode, maybe that was early this year. I can’t remember when we did it, but it was a while ago. One of the things that we mentioned around the feedback fallacy is this concept of objective

Andy: Mhm.

Mon-Chaio: that when you’re viewing somebody and giving them feedback, the feedback you’re giving, you’re giving as objective truth, that there is no argument to this because I am seeing it this way.

That is the way the world is. And that is a big fallacy when giving feedback. That applies here as well, doesn’t it?

Andy: absolutely.

Mon-Chaio: And so since we can kind of draw a line between some established research and this root cause analysis, it can already feel more real in that way instead of this abstract social construct type of thing. The other thing your description reminded me of actually was physics. And I don’t know if this is useful, let’s see, but the way I thought about it is We like to be analytical scientists in some ways as engineers, right?

You were mentioning You’re describing the physical world. You can’t blame a proton for acting the way a proton acts. It acts the way it acts. And so when you’re explaining an occurrence, you can say, well, the interaction between this proton and this or the electrical charges here or the forces governing this structure caused this to happen.

And there’s no blame. It just is. When you start to try to describe it in context, we don’t live in a complicated system like Newtonian physics. We live in, perhaps this is naive because I don’t have a great understanding of this, but I think about ourselves living in a more complex system like quantum mechanics. And in quantum mechanics, because there’s so much uncertainty, because it is a complex system, the same problem may have different solutions at different times. And so trying to quantify that and saying this is the way that this piece acts all the time is, I think, a fallacy. So again, I don’t know if that’s useful, but those are the kind of things that brought to my mind while you were talking about that.

Andy: think it is useful. I think it is useful because a way of working with this. a way of working with this reality that the right and wrong is a thing that is constructed. And in order to move along from there, because you’re like, okay, well then is this entirely just relativism and we don’t, there’s nothing that we can say about something going wrong?

That’s not really where I’m trying to take this. But I think, I think you’re kind of quantum, like things just happen is actually a kind of useful way of viewing it. Things just happen. And the only way that our systems keep working, our human systems, our technical systems, is through the constant efforts of people involved doing things that are kind of outside of the rules.

So someone who has just taken this discretionary effort to go and figure out why that pod was crashing or taking some time to go and talk to someone who’s not their direct report and find out what’s going on and then uncovering all of these things that are happening that their own team now needs to take into account or that they’re going to need to hire a person or whatever.

All of that are these kind of little minute corrections that we make to the system all the time. And what that does is it makes everything very unpredictable. And the fallacy that we have, or the thing that we go wrong with, is if we try to drive out all of that unpredictability, to say that that gets us a better system. And what it gets us is a more deterministic system that will go wrong more easily.

Mon-Chaio: I think you’ve hit at least one of the big nails on the head. That’s exactly right. In fact, I challenge listeners to look at your own organization through a critical lens. How much of the effective way your organization is running is because of these minute personal judgments that are not documented processes and in fact may go against a documented process. And are those the things that actually drive your effectiveness? Or is it actually the process itself that tries to rid you of these minute disturbances that actually drives the effectiveness? And I would imagine that for almost all software organizations complex organizations, right? We’re not talking about bracket manufacturing or something like that for all complex organizations I would imagine that that’s more likely than the latter,

Andy: And, and so the question is, what do you do with this? Like, where do you, where do you go? Okay. So, uh, the lifeblood of what we’re doing is these complex interactions, these, uh, random chance, uh, things that are happening. So what do we do with it? We don’t try to drive it out. But also, I would say, it would be completely wrong to entirely embrace it and just say, do more of it.

Mon-Chaio: right? I

Andy: going off and doing whatever they feel like at any time, also isn’t going to end up with a good system. So What do we do Mon Chaio? We’re kind of stuck, aren’t we?

Mon-Chaio: I will propose something something that may sound radical but maybe doesn’t. And I will draw on a situation that I had recently. So I was helping out a company recently, and one of the things that I told their leaders was I said, You are trying to drive compliance through process and documentation instead of driving compliance through culture and high level bottoms up trust. Ooh, that sounds like some weird stuff already. We’re talking about culture again, this unquantifiable thing, but I’ll tell you where it came in, right? So, they were having code reviews and they said, yes, we’re doing code reviews well. And I said, oh really, are you? Um, do you have a document documenting code reviews?

It said, oh yeah, oh yeah, here’s the document that documents code reviews, and it was big, right? It was a document and it didn’t just document style guidelines. It even said, oh, when do you do a PR? What attitude should you take when a PR comes in? How do you express that attitude, right? You’re supposed to be helpful.

So look at your questions and see, you know, are there any question marks instead of periods? Like, are you asking

Andy: How many emojis did you use? Is there a cake in there somewhere?

Mon-Chaio: right, because these folks, they’re engineers, they wanted to be very analytical about this and give step by step guidelines. And so the other thing I helped them do was actually do a code review. And so as I looked into the code, I’m like, wait, this code isn’t that well reviewed, and it isn’t that good.

And so I started listing out the things I thought were wrong in the code. And then I said, well, wait, they have this large document of things that they’re supposed to check for. So then I cross checked them. And lo and behold, many of those things weren’t in the document. Imagine that. And so as I was talking to these folks about that, I felt like they were saying, well, okay, so you found these six things, so we’ll just put them in the document. That’s not how you drive, in this case, code cleanliness. Which is a complex problem, because it doesn’t have a discrete set solution, a discrete closed form solution. What I said was, any document that had everything listed possible that’s a bad code smell, would be completely useless. Because it would essentially be, I don’t know, 7 gigabytes or something.

Andy: And it would be self-contradictory

Mon-Chaio: and it’d be self contradictory, yes. You cannot enumerate every possible thing that could go wrong in terms of code cleanliness. You just simply cannot in this complex environment. So what can you do instead? Does that mean you don’t have a coding document? No, it does. You have to have a coding guidelines document.

What I proposed was that the problem is that you are not hiring people that care about code cleanliness. And your process isn’t designed to produce code cleanliness because it’s a very rushed process, because it’s not very collaborative, because your PRs don’t have anybody reviewing them, because people are off working on their individual thing and they see PRs as simply a checkbox.

They don’t have actual responsibility towards it. Because, because, because, because, because. And what is the root cause analysis of that? That is a cultural issue. That is a how do you structure your team? What are the stories you’re telling your team? Because as you mentioned, Andy, these people have to make minute micro decisions from the bottoms up and the only thing that can guide those decisions is not a document, it is the culture of what you have built.

Andy: What do things look like if you’re if you’re starting to accept that it’s all these minute little things, minute changes, interactions, knowledge about how the system works.

And you can’t just document it all. You can’t just say, here’s the set process and we’ve run it. But you also, you don’t want to be off in the anything goes. So what does it look like? I’m going to say, I’m going to, I’m going to try a phrase out. We’ll see where it goes. That you want to be sitting on the edge of chaos. You want just enough process to keep you from falling into chaos. But you want to sometimes fall into chaos and sometimes fall into not enough chaos. And so there’s this kind of dynamic point where as a group, you’re constantly kind of jumping back and forth over this boundary. You’re trying to sit on that boundary, but you can’t quite do it because it’s always moving and you’re always changing and all that.

So for code review, what does a code review look like? Well, a code review looks like a an Act Sense Respond type environment where you are constantly communicating about what are people learning about in this code? What are, what are their assessments of it and all of that? Guided by some principles that they can use, and guided by those principles that is part of a feedback cycle that gets them to reinterpret what those rules mean.

Because one of the reasons that you can’t just write it all down is because a lot of it requires interpretation of what’s the context, what is the goal we’re trying to achieve, what does the current code look like, what’s the next logical, small step that we might use to move in the direction that we want to be going.

What is the direction we want to be going? All of those things are a negotiation of the entire group. And as that negotiation solidifies, parts of it will get moved into a process document or something like that. But that process document itself constantly evolves. so you’re kind of holding yourself right on the edge of this chaos.

You have just enough. I think in the agile world early on one of the mantras was just enough process. And that was kind of a similar idea. We want just enough to keep us from falling into a chaotic maelstrom, and not so much that we get these strictures on us where we can’t make the changes that we all know we need to be making. And so in failure scenarios, taking it back to the kind of idea of the root cause and incident analysis,

I think one of the things is to, to keep in a, I think one of the things is to keep yourselves involved in what chaos looks like. Because one of the reasons we sometimes have a really hard time judging where that is, because sometimes you can get to this point where you kind of feel like if there’s anything not in those written process documents, then you’re in chaos. But I would say that’s not chaos

that, that’s just you’re in an unfamiliar land. And so what you kind of need to do is you need to keep practicing what does it look like to be in chaos. If you work on the operations side of software, sometimes you run scenarios of outages.

Mon-Chaio: Mm hmm.

Andy: You’ve purposefully set your system into chaos to see what does chaos look like.

Mon-Chaio: Mm

Andy: So you might shut down some stuff and then, and then you get to find out like, how do we respond? This is not just a systems thing, but it’s also a human thing. Part of the system just went down. How do we diagnose it? How do we react?

How do we interact with each other? And by pushing yourself into that chaotic area, you can start to learn, Is this really chaos? Or, is it controlled chaos? And that’s, maybe that’s a better term there. You want to be constantly in controlled chaos.

Mon-Chaio: hmm. I like this concept. Even when we were working together in the early 2000s, that may have been when it first came out. Was that when Netflix or whoever first came out with Chaos

Andy: Ooh, I think it might be.

Mon-Chaio: For listeners that don’t know, this is a thing that runs in production often that generates errors on purpose so that you never get too complacent. One, so you can practice. Two, but also so that you’re exposed to errors that may not actually occur that often. And maybe that triggers thoughts like, well, this would never occur, but ooh, that would.

And that’s a close cousin.

Andy: hmm.

Mon-Chaio: So I like that. I like this concept of Chaos Monkey for more than just operations. And when we talk about learning, we do talk about how important practice is. And if you’re only practicing during performance time, i. e. the site is down and people can’t order anything, that’s not practice, right?

That’s triage. And you, you certainly learn things from triage, but not as much as you learn from practice. So you do need to set aside time for practice. So I like this idea of creating intentional chaos.

Andy: And it connects for me as well to a concept in the book Just Culture by Sydney Decker. In there he talks about, and I think he got this from someone else, but he talks about the difference between technical errors and normative errors. This is all, in his case, he’s talking about these after the fact assessments.

This is the social construct of error. You can have technical errors and you can have normative errors. A technical error, when you assess something to be a technical error, what you’re saying is You’re admitting that this is something that we could have trained more for. This is something that we, uh, misunderstood the tools.

Through training we would do that better. Through better mental models we would handle this more smoothly. Technical errors are things where you’re admitting you have as a group some sort of agency over you can control it in some way because you can say oh well let’s now that we understand what’s happened.

Let’s make sure that we train on this a normative error is Something about the professional him or herself relative to the profession. So if you judge something to be a normative error, that’s where you say, Oh, this guy just doesn’t want to understand how networks work. Oh, we don’t, we don’t have a good enough development team here.

Our management doesn’t understand their role. And so they, they meddle with how the teams operate. Those are normative errors. You’re, you’re saying that they don’t perform their role as they should. Part of what we want to do in these cases is we want to try to stick to how can we keep as much of these things as possible as technical errors, as assessing them as there’s a technical thing going on here that we could get better at.

We could, we can, handle the chaos better.

Mon-Chaio: I think it’s important to keep the focus on what we know to be true that objective truth and instead of object extending that objective truth further into what might be subjective and I like what you said earlier too going back to the sitting on the edge of chaos I think I think you are right that Well, let me start over here. Yes, we should just do just enough process. And I think a lot of companies do that, especially companies that purport to be agile. They will say, we do just enough process. And if you look at a lot of Silicon Valley companies, they will say that they do just enough process, even though one might disagree. So it is true that you do want to be sitting on the edge of chaos a lot of the time. And that gets into, if you’re a SaaS software company, you should release, I think Zuckerberg even pointed this out in an interview a couple weeks ago, are the difference between Meta and Apple is we release, we want to release and sometimes be embarrassed about what we release.

Andy: Mm.

Mon-Chaio: That was, his statement about how they’re different. And I don’t know if that’s necessarily true. I did work at Meta. I’ve never worked at Apple. I could see how that perception

Andy: Yeah. Yeah.

Mon-Chaio: Um, and so that’s kind of like sitting on the edge of chaos, right? Sometimes you want to fall into what you might consider chaos.

That, oh, we’re super embarrassed by what we released. And part of not falling into chaos you’ve mentioned, which is one, maybe what you’re falling into isn’t chaos, or it’s controlled chaos or something, and you should be okay falling into it. But two, I think a lot of people see. This concept of no process and they get afraid of it because they’ve been in places with what they call no process or limited process and they’re always falling into They’re never falling out of chaos. They’re always falling into chaos because I think done poorly, sitting on the edge of chaos is not a 50 50 thing. Done poorly, sitting on the edge of chaos is always falling into and never falling out of it. And so beyond the practice, what is the difference of teams that sit on the edge of chaos and mostly fall into normal instead of mostly falling into chaos?

Andy: I think I can tie this back to the very beginning. We were talking about at the beginning, like, I have an issue that I see a lot of these incident analyses, root cause analyses, whatever, where they focus on, we need to make sure we never have this happen again. And the thing that I coach people now to look at is, rather than trying to frame this as never let this happen again, frame this as, How can we recover faster? Now that faster recovery could mean that you have a zero time recovery. It’s like, hey, you got it It you you basically you were never there and I think that’s the thing It’s not that you want to sit in chaos all the time But it’s that whenever you find yourself in it You want to see how quickly can you get back out and you want to fall into it periodically

to find out how fast can you get back out? So it mean time to recovery is the buzzword on this So, what’s your mean time to recovery? So when this does happen, how can you recover even faster next time? And sometimes that means you recover before it even makes it to production. Sometimes you recover before the commit even occurs.

Sometimes you recover before the hire happens. It’s that it’s how do you recover more quickly? But the only way you’re ever going to measure that recovery time is if sometimes you do have to recover.

Mon-Chaio: you also have to identify that you’re in chaos, you can’t recover if you don’t know that you’re in

Andy: Yeah. This is, I think, something that people often forget about, uh, that mean time to recovery is not from the time that you notice to the time that you’re recovered. It’s from the time that it started. to the time you recover. Detection time is part of recovery. And so I’ve, I’ve had places that they’re like, Oh no, no, we recovered from that in five minutes. Uh uh. You didn’t recover in five minutes. You require, recovered in six months and five minutes. That had been for six months.

Mon-Chaio: I do want to touch on the fact that it is a social system again.

So you’re listening to this podcast and you’re saying, great, I’m going to go back and I’m going to implement this concept. I’m not going to have too many written down rules about how we should monitor production. I have just enough rules. And sometimes that means I don’t catch something in production.

But you know what? I’m going to work on fast recovery and I’m going to analyze these incidents, try to not project too much objective truth. And so you sit there and you do your thing. But maybe it’s not working out well and you’re still getting these things. Because what you’re saying is, as an example, well we’re blameless here and what we want to do is be able to recover faster and observe these changes happening quicker. really need everybody to document where they spend all of their time. And if you’re not spending 60 percent of your time on client work, then that requires a meeting. You’re not in trouble, but we need to look into it. Or rewards. We go to a company meeting and this engineer was great because there was a huge outage and they spent 60 hours that week out of their own personal time solving that outage.

We’re rewarding this engineer because that’s You know, they, they deserve a reward. They spent all this time or any sort of that stuff that you talk about, even the language that you use, language that might be good, that can be dangerous. Things like, do you have a regression test for that? That’s a great thing to say. Let’s have a regression test for that. We should always have regression tests. But when there’s a connotation of, therefore, we never have to pay attention to it and it’ll never happen again. People feel that. And so then that starts to go against the procedures and processes and ceremonies that you’ve put into place.

So I think the big thing that we have to realize here, as people take this into account, or as people take this into practice, is something we talk about a lot on the podcast. The intent is almost more important than the tactics. If you put into place the tactics with the wrong intent, even buried intent, you’re not going to get your solutions.

Andy: The majority of the actions people take is going to be a function of the system that they’re in. And so that includes Mon Chaio, which you were just talking about, what incentives do they have?

Go into it with this mindset of everyone is doing what’s right, but we’re getting the wrong outcomes. So you need to think about how everything is interacting to tell people that what they’ve done is the right thing. And, and then you make small changes over time to see what kind of impact on that overall system do they have.

Mon-Chaio: to that, Andy, is that that is a you problem, not a they problem. Again, and I think that’s true in many if not all complex systems, is when a complex system isn’t working, the first person that should get a microscope put on them is the leader of that complex system, not the moving parts. So for listeners listening, that’s always where you should focus your own attention.

There was an outage in my group. It’s not that Tim didn’t commit and didn’t get a code review. It’s what didn’t I do that allowed him to do that? Was it because I was pressing, you know, I, I, in my, In my all hands, I said, we gotta really hunker down and deliver this thing. I know you all are under pressure, but it’s important for funding that that message cause him to say, I’m going to skip code review this time because I really want to get it out.

Andy: Is it that you’ve set up a system that only seems to work when you’re there? Yeah.

Mon-Chaio: that’s a huge one. That’s a huge one, right? And that’s that culture thing about, um, you know, does the culture continue when you’re gone, even if you’re gone for a day? So we’ve touched on psychological safety, feedback fallacies, culture building, learning organizations. I thought we were talking about root cause analysis, Andy.

Andy: I think we are, aren’t we?

Mon-Chaio: I think we are too.

Andy: Maybe this is why people are so confused when I tell them about how I do incident analysis. They’re like, and I should say that when I do an incident analysis, it, the, the very first steps are actually very much factually grounded. They’re things like get a timeline, get a incredibly detailed technical analysis of what exactly happened because only once we have that kind of like objective reality touch point,

can we start safely going off into everyone’s individual subjective reality of what happened? And in order to get a really interesting incident analysis, that’s what you need to do, is you need to go from, this is objectively what happened, into This is subjectively how I understand what happened. To understand that larger system around it, what played into it?

What were the people’s, uh, motivations or involvements? Or, uh, uh, were they all afraid that they were about to be told off and they And so that’s why no one responds. Like, that’s how you get into those things to be able to, to, to work out how do you get to a better system.

Mon-Chaio: And the subjective realities is a big part of that. You have to listen, especially your contrarians, who often are the ones that we push aside the earliest. Everybody’s been in an incident review where we’re saying, how do we get this under control? And somebody will say, well, you know, if we move to a microservices architecture, this simply wouldn’t happen.

And everybody puts their hand on their forehead and they say, that’s like a two year thing. Like, why are you talking about this now? Or, you know, if we said we didn’t want to sell to those banks, which we’ve been talking about for a long time, then their requirements wouldn’t cause us to have this fragile system, right?

Those are your contrarians. Those are their subjective reality, but there’s often truth in those

Andy: Mm hmm.

Mon-Chaio: And I think when we ignore those realities is when we also often have challenges, uh, and then we get to that blame, blameful part of, well, you know, they didn’t take this seriously. Why are they always talking about microservices architectures?

Andy: All right. There were lots of things in here. I don’t even know how to enumerate them, Mon Chaio.

Mon-Chaio: I think I would start by saying Again, that software organizations are complex, not complicated , problem domains. And so those are generally the solutions that you need to apply, complex solution types. We talked about that in, what was it, uh, in, in Wicked Problems also,

Andy: Another one.

Mon-Chaio: that while you shouldn’t get rid of blame or culpability or whatever you want to call it, humans fall on that side way too often. If you want to start getting better at the sensing reacting part, look to look inward first and critically and figure out how to make small steps in this in the way that you can in your sphere of influence to make change and embrace a little bit of chaos and the way that you tame that is through practice so that you can get better at dealing with chaos when it comes in a in an urgent environment As well as figuring out whether your culture and the way that you do things leads to more chaos because it is, um, because your theory and use and your espoused theory maybe don’t align, even in very subtle ways.

And it’s funny how small the subtleness can be and how big the chaos it leads to can be. So I think those are kind of the things in my mind.

Andy: Yeah, I think that covered it. I can’t think of anything else that we might have gone over. Our listeners will have the advantage of being able to just rewind and play again and figure out what we’ve missed here. I think living on the edge of chaos, I will leave it at that. And they can tell us what we’ve missed in our recap. So it’s been another great conversation Mon Chaio. Until next time, be kind and stay curious.

S2E39 – Living on the Edge of Chaos

Show Notes

Transcript

Comments

Leave a Reply Cancel reply