DevOps Decrypted: Ep. 14 - Platforms; The good, bad and ugly

Please update your cookie preferences below to view this content.

Summary

Welcome to Episode 14 of DevOps Decrypted, where Romy, Rasmus, Jobin and Jon talk about platforms, people, and processes.

Just what exactly is a platform, and can we even define what an IDP is? Naturally, this leads onto the subject of Backstage – Spotify’s gift to the dev world: an open source framework for building developer portals. We discuss its limits, its power, and how culture and people-driven your tech stack really is.

The talk turns to Google's Service Weaver, which Jon has some choice words for (and a little bit of praise, too!), before wrapping the show up with cloud complexity; is it really the cause of one in five cloud outages? Romy talks about some hands-on experience with increasingly complex cloud integrations and what happens when it all falls down.

Romy Greenfield:

Hello, everybody, and welcome to DevOps Decrypted!

This is episode 14. I'm your host, Romy Greenfield, back from holiday – and joining me today, I have Jobin, Jon and Rasmus. Hi, everybody!

Jobin Kuruvilla:

Hey, Romy, welcome back!

Romy Greenfield:

Thank you! So, let's start with talking about some DevOps in the news.

I know that there's a lot of talk about “platform” in the news at the moment. And what are your guy's thoughts about building platforms, and their use cases? What's good about them? What's bad about them?

Rasmus Praestholm:

It is such a meaty topic. There's so much floating around about it right now, that's for sure.

I love the idea of having a platform.

But you have 2 camps right now, or rather you have a spectrum of camps. There are those that say, well, it's just part of the process, you know you can do one thing, you can do many things – while there's, in another way, where it's more like… Don't call it a platform, or whatnot. So, how do you even start defining it?! That's the hot part for me, I think.

Jobin Kuruvilla:

But that's the first thing that came to my mind when I said, let's talk about platforms. And then there's this article that said “don’t call it the platform”, right?

It reminds me of that discussion I had some time back – should we call it DevSecOps, SecDevOps, or SecOps – you know? So I'm confused… What's in the name? Why can't we call it a platform?

Romy Greenfield:

It is just a name – you can call it whatever you want. What's important is what it does.

Jon Mort:

Yeah, I think it's important to have something that you can, like, actually name things that you can talk about. So... If you're not calling it a platform, then what are you going to call it? You have to. There has to be something, some common vocabulary actually to have a discussion around and know what you mean.

So I'm not necessarily sure I agree with “don't call it a platform”, but I think there are some interesting ideas in that post, and I think the platform, platform engineering and building like, internal platforms, is a complex and widely misunderstood thing, I think – as possibly some of the industry is really just grappling with, and I can understand what it is that we mean when we're talking about that.

Rasmus Praestholm:

Yup. I can suggest sort of and clarify.

I think what we are thinking about as a DevOps podcast is sort of “the next thing", which is becoming more and more about the DevOps platform or the internal developer platform, and those sorts of things.

Of course, previous platforms, like Cloud Foundry and others like deep tools, are really made to do all kinds of things. But it feels like this new initiative in platforming, uh, platform engineering is new, and it is some of the next evolution of these kinds of concepts.

Jon Mort:

Yeah, so I think one of the key themes in the platform is that enabling developers and reducing the cognitive overhead. So in my mind, anything that reduces cognitive overhead and allows you to focus on solving the problem at hand – that's a platform to me, and that's the important part of it.

All of the rest of the fluff around it, you know, that's nice, but that ultimately, it's about giving people tools to build and to operate things without having to have to worry about too many concerns all at once.

Jobin Kuruvilla:

If I remember correctly, that was also one of the major arguments in that article. They were focusing on developer experience. Right? You call it a platform. You call it something else. Ultimately it has to be about developer experience, right?

Jon Mort:

Yeah, and I agree on a lot with them. They could be arguably in the team teams, apologies, book about, the thinnest viable platform is what you should be aiming for, and certainly at least like. And if that's just a set of documentation of these are the services or tools that we want to use. I think that's a brilliant place to start – and it might be where you need to end as well, and then there maybe not go too much. But I think, and I think this is where the context of a way where you're in is really important.

And so is the size of your engineering team. Because if you've got hundreds of engineers, you need to make the most of them. Then you're probably worth giving them high-level abstractions to build on so that they can go faster together, ultimately –

Jobin Kuruvilla:

The biggest question is okay – what exactly do we mean by the platform? Is it like a set of tools, or are we actually specifically, talking about any internal developer portal as the platform, right? Is it a service catalogue? Is it an internal developer portal? What exactly is the platform that we are talking about?

Rasmus Praestholm:

Yup, and it's going to vary so much, almost infuriatingly so, how that is defined per company.

I think the main thing we can maybe agree on is that, you know, DevOps when it came out on the scene, was really more about connecting. You know. Devon hops naturally.

But the way that work was turning off slowly into a service-oriented thing, to where it was being thrown over the wall. It was that Ops defined an interface to the wall so that you can go up to the wall and request things without having to go and file tickets and go left and right and all those kinds of things.

So not the next-gen of that well, maybe there isn't even an interface. Maybe it is just a platform you can log into and press the button. It's still done by somebody behind the scenes, like an Ops team or a DevOps team, or a platform engineering team.

But it's even more smooth than that first pass at DevOps.

Jobin Kuruvilla:

So you're talking about a fancy self-service portal, then?

Right? Where you can go and probably request your infrastructure. Probably even start onboarding your team. A lot of those since happen through the portal, the self-service portal that I was mentioning. But if that alone contributes to the platform, or is it like this law of underlying tools which could be the CICD tools, monitoring tools, all of those contributing back into that internal developer portal? Will the platform be a combination of all of that together?

Rasmus Praestholm:

You know… I wonder if somebody could make a map of the platforms? Where as a newcomer to the field. They like, do we need a platform? Here's a flow chart.

Okay, let's see which one you get.

And if it is just documentation in Confluence. That's good enough.

Suppose it's an IDP like, you know, Backstage, cool. Go to that thing – if it's something even grander, something consisting of multiple independent pieces. Sure.

But probably, that map is lacking, and it's lost in all this kerfuffle in talking about platforms and other platforms, and yes, platforms and multi-platform, and so on, and so forth.

Jobin Kuruvilla:

You did mention about Backstage there – obviously, you cannot have a discussion around platforms internally without talking about Backstage. There was a lot of emphasis for it, even in the AW3 event last year. So what exactly is happening in that space?

From what I understand, it is an internal developer portal; I mean, you still need the CICD tools. You still need a lot of the other tools that you would otherwise require for the functions that you need to do the DevOps functions, right? So what exactly is Backstage, and if I am the customer adopting Backstage, will that alone take me somewhere?

Rasmus Praestholm:

Right. So Backstage, at least, seems to have a sharp focus on what they are – which is a pane of glass view to your tooling and the stuff going on at your company. So that's your software catalogue and all the things in there, and it's a set of templates that get you started when you're trying to do a new thing.

Out of the box, it might not be so much figuring out; we are already an established enterprise. We have all these existing things, and what do we do? Well, I mean. You probably already have a bunch of platforms if you're already established and you've been going for years and years. Also, you have platforms. You might not know what they are, or where they are, but they're in there.

Backstage is a cool little thing where you can begin somewhere in a light fashion, and I believe that just recently, they also announced that they were getting ready to introduce their new back end because it did start when it originally got open source by Spotify. It started as… maybe a bit of a hodgepodge? Just because it's kind of like we're going to get this out of the door. Here it is, and it went V1, but it still took a lot of just, you know, messing around with code in the app itself to make it yours, which was hard to approach.

So now they are getting to where it might have more of a plugin manager like Jenkins or something, which can be nothing but a good thing in my mind. At least. So I think Backstage knows what it is, but the platform field is still much bigger than just Backstage.

Jon Mort:

Yeah. And I think that Backstage in a way, is almost a tool to build an internal to both platform out of its not. It's not an out-of-the-box experience of the things because each organisation is going to have. There are their own requirements and unique context in which they're working.

And so it's the collection of the plugins and of the templates, and the service catalogue, and of the various different things that you'll you want to put together for that to be your platform.

In the same way that I think Kubernetes is a tool for building your operations platform if you want to call it that, you have to customise it out-of-the-box . You know you have to have opinions about those kinds of things. So I think I think of Backstage as very similar, and it's some similar way of it's it's a starting point to build, and it's probably not the thing that you end you with. Right?

Rasmus Praestholm:

Yup.

Jobin Kuruvilla:

I remember one of the major problems that we faced internally was everybody liked Backstage, you know. But the developers themselves can say, “yeah, it's yet another tool, right?"

Another fancy user interface that I need to start using now, and once I started using it, yeah, it's great for onboarding teams. It's great for self-service. But at the same time, if I need to make simple tweaks, there is no easy way to do that in Backstage at the moment… You're probably looking at developing another plugin.

And for weeks, you need another developer on board who can actually go and do typescript and, you know, develop a plugin for it. And obviously, then comes from maintenance issues. So Backstage itself was becoming a problem internally, and maintaining it was becoming a problem.

Jon Mort:

Yeah. And we've reduced our, I guess, ambition for it initially because everyone was like, "yes, we we're going to use all the features, we want to use all the things, all the possibility!" And actually reducing it down to a much smaller set of capabilities, and actually, we are just going to start here and then see where we grow with it with our Backstage uses.

And that's been a lot more successful in terms of getting more teams on board and more teams adopting that. But Backstage is a place to you know, share – frankly, who's in the team? You know, it's a good place to start as well as some documentation and and and things.

Yeah, it's growing for sure.

Rasmus Praestholm:

There is definitely a large cultural aspect to it. My understanding is that probably part of why it was so successful at Spotify was because they had a culture where every team that was working in Backstage was also working on Backstage. And that works well. If your entire tech stack is oriented around Node and those sorts of things, it might be a big pill to swallow for a .net shop or something else that's just completely out of that field.

So, if you fit in it culturally and tech stack-wise, you can turn Backstage into whatever you need, and it can cover all your platforming needs as long as you're connected to the right tools and integrations around it.

Whereas maybe some of the other options in the market. They might be a better-fit tech stack-wise, but they could be trying to be everything when they're not, and they aren't extensible to where they can be. And maybe that's where some of these “don't call it a platform” things come in because if you can't truly make it into a platform that fits you uniquely – maybe it isn't a platform. Or maybe there's no platform that you need.

Jobin Kuruvilla:

Yeah, it goes back to that “build it or buy it” argument. Right? I think Netflix. When they started the platform engineering project, they looked at Backstage, and they realized that instead of investing all our resources and money into, you know, maintaining Backstage, it's probably better that we create an internal developer portal.

And… they actually built a different portal and didn't rely on Backstage.

And even at the end of it, they found that the adoption internally was a bit difficult because users thought that, again, as I mentioned earlier, it was some of the user interfaces, but at the same time, they found out that just consolidating everything into a single place is not enough, right?

You need to build your workflows using the tools that the adoption, as you… You need to get the buy-in from the developers. Otherwise, it's all pointless.

Rasmus Praestholm:

Yeah, this also gets to where I have an issue with some of the naming of IDP as an internal developer platform because sometimes it's suitable to make it about the developer experience that is developed first. It's pretty much nobody but developers.

But in some cases, like with Backstage, you can make it a thing that PMs and Leads and things rely on to keep tabs on some things, but they might not really count as developers. And it's again going to be so uniquely suited to different companies that it's how to get there. One thing I started wondering was, instead of an internal developer platform, is it more like a developer community because that can include everybody in the picture of developing software, whether they are a PM or they're actually in their coding on things.

Jobin Kuruvilla:

That's an interesting concept.

Let me ask you this question, right…

When we say IDP, a lot of people say it's an internal developer portal. You just mentioned it as an internal developer platform – there itself, we can see two lines of thought, right?

What exactly is it? Obviously, if I'm looking at something like Backstage, I see it as an internal developer portal because, you know, it is, at the end of the day, a portal, and it's not a platform because it's not hosting anything else.

All the tools that's running behind.

It's not part of the system itself, so it's nothing but an internal developer portal.

Rasmus Praestholm:

Yup, and the boarding has been so mashed up that I'm using them interchangeably by accident, anyway. Yeah, I'm pretty sure Backstage is a portal – but in some cases, it's like developer or DevOps, so is it like an internal DevOps platform or internal developer portal?

I'm sure some people have a definition for it, but I don't think that's out there firmly anywhere yet. So sometimes, for me at least, I like to, well, let's call it something less overloaded, and try to make it more clear.

Jobin Kuruvilla:

Makes sense. Another discussion that I was looking at was, you know, are there enough tangible benefits to say that, yeah, it's worth exploring platform engineering? It's worth investing in it.

Because one of the recent things was a lot of the benefits they can actually count or quantify, it comes from the proper implementation of DevOps processes and tools. Right? Continuous Integration, Continuous Deployment that actually brings benefits that you can quantify.

Whereas with platform engineering. Yeah, Great, You're implementing a set of standards as a set of the same processes to things like that. But they can't really quantify. And again, you need to say that, you know, having a platform team looking at platform engineering comes with the cost on its own, right? So, looking at the cost? Well, so the actual benefits. Is it worth doing it?

Jon Mort:

But I think in my mind this is part of the thing of measuring and an understanding where your organisation and, you know, where are you spending time? Where are engineers spending time that there may be there shouldn't be things in it? It can be that one of the problems you have is that.

You don't know what your state is of all the services there, so the most important thing is to get a service catalogue and understand which teams are running, which services, and what capabilities they provide. Because in surfacing that information, you can then see what are the new capabilities that we can serve our customers with by combining the different services.

So that might be one area of problem that you have. Another area of problem that you might have is, it is, is knowing what choices you have to go when you go and build, build a new thing. It's the kind of a blank sheet of paper problem. You like to have this template out of something as a starter which you can go – it gets you going more quickly, it means you're more likely to adopt good practices and the standards that you that the organization might have.

And so I think that that's another area. Another thing. It would be like standardization. And, like all this is just how we do things. And so they those things actually having a set of these are the blessed services this is, if you're going to build a web service. This is. This is how we do it here, and that being, and this is how we manage it. So you might even have your own custom CLI to deploy things, and you have a single path of production.

You have a single toolchain – the organizations are going to have a combination of those sort of 3 areas as problems, and there's plenty more I want to imagine as well.

But when I think about it. That's the level of complexity that you, yeah, when you start thinking about, how are we going to get more out of our teams, like, how can we remove busy work? How can we remove frustration? It kind of gets down to… “How can we make teams' lives happier?"

So you're actually doing what you love, rather than battling some… It's like. I've got to learn a new thing, or I've got to apply the security standards to this project, and I didn't have a template, to begin with. So now it's super painful to add all the things that I did, and I got a ton of refactoring, and it's just slowing me down, getting value out.

So I think those are the areas I think about in terms of platform and enabling teams to do more and higher quality.

Rasmus Praestholm:

So that's another one of the interesting parts that at least one of these articles recently touched on. I think it was by Humanitech, which is one of the other IDP options out there.

And they got into literally trying to calculate an ROI on the different activities that your team perform.

Because Backstage looks shiny out of the box, but if you think about it like, I could press a button and I can get a Git repository that it’s hooked up to Gitlad, will deploy in that – like, yay!

But if you start doing the math and, like, how often do you actually spin up a new environment or a new app, and it's like kind of a disappearing rare activity.

And what they came up with was saying that the most important part to really get results was to implement basic CICD. Yeah, yeah, that's right. But that's like way before you would think that you stopped thinking about polls in the first place.

So maybe it is a good exercise first to go in there and calculate. Well, what are you tied up by? What are you toiling away at? What is taking too long? And maybe there's a platform or portal or something that can help you.

But if the problem is that you still Haven't got automated testing, you probably should work on automated testing first.

Jon Mort:

Yeah, I think that word, toil, I think, is a really important one, and there's also about kind of capability as well. So one of the things that the platform can do is fill in for the lack of capability for a team.

So you might not have great operational capability. But you can lean on the platform to deliver that for you. That might be something you're at and able to enable differently shaped teams to run and deliver.

So yeah, I think that “toil” thing was maybe what I was talking about before. But this is the capability as well. So standing on the shoulders of giants in a way like that. That's another way of thinking about it.

Romy Greenfield:

We've actually started capturing all of the unplanned work we end up doing. That maybe having some tooling or automation across all of our different microservices would help because we realised that we were doing a lot of work that wasn't tracked in a sprint when we needed to upgrade dependencies, vulnerabilities came up, and we weren't actually tracking that we weren't apportioning but any estimations or time value to that when actually it was taking up quite a lot of our sprint.

So at the moment, we've just started measuring that and actually putting estimations on how long that work takes to see how big of an issue it is, and then it makes it a clearer-cut argument for, well, actually, are we doing this in the best way? Do we need to bring in some automation that's going to go and update all of the different microservices to make sure that we don't kind of context switch, do one service at a time, or have things in different states?

Jon Mort:

Yeah. And I think that that's kind of one of the things I recommend to teams to do is like your first thing, and then you can spend the time paying down the technical debt, and then see some results, maybe, and I think that you know it's super important to do that measure. It's kind of just the basics of flow metrics, you know? Where there is work waiting and why, and what's going to hold you are holding it back?

Jobin Kuruvilla:

Yeah, I was going to touch upon that. I mean, the problem that Romy mentioned, any IDP is not going to solve that problem. So you need to realize where your problems are. As you mentioned, where the toil is, where the waste is. So are we saying that, you know?

Probably it is worth doing value stream mapping early on before you enter into, you know, doing proper platform engineering, or whatever it is right, because you need to know where the wastage is, how good the flow is, and that kind of things before figuring out – okay, what is the problem? What's the solution to it? Right? An IDP may not be the solution.

Rasmus Praestholm:

It's also a fun catch-22 involved here because, as Romy noted, they've started capturing some new metrics. Okay, where do you put the metrics? Oh, well, we need a developer portal for that.

But we don't know yet if that will be worth it because we don't have the metrics.

Jobin Kuruvilla:

That's what, possibly, the value stream mapping will bring out right. It will bring out some data and some focus points where you can then start investing time. Suppose we do some metrics that you are after. Yeah, definitely go for it, right?

Rasmus Praestholm:

It is definitely a fun challenge.

Romy Greenfield:

So…

I've seen in the news that Google has announced something called Service Weaver. What are everybody's opinions on this new framework for writing distributed applications?

Rasmus Praestholm:

Oh, there's another fun one.

I think Jon has thoughts on this one.

Jon Mort:

I certainly do!

I'll be frank… I hate it. I hate the whole concept of hiding the fact that you're actually getting a distributed application behind synchronous APIs like a, or even asynchronous APIs that sometimes I've got like this is a super local speed, and sometimes it ends up with the remote call that could error and hit rate limits and and and all this sort of things.

And it feels like one of these problems that the industry runs into, seems like every kind of 20 years, we go through this cycle of, oh, we could hide all of the remote procedure calls behind the things, and that will be easier to work with.

But it's not easier to work with when you hit production because the real world happens, and distributed calls fail, and there's it, anyway. I think it's. It's one of those well-meaning things which I kind of feel is really dangerous.

I'd love to be proven wrong. I’d love them to get it right. And that it's not an issue. But yeah, it seems a dangerous paradigm to go to what we are working in. Yeah, I hope I'm wrong.

Jobin Kuruvilla:

I want to confirm, though… So from what I read, what I understood was, okay, we are going to end up with a single binary at the end of the day, so it's going to be a monolithic application, although while developing, you’re doing it, just like you develop microservices.

So if it is a single binary – again, there were a lot of advantages to microservices in the way that you have the services deployed in different places. You can scale them differently, like if I have a login application, I had a billing and something else, my billing and login is probably used a lot more than the other applications, and you know you need to scale them appropriately.

None of that is going to happen with a single binary. I mean, you still are going back to the Old World Order where you have to, you know, figure out how to scale this application with a single binary.

Jon Mort:

Well, that's what I think that's one of the things that this does quite well actually is, it divides things into modules, and those modules are those units of deployment that can be deployed independently. But you can combine some modules together for it and things. So it gives you that level of control.

So in a sense it that I think that's a really positive part of the framework, so not to be, you know, completely hating on it, but I mean what it gives you is the option to be more creative at runtime, you know. And as you scale, you can go – actually, yeah, this, you know, the billing part, let’s separate that out and run it in its own set of services.

Jobin Kuruvilla:

So I should have known Google did it. So they must have done something right!

Jon Mort:

Yeah. I think what I acknowledged is a lot of smart people are behind this as well. So you know, I know I'm hating on it, but I do want it to succeed as a way of looking at, like, I think, because if they're right, it would simplify writing up like a whole load of things, because it's much simpler to think about things as a contained unit of deployment, and I want to make one change. I can make an atomic change over things to that.

That simplicity is super attractive.

Rasmus Praestholm:

I don't know if I mentioned in this podcast before that, but I moonlight as an indie game developer in open-source land. So I'm very familiar with a problem that is common to game development, and I see this Service Weaver thing, and I go. Aha!

I could totally use this for making games out of it.

But that's probably not what they had in mind.

But it does. It brings me to this concept, where in video game development, you typically deal with a monolith, at least if you're doing something like a single-player, because it's a full-screen application running on somebody's computer. So yeah, everything is in one binary, and that's how it works, and it works really fast.

Then, when you mix it into something like multiplayer. Which is why you get your remote calls and so on. It gets complicated. And sometimes, you fix that by having a completely separate set of components for when you want a multiplier. And they're the ones that have all the network communication things hooked in.

But then you have this weird difference where you have the monolith, which works beautifully, and the multiplayer thing that's modular, and you get different bugs between the 2 of them.

Sometimes you fix that by cheating and actually making the local binary itself run a multiplayer server, and you connect to it, which adds a lot of overhead.

But at least it makes them similar. So me seeing this framework makes me think… Wow! So I could just write one thing? And then, just if we say, okay, now, we want to run the game server, but in Kubernetes, and it just pulls out the right pieces, and this makes it magically work?

That would be wonderful; I would love that!

But I also wonder that that's probably not the right use case, and I might be a little disappointed when some of the magic turns out to be like people hiding behind curtains with little long rods, and they're like shifting the doors around.

Jobin Kuruvilla:

So Jon, my concern then is, you know, what can Google do differently to make this more attractive to you?

Jon Mort:

I think it's the things that could be remote calls; they need to be really explicitly remote calls, and they, and to have those that that kind of constraint around it. So if you're in Rasmus’s game developer platform, the difference in performance is from everything running locally on your things. To part of it is in that, you know, is you know it's in the cloud somewhere where it's running on some, you know, so some server that you then need to maintain and synchronize state between the number of calls and the amount of chattiness on a protocol like that would be, you know that's really important to get right, that your you know your client part is got enough of the state that, and that matches with, with the server, and and and those sorts of things.

So I think. Bringing in that explicitness of the remote calls, the error handling and that kind of thing, I think, would be really helpful. One of the things that I think they have got right on that, looking to get right, is the instrumentation and the monitoring around that.

So you will know when your remote calls are failing. It just looks like the APIs aren't ready. It's written in Go so that you can handle the error. But a lot of the time, when you're writing distributed systems, you want to handle the error and do some back off and retries and propagate failure, and you use distributed systems, error handling methods.

And it feels like that's too hidden from you. And sometimes you don't care. Sometimes you just want to write a library, and if it that, you know, if something fails and that's okay.

But you'd need to get. You must get it right at your fingertips to do that when it matters.

And so, maybe one of the I haven't seen it in in this, but maybe something that could be added, or someone's going to tell me it's already. There is like an injection of arrows. So to be able to, you know, throw some noise in there and make it feel like it might when you're developing locally, make it look like you're actually in a distributed system that's going to require a retry to somewhere, or you know a transaction failing halfway through, and then it retries it to another, to another copy of the service.

Rasmus Praestholm:

So I'll throw in another thing that I would love to see, and I don't know if it's there now or if that Google would want to do it. But this really makes me think of how the framework of whatever the platform can be the best thing ever.

But if you're not mature in the different areas. It may not matter.

So in this case, you know, the concern is that it's. It might be too slow if something is just transparently suddenly remote.

A way to get around that, I would think, would be just extremely high coverage, automated testing, and making it really, really easy to get to it so that everything you write in there will automatically have tests generated that would run with different levels of latency.

Almost like a built-in Chaos Monkey. That is going to get you that, like… Okay, we're going to see how this beautiful thing that works great on your thing over here works if we throw all the components in the cloud, and then also through a monkey mentioned there and see how it does it if you get that somehow out of the box.

Then… if you, if you can kind of, like, fail fast and learn immediately like, oh, this thing when it runs remote, this completely screws up everything. But we got told by the system almost immediately because it came with all these handy helper functions to make automated testing a reality and make it easier because, honestly, that's one of the usual things that go missing.

You just don't have thorough test suites.

If you do, that helps so many things, but it's not too good. I want you to focus on that.

Jobin Kuruvilla:

Interesting Yeah. Looks like, I mean, that's some of the frameworks out there that developers can now go and take a look at, you know… Like we don't have enough frameworks.

Rasmus Praestholm:

They always need more frameworks!

Jon Mort:

This discussion reminds me… There was a post a couple of weeks ago about increasing complexity causing cloud outages. And I don't necessarily think it's the increasing complexity that's the problem. It's when it exceeds people's capacity to hold that complexity in the head and the thing.

And if you're trying to hide complexity in places where you actually need it to be explicit. I think that's the problem. So yeah, the and it's, and this links the conversation with platform engineering. If you've got a platform which can reduce your cognitive overhead, I think removing that like it's a way to get a more stable, more reliable system. So I don't know what you guys think about that. It's one of the things I think about.

Jobin Kuruvilla:

Yeah, I think it's at the end of the day. It's all about the abstraction of, you know, how to hide that complexity, right? I mean, part of the problem is, you know, we always look at platforms, for you know, the other frameworks. It's also a solution to hide the complexity. But the complexity is still out there. Right?

You talk about creating cloud-agnostic solutions, and then you have multiple clouds. You're talking to Google Cloud. You're talking to AWS. You like talking to others, all of them, and there are various technologies that you're using… I think you need a team of people at all of this, which is never the case.

And then suddenly, you have a problem at hand because something goes wrong, and you don't know who to pull in. I mean, how many different places to look at?

Romy Greenfield:

Yeah, when we, when we've had some outages recently, we thought that we understood the service that we're using and the complexity comes from. Actually, there are some caveats to the service that we're using…

You've read all the documentation you've, you've tested it in staging, and then there's one edge case, or there's one little caveat that you didn't realize.

And then you can write your test for that you can test for that in the future. But until you actually run that in production, and everything falls over, and you're like, “why?", and you’re messaging AWS saying we thought we knew how this worked, what happened, and sometimes they take a long time to come back to you and explain, actually, this is how this works, and this is a limitation that maybe we haven't publicized as much.

It might not be obvious it could be in some documentation, but it's not in what we just happened to have read when we were trying to build the service and what we tested. So there's the complexity on, like, the actual infrastructure that we're hosting everything on. But for one developer to know all of that is highly unlikely. For a small team of developers to come across that without having that edge case experience once in their life before, it's very unlikely.

So yeah, lots of outages were caused by these little nuances that we didn't know about until it was too late.

Jobin Kuruvilla:

Yeah. And those edge cases are always going to be different for different subways of different cloud providers. And that makes it even more complex. I think there was a statistic, you know. One in 5 systems always go down because of the cloud complexity, and how outages are something that is really bothering all the infrastructure providers.

So it's quite interesting.

Rasmus Praestholm:

So I'm gonna throw in something here I'd like to make out as being clever. But maybe I'm more clever than I get myself credit for! I think all this is very cyclical, just like what Jon was talking about earlier. Every X years, you find a new thing, and you add complexity to it.

And I'm also going to be nerdy and tie this back to platforms, and again, being an indie game dev, because once upon a time, when I started with a group of people, we were worried about writing like renderers.

Really low-level stuff that's hard to do. It's complex, and so on. And then, eventually, the game engines, much like platforms, started coming out.

And they kind of just abstracted some of the complexity behind your libraries, or your engine, or your platform, or in the case of cloud – which is really to somebody else's data centre – it's still there, but now you just building this increasingly high tower of madness, like that famous Internet graphic of what keeps the Internet running is this one guy, maintaining this little teeny, tiny, open source library.

And you make it better tools.

But if you keep doing things, process and people-wise, the same, you will repeat the mistakes of the past over and over and over again. So to me, it becomes almost more like.

How do you break the cycle? How do you jump out of that loop?

And I'll bring this back to saying that automated testing and chaos monkeys and those kinds of things, and having, like a healthy overall balance between things, might help it. Because you will be less surprised at things that go wrong you hadn't pictured if you throw a chaos monkey in there and have tested, it'll be able to catch it more readily.

So it's easy to get stuck on that kind of like the incline of tools and the productivity, and so on.

But you only ever still advance in one thing. Oh, now we can do more things faster. Yay, let's do more things faster. And then you look at graphs of productivity over time, like productivity is going up. But revenue is not going up. How does that work? And it all comes back down to the well; maybe we need to be more balanced.

Maybe we need to focus not just on doing things faster but also doing better and bringing people in, having them involved and knowledgeable about all these things.

But that's probably hard, especially for for-profits to do. Because it's the thing that gets, you know, short-term quarterly returns made more visible that gets the attention.

And stuff like writing good documentation or automated testing or having your value stream maps and things well documented somewhere… That's just not what gets attention.

Jon Mort:

Yeah, I think it's hard to do, but it's also, if you don't do it, you're going to die!

The mindset of a team is looking to continuously improve and have less of an agile, adaptive mindset. But like this, you know that this question is a question of things you know. It's not; think of any of our practices as being immutable or like, best and final. You know this; this is what we're always looking for – what can we do today? Improve what we do coming? Can we look at how I examine how we do things and do things differently?

And then, and I think this is why you know what DevOps is. The whole thing is, has come from. It's all about not going to stand still and accept that this is how it has to be!

And I think, like looking at the complexity and how can we can, we make sure that we understand as much of the complexity as possible, so we have the observability in a place where we understand that at the same time. Do we understand where our team is spending time, you know, like we're where is the toil? Is it worth the unplanned work?

Now I'll take that observability mindset and go, let's apply that to our team as well, and maybe let's apply that to the incoming work that we got, or maybe what we need to do is apply as observability across the entire system. That is the team, the work that we're doing, and what's happening in production and have, like a huge holistic view of things or different things.

We think about to try and understand different layers of complexity so that you can end up running right, running into it.

Rasmus Praestholm:

So, in other words… it's not all about the tools; it's about people and the process!

Romy Greenfield:

So that's it for today's episode – episode 14 of DevOps Decrypted!

Connect with us on social apps, @Adaptivist, and let us know what you think of the show.

But from myself and all of our speakers today, thanks for listening, and we’ll see you next time on DevOps Decrypted, which is part of the Adaptavist Live podcast network.

What are you interested in?

What future technology do you think could take over the world instead of ChatGPT?

Let us know!

Please review and comment on our podcast wherever you interact with your podcast – be it Spotify, Apple Podcasts, Google – whatever! Review, comment, and get in touch with us on Social @Adaptivist, on your favourite social media platform.

For Jobin Kuruvilla, Rasmus Praestholm, and Jon Mort, I'm Ryan Spilken – and we'll see you next time on DevOps Decrypted – part of the Adaptavist Live network of shows.

Why not leave us a review on your podcast platform of choice? Let us know how we're doing or highlight topics you would like us to discuss in our upcoming episodes.

We truly love to hear your feedback, and as a thank you, we will be giving out some free Adaptavist swag bags to say thank you for your ongoing support!