Skip to main content

DevOps Decrypted: Ep.24 - The Money Management Episode: FinOps, People and Process

In this episode, we're talking about how we've enabled FinOps at Adaptavist with CloudZero, as a way to highlight anomalous spending through simple chatbots, and give teams a sense of cost ownership.

Vanessa Whiteley
Vanessa Whiteley
2 April 24
DevOps Decrypted: Ep.24 - The Money Management Episode: FinOps, People and Process

Summary

Money, money, money…

In this episode, we're talking about how we've enabled FinOps at Adaptavist with CloudZero, as a way to highlight anomalous spending through simple chatbots, and give teams a sense of cost ownership.

We've got Chalky on as our special guest, talking to Rasmus, Matt and Laura about the inevitability of overprovisioning at the start of a new venture, controlling Kubernetes through Crossplane – and the concept of efficient scaling.

There's a lot to unpack – but while it's all about money, the heart of the matter comes down to linking people and process, through FinOps.

If you like what you hear or wish something would change about the podcast, we'd love to know! Please give us your feedback on the show at devopsdecrypted@adaptivist.com.

Transcript

Laura Larramore:

Welcome to DevOps Decrypted, where we talk all things DevOps.

I'm your host, Laura Larramore, here with our Adaptavist panel – today, we have Rasmus and Matt, and we also have a special guest named Chalky.

Quickly of note. We'd like to ask you to leave some feedback at devopsdecrypted@adaptavist.com.

Matt, would you like to kick us off and introduce Chalky to our audience?

Matt Saunders:

I absolutely can. Yes, thank you, Laura. So yes.

I don't think his real name is actually Chalky, but we're going to go with that for now.

Chalky:

Yeah, it is! That's it by Deed Poll, just first and last name, yeah!

Matt Saunders:

First and last name? So welcome, Chalky Chalky, to DevOps Decrypted! So Chalky has worked in Adaptavist for 5 or 6 years, running lots of our cloud infrastructure. The amount of stuff that this guy runs with his team in our cloud infrastructure is, well, it's large – and I think it's safe to say this: a lot of it wouldn't actually work if it wasn't for him.

So yes, Chalky. So there's a glowing introduction from me! Did you want to say a few words about yourself?

Chalky:

No, I'll leave it at that; that's as good as it's going to get – it's all downhill from here!

Matt Saunders:

As it always is. So yeah. So we're going to; we decided when we were putting together the agenda for this podcast that we've ended up with lots of stuff about money.

So, we're going to talk about money a lot today.

Rasmus Praestholm:

It's all about money. Yeah, how do you perhaps prioritise it better? How do you figure out what you're doing with it? Since it is often just kind of like, you have money, you have needs, and are they connected?

Matt Saunders:

It seems simple. It is a very, very simple thing. It's like looking at it in the context of a business, perhaps a business providing cloud services, and you need to run service and infrastructure.

So you spend money on it.

And then I guess you work out if you can sell products and services that you sell for more money than what you're paying for your infrastructure, all your people, and all your other costs.

And that's how you do business, right?

And everybody's doing cloud computing, not having to run their own data centres, and not being deafened by living in data centres. And therefore, that's all nice and good.

And we have nice small cloud fills, and everybody loves them.

Chalky:

Well, yeah, you say that. But people forget that all the procurement decisions now become the software teams' because they're now buying all these IO services. So instead of having this, like, big old stack co lo in a data centre somewhere, and that's, you know, you've got a finite amount of hardware until you procure and procure more.

You could, in theory, infinitely procure several services and infinitely scale them.

Matt Saunders:

But this is the dream, isn't it? It's like, yes, you divide what you need. You don't have to buy a data centre, generators 3, phase electrical power and high voltage and air conditioning, and all that stuff… and yet we'll be reading the news this week. Many people seem to be getting this wrong – or if not so much getting it wrong, there's lots of stuff going on.

We've looked at, for example, CNCF, which did a survey, which I'm just reading here. The people running Kubernetes are overprovisioning things, so they're spending loads and loads more money than they thought they should be.

This isn't the dream – this is not the dream we were promised with this infinitely scalable IaS, is it?

Chalky:

Well, that's not just a Kubernetes problem, is it? Any compute we provision, we probably overscale it, you know. We put a finger in the air. See which way the winds blow. Go, "Yeah, that'll do". When you start running it in the wild, you get some idea about how things will behave.

And then you can, you know, scale inwards or purchase savings plans and stuff like you always start by guessing, don't you? Or overprovisioning. And then you just get more efficient. But to get more efficient, you need observability tools, and you know, that means financially – but your financials now become observable. They become metrics, you know?

You want to be looking at not just your standard metrics or logs you'd use in an application but, actually, what you're spending.

And that's easily forgotten, I think.

Rasmus Praestholm:

Yep, and I think that there's also another point that I keep hopping on on these calls and that;

Tools, people, process.

You can count on paying for your infrastructure and such in the tool bucket here, which is so easy to click a button; here are your tools. So here's your hosting, your queries, and so on.

But the culture and process change in getting people to use Kubernetes. Right? It is a whole other ball game.

And I have examples of you know, having seen that play out. And it's just like. Yeah, we could do this better. Why aren't we? Well, as it turns out, it's hard to get people to do things more efficiently if they have other things keeping them busy.

Matt Saunders:

Well, I'm of the view that people shouldn't actually have to do things efficiently. Should they, really? Looking at it from a developer's perspective or maybe from a product owner's perspective. It's like, well, surely I should have this platform that kind of spins up what I need, and it's kind of right size? How many years' worth of experience do you have doing this sort of thing? And it seems to me that a lot of overspending comes through. Is it…

I mean, I'm not trying to beat up on people here for overprovisioning things. But some of these needs have got so complicated and so abstracted that it's hard to see how much capacity we need and how much things need to be demarcated.

I mean Rasmus, I don't know if you can talk about this a little bit because I know that with the Venue stuff, you know these decisions as to, like, how much of a box do you put each of your individual customers. How much do you have to overprovision to ensure that none of those customers bump into each other and have room to grow, right?

Rasmus Praestholm:

And that's exactly it. Kubernetes is good at scaling.

But if your business logic gets in the way of that by having you want to kind of like isolate different customers from each other, you might not – you might end up in a situation, and I've seen this as a consultant with tons of different companies, too, that each team gets their own cluster.

Which is functionally easy but not cost-efficient.

It's even to the point where – of course, since I'm a nerd, I have my own Kubernetes cluster for hobby products and such – but I'm having a hard time optimising it because I'm using one node.

And every once in a while, I use two nodes, and I was like. That sounds efficient. But then I realised, oh, the second node is just not spinning down because there's some quirk in that pod that's stuck there… Urgh!

So, I'm manually scaling my cluster down to one node. Which is this goofy? That's not how you're supposed to use Kubernetes. So with something like Venue, for instance, we try to do the whole, manage things and spin up resources for you, and I know I remember one of my early pitches was that, okay… How about we don't just let host this giant cluster to optimise cost, and each client gets their own namespaces or set of namespaces?

And then you're going to say – well, but then, are you going to be on the hook like, what if this thing over you doesn't work, or one customer ends up impacting a different customer?

And you get into all these knotty issues and the whole like.

Well, yeah, this is cost-efficient.

But is it going to be problematic at that people and process level?

Chalky:

Yeah, well, did you compromise for that for that efficiency? Right?

Rasmus Praestholm:

Yeah. Yeah. Right now, we are still growing, so I don't think we really have the good background set of infrastructure yet to really optimise it. It's still, you know, not too much to worry about, but one day it will be, and hopefully, that day we will be able to either do Mega clusters with lots of customers or, otherwise, just be able to efficiently use Kubernetes.

Not just, well, it's there, we made it convenient. Good luck!

Matt Saunders:

I think it plays into the fact that Kubernetes is designed obviously to scale and to run workloads of a certain type, doesn't it? It's the things that work well in Kubernetes that are stateless, the things that start up quickly and shut down quickly and don't linger. But I think a lot of organisations aren't necessarily in a position to run workloads or ideal for Kubernetes all day.

And you get this weird thing where people are – Rasmus, you talked about business priorities earlier on, and I always stop a little bit when I hear like, "Oh, well, we need to make sure that Kubernetes fits into our business priorities or our tech stack fits into our business priorities and the way we want to do things" – And I wonder if, in a lot of cases, it still doesn't quite hit there.

Things like, yeah, the scaling. So I'm remembering… So, Adaptavist is a big Atlassian partner; we've got some clusters that run development and test instances of Jira, Confluence and Bitbucket – all of the self-hosted products. And we've got that running on Kubernetes, and it's the dream! We scale up the cluster when we need more, and then we scale down… Then we, oh, hang on a minute.

You've got these big monoliths running inside the cluster, and if you want to rebalance your cluster by removing nodes, you have to bump some of these instances to other nodes, but they take 5 minutes to start up.

And so, yeah, we've kind of overprovisioned that and can't really shut down notes as fast as we want to.

Chalky:

Also, you're bundling services on that, right? Because you're trying to roll up Postgres as well. You don't wanna, you know, because they're in testing, so you don't want to be provisioning RDS clusters for every single instance of those products being run. So, you know, you're running a sidecar. Which means you're going to wanna, you know, persistent volume for it.

Well, you will for Jira anyway, right for the artefacts to get uploaded. And then there's all the other persistence shenanigans you must also go through when dealing with that. And again, that comes back to cost again as well. For example, how much do you want to run? How much do you want to overprovision? You know. How easily do you want to be able to move things from one place to another? Generally, that comes with a cost because you're locking yourself into an architecture that might not necessarily adjust for different use cases.

Actually, Kubernetes is a good example of that because, generally, you're building a cluster to deal with to provide a platform that has some opinions about what you want to run on it as well, like… You don't tend to talk about Mega clusters; for example, you're going to have an absurd amount of controllers and operators and such to do everything that the business wants to do. But then you'll find that it's going to be really hard to move away from it now because everyone needs every single feature that cluster's provided rather than giving someone a cluster that meets the needs of what they're trying to deliver. And I think there's a balancing act there because you don't want to be on either end of that spectrum. What you want's probably somewhere more in the middle.

Rasmus Praestholm:

I almost wonder if one could throw in envy as a factor in here.

It's normally hype. But when you think about where Kubernetes came from, coming out of Borg that ran Google, which is like, Google scale stuff. Then, Google releases it as open source, like, here you go. You can be Google, too, now!

And these companies like, ooh, we can be Google. Yes, let us go, for there, and then they go into the whole. Yeah, we're going to scale to Google scale and do all these awesome things. Because now we're using the Google Tool without realising that. Well, wait a minute.

We're not shaped like Google.

And that's where another topic we've talked about repeatedly, digital transformation, really comes in for the whole like, if you have a bunch of monoliths, this might not work so well for you yet until you can refactor and improve it.

And if you have a lot of stuff that's tied to specific hardware and fun things. I've had issues in the past where teams had trouble getting on Kubernetes because they used legacy software that was licensed to a specific CPU call, on like old, you know, bare metal stuff, and they didn't know how they could translate it into Kubernetes.

Chalky:

And that one still exists, by the way. It's not like gone…

Rasmus Praestholm:

And on a similar note there, Chalky, you mentioned RDS instead of Postgres, which is like, I love kind of pure Kubernetes, and this is where my purist open source nerd comes out because – it's easy out-of-the-box . But you can so complicate it.

And that's where you get into the whole. Well, yeah, we need all these controllers. We need all this Terraform to spin up, like load balancers and RDS.

And then you're locked in, like. Now you're stuck on AWS.

Even though, if it was like pure Kubernetes, you could, in theory, just move some… and spin it up over here…

Chalky:

But that's not completely true because the lock-in is dependent on the abstraction you provide. Right? So say, for example, you have a CRD, and that is just that I want a Postgres cluster right with some arguments.

You could abstract it in such a way that the developer doesn't know whether it's RDS or another IaS's flavour of Postgres or MySQL – like it's not… But that's an abstraction. If you're going to give them Crossplane or Terraforms, right… then you've locked them in, right?

But you can abstract that away and–

Matt Saunders:

Rewind a bit there, Chalky. Sorry, we're talking about CRDs, Crossplane, and all this sort of stuff. Yeah, yeah. And we came. We came back in on this in terms of Rasmus and his pure Kubernetes. So, just as a bit of expansion…

No, you can do this since you invoked it, Chalky. Do you want to explain what Crossplane is?

Chalky:

Yeah, Crossplanes are like an Infrastructure as Code controller for Kubernetes. So it gives you – you could just write infrastructure as if you were writing Kubernetes manifests effectively, and it's, I would argue, so big it's a monolith! It's quite big, but it is native to Kubernetes.

Matt Saunders:

When you're talking Crossplane, we're talking about provisioning things like, yeah, and other random AWS slash. GCP slash Azure infrastructure from within your Kubernetes cluster, right?

Chalky:

Yeah. And that feeds into the cost stuff as well, right? Because if you're providing CRDs to produce that stuff, that's now another cost. It's not just the compute cost of the cluster, but if you're using RDS, do you have good opinions about how to scale those clusters? Are you running serverless Postgres? Are you scaling it manually? Have you got your own alarms in there? What's your opinion? So those sorts of things still should be considered.

And that's where platform teams like mine exist, right? The idea is that the development team shouldn't have to understand how to do that. They're not. They're not DBAs. They might know how to migrate. Yeah, make a migration, but, like, you know, write good queries, set up indexes properly and stuff.

But they don't want to manage the cluster or the database, right? So you want it as abstracted away as possible, and chances are you do want to use something managed like RDS to move that to offload that concern.

So you know, when you buy into it – again, you're making a procurement decision in advance for the developer. And so actually having a good handle on it is quite important.

And that's holding. And yeah, you don't want to. I think it provides the right abstraction as well. So you're not baked into the cloud platform or actually fully married to it, which is actually quite an important one to do.

However. at the same time, if you're getting value from it… is it? You know, does it really hurt you at the end of the day if you've blocked yourself in?

It's a good problem to have, kind of what I'm saying, I suppose!

Rasmus Praestholm:

And anybody watching this as a video might have caught me on the whole. Aha! I was waiting for the Crossplane reference, and I love that. I love the tool. I love the idea because, again, it is kind of like it, almost like it builds Kubernetes on top of Kubernetes.

But it all looks pure and pretty, works beautifully, and is great.

But another topic we've hit on this podcast in the past was the complexity of Kubernetes. Can you actually simplify it? Or you're just moving it around?

So, at first, when I saw Crossplane, like, yes, this looks awesome. We should do this, and then we did it, and guess what?

We only ever used one cloud – and had no plans to go to a different cloud. Kind of like… But we put all this effort into this cool, awesome thing that lets us spread to multi cloud.

Do we actually have a use case for going multi-cloud at some point?

And that pulled me back to the ground like, huh!

This is more of a people, process, and architecture issue than a tools thing.

Matt Saunders:

So I've got all this way in and not mentioned Melvin Conway yet. Yeah, it totally is because yes. So you put Crossplane in, and it massively simplifies a whole lot of stuff.

But the flip side is that there's a whole load of complexity hidden away.

And I would say that the arguments for using a tool like that are twofold – number one is yes, in theory makes you use more easily multi-cloud. TL;DR, I don't believe in designing for multi-cloud at the start, and I don't think any of us do – but the other point is that it lets you use these native resources that probably your developers are quite keen on

I mean, we come through. We come from a world where developers used to have to ask server admins very nicely to install something, and then all of a sudden, they can spin up their own databases. They don't even have to worry about installing packages or anything like that, let alone operating systems.

And so yeah, we get to this wonderful situation where, in theory, developers can take control and spin up things that are kind of mandated – no, sorry, not mandated- acceptable to an infrastructure team or platform engineering team. And off we go.

And yet, we still bring it back to FinOps and the cost of things. I think it's a bit of a misnomer to obsess too much around costs. But we have to acknowledge that one of the reasons we invited Chalky on was to talk a little bit about cost management FinOps that we're doing here at Adaptavist.

So yeah, I think, personally, it's something that we shouldn't have to worry about too much. But we evidently do because it's big, and there are products out there that help people with FinOps. FinOps itself is a thing – you can go to FinOps, meetups, conferences, etc.

And yeah, I'm interested in knowing, you know, Chalky, what sort of stuff you're doing here at Adaptavist just to help us along that way.

Chalky:

Well, I will start with that, and that is maybe we just had DevOps and stuff. I was thinking back to, like, the whole DevSecOps thing. We were saying security is everybody's problem.

And I think you could say the same for FinOps and the financial stuff. As I said, engineers are making procurement decisions and don't want to use RDS; they've committed to that service now.

Going back to that example.

And what I've been doing here at Adaptavist is that we've got quite a large estate on AWS, and we spend quite a lot of money on there, to put it mildly. And actually, we've grown fast and taken the time to look at what we spend.

And when you look at it, there's a term used with FinOps, and it's probably used in many other things. And actually, we were talking about it off the podcast a minute ago, which is like crawl, walk, run.

And in many regards, like we're, we're still very much in a crawling state. Our FinOps starts with observability; you want to understand what you're actually spending. You don't necessarily want to react to it necessarily straight away. And that's actually like getting people to understand what their cost of ownership is right.

And this goes back to the simple things that we do like, tag our resources or add labels to our containers and pods and that sort of stuff, and have actually some sort of common matrix of like tags, labels, and such that we can reuse and then just try and make sense of things, really.

And that's what we've been doing at the start of our journey, because actually, we've, as with all businesses, they grow quickly. These things come late, and therefore, we have a bunch of legacy we have to fix. And this is fine, right? Again, these are great problems to have. We've grown. And now we've got to, you know, grow up.

And so, we're really working on the quality of the data and getting other engineering managers like myself, getting the tools in people's hands to make the right decisions.

And what we try not to do is boil the ocean.

We want other teams to start their phase of walking, crawling, walking, and running. So, what we want to do is just get the high-level aggregations in there.

For example, a business unit in a business like ours. There are a bunch of business units, but we have centralised billing. So, we first want to get the general managers of their business units to care about their costs.

So we want to make the highest level label and then let them aggregate deeper, based on their business needs, because some business units will have different ways of measuring their cost of ownership than others.

And so we don't, we don't impose any opinions on them. We need to go – here's your share of the bill. Would you like our help understanding how you want to drill that down further?

And so again, like, we're very much crawling ourselves. It's mainly a data quality thing. But there are things where we may be walking or running.

And so over the past few years, we've been really good at looking at what our spend is on certain services. So, rather than looking at a business unit level internally, we're just looking at the service levels. Usage on, say, AWS, and just buying savings plans, you know, just, we know we're spending that much. Why not just get a savings plan?

Let's make some savings. We haven't had to make any architectural changes. We haven't had to fix anything forward. We know we're committing to that spend. So, just commit to it. Right?

Matt Saunders:

So the savings plans are basically when you just say, well, we've already. We spent a lot of money on this last year; therefore we expect to spend about the same amount next year. So…

Chalky:

Yeah, I mean, you could look at the service and see how it's trended over the year and go actually based on that. We can make a fair guess that we're going to use this much, and then you can save quite a lot of money on making that commitment. And if you're willing to put money down as well, you can save even more. Because, like I said, if you know you're spending that, and there's money in the bank, you could actually go, do you know what, we'll commit to it and prepay as well. There are further savings to be made, and they scale quite well before you even start looking at improving the data.

For example, you've already benefited from the tools and don't have to use any third-party products for this.

Just go on the cost explorer, do it based on service, and, you know, buy a savings plan that meets that!

Rasmus Praestholm:

I can also speak to having been on the kind of the receiving end of Chalky's useful work here. Because on the Venue team, we have this cost explorer thingy now, sending reports into a Slack channel.

And that is cool, both because it proves there's data there, and I can't imagine, well, I can't explain how much I sometimes worry about how 2, 3, 4 AWS accounts ago, there was like a goofy resource somewhere that I didn't know how to delete, that's still billing hundreds of dollars after years to sitting there.

But then we went to this thing where suddenly there's a message in the channel, saying that, oh, Rasmus's individual account went up by $400 this month, like…

Wait, what? Huh? But I'm not doing anything!

Okay, that's a good reminder to clean up that stuff right now.

If only that were easier to do on AWS. That's a whole different podcast…

Matt Saunders:

I think that's a good way of doing things, isn't it? Because you go into this FinOps, here's our cost explorer, our CloudZero or whatever tool we're using – and here's a big list of who's been in the most. And that's like, if you're at the top of that list, you will inevitably feel the pressure. It's like, we've got to cut down on spending. Maybe it's not the right thing to do.

But it seems like what we've got with this tool is something that says, actually, this happened this month, which didn't happen the month before.

Is that right? Has something changed? Has AWS started charging for something that they weren't charging for before? And we need to adjust our behaviour. And I think that probably puts more power into the hands of the people running these services rather than just being relentless, like "don't spend so much!" type of pressure. Right?

Chalky:

Yeah, let's not make it a stick, right? So, I think putting it in people's hands might actually put it in. I would see that pressure if I was going to be given this new tool – oh, by the way, this has all been allocated to me. Great! But what you want to do is go. Now that I know that I can work, can I prove the value it's delivering to the rest of the business or to customers, even, for that matter?

And can I segment it further to understand what the products or services that are delivering it cause? As I said, we're only doing it based on BU right now, for example.

But if we go down to products, yeah, and if we do it based on products and things like that, then you'll probably find the yeah. Some things cost them more than the value they derive, but it'll give you a better idea of how much value you create because I think it's very hard to – well, I find it very hard to do that. When people say, well, what do you offer the business in these aspects? And then, if you go and actually cost explorer, be out of topic, that if you'll realise quite quickly. Actually, you deliver a lot of value. You've just not quantified it very well.

And so, the tool helps you figure that out.

Of course, it could work the other way, and you go, damn. I'm burning a lot of cash here; I mean, that's probably still true.

Laura Larramore:

That could be useful, though, like there are some people who are like, I want to be better all the time. So if they see, "Oh, I'm burning a lot of cash", they may feel like, "Oh, I would like to do that a little better", you know, so I think that it could be useful like as a social construct to create a new way of thinking about what you're spending and where.

Rasmus Praestholm:

I love that it's an automated notification that shows up on Slack. That is a perfect example of how you can have a tool support, a people thing or a people and process thing.

I know this from past years of how I worked in an open source project online, where you are interacting with volunteer contributors, and so on, and where, if you, as a maintainer, are kind of like nagging a ticket because somebody's not putting on all the information, or they haven't given giving updates, and so on. You come off as a nag, and it's like that's negative. You know, it de-motivates people.

If it's a robot that says that, hey? This ticket hasn't been updated for X weeks.

Are you up for updating it soon? If not, at some point later, I might close it because I'm a robot. I'm not a human. I'm not doing it because I'm mad at you. I'm a robot.

So when you do that, but with costs – that's really… I get a similar feeling that if somebody called me out as a person – Hey, Rasmus, you're spending too much money!

I might be like, oh, wait a minute. What is this? Has this person got it out for me or something like that?

But if it's a robot, okay. Well, the robot said it, it must be true!

Laura Larramore:

And with that personal nature, something you could have a tendency to be like, let me cut this down to as small as I possibly can, and then you wind up not being as innovative because you're not looking at how much value am I getting out of this? You're just trying to get that number smaller so you'll get the monkey off your back.

So yeah, the automated robot thing is a very cool feature.

Chalky:

We implemented another feature, by the way, and that's like, say, if you're an engineering team, you're changing your architecture, you change some code, and you might not have envisioned it having a cost impact.

Having those sorts of bots or anomaly detections going on means that it's in your hands as soon as the anomaly's been detected, and then you can fix it forward. Because you may not deliberately be trying to spend more money, but there was an inadvertent consequence of a change you've made.

But now you can fix it forward. You've learned its cause and effect. You can do the triage, and again, it goes back to it; it's an observability tool as well, just like if you had metrics for scaling or alerts for errors, you'd do the same thing, for your costs have changed outside of expected boundaries, so it's worth a look.

Rasmus Praestholm:

This might also be a good pitch for more ChatOps again.

Because I and I've also loved this, although I see it influenced so rarely. Because if that robot can tell you, hey? By the way, you're spending this much money, and you have 3 test environments that are still running. And you go, why are they still running like, hey, robot, turn off these environments, enter, done.

You're done.

Chalky:

Delete those CRDs. And then production goes down!

Rasmus Praestholm:

I'm sure!

Whoops.

Matt Saunders:

Yeah, I think this plays into something we probably don't really have time to go into in much detail, but we're seeing monitoring and observability products getting actually, sensibly using AI for anomaly detection.

And it's exactly the same principle. It's like, you know. Your CPU is, you know, constant about 5 or 10%, and then suddenly goes "bzzt!" – anomaly.

The spending is constantly like a thousand dollars in this account, and it suddenly goes to 5,000. "Bzzt!" – anomaly.

And yeah, I think so. I'm always wary of robots sending messages to me. I think I find them a lot more ignorable.

But there seems to be increasing amounts of science based on observability metrics, based on real dollar spend as well, just making it a lot more valid. So yeah, maybe I'll stop ignoring those machines and fall in love with them like you have.

Chalky:

That's where they add real value, but the value is really subtle. I think some people expect this massive paradigm shift in how you work, but I don't think it's – I don't expect that. I expect things to be subtle. I expect little things to enable me to be a bit faster, but I don't expect it to remove me.

Obviously, the whole doom and gloom discussions that we get are around us being replaced, and to some extent, that may be true. Or, in the future, it'll be more true. But right now, they enable us to make better decisions faster. And that's how we should be making the most of them.

And anomaly detection has been around for ages. Right? It's not. That's not a unique thing that's come out of more recent LLMs; we can do much more interesting things with them now.

Matt Saunders:

Yeah, yeah, precisely. It's a. It's an evolution, isn't it?

Chalky:

So we've spoken about what we've been doing, but I've failed to mention the product we're using and actually how we are using it if that's of interest.

Matt Saunders:

Yeah, go for it.

Chalky:

Yeah. So we've been using CloudZero – we piloted it last year, and we're not; we've not gone completely mad with it just yet. We're trying again; we want to do the crawl-walk-run thing rather than steaming ahead. We want to understand how, what, what value you want to get from it.

And the first thing we're working on right now is cost of ownership, and it's got quite a useful… I don't know. Can you call it cost of code, I suppose, or finance as code? They've got a thing called cost formation, and it looks very inspired by cloud formation. Very, very inspired. In fact, our custom configuration file is a thousand lines.

It's really hard to read.

And that's probably one of my few complaints about CloudZero, by the way; the rest of my feedback is quite, quite good.

And that's something that they've taken aboard, and they've acknowledged. And really, we've just written our own tooling to work around it because a monolithic YAML file is hard to work with.

But what it allows us to do is create custom dimensions, which are things we can use to aggregate our costs. And that's really the powerful tool that there is, for example, business unit – we can derive business unit from a number of sources such as docker labels, AWS tags, AWS account names or IDs. What we can do is bundle those into that dimension.

And that gives us a nice way of grouping things together. Unfortunately, as I said, this is the real world; businesses grow quickly, and nothing is necessarily perfect. There's always going to be tech debt. And because we've got this configuration code method of managing costs, we can deal with the edge cases and make sure they get rolled up into those things.

One of the earliest things we actually did was work around individual accounts. So Rasmus referred to his, and seeing the insights of his costs, what we were doing there is. We were actually using our HR data, pulling out people who work inside business units and rolling them into the business unit.

So Rasmus's account falls into the Venue business unit, for example. So when the Venue costs go up, Rasmus's account ID might appear as the anomaly in this example.

But that does allow those general managers and engineering managers to see what the usage is on the individual accounts as well, because what we were for a long time was very blind to what people were doing onto those accounts, and effectively, you're giving someone their own account to spend Adaptavist money, effectively.

And you can spend a lot by accident. Right? You can run a step functional deployed RDS cluster or have an EC2 instance with a GPU attached for unknown use cases, and you could spend a lot of money very quickly. And and and these things can highlight it really, really well.

Matt Saunders:

I think it's quite interesting how we're now trying to put more substance around the thing where we're like. There you go, Rasmus. There you go, Matt. These are individual accounts to do… stuff in.

Because, depending on which side of the business you're on, that could be either the best thing in the world or the most horrific thing in the world.

Yeah. You know, one hat on, I'm like, Oh, my God, it could just spend like greater than the gross domestic product than some African country very, very easily – and on the other side, it's like, well, actually.

And this is the side I more passionately believe in.

That's how the magic happens. It's like giving clever people access to tools to try and work out what they can create with them.

And I'm really interested to see how this plays out where we're. Well, actually, yeah, Rasmus, we know you just created this extra cluster here.

And the temptation to go well, did you need to do that? Did you need to spend that extra $1,000? It is really high.

And I think we've absolutely got to battle against letting the high definition of these numbers define us.

Rasmus Praestholm:

Yep, and in all things, it's a balancing act, and that's why I love to get more information. I love to get an easy way of acting to it, like turning off things through ChapOps.

But yeah, we're still human beings. We still make mistakes. And just for the record. I have never left an unsecured Jenkins for testing, forgotten about it, and had a crypto-mining robot find it and start mining Bitcoin…

Chalky:

It should not be used as a stick. It should be used as a carrot to help people understand the costs that they're making. I don't think we'd ever have that intention. The idea is to get better ideas because of what, say, your business unit and your reporting on your margins, for example.

But actually, your report on your margins is wrong because someone in your BU was actually using their individual account to produce something. And actually, your profit loss is not as accurate as you might think it was, and so, actually, that still needs to be visible.

But still, it's still not there to, you know, to hold someone over the coals. But it's still a laterally important thing, like when we're budgeting and reporting on our costs and our usage. And the other thing is that it is just talking about the cost of ownership.

And, like, we're not even moved into the subjects of unit economics, which we're not again; we're still very much crawling on that side of things. But we run a bunch of services where there are APIs with a bunch of like tenants, and a tenant could be another team or a business unit. There are quite a few.

What we currently have at the moment is just our cost of ownership of the service we run and the APIs we have. and if we wanted to allocate a portion of that cost to another team or business unit based on their usage of that service. We have yet to do that.

But the CloudZero, which we're using, does enable that for us. So we can do, for example, what's called telemetry-based cost weighting.

So the idea is that we've got a dimension with this much cost, and we've got this number of tenants and based on their usage of our application, we could emit metrics that affect the weights of how that cost is distributed between our tenants, and in that way, you can actually figure out what your cost, your unit cost is.

Matt Saunders:

So this kind of gets you closer to understanding that big amorphous blob of like shared costs. Providing a service for multiple people. And you can't really tell who's using exactly which bit of it? Right?

Chalky:

Yeah. And actually, also knowing what value you're giving people cause if they're using it, if you're costing this much, but it's actively being used by a bunch of other business units, you can see the usage of it, then you obviously are providing some degree of value, let alone understanding the unique economics of it.

Laura Larramore:

Fascinating.

So, as promised. We did indeed talk about money!

Well, that's it today for DevOps Decrypted.

Connect with us on social at Adaptavist, and let us know what you think of the show. For myself and Matt and Rasmus and Chalky – we're signing off.

Like what you hear?

Why not leave us a review on your podcast platform of choice? Let us know how we're doing or highlight topics you would like us to discuss in our upcoming episodes.

We truly love to hear your feedback, and as a thank you, we will be giving out some free Adaptavist swag bags to say thank you for your ongoing support!