Podcast: Break Things on Purpose | John Martinez, Director of Cloud R&D at Palo Alto Networks
In this episode Jason is joined by John Martinez, Director of Cloud R&D at Palo Alto Networks, to talk about the FinOps Foundation and the vast range of optimization opportunities to reduce spend in the cloud. John comes in with some extremely useful insights into how FinOps is laid out and their use of a “crawl, walk, run approach.” John and Jason discuss multi cloud and go into the specifics on the costs associated with multi cloud as well the security changes that will come with. Curious what the future of multi cloud might look like? Tune in to this episode for those details and more!
In this episode, we cover:
- 00:00:00 - Introduction
- 00:03:15 - FinOps Foundation and Multicloud
- 00:07:00 - Costs
- 00:10:40 - John’s History in Reliability Engineering
- 00:16:30 - The Actual Cost of an Outages, Security, Etc.
- 00:21:30 - What John Measures
- 00:28:00 - What John is Up To/Latinx in Tech
- Palo Alto Networks: https://www.paloaltonetworks.com/
- FinOps Foundation: https://www.finops.org
- Techqueria.org: https://techqueria.org
- LinkedIn: https://www.linkedin.com/in/johnmartinez/
John: I would say a tip for better monitoring, uh, would be to, uh turn it on. [laugh]. [unintelligible 00:00:07] sounds, right?
Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode we chat with John Martinez, Director of Cloud R&D at Palo Alto Networks. John’s had a long career in tech, and we discuss his new focus on FinOps and how it has been influenced by his past work in security and chaos engineering.
Jason: So, John, welcome to the show. Tell us a little bit about yourself. Who are you? Where do you work? What do you do?
John: Yeah. So, John Martinez. I am a director over at Palo Alto Networks. I have been in the cloud security space for the better of, I would say, seven, eight years or so. And currently, am in transition in my role at Palo Alto Networks.
So, I’m heading headstrong into the FinOps world. So, turning back into the ops world to a certain degree and looking at what can we do, two things: better manage our cloud spend and gain a lot more optimization out of our usage in the cloud. So, very excited about new role.
Jason: That’s an interesting new role. I’d imagine that at Palo Alto Networks, you’ve got quite a bit of infrastructure and that’s probably a massive bill.
John: It can be. It can be. Yeah, [laugh] absolutely. We definitely have large amount of scale, in multi-cloud, too, so that’s the added bonus to it all. FinOps is kind of a new thing for me, so I’m pretty happy to, as I dig back into the operations world, very happy to discover that the FinOps Foundation exists and it kind of—there’s a lot of prescribed ways of both looking at FinOps, at optimization—specifically in the cloud, obviously—and as well as there’s a whole framework that I can go adopt.
So, it’s not like I’m inventing the wheel, although having been in the cloud for a long time, and I haven’t talked about that part of it but a lot of times, it feels like—in my early days anyway—felt like I was inventing new wheels all the time. As being an engineer, the part that I am very excited about is looking at the optimization opportunities of it. Of course, the goal, from a finance perspective, is to either reduce our spend where we can, but also to take a look at where we’re investing in the cloud, and if it takes more of a shift as opposed to a straight-up just cut the bill kind of thing, it’s really all about making sure that we’re investing in the right places and optimizing in the right places when it comes down to it.
Jason: I think one of the interesting perspectives of adopting multi-cloud is that idea of FinOps: let’s save money. And the idea, if I wanted to run a serverless function, I could take a look at AWS Lambda, I could take a look at Azure Functions to say, “Which one’s going to be cheaper for this particular use case,” and then go with that.
John: I really liked how the FinOps Foundation has laid out the approach to the lifecycle of FinOps. So, they basically go from the crawl, walk, run approach which, in a lot of our world, is kind of like that. It’s very much about setting yourself up for success. Don’t expect to be cutting your bill by hundreds of thousands of dollars at the beginning. It’s really all about discovering not just how much we’re spending, but where we’re spending it.
I would categorize the pitting the cloud providers against each other to be more on the run side of things, and that eventually helps, especially in the enterprise space; it helps enterprises to approach the cloud providers with more of a data-driven negotiation, I would say [laugh] to your enterprise spend.
Jason: I think that’s an excellent point about the idea of that is very much a run. And I don’t know any companies within my sphere and folks that I know in the engineering space that are doing that because of that price competition. I think everybody gets into the idea of multi-cloud because of this idea of reliability, and—
Jason: One of my clouds may fail. Like, what if Amazon goes down? I’d still need to survive that.
John: That’s the promise, right? At least that’s the promise that I’ve been operating under for the 11 years or so that I’ve been in the cloud now. And obviously, in the old days, there wasn’t a GCP or an Azure—I think they were in their infancy—there was AWS… and then there was AWS, right? And so I think eventually though you’re right, you’re absolutely right. Can I increase my availability and my reliability by adopting multiple clouds?
As I talk to people, as I see how we’re adopting the multiple clouds, I think realistically though what it comes down to is you adopted cloud, or teams adopt a cloud specifically for, I wouldn’t say some of the foundational services, but mostly about those higher-level niche services that we like. For example, if you know large-scale data warehousing, a lot of people are adopting BigQuery and GCP because of that. If you like general purpose compute and you love the Lambdas, you’re adopting AWS and so on, and so forth. And that’s what I see more than anything is, I really like a cloud’s particular higher level service and we go and we adopt it, we love it, and then we build our infrastructure around it. From a practical perspective, that’s what I see.
I’m still hopeful, though, that there is a future somewhere there where we can commoditize even the cloud providers, maybe [laugh]. And really go from Cloud A to Cloud B to Cloud C, and just adopt it based on pricing I get that’s cheaper, or more performant, or whatever other dimensions that are important to me. But maybe, maybe. We’ll remain hopeful. [laugh].
Jason: Yeah, we’re still very much in that spot where everybody, despite even the basics of if I want to a virtual machine, those are still so different between all the clouds. And I mean even last week, I was working on some Terraform and the idea of building it modularly, and in my head thinking, “Well, at some point, we might want to use one of the other clouds so let’s build this module,” and thinking, “Realistically, that’s probably not going to happen.”
John: [laugh]. Right. I would say that there’s the other hidden cost about this and it’s the operational costs. I don’t think we spend a whole lot of time talking about operational costs, necessarily, but what is it going to cost to retrain my DevOps team to move from AWS to GCP, as an example? What are the underlying hidden costs that are there?
What traps am I going to fall into because of that? It seems cool; Terraform does a great job of getting that pain into the multiple clouds from an operations perspective. Kubernetes does a great job as well to take some of that visibility into the underlying—and I hate to use it this way but ‘hardware’ [laugh] virtual hardware—that’s like EC2 or Google Compute, for example. And they do great jobs, but at the end of the day we’re still spending a lot of time figuring out what the foundational services are. So, what are those hidden costs?
Anyway, long story short, as part of my journey into FinOps, I’m looking forward into not just uncovering the basics of FinOps, where is what are we spending? Where are we spending it? What are the optimization opportunities? But also take a look at some of the more hidden types of costs. I’m very interested in that aspect of the FinOps world as well. So, I’m excited.
Jason: Those hidden costs are also interesting because I think, given your background in security—
Jason: —one of the challenges in multi-cloud is, if I’m an expert in AWS and suddenly we’re multi-cloud and I have to support GCP, I don’t necessarily know all of those correct settings and how to necessarily harden and build my systems. I know a model and a general framework, but I might be missing something. Talk to me a bit more about that as a security person.
Jason: What does that look like?
John: Yeah, yeah. It’s very nuanced, for sure. There are definitely some efforts within the industry to help alleviate some of that nuance and some of those hidden settings that I might not think about. For example, CIS Foundations as a community, the foundations of benchmarks that CIS produces can be pretty exhaustive—and there are benchmarks for the major clouds as well—those go a long way to try and describe at least, what are the main things I should look at from a security perspective? But obviously, there are new threats coming along every day.
So, if I was advising security teams, security operations team specifically, it would be definitely to keep abreast into what are the latest and go take a look at what some of the exploit kits are looking for or doing and adopting some of those hidden checks into, for example, your security operations center, what you react to, what the incident responses are going to be to some of those emerging threats. For sure it is a challenge, and it’s a challenge that the industry faces and one that we go every day. And an exploit that might be available for EC2 may be different on Google Compute or maybe different on Azure Compute.
Jason: There’s a nice similarity or parallel there to what we often talk about, especially in this podcast, is we talk about chaos engineering and reliability and that idea of let’s look at how things fail and take what we know about one system or one service, and how can we apply that to others? From your experience doing a wide breadth of cloud engineering, tell me a bit more about your experience in the reliability space and keeping—all these great companies that you’ve worked for, keeping their systems up and running.
John: I think I have one of the—fortunate to have one of the best experiences ever. So, I’ll have to dig way back to 11 years ago, or so [laugh]. My first job in the cloud was at Netflix. I was at Netflix right around the time when we were moving applications out of the data center and into AWS. Again, fortunate; large-scale, at the cusp of everything that was happening in the cloud, back in those days.
I had just helped finish—I was a systems engineer; that’s where I transitioned from, systems engineering—and just a little bit of a plug there, tomorrow is Sysadmin Day, so I still am an old school sysadmin at heart so I still celebrate Sysadmin Day. [laugh]. But I was doing that transition from systems engineering into cloud engineering at Netflix, just helped move a database application out from the data center into AWS. We were also adopting in those days, very rapidly, a lot of the new services and features that AWS was rolling out. For example, we don’t really think about it today anymore, but back then EBS-backed instances was the thing. [laugh].
Go forth and every new EC2 instance we create is going to be EBS-backed. Okay, great. March, I believe it was March 2011, one of AWS’s very first, and I believe major, EBS outages occurred. [laugh]. Yeah, lots of, lots of failure all over the place.
And I believe from that a lot of what—at least in Gremlin—a lot of that Chaos Monkey and a lot of that chaos engineering really was born out of a lot of our experiences back then at Netflix, and the early days of the cloud. And have a lot of the scars still on me. But it was a very valuable lesson that I take now every day, having lived through it. I’m sure you guys at Gremlin see a lot of this with your customers and with yourselves, right, is that the best you can do is test those failure scenarios and hope that you are as resilient as possible. Could we have foreseen that there was going to be a major EBS outage in us-east-1? Probably.
I think academically we thought about it, and we were definitely preaching the mantra of architect for failure, but it still bit us because it was a major cascading outage in one entire region in AWS. It started with one AZ and it kept rolling, and it kept rolling. And so I don’t know necessarily in that particular scenario that we could have engineered—especially with the technology of the day—we could have engineered full-on failover to another region, but it definitely taught us and me personally a lot of lessons around how to architect for failure and resiliency in the cloud, for sure.
Jason: I like that point of it’s something that we knew theoretically could maybe happen, but it always seems like the odds of the major catastrophes are so small that we often overlook them and we just think, “Well, it’s going to be so rare that it’ll never happen, so we don’t think about it.” As you’ve moved forward in your career, moving on from Netflix, how has that shaped how you approach reliability—this idea of we didn’t think EBS could ever go down and lead to this—how do you think of catastrophic failures now, and how do you go about testing for them or architecting to withstand them?
John: It’s definitely stayed with me. Every ops job that I’ve had since, it’s something that I definitely take into account in any of those roles that I have. As the opportunity came up to speak with you guys, wanted to think about reliability and chaos in terms of cloud spend, and how can I marry those two worlds together? Obviously, the security aspect of things, for sure, is there. It’s expecting the unexpected and having the right types of security monitoring in place.
And I think that’s—kind of going back to an earlier comment that I made about these unexpected or hidden costs that are there lying dormant in our cloud adoption, just like I’m thinking about the cost of security incidents, the cost of failure, what does that look like? These are answers I don’t have yet but the explorer in me is looking forward to uncovering a lot of what that’s going to be. If we talk in a year from now, and I have some of that prescribed, and thought of, and discovered, and I think it’ll be awesome to talk about it in a year’s time and where we are. It’s an area that I definitely take seriously I have applied not just to operational roles, but as I got into more customer-facing roles in the last 11 years, in between advising customers, both as a sales engineer, as head of customer success, and cloud security startup that I worked for, Evident.io, and then eventually moving here to Palo Alto Networks, it’s like, how do I best advise and think about—when I talk to customers—about failure scenarios, reliability, chaos engineering? I owe it all to that time that I spent at Netflix and those experiences very early on, for sure.
Jason: Coming back to those hidden costs is definitely an important thing. Especially I’m sure that as you interact with folks in the FinOps world, there’s always that question of, “Why do I have so much redundancy? Why am I paying for an entire AZs worth of infrastructure that I’m never using?” There’s always the comment, “Well, it’s like a spare tire; you pay for an extra tire in case you have a flat.” But on some hand, there is this notion of how much are we actually spending versus what does an outage really cost me?
John: Right. We thought about that question very early on at another company I worked at after Netflix and before the startup. I was fortunate again to work in another large-scale environment, at Adobe actually, working on the early days of their Creative Cloud implementation. Very different approach to doing the cloud than Netflix in many ways. One of the things that we definitely made a conscious effort to do, and we thought about it in terms of an insurance policy.
So, for example, S3 replication—so replicating our data from one region to another—in those days, an expensive proposition but one that we looked at, and we intentionally went in with, “Well, no, this is our customer data. How much is that customer data worth to us?” And so we definitely made the conscious decision to invest. I don’t call it ‘cost’ at that point; I call that an investment. To invest in the reliability of that data, having that insurance policy there in case something happened.
You know, catastrophic failure in one region, especially for a service as reliable and as resilient as S3 is very minuscule, I would say, and in practice, it has been, but we have to think about it in terms of investing. We definitely made the right types of choices, for sure. It’s an insurance policy. It’s there because we need it to be there because that’s our most precious commodity, our customers’ data.
Jason: Excellent point about that being the most precious commodity. We often feel that our data isn’t as valuable as we think it is and that the value for our companies is derived from all of the other things, and the products, and such. But when it comes down to it, it is that data. And it makes me think we’re currently in this sort of world where ransomware has become the biggest headline, especially in the security space, and as I’ve talked with people about reliability, they often ask, “Well, what is Gremlin do security-wise?” And we’re not a security product, but it does bring that up of, if your data systems were locked and you couldn’t get at your customer information, that’s pretty similar to having a catastrophic outage of losing that data store and not having a backup.
John: I’ve thought about this, of course, in the last few weeks, obviously. A very, very public, very telling types of issues with ransomware and the underlying issues of supply chain attacks. What would we do [laugh] if something like that were to happen? Obviously, rhetorically, what would we do? And lots of companies are paying the ransom because they’re being held at gunpoint, you know, “We have your data.”
So yeah, I mean, a lot of it, in the situation, like the example I gave before, could not just the replication of, for example, my entire S3 bucket where my customer data is thwarted a situation like that? And then you think about, kind of like, okay, let’s think about this further. If we do it in the same AWS account, as an example, if the attacker obtained my IAM credentials, then it really comes down to the same thing because, “Oh, look it, there’s another bucket in that other region over there. I’m going to go and encrypt all of those objects, too. Why not, right?” [laugh].
And so, it also begs the question or the design principles and decisions of, well, okay, maybe do I ship it to a different account where my security context is different, my identity context is different? And so there’s a lot of areas to explore there. And it’s very good question and one that we definitely do need to think about, in terms of catastrophic failure because that’s the way to think about it, for sure.
Jason: Yeah. So, many parallels between that security and reliability, and all comes together with that FinOps, and how much are you—how much do we pay for all of this?
John: Between the reliability and the security world, there’s a lot of parallels because your job is about thinking what are the worst-case scenarios? It’s, what could possibly go wrong? And how bad could it be? And in many cases, how bad is it? [laugh].
Especially as you uncover a lot of the bad things that do happen in the real world every day: how bad is it? How do I measure this? And so absolutely there’s a lot of parallels, and I think it’s a very interesting point you make. And so… yeah so, Jason, how can we marry the two worlds of chaos engineering and security together? I think that’s another very exciting topic, for sure.
Jason: That is, absolutely. You mentioned just briefly in that last statement, how do you measure it?
Jason: That comes up to something that we were chatting about earlier is monitoring, and what do you measure, and ensuring that you’re measuring the right things. From your experience building secure systems, talk to me about what are some of the things that you like to measure, that you like to get observability on, that maybe some folks are overlooking.
John: I think the overlooking part is an interesting angle, but I think it’s a little bit more basic than that even. I’ll go to my time in the startup—so at Evident.io—mainly because I was in customer success and my job was to talk to our customers every day—I would say that a bunch of our customers—and they varied based on maturity level, but we were working with a lot of customers that were new in the cloud world, and I would say a lot of customers were still getting tripped up by a lot of the basic types of things. For example—what do I mean by that? Some of the basic settings that were incorrect were things just, like, EC2 security groups allowing port 22 in from the world, just the simple things like that. Or publicly accessible S3 buckets.
So, I would say that a lot of our customers were still missing a lot of those steps. And I would say, in many of the cases, putting my security hat on, the first thing you go to is, well, there’s an external hacker trying to do something bad in your AWS accounts, but really, the majority of the cases were all just mistakes; they were honest. I’m an engineer setting up a dev account and it’s easier for me, instead of figuring out what my egress IP is for my company’s VPN, it’s easier for me just to set port 22 to allow all from the world. A few minutes later, there you go. [laugh]. Exploit taken, right? It’s just the simple stuff; we really as an industry do still get tripped up by the simple things.
I don’t know if this tracks with the reliability world or the chaos engineering world, but I still see that way too much. And that just tells me that even if we are in the cloud—mature company or organization—there’s still going to be scenarios where that engineer at two in the morning just decides that it’s just easier to open up the firewall on EC2 than it is to do, quote-unquote, “The right thing.” Then we have an issue. So, I really do think that we can’t let go of not just monitoring the basics, but also getting better as an industry to alert on the basics and when there are misconfigurations on the basics, and shortening that time to alert because that really is—especially in the security world—that really is very critical to make sure that window between when that configuration setting is made to when that same engineer who made the misconfiguration get alerted to the fact that it is a misconfiguration. So. I’ll go to that: it’s the basics. [laugh].
Jason: I like that idea of moving the alert forward, though. Because I think a lot of times you think of alerts as something bad has happened and so we’re waiting for the alert to happen when there’s wrongful access to a system, right? Someone breaks in, or we’re waiting for that alert to happen when a system goes down. And we’re expecting that it’s purely a response mechanism, whereas the idea of let’s alert on misconfigurations, let’s alert on things that could lead to these, or that will likely lead to these wrong outcomes. If we can alert on those, then we can head it off.
John: It’s all the way. And in the security world, we call it shifting left, shifting security all the way to the left, all the way to the developer. Lots of organizations are making a lot of the right moves in that direction for embedding security well into the development pipeline. So, for example, I’ll name two players in the Infrastructure as Code as we call it in the security space. And I’ll name the first one just because they’re part of Palo Alto Networks now, so Bridgecrew; so very strong, open-source solution in that space, as well as over on the HashiCorp side where Sentinel is another example of a great developer-forward shift-left type of tool that can help thwart a lot of the simple security misconfigurations, right from your CI/CD pipelines, as opposed to the reaction time over here on the right, where you’re chasing security misconfigurations.
So, there’s a lot of opportunity to shorten that alert window. And even, in fact, I’ve spent a lot of time in the last couple of years—I and my team have spent a lot of time in the last couple of years thinking about what can the bots do for us, as opposed to waiting for an alert to pop up on a Slack message that says, “Hey, engineer. You’ve got port 22 open to the world. You should maybe think about doing something.” The right thing to do there is for something—could be something as simple as an alert making it to a Lambda function and the Lambda function closing it up for you in the middle of the night when you’re not paying attention to Slack, and the bot telling you, “Hey, engineer. By the way, I closed the port up. That’s why it’s broken this morning for you.” [laugh]. “I broke it intentionally so that we can avoid some security problems.”
So, I think there’s the full gamut where we can definitely do a lot more. And that’s where I believe the new world, especially in the security world, the DevSecOps world, can definitely help embed some of that security mindset with the rest of the cloud and DevOps space. It’s certainly a very important function that needs to proliferate throughout our organizations, for sure.
Jason: And we’re seeing a lot of that in the reliability world as well, as people shift left and developers are starting to become more responsible for the operations and the running of their services and applications, and including being on call. That does bring to mind that idea, though—back to alerting on configurations and really starting to get those alerts earlier, not just saying that, “Hey, devs, you’re on call so now you share a pain,” but actually trying to alleviate that pain even further to the left. Well, we’re coming up close to time here. So, typically at this point, one thing that I like to do is we like to ask folks if they have anything to plug. Oftentimes that’s where people can find you on social media or other things. I know that you’re connected with Ana through Latinx in Tech, I would love to share more about that, too. So.
John: For sure, yeah. So, my job in terms of my leadership role is definitely to promote a lot of diversity, inclusion, and equity, obviously, within the workspace. Personally, I do also feel very strongly that I should be not just preaching it, but also practicing it. So, I discovered in the last year—in fact, it’s going to be about a year since I joined Techqueria—so techqueria.org—and we definitely welcome anybody and everybody.
We’re very inclusive, all the way from if you’re a member of the Latinx community and in technology, definitely join us, and if you’re an ally, we definitely welcome you with open arms, as well, to join techqueria.org. It is a very active and very vibrant community on Slack that we have. And as part of that, I and a couple of people in Techqueria are running a couple of what we call cafecitos which is the Spanish word for coffees, coffee meetings.
So, it’s a social time, and I’m involved in helping lead both the cybersecurity cafecito—we call it Cafecito Cibernético, which happens every other Friday. And it’s security-focused, it’s security-minded, we go everywhere from being very social and just talking about what’s going on with people personally—so we like to celebrate personal wins, especially for those that are joining the job market or just graduating from school, et cetera, and talk about their personal wins, as well as talk about the happenings, like for example, a very popular topic of late has been supply chain attacks and ransomware attacks, so definitely very, very timely there. As well as I’m also involved—being in the cloud security space, I’m bridging, sort of, two worlds between the DevOps world and the security world; more recently, we started up the DevOps Cafecito, which is more focused on the operations side. And that’s where, you know, happy to have Ana there as part of that Cafecito and helping out there. Obviously, there, it’s a lot of the operations-type topics that we talk about; lots of Kubernetes talk, lots of looking at how the SRE and the DevOps jobs look in different places.
And I wouldn’t say I’m surprised by it, but it’s very nice to see that there is also a big difference with how different organizations think about reliability and operations. And it’s varied all over the place and I love it, I love the diversity of it. So anyway, so that’s Techqueria, so very happy to be involved with the organization. I also recently took on the role of being the chapter co-director for the San Francisco chapter, so very happy to be involved. As we come out of the pandemic, hopefully, pretty soon here [laugh] right—as we’re coming out of the pandemic, I’ll say—but looking forward to that in-person connectivity and socializing again in person, so that’s Techqueria.
So, big plug for Techqueria. As well, I would say for those that are looking at the FinOps world, definitely check out the FinOps Foundation. Very valuable in terms of the folks that are there, the team that leads it, and the resources, if you’re looking at getting into FinOps, or at least gaining more control and looking at your spend, not so much like this, but with your eyes wide open. Definitely take a look at a lot of the work that they’ve done for the FinOps community, and the cloud community in general, on how to take a look at your cloud cost management.
Jason: Awesome. Thanks for sharing those. If folks want to follow you on social media, is that something you do?
John: Absolutely. Mostly active on LinkedIn at johnmartinez on LinkedIn, so definitely hit me up on LinkedIn.
Jason: Well, it’s been a pleasure to have you on the show. Thanks for sharing all of your experiences and insight.
John: Likewise, Jason. Glad to be here.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more