Blueprint: Build the Best in Cyber Defense

Preventing Silent Failures with Nir Loya Dahan

Season 5 Episode 11

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 55:28

This episode is sponsored by Fig.

This episode features a conversation with Nir Loya Dahan, Co-Founder and CPO at Fig, recorded at RSAC 2026. Our discussion covers telemetry health and SOC infrastructure resilience: what breaks in a log pipeline, why silent failures are so hard to catch, and how detection teams can build more confidence in their data foundation.

Resources:

Nir's Email: nir@fig.security

Fig Website: https://www.fig.security

Contact, Courses, and More:

For feedback, reviews, guest pitches, or to get in contact with me for any other reason, head to blueprintpodcast.live!

Check out John's SOC Training Courses for SOC Analysts and Leaders:

Follow and Connect with John:  LinkedIn

SPEAKER_03

Every log that hits your sim is like a heartbeat. A firewall deny, an authentication event, a process spawning on an endpoint. Altogether these events form the pulse of your environment, a proof that your detection capability is alive. But here's what makes telemetry health difficult. You're not watching one heartbeat, you're potentially watching hundreds of thousands of them simultaneously. All day, every day. And oftentimes the failure you're trying to catch is incredibly subtle. It's a single beat that minorly changed shape or briefly stopped. Well everything else kept going normally. Your dashboards look fine, your sim is running, the vast majority of logs are reviving and parsing without issue. But somewhere in that volume, one source has changed. The formatting change broke a parser, or in the worst case, an attacker deliberately disabled a log source to blind you to their movement. You're not gonna get an alert for that. Rules need content to trigger on. The absence of data won't trigger a detection. All you get is a silent failure. It's a nightmare scenario. Your rules were right, your tuning was solid, the pipeline just silently stopped delivering, and you had no idea. Telemetry Health sits underneath everything else in your detection program. All the logic, coverage work, and tuning that you've built upon depends on logs arriving correctly and landing where your rules can reach them. When that assumption breaks, and breaks silently, there's often no warning. You just start missing things. At RSA this year, I sat down with Nir Loya Dehan from Fig Security to talk about this exact problem. What breaks, what problems are so hard to spot, and what it actually takes to keep that pulse intact across an enterprise environment. As a quick note, this episode was made possible with sponsorship by Fig Security. So thanks to Nier and the Fig team for helping bring this content to the Blueprint audience. Okay, let's get into it. Nier, thank you so much for sitting down with me and your busy schedule during RSA to talk about my favorite topic, security operations, and how we can make it better over time. Uh, can we start off with a quick introduction to yourself and the problem you're working on?

SPEAKER_06

Amazing. Thank you for coming over. And so quickly introduce myself. I'm Nier, I'm co-founder and CPO at Fig. Uh before Fig, uh I was the VP of product for uh Simulate, uh, which is a bridge and attack simulation company. And before that, I used to work for a few years for Simplify, uh the Sore vendor, uh, before it got acquired by Google, uh, in different product roles. And that's it, I guess.

SPEAKER_03

And so what is the uh the problem that you're working on now at Fig?

SPEAKER_06

Yeah, uh so uh the problem that we're working on at FIG is uh the fact that the security operations infrastructure and tech stack uh is one of the most complex ones, uh, and for a very good reason, because uh it holds all the security flows in the entire organization, yet security engineering today don't have control over the efficacy and the resilience of this infrastructure. Uh and the fact is, any change that happens in this infrastructure may be it because of drift that happens upstream, or any change that they're pushing into production is breaking detection and response flows and creating blind spots all the time. And we're coming to fix that. Uh we're finding and fixing silent failures and broken security flows that are affecting the SOC from detecting and responding to threats.

SPEAKER_03

Excellent. Yeah, that is a real problem that a lot of people are going to run into at some point, whether they realize it or not, right? And one of the themes I've seen just walking around RSA, the clear theme is agentix sock, right? Having things uh run themselves, you know, working through triage and things like that. But before all of that can work, you have to make sure you're actually getting good data, right?

SPEAKER_06

100%. And and you know, from what we're seeing, this is two big motions that happen today uh in around SOC modernization, which one uh uh new players are coming into this space from the data infrastructure perspective. Uh for for the first time in many years, uh a cheaper, better, faster SOC is within reach, right? Uh obviously uh uh with AI, Igentic AI coming into the SOC, which is an amazing use case, uh, but it also holds uh it also holds um a challenge uh from the standpoint of AI is amazing. AI can be a great tool. If it runs on a broken data infrastructure, it's only going to break things faster. Yes. So uh ahead of uh implementing AI, you need to make sure your data foundation, in our case, the security data foundation, is rock solid.

SPEAKER_03

And so what does a rock solid security data foundation look like?

SPEAKER_06

Yeah, so let me be maybe start talking about how the foundation looks like today. And you know, uh with talking organization, you can see uh in more places where this foundation is split between different vendors, uh, I'd say. And you know, when you come to think about it, uh all of the data, all the security data in the organization find its way uh into uh the data pipelines uh uh where the data is routed, ingested, potentially transformed uh and moved into the data lake or sim, uh, or sometimes it's uh the SIM's own ingestion tools that are flowing data into it. Um there the data is being stored, potentially altered, uh modified, uh uh to be available for uh for detection purposes. Um sometimes uh uh parsing and normalizing the data can happen before that. It could happen after the data is being saved, you know, different uh solutions are uh running uh uh schema on read, some uh uh solutions are running schema on write, like enforcing schema, uh head of ingestion, uh, but all are configurable, uh obviously. Um and other two other very important pieces that allow detection engineers to consolidate detection logics and have uh have robust detection logics is a around information models, uh very common in Splunk, obviously, uh in Sentinel, other uh solutions uh where you don't need to write different detection rules for each of your tool. You have an information model uh that allow uh allows the uh engineer, the detection engineer, uh to create a single logic. And then whatever the day, however, the data is changing between different types of tools, the information logic normalizes for that uh and allow you to ask those kind of questions. And on the other hand, you have only enrichment pipelines that are also fitting into your detection logic, um uh coming in both like natively from those solutions, uh, but also fed by the organization, by that detection engineer, be it you know your lookups, your reference tables, your threat intelligence uh flows, your uh CMDBs, uh whatever additional data that is being used for enrichment either within the detection itself or within the pipeline. And so the data that gets to uh like to be be available for the detection engineer might be completely different, different in structure, different in uh patterns than how it originally came out of uh out of the source. And uh being aware of all of the changes that the data is going through through all the different phases is super crucial and super important for a detection engineer uh to understand.

SPEAKER_03

Yeah, there's um that's one thing I cover in class, right? It's like the data you're seeing in your sim might have completely different fields than the things if you go look at the broad data on the source, and that's if you're getting it at all, right? So it's not just like the magic box that you drop an agent and everything is just perfect and shows up, right? There's a lot of decisions and intentionality that a you know detection engineer or sim engineer has to think about along the way, right? Where am I deploying these agents? How are they getting there? What's happening in the process? Um, could you talk a little bit in more detail about the different stops uh from data being generated on the endpoint all the way to where it's actually being looked at by a detection?

SPEAKER_06

Yeah, uh for sure. So let's take uh let's take a classic example. Um let's look at Windows event logs, right? Um so there are various uh types of logs uh um uh that are flowing from different machines in the organizations. Sometimes they uh go through different aggregation spots uh and folders uh within the organizations. Um uh in different different uh technologies has different have different ways of aggregating those logs. Some organizations rather just shoot uh the data straight from the source to the uh data lake or send layer, some uh uh goes through data pipelines uh to reduce the sizes, uh the breadth and depth of those logs. Uh we often see a lot of cost optimization initiatives that are being run by uh data engineers in the organization to consolidate and filter out the data uh uh that is uh considered by them important for detection. Now, what is important for detection? That's a big question. Oftentimes, already at this ingestion phase, and oftentimes where the teams, detection engineering and security data engineering are not the same team, sometimes there's miscommunication in terms of what do I ingest? Uh how do I ingest it, what do I filter out, how do I normalize and deduplicate, and how does that uh correspond with what my detections are doing, what are they looking for? Uh how does uh uh uh a post uh alert hunting looks like and and where can I find the data that I need? And already this ingestion phase uh uh uh there might be there might be challenges we can talk about. Uh later on we uh we can talk about what kind of challenges do we uh see uh the later on at uh after the ingestion, obviously it needs to be routed to a specific place where the data is being saved. Again, uh in this uh point in time, uh two main approaches are either enforce schema upon writing uh that log uh uh into it. Some solutions uh allow it but uh do not uh enforce it. Uh some solutions are enforcing it, uh uh and sometimes it really requires a lot of parsing work already at the ingestion phase. Uh uh and uh that's uh that's an interesting uh part too that we also identify a lot of challenges, uh obviously. Um uh all the way to data being stored. Now, once the data is being stored and it's out there, uh uh the question is how do I query it? Right. Um sometimes mostly in the uh in the uh schema on right, uh obviously uh the data is already being structured and available for uh querying. Where when the schema is uh upon read, uh well, just push the data in there. I can either parse it before or not parse it at all, just can push it there. And the entire parsing process happens upon the query itself. Uh uh I can look at the data in different states. I can look at the data as it is coming into where it's saved. Uh but when I want to query it, uh I look at the data as it's after the parsing uh process. And this is where uh the process uh that I mentioned the parsing, the extraction, the aliasing, uh the evaluations and and uh and uh different calculations that are being held on top of the data before it's being available uh to being queried. That's where uh uh you can either use information model to query it uh uh or query it directly, but you're still gonna see the data after it's being parsed and evaluated. Um there's the enrichment part where it adds on additional data on top of the uh either adds additional data on top of the log itself. Uh a great example would be uh event types and tags on Splunk, uh uh, which is uh pretty straightforward, um sometimes really affect the detection logic uh in a way. And sometimes a lot of the problems could be at that area alone. Um and it can also be around the enrichment that doesn't revolve around the log itself. Like I want to correlate whatever uh is in the log within a specific field within a specific format to something that's in a lookup or in a reference stable and it needs to be matched in a way, in pattern, in value, in in field, uh whatever. And uh following that uh and like those kind of stops along the way, we get to the detection. Detection, when I'm saying detection, obviously it's not just the detection itself, it's also the query that you want to run against the data, uh, etc. We haven't even talked about uh automation uh you know from my past at uh Simplify. Uh we used to see all the time that a lot of organizations are moving to use SOAR as a case management that runs automations. And then this kind of like the next stop. An alert has triggered, now it moves into the SOAR, and the playbook is triggered based on that alert. That alert needs to be structured in a specific way, needs to have data in a specific format, in order for the playbook to refer to the data in it, uh, which is another uh stop to it.

SPEAKER_03

Yeah, so there's all these different things happening to a log as it goes to this entire process, right? And at any stage, right, whether it's whether it's being collected or not, whether it's being enriched or not, you know, there's all these fields that it exist inherently, and then the ones that you might add on later. And then there's the uh normalization and the categorization tags you might apply, right? All of that stuff uh introduces complexity and potentially unpredictability, right, for your log. And then you have a detection that is organized and written to look at one of those fields hoping that it will be there, right? Yep. Um where along those pipelines, or well, along that entire pipeline, do things start to go wrong or most commonly go wrong?

SPEAKER_06

Yeah. So uh it starts with a very big, very basic stuff. Like every piece of that pipeline uh is uh a configurable object. You can configure ingestion, you can configure parsing, configure data model enrichment, configure storage uh and routing. And bad references happen all the time. You'd be surprised uh by the amount of times uh uh a typo in the reference for an index or a table or a specific source type. Uh uh I think I have a specific lookup uh and it doesn't there. Somebody uh somebody deleted it, uh somebody changed the name of the lookup, and uh uh or I'm looking at uh data that should come from a specific parser where it's disabled. Those are kind of like very basic stuff uh that happens all the time because just the environment is changing. And uh and it's important uh that that's kind of like the basics of having the pipes ready. Okay, if you don't have a piece of the pipe, data can flow. Um the second thing uh that we see oftentimes is the data flow. Uh like you'd expect a data to be coming from uh a specific source that you're monitoring. Oftentimes, this is where we see a lot of uh a lot of organizations do have some basic controls over and monitoring uh whether data is flowing or not from a specific source. Uh a big challenge is uh look not just at the source level, but at the schema level, right? A source can generate a lot of type uh different types of logs, right? Even if you're seeing data can data keeps on streaming, if you don't look at the different logs and log types and different schemas coming in from that source, you might have a false sense of security where data is flowing, everything is okay, but specific types of logs aren't aren't flowing because some configuration happened uh at the source level because of some filtering at the pipeline level that happened because somebody wanted to save some money, and or because parsing errors uh uh broke the pipeline over at the sim or whatever. Yeah, uh this is kind of like uh another area that uh uh that can go wrong. The third uh area, which is I think uh the most uh prone to error uh area, is around uh schema and pattern uh issues. Um can't tell you uh how many times we've seen uh some upstream change happen in IT. Uh uh somebody changed the the uh structure of the value within a username field coming in from Active Directory, and all of a sudden seven detections broke. Why? Because you look at this pattern and the data is flowing in another pattern, and it's not like you're gonna have an error somewhere. It's just the detection is not gonna work, it's not gonna trigger uh changes in in schematic structure of the log itself. Um uh that often happens. By the way, also in this area, we do uh we do see organizations implementing some some level of enforcement uh to uh to monitor uh schema health uh in a way. Uh we don't see it uh as often as we would like to see, uh where we uh where we do see it. This is a major uh major uh area for uh the sense of resilience, right? I'm not just looking at the data flow, I'm looking that everything that I'm I need from that source is getting uh getting there. Everything I need for detection is getting there. So those are uh uh additional places uh uh that we see detections break for. And the other areas around meta configuration, you know, I think uh uh what I'm gonna say right now, timestamps and time frames, uh people in uh people at home were gonna be like, yeah, uh I think uh alerts from the future, you know, uh misconfigured uh timestamps uh and uh time frames that don't make sense are a very big source for headaches uh for a lot of uh detection engineers, uh where uh oftentimes uh automatic extractions that happen usually in different parts of the flow. Um sometimes uh uh data sources that weren't configured, not just configured, misconfigured, just were not configured uh in the first place, and the default wasn't good enough. And uh this is an area where we see a lot of uh see a lot of issues happening. It's hard uh to keep track of those, like keep track of latency, keep track of time frames, keep track of the right mapping. Uh, you know, when you look at uh a log from CrowdStrike or from any EDR or whatever, you can see seven different timestamps right there, and mapping the right ones uh and know that uh it comes with the right format and matches your detection logic is a big challenge because you need to do it in uh anywhere.

SPEAKER_03

Yeah. Um yeah. One of the things I'm thinking uh I have to be careful not to give this away, but there is a SANS class where there is a CTF, and in that CTF, one of the things the author of that CTF did, to be very sneaky, was on an endpoint, break into it, set the system clock back like 20 years, and then do the bad thing, which then the log goes in saying it happened 20 years ago, and then the detection fails, right? Because it was relying on what the log said the timestamp was versus the actual recorded time of receiving the log, right? Which is a Somewhat specific to that setup, but certainly those kind of games can be played as well, right? 100%. 100%.

SPEAKER_06

And I tell you, uh like moreover, uh oftentimes um playing with uh the time frame of the uh the detection itself, like uh earliest to latest, and then playing around latency to see whether you're gonna miss stuff or not. Um those are stuff that's happening uh all the time. Uh we had a few clients actually uh recreated it to see uh to see if uh they're gonna do they're gonna identify it, if they're they they're getting like lower alert rate and stuff like that. And this is just one exercise, but it's a super important exercise because it's a very big source for alerting failure. Yeah. Right. Um and and by the way, not just that, sometimes uh like misconfiguration at the ingestion level, like uh looking at like monitoring the performance of the different spots along the way, whether your forwarder is sitting uh on a machine that is up running, not uh grinding CPU uh in a way, anything that could affect the ingestion uh uh and correlation downstream, but it's starting from that that second layer. If the first layer is the data source itself, the second layer is going to be the ingestion, which can be super uh super complex only at the at the infrastructure level and the performance level that could affect a latency and and and uh time frame issues downstream. Uh so this is what what we what I also call uh the meta configuration itself. Just to give you an example, um, somebody changed an ingestion setting uh to limit amount of events that are flowing per minute, and all of a sudden there was a crazy drop in specific types of logs because it got to the maximum every time. Oh yeah. And and and like uh we've been saying like it was a few years ago, uh one of the stocks that we were working with, and they they didn't understand why uh why data is is is dropping and uh because they they they haven't monitored it. And some engineer, after like working a week to identify kind of like kind of different, like walking across like walking backwards across the pipeline to identify okay, that this small misconfiguration at this specific node within this very uh crazy graph, just was misconfigured.

SPEAKER_03

Yeah, crazy stuff like that, right? And the the tracing it back becomes just a whole nightmare, let alone right when attackers do it. Have you ever seen other like attacks that involve turning off log sources and relying on the teams not seeing it?

SPEAKER_06

So we I actually haven't. I had a very uh interesting conversation about it uh with my CTO, Roy, which has a uh a very extensive offensive uh security background. And we're talking about this uh uh notion of whether attackers uh uh would use um like would try to misconfigure security operations uh infrastructure uh to allow attacks to happen. And he had a saying the the good ones might be doing it, the like the very good ones would not uh wouldn't be caught. So they don't even need to do anything there, yeah. So uh that was that was kind of interesting uh to think about it. I guess uh because the problems are known to all, right? And I guess uh I won't be surprised if somebody would have uh used uh uh this kind of knowledge to identify how they can either bombard or alter how they're working in order to uh impact how the uh the Blue Team's uh SOC is gonna be able to respond to threats. But uh that's kind of like uh two different approaches uh around that.

SPEAKER_03

Yeah. You uh had mentioned earlier the idea of looking at your schema health, right, and things like that. Um for those who are listening that may not be doing really anything around that, uh, what are some of the actual methods that people use to just look at the data and say, like, has anything changed here uh that are implementable by you know largely anybody?

SPEAKER_06

Yeah. So uh first and foremost, um building mechanisms uh for uh enforcing some of the schema, uh some of the schema uh requirements coming in from the detections. Uh uh that requires some engineering work, uh obviously, but even starting with that, uh implementing those kind of controls uh in whatever SIM you're using, whatever data lake that you're using, that's already a very good uh step uh towards that you know, security operations resilience uh in a way. It starts by identifying what do you use in your detections, uh, what kind of uh data sources, what kind of fields, what kind of uh uh what kind of conditions uh you're using, and then break it backwards to what are the fields that should be getting and parsed and flowing correctly into the SIN itself. Yeah, that's a very uh important and first step towards uh uh these uh like uh towards being sure about your detections, uh obviously. Uh and we've seen that uh uh happen in different organizations. Uh I would say that uh again, because it can require a lot of uh some and some engineering work to put this, I would focus tomorrow morning on the critical detections, okay, and critical sources uh that those detections rely on. Even that is going to be uh a very important uh step.

SPEAKER_03

Yeah. So just um being aware of right, your most critical, like the source it's coming from, what's happening to it along the way, the actual fields that are put in, and then having some kind of continuous check to uh ensure that nothing of that changes over time.

SPEAKER_06

Yeah, I would even say that uh in some cases, in some tools, you can actually put some controls to enforce uh uh schemas uh in different ways. It's a double-edged sword uh because uh it needs some coordination today uh when you build it yourself on your SIM. It needs some coordination between the controls that you put in and the detections. If tomorrow morning you're gonna change, update your uh detection, you're gonna need to make sure you update those controls as well. Right. Uh obviously. Uh but again, uh starting with the most critical ones is a very important step uh uh already.

SPEAKER_03

Yeah, starting with the priorities and making sure your biggest things at least are in place and then kind of working your way down the list is the approach I'm always uh pushing on that as well. On the uh schema side, right? How often do schemas change for like security products?

SPEAKER_06

So uh it differs. Um I would say that uh most vendors uh do understand that what they get out of their product affects detection and response uh down the line. And the thing is that you don't necessarily know when. Right. Uh and we had this uh this thing that we we saw with a client uh that told us that uh one of the one of the big vendors uh uh without any names uh mentioning um changed a schema uh into like a nested field within one of the logs. And uh it was one change, uh a very slight one, by the way. Uh but it broke parsing uh for specific uh for specific flows, and uh seven detection seven or eight detections uh uh uh from uh from their detection array around for like looking at this source uh weren't triggering uh for uh it was two months. And somebody at the SOC was like, hey, uh I it's odd, they haven't seen those uh those rules trigger. It's funny because those rules were generating false positives every once in the every now and then, but all of a sudden it's completely silent, and they were like, Okay, let's look into it. They spend around two weeks trying to figure out what exactly was broken because they've seen data flowing, but the rules weren't triggering, and they identified this slight change that broke uh extraction and evaluation of a specific field that those specific detections were relying on. Yeah. So it's not necessarily about how often does it happen, but when it happens, uh that's something that could, in this case, really affected their ability to respond to threats. Now, in solutions that are being managed internally within the organization, where schema can change based off internal configurations, this is something that happens a bit more often, I'd say, uh because it's not a vendor that's very that's uh very cautious of every change, yeah, uh, but the internal stakeholders like IT again again, that's the example with the Active Directory. IT uh changes configurations often. Um for internal solutions, internal uh logging for internal products uh happens a lot for uh you know tech companies and stuff like that. This is happening uh way more often because you know the backend engineer just changes something, they don't even know it's being utilized uh downstream, uh, but they change it and this breaks much more often. Uh this is where, by the way, where a lot of organizations uh are implementing some level of schema validation. And this is where we've kind of seen it.

SPEAKER_03

Yeah. Uh I think of you know in a number of different scenarios there. I've seen definitely like cloud products where there's a whole bunch of different services, and like I'll have a username field, but like even within the platform itself, right? Username from one service might be a different format from username from another one. And you kind of hope your vendor is taking care of that to some extent. Um, maybe that's a question, right? Uh vendors and their information models, right, to some extent should be on top of that. But do you find that there is uh any kind of gaps or lag in when you know vendor change is a thing, and then the uh the information schema and like the transform of that change doesn't catch up?

SPEAKER_06

So it differs. Uh again, there are some vendors who are more uh aware of it and more try to uh solve for it. Um but the reality is um you know uh security engineers can be on top of everything, can be on top of every slight change in the API documentation, additional versioning. Um and and they definitely can be on top of the changes that happen within the data itself in that structure. And uh we often see that's that gap. Something like they they kind of don't have a reason to suspect a change, but when it happens, they're not aware of it, and it kind of affect uh affect the environment. So there's definitely a gap, it changes in that time time frame, but uh still yet uh still yet a gap. Uh even with it, and that's a good point that you're raising about even within kind of like the same vendor, because those solutions, you know, you mentioned cloud providers, uh kind of like IT solutions, uh cloud solutions are being managed by either IT or DevOps or whatever, but it's very, very much configurable. I can decide what uh what do I want to use, what kind of conventions do I want to use. If I change in conventions in specific services in that side of the cloud versus uh somebody else in another side of the cloud, often there's not a lot of uh communications or coordination. A lot of times we uh talk with uh detection engineering teams that are saying, I'm not exposed to any change management log that happened in different parts of the organization, and oftentimes just need to uh uh handle the coincidence. Um you know, obviously the ideal state would be um an open communication about change management and transparency around it, being realistic, um uh detection engineering and security data engineering needs to uh they need to uh uh have control on their own rather than hoping that somebody's gonna update them if something changes. Because at the end of the day, if an alert isn't triggering into the security side.

SPEAKER_03

Yeah, and that's that's the hard thing to notice, right? It's like it's easy to tell when you write a detection rule that is bad and it fires off a million false positives and you explode your queue, you know, clear signal there. But it's when things disappear, right, that that's where you say, oh hmm, it's quiet, a little too quiet, right? Yeah, yeah.

SPEAKER_06

Nobody like we uh we have the saying, nobody uh knows if an alert wasn't triggered for three months because it was there was secure for three months or because something was broken. Yeah, right. And it's like immediately uh immediately make sense because, hey, we've been there.

SPEAKER_03

Yeah, yep. And so resilience is the thing we're getting to here, right? And every SOC needs that, like, what can be pushed and changed and modified, how can the environment drift and how can how much can happen before something actually breaks? Um, what does a SOC that is doing this really, really well look at? And how do they kind of maybe measure and approach building that resilience?

SPEAKER_06

Yeah, so uh to your point about resilience, you know, there's a lot of discussions these days about resilience because you know uh that's a new focus, right? Risk uh risk kind of shifted. Uh the discussion about risk, uh resilience uh has a lot into it, but oftentimes when you refer to resilience, you talk about uh attack resilience. Do I have all the controls in place to make sure an attacker can attack me? But you can't really talk about cyber resilience without talking about operational resilience. Once an attacker did their own thing, now how do I respond, recover, like response, remediate, contain, recover, whatever, uh in the most efficient way. And uh when you come to think about it, the security operations infrastructure is what allows uh detecting and responding in the fastest, most efficient way. Uh and so kind of like three areas, like three main questions I'd say uh uh organizations in our opinion should ask themselves about uh resilience is uh around the security operations resilience is first uh how do I know uh how do I uh understand if our detections, what do we know how to respond to, uh is uh fully operational right now uh and alerting properly and have everything that we need it uh to have. Um that's an important question to ask. Uh the second question, uh, and by the way, the first like for the first question is around like the main measurement is uh uh the like how many of my detections uh are working right now out of the entire array. Some organizations uh can measure it through the minor attack lens, some organizations might uh look at it from the CIS lens, uh obviously like from the defender side. Um compliance has a big uh important uh important angle to it, obviously, because you need to report your auditor that you're monitoring properly and nothing is broken. The second question is should something happen, should something broke, how fast would I identify it and how fast would I be able to uh fix it? Uh this revolves uh it kind of kind of work walk along the lines of vulnerabilities. You know, vulnerabilities happen, right? It's a question of how fast do I respond, how fast do I patch, how fast do I put compensating controls around it, and you know, around the SLA, right? Uh so this is a very uh a second, very important question, like time to fix uh this uh these issues because uh every uh every minute that this issue isn't fixed, you have a blind spot. Um and finally, it's asking about uh priorities. How much do I know about my detections from the most critical ones uh to the least critical ones? And this is revolves around my uh my resources, okay? And what I was talking earlier about if you need to if you can do something, if you can monitor for something, monitor it in the most critical detections. Okay, let's say you have a specific amount of resources and uh you want to kind of distribute it uh to uh to like to your detection array, make sure you have the most focus on the critical ones uh first, and then kind of break it down uh and and and and narrow it down to the other like levels of uh urgency or severity for the detections. Uh we do realize that without a proper solution for to help you with, uh uh there's some engineering work that needs to be done. Uh, we really believe that uh you gotta look for it. You you gotta monitor for it, you gotta be on top of it. And if you can do anything, uh you need to start prioritizing from your critical detections. This is also how you're uh reflecting uh upper and to upper management, uh, that you have the right focuses with like have this uh resources, uh uh uh this is how I'm distributing it to my most critical uh detections. Um, and there's some level of uh to that level, uh to that point, there's also a level of uh how continuous is it? How often do I check my critical uh detections? How often do I check my high uh severity detections, and so on and so forth? This is also questions of resource. Uh how how how uh urgent it is to know on a daily basis, on a weekly basis, on an hourly basis that my uh high value detections are working.

SPEAKER_03

Yeah. Uh what do you find is like the average team's rate of checking these things, like based on the resources they have?

SPEAKER_06

Uh so uh for a start, uh for most things it's not continuous. Uh like uh when I'm saying continuous, it's less than a monthly basis. Yeah. Oftentimes it's either being done on a quarterly basis, semi-annual basis, or an annual basis. Um, some advanced teams uh uh run purple teaming exercises. You know, coming in from simulate, uh we've been seeing a lot of teams uh that are running it. Uh the problem is with uh purple teaming exercises is that uh you'll run as an attack simulation. You'll see an alert as in triggering, but that's it. Now you need to understand why. And uh oftentimes uh just having the resource to investigate every silent failure uh is a lot to handle. And um, and so it really depends on the amount of resources uh that is uh available for the security engineering team. Um sometimes we see uh some like uh outsource uh assessments uh uh being done, uh also mostly around uh the high critical alerts, because it's very expensive uh to do it for the entire detection array uh being done by somebody, right? Uh but this is kind of like what we're seeing.

SPEAKER_03

Yeah. That kind of hits on two questions I wanted to ask based on the three things that you had posed there. Uh first one being right, how do you test and how often do you test? And you already hit you know purple teaming and kind of checking on some of those things. Um are there any other methods that you would recommend for people to go back and and use to like check these on a more frequent basis or to make that more easy?

SPEAKER_06

Um so uh first of all, there's Fig. Yeah. Uh yeah. Uh and it's kind of like uh one of the core uh challenges we set up to solve uh. Without building something on your own, which can be a challenge. Second, building uh point uh point validations, uh like data validations in specific uh specific part of the pipeline. Uh uh there are some point monitoring and point validation that some of the vendors are offering. Um being able to kind of connect the dots, whereas something that's failing here might be connected to a failure over there, and what do I need to do in order to fix it? Sometimes even just knowing that there is a problem, that's the first step for solutions. So uh looking more around these areas uh uh of uh uh failure loggings and error loggings uh in your existing tools. We've been seeing uh some cool implementations coming from like open source uh for point monitoring. Um those are very good first steps. And you need to take the first step.

SPEAKER_03

Yeah, right. That's kind of like uh the other thing I want to ask on that is as you are building and expanding and changing your environment, uh, how do you tackle staying on top of and what kind of challenges are created as you are you know going into a new cloud platform, delivering new custom applications, right? Like where do people get themselves into trouble there?

SPEAKER_06

Yeah, so again, mostly it's around ingesting new types of data sources, changing uh data streams and ingestion, um expending uh expending uh uh uh sources coverage. Uh and and these uh these it has uh an extensive effect on on the detection array at the end of the day. Um uh imagine yourself that you have a specific pipe that's built for a specific stream of data, now you're tripling it, uh and all of a sudden uh this pipe breaks. Or you like you built for specific types of data, and now uh uh different forms of that data came in from other places through that pipe, and now your detection, you think your detection should be covering those sources as well, but because they came with different structures, even though it's the same source, the detection won't trigger on top of it. So uh there's there's a lot of uh different uh uh um tweaking uh to do in order to handle every source, even if you already uh uh ingested data from uh from this vendor into that type of product, it's the specific deployment in your environment that could affect it. And and again, um the first step uh is identify that there is a problem. Uh, what we often see again at Fig is that uh you need to the oftentimes there's a problem that's being flagged, but like understanding what's gonna be the root cause for it, and especially what's gonna be the recommend, like what's gonna be the fix that would solve this. This is already uh uh a challenge that, you know, again, we at Fig uh set out uh set out to solve uh to allow engineers to stay focused on security rather than you know the plumbing. Yeah.

SPEAKER_03

One one scenario that pops up a lot, uh especially with with students, is they might not be running their entire set of security tools. Someone else is managing their infrastructure, at least partially for them. Um any suggestions on what people in that situation maybe could ask if they have an MDR or an MSSP? Like, how can they make sure they're doing their job to have none of this happen?

SPEAKER_06

Amazing. Uh and so uh it returns to the questions of resilience, like to those three questions that you know uh you can ask internally, but you can also ask the MDR that manages your environment. Like, uh how do you monitor for the different uh issues? Uh how do you know if my detections are working properly, the detections that you're running for me? Uh B, how fast would you be able to identify uh a silent failure or any failure and fix it? Uh uh, oftentimes looking at it from the SLA perspective. And finally, uh, how do I look at it from uh the critical like the detection criticality perspective? And that's perfectly fine uh because you know there's different levels of support that uh MDRs are giving or providing to their clients. You need to be very aware of what is the level of support that you're getting from your MDR, and in correlation to that, what should you expect? And so uh if you have a very high level of support uh in your contract with your MDR, uh it's important to get that level of visibility to I know that uh when there's an alert, you're reporting to me, you're giving me what what I need to do in order to respond to it. But if there was silence, I want to know it was silence because I wasn't attacked and not because something broke. Yeah, and having those assurances uh from the MDR, by the way, is really helpful also for uh for uh continuing the discussion internally on working with the MDR and trusting it and you know uh continuing working with that uh with that vendor.

SPEAKER_03

Yeah, you could probably test them yourself too to some extent, right? If you really wanted to, you could maybe push them and like turn off one of your things and see if they say anything. Oh, 100%.

SPEAKER_06

By the way, uh we get we got it a lot from you know uh back at Signulate where uh clients used to run purple teaming exercises with the MDR to see that the MDR is actually responding, like picking it up. Uh and uh it happened a few times, and the MDRs our clients uh work with are actually pretty amazing. Uh identify, pick it up and say, hey, uh your client uh we just identify a very uh suspicious activity in your environment. It's like good, great. It was me testing you, and you passed the test, amazing. Uh so it's important because at the end of the day, when you're working with an MDR, you need to have the confidence in its resilience. Yeah, it's it's like uh it's like it's a monitoring uh and response team, right? It's just not your like your own, but it's still it keeps your uh it watches your six, right?

SPEAKER_03

Yeah, you're you're outsourcing the technical detection, but not the responsibility ultimately. So you're putting that trust in their camp and hoping they're covering that kind of stuff for you, yeah. Um anything else we didn't cover uh on the kind of resilience or log and telemetry health kind of topic that you think is a really important thing for listeners to know.

SPEAKER_06

I think uh maybe um I think uh we we covered everything, uh in my opinion. I think if uh if we're talking about takeaways, uh that's important uh to take away from the uh uh from our discussion is it's important to understand uh that uh the SecOps infrastructure is the foundation of the organization for being able to detect and respond to threats, right? That is the beating heart of the security organization. Um it's working, having the assurance that it's working, and if not, being able to identify what's the problem and fix it, it's not an engineering problem, that's a security problem, yeah, right? Uh and first of all, having that level of understanding and acceptance that this is a security problem is the first step for starting to solve for this problem. Second, uh it's important to look at the different levels of the problem at the different points uh that the data is going through, because at each point and in every level, there could be a problem that could affect detection and response. And and it might be a journey to start covering everything, but you need to take the first step. Yeah, look at the data flow, validate your detections, uh take it from an annual exercise to a semi-annual, uh, and and go and and and move towards solutions that help you, uh obviously, uh, with covering it fast and give you that peace of mind about the resilience and efficacy of your environment. This uh this is really the first step towards sock modernization because if you can trust your foundation today, you can definitely advance to the sock of the future in a way.

SPEAKER_03

Excellent. Yeah, I love that approach. Um, you know, understanding what is your most important data, right? Priorities, looking at your entire pipeline and knowing where that data comes from, saying what could go wrong at every single step, kind of doing that threat model of sorts, only it's for your data pipeline, but it's almost the exact same kind of process. And then uh yeah, making sure that you have the answer to the question, what happens and how will I know if something breaks here? Yeah, very logical approach. Love it. Uh, where can listeners find you and get in contact if they want to?

SPEAKER_06

So uh fig.security is our website. Uh reach out. Uh my email is near at fig.security. Reach out if you have any questions, uh, leave a note in our website, uh come see a demo. Uh um and and we really very much like to hear uh you know from you know from the listeners, from the practitioners, you know, from the pains we we came to solve that pain, like from the root. And uh and we'd very much to hear, we'd very much keen to hear what are you experiencing, have you like experienced anything that we just talked about uh today, and how can we uh take you from uh doing the plumbing to doing really focusing on security?

SPEAKER_03

So yeah, awesome, perfect. Thank you so much for taking the time to uh speak with me here, and we'll get uh resources in the show notes, we'll get that information in there, and if anyone's interested, they know how to reach out. So amazing, thank you so much. Yeah, it's amazing. Yeah, thanks for joining Blueprint, and we will catch you later.

SPEAKER_06

We've been seeing uh some cool implementations coming from like open source uh four-point monitoring. Um those are very good first steps, and you need to take the first step.

SPEAKER_03

Yeah, right. That's kind of like uh the other thing I want to ask on that is as you are building and and expanding and changing your environment, uh, how do you tackle staying on top of and what kind of challenges are created as you are you know going into a new cloud platform, delivering new custom applications, right? Like where do people get themselves into trouble there?

SPEAKER_06

Yeah, so again, mostly it's around ingesting new types of data sources, changing uh data streams and ingestion, um expending uh expending uh uh uh sources coverage. Uh and and these uh these it has uh an extensive effect on on the detection array at the end of the day. Um uh imagine yourself that you have a specific pipe that's built for a specific stream of data, now you're tripling it, uh and all of a sudden uh this pipe breaks. Or you like you built for specific types of data, and now uh uh different forms of that data came in from other places through that pipe, and now your detection, you think your detection should be covering those sources as well, but because they came with different structures, even though it's the same source, the detection won't trigger on top of it. So uh there's there's a lot of uh different uh uh um tweaking uh to do in order to handle every source, even if you already uh uh ingested data from uh from this vendor from that type of product, it's the specific deployment in your environment that could affect it. And and again, um the first step uh is identify that there is a problem. Uh, what we often see again at Fig is that uh you need to the oftentimes there's a problem that's being flagged, but like understanding what's gonna be the root cause for it, and especially what's gonna be the recommend, like what's gonna be the fix that would solve this. This is already uh uh a challenge that, you know, again, we at Fig uh set out uh set out to solve uh to allow engineers to stay focused on security rather than you know deployment.

SPEAKER_03

One one scenario that pops up a lot, uh especially with students, is they might not be running their entire set of security tools. Someone else is managing their infrastructure, at least partially for them. Um, any suggestions on what people in that situation maybe could ask if they have an MDR or an MSSP? Like, how can they make sure they're doing their job to have none of this happen?

SPEAKER_06

Amazing. Uh and so uh it returns to the questions of resilience, like to those three questions that you know uh you can ask internally, but you can also ask the MDR that manages your environment. Like, uh how do you monitor for the different uh issues? Uh how do you know if my detections are working properly, the detections that you're running for me? Uh B, how fast would you be able to identify uh a silent failure or any failure and fix it? Uh uh, oftentimes looking at it from the SLA perspective. And finally, uh, how do I look at it from uh the critical like the detection criticality perspective? And that's perfectly fine uh because you know there's different levels of support that uh MDRs are giving or providing to their clients. You need to be very aware of what is the level of support that you're getting from your MDR, and in correlation to that, what should you expect? And so uh if you have a very high level of support uh in your contract with your MDR, uh it's important to get that level of visibility to I know that uh when there's an alert, you're reporting to me, you're giving me what what I need to do in order to respond to it. But if there was silence, I want to know it was silence because I wasn't attacked and not because something broke. Yeah, and having those assurances uh from the MDR, by the way, is really helpful also for uh for uh continuing the discussion internally on working with the MDR and trusting it and you know uh continuing working with that uh with that vendor.

SPEAKER_03

Yeah, you could probably test them yourself too to some extent, right? If you really wanted to, you could maybe push them and like turn off one of your things and see if they say anything.

SPEAKER_06

Oh by the way, uh we get we got it a lot from you know uh back at Sanglate where uh clients used to run purple teaming exercises with the MDR to see that the MDR is actually responding, I like picking it up. Uh and uh it happened a few times, and the MDRs our clients uh worked with are actually pretty amazing. Uh we identify, pick it up, and say, hey, uh your client uh we just identify very uh suspicious activity in your environment. It's like good, great. It was me testing you and you passed the test, amazing. Uh so it's important because at the end of the day, when you're working with an MDR, you need to have the confidence in its resilience. Yeah, it's it's like uh it's like it's a monitoring uh and response team, right? It's just not your like your own, but it's still it keeps your uh it watches your six, right?

SPEAKER_03

Yeah, you're you're outsourcing the technical detection, but not the responsibility ultimately. So you're putting that trust in their camp and hoping they're covering that kind of stuff for you, yeah. Um anything else we didn't cover uh on the kind of resilience or log and telemetry health kind of topic that you think is a really important thing for listeners to know.

SPEAKER_06

I think uh maybe um I think uh we we covered everything, uh in my opinion. I think if uh if we're talking about takeaways, uh that's important uh to take away from the uh uh from our discussion is it's important to understand uh that uh the SecOps infrastructure is the foundation of the organization for being able to detect and respond to threats, right? That's the beating heart of the security organization. Um it's working, having the assurance that it's working, and if not, being able to identify what's the problem and fix it, it's not an engineering problem, that's a security problem, yeah, right? Uh and first of all, having that level of understanding and acceptance that this is a security problem is the first step for starting to solve for this problem. Second, uh it's important to look at the different levels of the problem at the different points uh that the data is going through. Because at each point and in every level, there could be a problem that could affect detection and response. And and it might be a journey to start covering everything, but you need to take the first step. Yeah, look at the data flow, validate your detections, uh take it from an annual exercise to a semi-annual uh and and go and and and move towards solutions that help you uh obviously uh with covering it fast and give you that peace of mind about the resilience and efficacy of your environment. This uh this is really the first step towards sock modernization because if you can trust your foundation today, you can definitely advance to the sock of the future in a way.

SPEAKER_03

Excellent. Yeah, I love that approach. Um, you know, understanding what is your most important data, right? Priorities, looking at your entire pipeline and knowing where that data comes from, saying what could go wrong at every single step, kind of doing that threat model of sorts, only it's for your data pipeline, right? It's almost the exact same kind of process. And then uh yeah, making sure that you have the answer to the question, what happens and how will I know if something breaks here? Yeah, very logical approach. Love it. Uh, where can listeners find you and get in contact if they want?

SPEAKER_06

So uh fig.security is our website. Uh reach out. Uh my email is near at fig.security. Reach out if you have any questions, uh leave a note in our website, uh, come see a demo. Uh um and and we'd really very much like to hear uh you know from you know from the listeners, from the practitioners, you know, from the pains we we came to solve that pain, like from the root. And uh and we'd very much to hear, and we'd very much keen to hear what are you experiencing, have you like experienced anything that we just talked about uh today, and how can we uh take you from uh doing the plumbing to doing really focusing on security?

SPEAKER_03

So awesome, perfect. Thank you so much for taking the time to uh speak with me here, and we'll get uh resources in the show notes, we'll get that information in there, and if anyone's interested, they know how to reach out. So amazing. Thank you so much. Yeah, it was amazing. Yeah, thanks for joining Blueprint, and we will catch you later. So, what have we learned? Telemetry Health sits at the ground floor of your detection program, and if the data isn't arriving correctly, none of that logic on top of it matters. The key takeaway from this conversation, fit your pipeline like a threat surface. Map it, monitor it at the schema level, prioritize your most critical detection dependencies, and make sure you can answer the question if something breaks here, how fast would I know? Thanks to Nier and the Fake Security team for sponsoring this episode. Links and resources are in the show notes. I'm John, and I'll see you on the next episode of Blueprint.