Hybrid Endpoint Security: Balancing Inline and Out-of-Band Approaches to Protect Data
Through a hybrid architecture, modern endpoint security software can balance the tradeoff between security and productivity
Alex Ionescu (Advisory Board Member, Cyberhaven)
Radu Banabic (Co-founder & VP Engineering, Cyberhaven)
Alex Copot (Software Engineer, Cyberhaven)
Aristide Fattori (Security Software Engineer, Cyberhaven)
Enterprise cybersecurity software is often perceived as a necessary evil. In one form or another, companies need to enforce boundaries and defend their digital activities. Any security enforcement comes with costs: setup costs, user impact, management costs, etc. These costs pose a challenge for decision makers, who need to weigh the concrete cost of implementing a solution against the expected cost of addressing a security incident post-fact.
The decision today is between two opposite sides on a productivity/security spectrum. On one end we have extreme productivity: not having any security in place at all and relying on remediating incidents post-fact, if possible. On the other end of the spectrum we have extreme security, where each and every system action is submitted to thorough scrutiny, bringing overall productivity down. Both extremes are untenable, for different reasons. Today, many companies use ad-hoc approaches to try and find a good compromise: blacklists, exceptions, partial deployments, etc.
In this post, we argue that the decision does not need to be binary, nor ad-hoc. We describe a hybrid design that combines the advantages of inline and out-of-band monitoring in order to enable a systematic approach to navigating the tradeoff between productivity and security. This hybrid design enables operators to dynamically tune the security enforcement to their business needs.
For brevity, we focus on endpoint use-cases, but the overarching points have broad applicability even in network monitoring, CASB, and more.
Brief history of endpoint security
In the early days of computers, malware first appeared in the form of experimental self-replicating software that spread from computer to computer through software vulnerabilities, removable media, and through the network. Over the years, malware mutated into a profitable business for cybercrime gangs, and nation-states equally partook in cyberwarfare as part of their legitimate arsenal of military and economic weapons. As so much work is becoming digital, the targets of software attacks become more diverse: personal computers, smartphones, point of sale terminals, hospital machines, database servers, etc., are all being attacked for profit. Criminal malicious activities are now so prevalent and profitable that entire industries have spawned around them, including OEM and service providers like “Exploit-as-a-Service” or “Malware-as-a-Service”.
To counteract the cybercrime business, the cybersecurity market evolved, producing a plethora of ways to identify and block malicious software. Later, to address nation-state threats (often labeled as Advanced Persistent Threats, or APT), the market evolved even further to analyze and block non-malware-based attacks, such as fileless exploits and hands-on-keyboard activity.
Traditionally, malware analysis and detection approaches can be divided into two categories: static and dynamic. Static approaches include all techniques based on analyzing malware without running it and without analyzing its interactions with the host system. Unfortunately, many countermeasures exist against static analysis methods, ranging from basic packers to poly- and meta-morphic engines, up to extremely complex virtual machines embedded inside the malware to execute payloads written in dedicated languages. Dynamic analysis approaches, on the other hand, are based on observing the behavior exhibited by malware during its execution — either in an analysis environment or on the protected endpoint — and looking for patterns of malicious behavior. Dynamic approaches are more difficult to obfuscate against, so this is the approach used by most modern security tools.
Over time, security tools expanded their scope from just antimalware and APT protection to also cover other threats, such as insiders, negligent users, or misconfiguration. These tools focus on protecting and monitoring sensitive data, rather than protecting the computer from infection. Their goals range from satisfying compliance requirements, to serving as a legal paper trail during proceedings, to guarding the crown jewels of a company – often its intellectual property, in digital form.
Such data security approaches suffer from the same problems of endpoint protection software that use dynamic techniques since they need to monitor all system activity in real time. Then they must try and abstract it into high-level behavioral representation, to evaluate it against policies that determine whether such behavior is allowed or not. Unfortunately, as we will discuss in the rest of this post, applying such a fine-grained approach to data protection poses several technical challenges. Furthermore, while OS vendors have been providing support for classical antimalware products (e.g., Microsoft’s Virus Initiative – or MVI), the same cannot be said for data protection.
To better clarify the challenges of data protection, we will use the example of preventing exfiltration via a USB storage device. This is a traditional, almost deprecated use case for security software, which is gradually being replaced by other file transfer methods, such as those that leverage cloud services, chat applications, or direct transfer protocols. While the implementation is slightly different, the high-level principles are the same, so we will use the USB example for simplicity. We focus on an endpoint DLP use case, but the synchronous vs. asynchronous principles apply to other categories, such as network DLP, SASE, CASB, or XDR.
USB drive file copy blocking
USB devices are pervasive and are traditionally one of the easiest targets for exfiltrating data. It is relatively easy to completely block such devices using group policy, but completely blocking functionality can be seen as too disruptive to the business. There is always a conflict between security and business enablement, and to enable good trade-offs, security tools need to be precise. For our USB copy example, it means we need to prevent copying certain sensitive files, but not other files.
At a high level, there are two steps to this task: detection and enforcement. Detection means inspecting system events and understanding when a file copy to USB storage is taking place. It also means understanding the context of the file copy – who is copying what file from which location to where. Enforcement means taking some action when the file copy should be restricted, typically relying on these contextual clues.
Operating Systems have long provided support for security tools, in the form of programming interfaces that allow intercepting low-level system activity. This allows third party tools to monitor and interact with the way the operating system works. This is also where the first challenge comes up: system activity is very granular and does not map 1:1 to user actions. For instance, a high-level action such as copying a file to a device corresponds to many low-level system events: scanning folders, opening files, opening metadata, reading/writing chunks of data, etc.
In general, a typical copy operation corresponds to the following sequence of system calls:
- Open source file
- Create destination file
- Read from source
- Write to destination
- Repeat from step 3.
- Close source file
- Close destination file
This sequence can then be repeated for any number of files.
Identifying this pattern of events and producing a higher-level “File Copy” event can be easily done with a state machine. But even this has its challenges. For example, any failures might need to reset the state. This means that visibility into failures is also required at each step – which could include events even the operating system doesn’t expect, such as sudden device removal.
Similarly, if 2 events are very far away in time from each other the state machine should probably be torn down. Otherwise, any sequence of 1&2 together would end up initializing a state machine, waiting for state #3 and ultimately consuming memory needlessly. But what if the USB storage device has very slow write speeds? Resetting the state machine too quickly can cause false negatives.
As one further complexity, the state machine must handle events from multiple files being copied in bulk, or to truly expose the issues — two separate file copies performed by different users logged onto the same machine, transferring from two separate hard drives to two separate USB storage devices. Now the state machine must know how to avoid transitioning from state 3 (Read) to state 4 (Write) if the 2 events refer to different files, different users, different USB storage devices, etc., or else it might cause false positives or report wrong data.
Inline security checks
The first approach to implementing a monitoring tool is to use inline checks. Inline checks mean intercepting each system activity (for example, each system call, each network request) and pausing that activity until all the checks can be performed and the activity is deemed safe.
For USB file copy blocking, this can be done rather easily using a file system minifilter driver. Minifilters are a mechanism built into Windows that allows synchronous interception of file I/O operations. To implement the state machine described earlier, the minifilter would intercept the file open, file read, file write, and file close operations to then generate the corresponding transitions in the state machine. Additionally, in the case of write operations, the minifilter would check the current state and if it corresponds to a forbidden activity, it would force the write operation to fail, by returning an error code to the calling application.
This approach has strong security guarantees because no activity goes unchecked. Each write to the USB drive is analyzed and blocked, if necessary.
However, this approach also has high overhead, because the monitoring app becomes a choke point for the entire system. The filter is maintaining the state machine in the context of I/O operations, which can lead to significant performance degradation. For example, suppose that the state machine needs the full file path of the source. In Windows, this is trivially done by asking for the normalized file name. However, in certain cases like files on network shares, getting the normalized file name can take several seconds; if this is done for each open or read file operation, that operation would now be delayed by several seconds. Furthermore, given that the security app needs to intercept all reads, not just those from USB drives, this overhead will be added on to all read operations on the system.
The approach also has high risk. The most obvious risk is that of an operating system crash: bugs in drivers can lead to invalid state and cause a ‘blue screen of death’ or ‘kernel panic’. However, there are less obvious risks. There may be other minifilters and other drivers registered in the system, which may also get confused by the changes to the file flow. The Windows kernel is monolithic, so all drivers run in a common memory space, leading to very high state explosion and exponential complexity. Apple tried to solve this problem by pushing security products out of the kernel using the new Endpoint Security Framework, but that API is still in its infancy and lacks significant functionality compared to the Windows approach.
An even more subtle issue with the approach is that by analyzing events inline, the security product is subject to the current state of the code that is performing the operation. For example, if the activity is currently being performed while certain kernel operations are prevented, such as pre-emption being disabled (which makes waiting on objects impossible) or while paging is disabled (which makes touching user space memory impossible), then the monitoring tool must obey the same rules. As a technical example, in Windows, certain file operations can be seen at an IRQL of DISPATCH_LEVEL, which fundamentally alters what the minifilter can do.
Out-of-band security checks
Another alternative is to have out-of-band checks. This means that activity is audited asynchronously, from different threads than the one executing the activity. Activity is not paused, so application performance is not directly affected by the extra monitoring.
For USB file copy blocking, this can be implemented most easily using the FileIO class of events in the Event Tracing for Windows framework. ETW is a user-space tracing framework that allows applications to subscribe to events from many different sources. The framework is asynchronous, decoupling the event producers from the event consumers using an internal queueing mechanism.
The detection part of the USB file copy blocking approach remains very similar to the synchronous approach. The security app subscribes to very similar File I/O events as the minifilter provides and triggers state machine transitions accordingly.
Because of the asynchronous nature of the approach, slow operations in the detection code no longer directly impact the application that is monitored. For instance, if the system does the same file name normalization operation described in the previous section, the only effect of the normalization latency for network shares is that the state machine transition is delayed by a few seconds. That’s not to say that overhead is absolute zero: there can still be indirect performance interference. For example, if the security application performs a lot of I/O on the USB drive as files are being copied to it, the drive performance will likely degrade. Random I/O is usually significantly slower than sequential I/O, so depending on the specific access patterns, parallel I/O from the security app can lead to indirect performance degradation. There are other non-immediately observable impacts through which asynchronous designs can affect performance: cache pollution, NUMA noise, etc. In most cases, however, the overall performance impact of asynchronous designs is much lower.
The potential of negative interactions between applications is also reduced. Using the asynchronous approach, the security application can run in a separate context (separate memory space), with all the data exchanges performed over well-defined programming interfaces. This reduces the complexity of interaction with both the monitored app and with other security tools running on the machine, thus decreasing risk.
The improved performance comes at the cost of weaker security guarantees and higher complexity.
Because bad activity is only detected after the fact, asynchronous security tools cannot do true prevention, only remediation. This means that there is a window of time in which the system/data is compromised. Sometimes, that compromise cannot be remediated. For instance, in our USB example, remediation means deleting the sensitive files from the USB storage device once the copy is detected. However, if the device was removed in the interim, remediation is no longer possible.
Another relevant issue of asynchronous mode is that monitored events may arrive out of order. This means that the state machine encoding the business logic needs to account for more transitions and is thus more complex. In the USB file copy scenario, because of the way ETW events are queued per CPU core, it is possible that write events are received before read events, which is unintuitive.
Many other issues exist with asynchronous approaches, especially if based on delayed analysis of operations by leveraging tracing, logging, and auditing tools – all of these have much lower security boundaries and a weaker threat model than a kernel-based inline approach such as minifilters and the macOS ESF.
Neither inline, nor out-of-band security checks are ideal. They both have pros and cons — one favoring security over user impact, while the other favoring user impact over security.
To get the best of both worlds, it is possible to combine asynchronous detection with synchronous enforcement. An enforcement component sits inline and intercepts all activity, pausing any suspicious activity until an asynchronous detection component decides what to do.
Decoupling the two components allows the hybrid system to be highly tunable. For example, the pausing done by the enforcement component can have a configurable timeout. With a high timeout, the overall system would behave like an inline security system, while with a low timeout it would behave out-of-band. This tuning can be either done by a human operator, or can be automated, using metrics. By having a third component that monitors the responsiveness, the security system can be configured with a certain performance budget, so that it tunes itself to enforce security within acceptable user impact envelopes.
Hybrid approaches don’t have to be as simplistic as inline blocking coupled with asynchronous processing – in some cases, a hybrid approach can combine inline-only “simple” checks, which are worth doing in the unfriendly environment of the kernel for performance reasons, with out-of-band complex checks performed only if the action isn’t immediately classifiable as unwanted. For example, if at file creation time, based on the extension of the file name alone (which does not require normalization), the policy indicates that the action should be blocked, this can be done right away.
Another technique that can be used is to make the as-of-yet-unclassified operation visible to the user but not yet fully committed, to allow for guaranteed remediation once the asynchronous processing has completed classifying the operation as undesirable. In the USB file copy example, this would mean, at creation time, allowing the file to be created from the operating system’s perspective, but not actually on the USB storage device (say, in memory, or in a protected local storage location). As the writes start happening, continue allowing them, but in this alternative ephemeral location instead. Once the asynchronous policy result has arrived, if the operation is permitted, the security product can then create the true file on the USB disk, transfer currently performed writes, and move all future writes to the true file location (this is highly complex, but something that minifilters are able to do). If, on the other hand, the operation is not permitted, then simply fail future writes and destroy the ephemeral copy of the file (which the user doesn’t even know about and could only exist in kernel memory).
What if the user wants to pull out their USB disk? Even without the security product installed, Windows already caches writes on its own, so the user would’ve lost their data regardless. The proper way to remove the storage device is to “eject” it, which causes a flush operation. A minifilter could intercept that, and delay the flush until the asynchronous result finally comes in. Delaying an eject operation, which already takes a lot of time, by a few hundred milliseconds, is not the type of operation that a user would easily notice, nor care about. This touches on a last point and benefit about hybrid approaches – in many cases, the final policy decision can be delayed to a point where the user either won’t notice millisecond-level delays (process creation) and/or won’t mind them (USB device removal), all while collecting data through the inline processing.
In this post we discussed modern approaches to building non-intrusive security software. There are two conceptually different approaches to building such tools. The first is inline, or synchronous: intercept every operation seen on the system, analyze it — and decide whether to allow or block the operation. The second is out-of-band, or asynchronous: observe the system from the outside and, if the system is behaving in a disallowed way, remediate it. We presented the advantages and disadvantages of each and described a hybrid approach that maximizes security while minimizing end-user impact.
We described how, through a hybrid architecture, modern security software can enable operators to dynamically tune the tradeoff between security and productivity. By combining granular controls with thorough metrics, security software can become a dynamic tool that adapts to its users’ evolving needs, rather than be perceived as a necessary evil.