I work on Illumio’s Performance Engineering team, which is responsible for analyzing the performance and scale of Illumio’s products. In this blog post, I’ll share some background on the Illumio Virtual Enforcement Node (VEN), and some information on how we designed and implemented our performance measurement system, which tracks the VEN’s impact on a workload.
What Is the VEN?
The Illumio VEN is installed on a managed workload (server) to enforce security policy. It resides in the guest OS, and is supported on a wide variety of operating systems, including most Linux and Windows systems. VENs report status to the Illumio Policy Compute Engine (PCE) and download the PCE-generated policy to apply to the server, using iptables and ipsets for Linux systems or Windows Filtering Platform (WFP) for Windows systems.
The VEN was designed from the start to function as a lightweight embedded system, which means it should never get in the way of the workload doing actual work. It was also designed for reliability and to work in a distributed environment.
The VEN accomplishes these design objectives by:
- Operating quickly and periodically;
- Remaining in the background most of the time; and
- Using configurable operational modes to minimize the impact to workloads.
As my colleague Mukesh put it in an earlier blog post, the VEN behaves like a guest inside the workload.
Designing for Performance Measurement
Our team, working closely with the VEN developers, identified the key performance indicators (KPIs) for monitoring the VEN’s behavior. We also identified the key operational factors that might cause a change in the VEN’s performance.
We designed and implemented a cross-platform metric collection system. Because the VEN runs on both Windows and Linux, KPIs had to be measurable on both platforms. We also developed and implemented an automated alerting system to identify substantial changes in KPIs that indicate a significant change in performance on a VEN.
At the start, we identified the full set of key operational factors that affect VEN behavior. We also studied our own internal enterprise network to create models of key operational factors and their effects on VEN performance. These models helped us identify typical load on the VEN. Then we designed a load test system that could vary the key operational factors in a large range, from well below the typical load to well above it.
From this initial study, of both the KPIs and key operational factors, we learned to properly design a a system for automated VEN performance measurement and alerting. Lastly, we designed a system to collect the selected KPIs and use the saved data for trend reporting as well as automated anomaly detection and alerting.
VEN Automated Performance Testing
Based on our design work, we developed the VEN Performance Testing System, which we call Psylocke. We’ve fully automated Psylocke to run daily tests of the VEN on multiple platforms and to collect data for trend graphing.
Cumulatively, tests are run daily for each OS, with parallel testing being done on Linux and Windows workloads. Psylocke analyzes metrics and sends automated alerts if any metric exceeds thresholds. To accomplish all of this, our performance engineering team has has employed a wide variety of technologies, such as Jenkins, Docker containers, and Graphite (a time-series database), and custom tools written in Ruby, Scala, and R. On Linux, bash scripts are used, and on Windows, PowerShell scripts are used.
We implemented Psylocke incrementally. Initially, the KPIs we chose to collect for the nightly automated tests were the CPU overhead, memory overhead, and network overhead. We also intentionally chose not to include some KPIs, either because they weren’t necessary for nightly tests or because they were covered by other test frameworks, such as unit tests, system integration tests, and feature/functional tests. Initially, we added only a few key operational factors, but we left room for adding more factors in the future.
Typically, with Pyslocke, Linux and Windows VENs are tested every night using the latest development version of VEN code. Each OS’ test run takes over seven hours. First, using a clean Docker image, we measure the baseline performance (CPU, memory, network I/O) without installing the VEN. For each VEN, we carefully subject it to all possible configuration choices and run multiple load tests with different key operational factors for each configuration. For each test run against a configuration, the selected KPIs/metrics are recorded.
To be absolutely certain we understand the impact of each change, we change only one of the key operational factors between each test. Load tests are selected carefully to simulate different types of load that a workload may experience in a real network. For example, an Internet-facing workload may have a connection rate that is around 10K connections/second. In this case, we chose to subject the VEN to loads of 1K connections/second, 5K connections/second, 10K connections/second, and 15K connections/second.
We compare the measured KPIs against the baseline KPIs and the difference is reported as overhead. Alerts are raised when KPIs exceed thresholds. Cumulative metrics are charted for trend analysis and reporting. When performance impacts are noticed, we have internal engineering processes to track the issue, triage it to assess the performance impact, and produce fixes.
With automated tests running nightly, automated alerting, and well-honed processes to track fixes, we have achieved a very short turn-around time from observing a performance issue to validating the fix.
Sample of Test Results from Psylocke
On Apr 19, 2016, 5:28:00 PM, a machine with Ubuntu 12.04 64-bit OS running on four Intel Xeon CPUs (E5-2670 v3 @ 2.30GHz) and 16GB memory was tested with Psylocke.
Before managing with Illumio, this system was subjected to an average load of 10,000 TCP network connections per second over a period of five minutes. During the test run, the CPU usage, available memory (unused memory), network throughput, and actual connections per second were collected using the linux collectd tool. This system was then paired with a PCE and put in Illuminated mode. The same load was applied and the same data collected. The test was repeated with Enforced mode.
As can be seen above, for a load of around 10,000 connections per second, the CPU overhead introduced by the Illumio VEN is ~2% and the memory utilization overhead is 37–58MB.
Keeping the VEN Light
Using Psylocke, our in-house tools, and a robust analytics methodology, the Illumio Performance Team ensures the VEN remains a light, yet powerful tool to safeguard your computing environment.