A few years ago, the Illumio product team embarked on an ambitious project. The challenge was to reimagine Illumio’s event data in a way that was easy to use, maintain, and process. With the benefit of hindsight, I can say that we were successful at delivering something really useful and multi-purpose. We took the existing auditable events feature and we replaced it, creating a “Swiss Army knife” of data: useful, multi-faceted, robust, and handy. This capability evolved over time and faced many challenges. While there are ways it can be enhanced, it has already unlocked value for Illumio customers. In this post, I will describe both the product value provided by redesigning Illumio’s events framework and its use cases, as well as the process lessons learned.
To learn about the product value created, let me start by describing events. Simply, events are records of activity that occurred on the system. The activity could be of many types, like changes to the system configuration or security policy. An event could also record notable activity on the workloads being protected by Illumio, including workload firewall tampering.
Since this data could be used by auditors, the data is subject to auditing standards. Every recorded change includes a reference to which user or program made the change, what was changed, when it was changed (i.e., a timestamp), and where it occurred. Industry literature refers to this as who/what/when/where. Illumio events do not just meet those industry audit logging standards – they do much more. In addition to the basic who/what/when/where, the events also include notifications and the actual before and after values. If applicable, events also include the status of the action so that a failure to complete an attempted action is recorded. To meet the immutability requirements of audit data, the events are read-only post-recording. Additionally, events comply with the auditing requirements of Common Criteria (Class “FAU Security Audit”). The data also can be presented in multiple formats, like JSON, LEEF, and CEF.
The highlights of the feature are that it:
- Exceeds industry standards
- Provides complete and comprehensive coverage
- Every API call generates an event
- There is no duplication
- Is accessible via UI, API, or Syslog
- Is structured with a common schema
- Is available in JSON, CEF, or LEEF formats
Auditors, as well as Governance, Risk & Compliance (GRC) teams, use Illumio events as evidence for audits or during a review. This evidence can include that scoped resources are managed according to policy and regulatory standards. They look for data that tracks a resource through creation, modification, and eventual deletion. They also need a report on user access to systems, the roles associated with those users, and changes to those roles. In certain conditions, they may also need evidence regarding data collection that was interrupted, unavailable, or lost in any way. Generally, the scope of an audit is restricted to certain workloads; for example, the cardholder data environment scope in a PCI audit or the related PCI connected systems. Events can be used to focus on records related to only the workloads in that scope.
Security operations (SOC) and IT operations teams use Illumio events to trace changes to resources. They could be looking for records leading to a failure of an intended change. In this case, the associated event would show that the user was attempting an action for which they had insufficient permissions. There are additional SOC use cases that events enable, including:
- Change monitoring – validate that only authorized changes are executed on the system. Events record what was changed, as well as who was authorized. By comparing these, discrepancies between authorized and actual changes can be found and investigated.
- Security monitoring – changes in security posture can be investigated and followed up with remediation (e.g., a workload being moved from an enforced state to a non- blocking state or a workload being unpaired). Periodic reports on security posture changes and deviations from the policy can be found and investigated.
- Policy violations – an activity that records failure can be due to authentication or authorization. Events are used to identify similar failure patterns, sources of failures, and help with forensics.
Operational teams’ use cases include troubleshooting and deployment monitoring. An incorrect policy deployment is one possible cause that leads to an application being blocked. The initial problem report would simply be that an application does not work as expected. The operations team can then use events to see if there was a recent change to the security policies, when those policies were deployed to the impacted workloads, and also if those policies were deployed successfully. Events can also be used to indemnify the Illumio policies, leading to investigating other causes like the network or firewalls.
Here are some process lessons learned about incrementally delivering a new foundational platform feature that replaces an existing foundational feature while providing a smooth transition for the users.
- Get early feedback from customers on the new feature using slides or a working demo.
- Ship a minimal viable product (i.e., the new feature exists but is disabled and hidden by default). The new feature is carefully enabled for select customers under Illumio’s guidance.
- Release an update that includes both the old and the new feature, selectable with a feature flag, which defaults to enable the old feature. Customers can choose to use the new feature if they want. This is the beta stage.
- The next release would ship with the feature flag switched and default using the new feature. The old feature is still available if needed. This release makes the new feature generally available (GA).
- Some issues are inevitably found in the GA release as most customers will start using the new feature. Addressing those will require shipping patches, software updates, or updates to documentation.
- Lastly, a release completely removes the old feature.
So that’s how we took a core capability of Illumio Core (excuse the pun) and rewrote it from the ground up to provide an event reporting platform that gives customers the fidelity they need (and then some) to support their various use cases.