How to Detect Configuration Drift
by Scott Reece, on Jul 17, 2019 9:05:00 AM
Ideally, configurations across developers’ environment servers (Dev/QA/Staging/Prod) would be uniform. However, the reality of enterprise business is that, as new features are introduced, (both customer and internal facing) application owners constantly make changes to hardware and software infrastructures. This results in non-uniform infrastructure across environments. These changes are individually known as “Configuration Gaps,” and the process of their accumulation over time is known as “Configuration Drift”.
As the Gaps between the environments grow, the risk of serious problems increases. Left unchecked, Production and Recovery configurations can become so different that major failures in high-availability (HA) or disaster recovery systems become nearly inevitable.
Enterprise businesses need a plan in place to identify Configuration Gaps and to remediate Configuration Drift on an ongoing basis.
Why Does Configuration Drift Happen?
Configuration Drift is unavoidable in a modern enterprise organization; there are constant valid causes for a server configuration to change. Best practices would dictate that any changes go through an approval process and are well-documented so that every change is known, documented, and approved. This is even more important for complex and dispersed systems.
Unfortunately, the reality never matches up with the need, because the size and complexity also makes it nearly impossible for these organizations to follow best practices 100% of the time. The result is many legitimate and unavoidable reasons and business demands that create opportunities for Configuration Drift to occur.
Example Situations That Would Lead to Configuration Drift:
- Critical Package Updates – Business critical applications are often built upon a smaller framework of packages. These packages may be created in-house or by a third party. If a package that is relied on is found to have a security vulnerability, it may be likely that an individual would update to a different package or version for security reasons immediately without going through standard procedure(s).
- Ease of Access Testing Servers - While debugging an application, a developer may change a configuration manually on a testing sever to help track a bug or further document its causes. It may be successful in helping to define the issue, but if the developer doesn’t change the configuration back, it will create hard-to-find errors on future tests.
- Peak Time Configurations - A team often knows when they will experience peak load times and may want to add more resources or make more robust server configurations during those times (Monday shopping is such an example). If these activities aren’t part of a documented and planned change then they run past need and cost the organization money OR might not be implemented again when needed in the future.
Why Is Configuration Management Important?
The solution to the Configuration Drift problem is Configuration Management. Awareness about what is changing and why not only helps to eliminate Drift, but these practices, polices, and tools help an origination in a number of other ways as well:
- Lower Costs – Having a clear picture of your entire IT infrastructure allows you to spot duplications or over provisioning of infrastructure allowing you to reduce the total amount needed.
- Higher Productivity – Creating consistent and known configuration clusters allow for batch infrastructure creation and management. Further, by limiting unique (or snowflake) servers the need to manage individual configurations manually is reduced.
- Faster Debugging – Having consistent configurations allows debugging teams to rule out configuration errors. Not needing to search for configuration differences between severs, server clusters, or environments teams can focus on other possible causes leading to faster resolutions of tickets.
What Problems Does Ignoring Configuration Drift Cause?
Configuration Drift isn’t only about gaining efficiency. On the flip side, ignoring the need to manage it properly can create a perfect storm scenario that can paralyze an entire organization. Infrastructure is the heart and lifeblood of enterprise organizations, and not paying attention to issues with it can (and does) lead to potential disaster with:
- Security Breaches - Unknown and undocumented changes can open up internal infrastructure to the outside. Changes could be direct, like opening a port for outside access, or by installing software on a machine that is flawed, or worse has malware/ransom-ware embedded.
- Poor Experiences for Your Users - Fast detection and remediation of configuration drift means any negative effect users experience is minimized. Fast detection and remediation can be the difference between slight inconvenience and angry users.
- Service Outages - It often isn’t one thing that brings down a system if you ignore configuration drift; rather, it’s many things. Finding the last change may be easy, but looking for all of the previous (undocumented) changes will take a lot of time to unwind and detect. Even after finding all of the changes and learning why they were made, more time will be dedicated to decide what the new configuration needs to be.
- Lack of Visibility into the Impact of Changes - Changes to a system affect each other and become intertwined over time so that individual changes can affect the system in unexpected ways. Without documentation and configuration management, these changes and their interdependence isn’t noted until failure.
How Can You Detect Configuration Drift?
Once you understand the havoc it can wreak, it’s clear why detecting Configuration Drift should be a top priority. The first step in that process is knowing what to keep and why it was introduced as a change that caused drift.
Decide What to Look for and How Often to Look
Triage your organization by defining what parts are critical to the organization as a whole and what is critical to each business unit.
This will vary from unit to unit and may be expansive in highly regulated industries, or it may only focus on narrower system-critical files/applications. Both the regularity and severity of monitoring systems will be dictated by how important that system is.
Set a Baseline for What Constitutes a “Gap”
The nature of different environments means there will always be differences between a production environment and testing stages. Defining what each stage should be and also defining acceptable types of differences creates the baseline to monitor for drift.
It may be appropriate for early testing stages to have a higher drift allowance than a User Acceptance Testing environment or a zero drift production stage.
Monitor Your System Accordingly
Monitoring will depend on an organization’s maturity, current systems, tooling, total amount of configurations that need checking, and how rigorous that checking needs to be. Monitoring may be unique to each unit within an organization depending on needs and compliance.
How Can You Monitor/Prevent Configuration Drift?
Once a baseline of configurations and acceptable gaps have been documented, monitoring must be implemented to ensure that infrastructure is maintained in its desired configuration. Without a monitoring plan, creating configuration plans and documentation is wasted effort.
There are several different methodologies that can be used to monitor for Configuration Drift, and many enterprises will use a combination of methodologies and tools determined by maturity and compliance needs.
Continuous Manual Monitoring
Individual machine configurations can be checked manually and compared to an expected configuration file. This process is costly in personnel hours and is still error-prone because of the human factor. It is really only appropriate on a small scale, whether that is an organization with a very small infrastructure footprint or for a few unique server clusters.
Configuration Audits involve a team manually checking server configurations and comparing them to a predefined configuration. These audits can be costly as they involve specialized knowledge to understand what a system should be configured as and then backtracking any non-documented change to find out why it was made and whether it should be kept. Finally, the audit team updates existing configuration documents that will be used at the next audit. Due to the time and expense factor, audits are generally kept for high-value or compliance-heavy clusters and are performed on a reoccurring basis, usually several times a year.
Auditing does ensure regular and repeatable server configuration on a known timeline. However, configurations will drift and continue to be in a more and more drifted state until the next audit.
Real-time Automated Monitoring
Automated real-time monitoring is the most mature method to maintain configurations in their desired state. This involves using dedicated server configuration tools to create servers, or groups of servers, and a definition for how they are supposed to be configured. These tools will have a lightweight agent monitor how a server in that group is actually configured and then compare its actual configuration to its definition.
This automated method alerts to drift almost immediately and generally will allow several different options to address the server drift.
Using Otter to Monitor and Prevent Configuration Drift
One of the best tools for detecting, monitoring for, and ultimately preventing configuration drift is Otter. Otter is a real-time automated monitoring system that provides flexibility, scalability, and automatic drift monitoring.
When a server is defined and created in Otter, the tool will continually check that server’s configuration and automatically compare it to the definition that was created and saved in Otter. If the configuration varies from definition, Otter will immediately take action by either:
- Notifying team members that drift has occurred. Otter will highlight which server groups or individual servers have been changed and will indicate what the change is.
- Automatically fix the drift. Otter can re-run configuration plans on any drifted machines to bring them in-line with their expected and pre-approved configurations.
Once a configuration has been created in Otter, it can be applied to limitless servers, all of which are then monitored for drift, ensuring that infrastructure is maintained in a known way. Furthermore, needed changes to configurations can be applied automatically by updating configuration definitions in a single location, rather than on each individual instance.
Each change and drift instance is logged automatically by Otter, ensuring that any auditing needed can be done as quickly as possible by going through past logs to find past changes and who implemented them.
Are you ready to take on Configuration Drift and all the other DevOps challenges holding your organization back? Inedo’s DevOps tools maximize developer time, minimize release risk, and empower stakeholders to bring their vision to life faster. All with the people and technology you have right now. To get help streamlining your CI/CD processes, contact firstname.lastname@example.org.