When I joined NewsCred over three years ago, the company had already decided on Puppet as the company-wide configuration management software. And it made sense – it was proven technology that had been around for years (going on ten). Our implementation at the time was a bit unconventional. Rather than using a “puppet master”, we instead opted for an agentless implementation, whereby regular deployments to our stack would run “puppet apply” remotely to apply the manifests (configurations). With Fabric, the infrastructure team wrote Python scripts that executed the puppet commands in parallel across our servers. It worked, and within months our 200 AWS servers were under configuration management.
About two years ago, Ansible rose as the new player among configuration management software, gaining widespread attention on Hacker News. Boasting an agentless architecture, Ansible rose to popularity for its minimalism, reliability, consistency, and ease of use. It piqued interest among our growing R&D team, particularly as we adopted a dev ops culture and sought to involve the developers in the work managed by the infrastructure team. So we decided, as a team, that we would migrate our entire stack from Puppet to Ansible.
But how do you migrate such a complicated system from one configuration management system to another? Christian Zunker explained one approach where Ansible executes the Puppet commands as part of its run. Given all of the tools we had built in Python around Fabric and Boto, we went in a different direction, outlined below:
Retain the Interface
NewsCred has a strong DevOps culture in which all squads contribute to the main configuration management repository. Rather than one Operations team owning configurations for the entire system (with a production system spanning over 200 cloud servers), we instead all follow agile and DevOps practices and own the configurations for the underlying servers. When we started using Puppet, the fragility of the configurations while using its esoteric, Ruby-like language became apparent pretty quickly. At the time, we were primarily a Python shop, and coupled with our decision to implement it agentlessly, we decided on a wrapper script that rsync’d our configuration codebase, SSHed into individual servers, and ran “puppet apply” locally.
With so much going on in the background, we intentionally simplified the interface used by the developer (and eventually the Jenkins project that runs them), by which the user simply provides parameters to the fabfile for “hosts” and an action (i.e. “deploy”). This proved incredibly useful during our experimental and transition phases, since developers continued to provide the same input whether the scripts ran the local puppet commands or remote Ansible commands under the hood. Even today, we continue to use Fabric to generate the Ansible commands, as it allows us the flexibility to introduce other logic in Python around the Ansible runs, such as profiling.
Leverage AWS Tags
Since we care about our entire production stack in Amazon Web Services being configuration managed, we didn’t want to run into a problem where both Puppet and Ansible attempt to control the same system, potentially undoing the actions done by the other. To accommodate this, we added a “configuration_management” tag to all of our servers specifying “puppet” or “ansible” with the option to override the behavior from the command line. Depending on the value, it could run either the corresponding Puppet manifest or Ansible playbook corresponding to the designated AWS role (another tag) – not to be confused with an Ansible “role”. The assumption here is that a given server is completely managed by either Puppet or Ansible, which required us to duplicate Puppet logic into Ansible (and run the risks of inconsistency). However, our exhaustive suite of unit and application tests made it relatively easy to determine the equivalence of a “puppetized” and an “ansiblized” server for a given role. Furthermore, given that both configurations were idempotent, we could also test that an originally puppetized server had no changes made when subsequently running Ansible (and vice versa).
Separate the Bootstrap
With nearly 100 AWS roles managed by puppet, we didn’t expect it to be an easy feat to migrate every server configuration from one to the other. However, we saw an opportunity to carve out and isolate some of the resources shared across servers (including tools, package repository references, user management), thus separating them from the functional configuration of the server. We refer to this shared, bootstrapped configuration as “standard tools”, and we took care to make sure no functional configurations had dependencies on the contents or variables used in standard-tools.
Taking the time to separate the bootstrapping (and associated dependencies) from the role-specific configuration allowed us to continue managing the tools in puppet across all servers as we continued to introduce new Ansible roles and playbooks. This kept their base configuration consistent without duplication between Puppet’s Ruby templates and Ansible’s Jinja2 templates. Eventually, we had a critical mass of ansiblized servers and switched over to standard-tools being managed by Ansible. The decreasing minority of servers still ran Puppet after Ansible, and this eased us into a completely ansiblized stack.
Over the course of two years, we migrated all of our AWS servers to Ansible while focusing on our regular operational duties. The last few months saw a more concerted effort to migrate servers, as it became increasingly easier to create configuration with the expanding list of available Ansible roles. It also exposed several shortcomings of Ansible that we had to work around – a topic for another day!