Design :: Terraform Automation

Status: Implementation

On this page

Problem Statement

Infrastructure changes in our terraform repository should be deployed without any manual steps, and should be well documented (starting with this design).

The general workflow for each environment is linear, and will benefit from automation in terms of SRE time/effort, as well as consistency and accuracy of our terraform code as the Source of Truth(tm) for our infrastructure configuration. Environments will be managed within separate repositories so that we do not needlessly deploy to all environments within a monorepo, when only a subset have actually changed, nor do we run deployments in sequence where it is not necessary.

Finally, the use of a monorepo to manage code for multiple environments and reusable modules presents multiple limitations and inefficiencies. Since the code for all environments references modules in the same repository by relative path from the master branch with no version constraints, any module changes must be deployed to all environments immediately to ensure consistency and avoid breakage by configuration drift. This can result in delays or prevent multiple people from working on unrelated changes simultaneously, as it creates a global FIFO queue for ALL changes in ALL environments. Smaller, more atomic grouping for infrastructure code will enable better collaboration and faster/more flexible deployments.

Design

Modules

We will use versioned modules in a separate repository, terraform-modules. The ultimate goal would be per-environment pins, allowing more control over parallel in-flight changes. For example: SRE A makes a change to the registry module, adjusts the source/version for gstg, merges the change, and begins validation. SRE B modifies a node_count variable for the registry and plans/applies/merges for both gstg and gprd, but does not change the version tag for the module, so the changes still being validated by SRE A are not shipped to gprd early, nor are they a blocker for minor changes to existing functionality by SRE B.

If we can do this with a variable version containing the git tag for the module source repository, we can maintain the currently DRY approach to symlinked main.tf files, and/or look at moving to terraform workspaces for duplicating deployments. Eventually, we can break out individual modules into separate repositories, with pipelines to perform automated linting, testing, and deployment to a private or public registry.

This approach will enable faster iteration on smaller logical pieces of code, and allows us to setup per-module pipelines with compliance and integration tests using tools like Chef InSpec (e.g. with GCP Resource Pack) or terratest to validate resources created/configured by that module. InSpec is preferred, due to tech-stack alignment and better compliance-centric features, but the general point is to enable smaller/more reliable/testable changes in our infrastructure code.

Project & repository organization

Rather than implement continuous delivery logic that sequentially iterates through all environments for every MR, or attempts to use some logic to determine target environments based on the list of files changed within an MR, we will break out separate repositories by deployment; this also re-inforces the move to break out modules into separate version tagged repositories.

Those separate deployments should be organized by lifecycle (foundational account-level or VPC/network-level configuration that rarely changes vs more frequent iterative changes to DNS or access-control/users/roles/permissions) and project (feature/service-centric infrastructure module(s)). Each project should be automatically deployed to a pre-production environment like gstg for validation, before the same code is used to automatically deploy the changes to production, pending passing tests and approval, of course.

Generally speaking, we do not need to keep track of state, or take the time to refresh state for the entirety of the infrastructure when making changes that only impact a single, service-specific portion. We will use remote state data sources or a shared distributed key-value store like consul to publish information that downstream services may need to consume, rely on conventions (tagging, dns naming standards, etc.), or (less desirable) reference another deployment's remote state via data sources.

This results in many more git repositories and pipelines, but moves us towards a more loosely coupled, composable codebase that is more performant for deploys, more testable, and more scalable in terms of the number of contributors and parallel work. The same approach would serve us well in other areas of our infrastructure, as well (e.g. chef cookbooks, helm charts deployed to GKE, etc.)

Goals

  1. Modules will be maintained and versioned independently of the terraform files for live environments; this will allow different environments to simultaneously reference different versions of the modules, and treat them more like independent, reusable packages (as intended)

  2. We will use separate repositories and terraform workspaces for live infrastructure deployments
    1. We will use the established git workflow for our team to manage deployments to environments which share the same code base, and follow a linear deployment workflow using separate branches for each environment. E.g. dev -> tst -> uat -> prd

    2. Merge request pipelines will execute a terraform plan against the target environment, allowing reviewers to approve/request changes based on the plan output before the changes are applied.

  3. Deployment of infrastructure code changes for all environments will be automated
    1. Plans will automatically be applied to the first environment within the promotion path once the merge request is approved and the merge process started.
    2. Changes to subsequent environments will be separate merge requests for a tagged version previously deployed and validated in the preceding environment, and will follwo the same plan/apply flow

Future considerations

  1. While we will initially separate modules into another separate repository, we do not yet need the added complexity of one repository per module. Should we decide to continue investing in this area, we should look at including automated tests in much the same way we do for our chef cookbooks.

  2. Ephemeral environments may be used for testing infrastructure deployed from feature branches during merge requests. This is beyond the scope of this document, and independent of the flow described above, however.

  3. InSpec profiles (or equivalent testing tools) should be (re-)used to provide post-deployment validation for the live environments.

GitLab.com and Self-managed

This solution only applies to gitlab.com, and does not apply to self-hosted GitLab installations. We can certainly look into leveraging this work to publish "official" Terraform modules to the public module registry for customers to more easily deploy GitLab to various IaaS cloud providers, but that is beyond the scope of this design for now.

Operational Considerations

Monitoring

Monitoring will be facilitated by publishing metrics to prometheus via a push-gateway. Initial metrics should include counts for pipeline success/failure and execution time. Future improvements may include metrics based on automated testing results. Finally, existing monitoring already provides for notifications via email for pipeline failures.

Additional Considerations

The new repositories should be grouped under a top-level project, with sub-projects for modules and deployments. This project structure can also also accommodate a similar pattern for deploying to kubernetes (or other platforms) in the future. In the case of kubernetes, we may need to maintain custom some helm charts, manifest files, etc. as well as separate repositories with the configuration and automation to deploy them

Alternatives

Atlantis was initially considered to handle automating the terraform workflows, but ultimately rejected due to limitations in the security features it provides, especially when running pipelines on a public repository. We are working on relocating our infrastructure and support repositories to ops.gitlab.net, but we will still maintain public mirrors on gitlab.com for transparency and to facilitate community contributions