Epic: Infrastructure &62
Currently we manage all GitLab.com infrastructure secrets via chef-vault and
gkms. Our infrastructure at GCP all uses the gkms method, whereas older
chef-vault. Updates with
gkms are done via a script in
chef-repo. The script downloads the encrypted JSON file, decrypts it, allows
you to edit it, checks for valid JSON, and finally re-encrypts and uploads the
new data. Updates to
chef-vault are done via the
knife vault command. Both
of these methods are hard to manage and have no audit trail or rollback
capabilities. Additionally, there is little in the way of access control which
prevents us from abiding by the principle of least privilege.
Hashicorp Vault (henceforth Vault) is a solution to this problem. Vault provides a method to store our secrets and access them via an HTTP API. Vault also has robust access control policies, auditing, and a variety of authentication mechanisms.
Enabling the application to manage CI/CD secrets via Vault is an unrelated topic and we also don't aim to provide a Vault instance for all GitLab.com customers.
Vault should be the only thing running on a server and no one should have access, SSH or otherwise, to the server. We should deploy Vault in the most isolated way possible without using any resources from other environments. There is already an example of deploying Vault to GKE using Terraform. This would be relatively easy to adapt to our current environment.
We would deploy Vault in an HA configuration with Google Cloud Storage as the storage backend and use GKMS to automatically unseal the vault on startup. We would want to provision a new and heavily locked down GCP project in order to keep the service completely isolated from other projects and services as well as allow us to keep access to the project itself tightly controlled. The kubernetes master node would only allow certain IP addresses to access it so we could lock this down to only the ops CI runner(s) to prevent access from an employee workstation.
The only way to access vault would be via the API which, as mentioned, would have strict access control policies.
Sample authentication flow:
graph LR; subgraph gstg/gprd A[Servers] end B[Users] --> D subgraph GCP-Vault-Project A -->C E--encryption keys--- F[GKMS] subgraph GKE D[User/Password Authentication] --> E C[GCE Service Account Authentication] --> E[Vault] E end E -- storage backend--- G[Google Cloud Storage] end
Unfortunately, our secrets are stored on disk in plain text. Even by locking down vault, it doesn't help as much when anyone with SSH access can go and look at the secrets on disk. Vault has many secrets engines that can generate passwords on the fly. For example there is a Postgresql Database Secrets Plugin which can be configured to dynamically generate secrets and keep track of old ones to ease transition. By using something like this with consul-template we would be able to have the database password rotate frequently. Using consul-template would also free chef from having to deal with writing secrets into config files as all what's needed is to deploy the consul templates. There are other secrets engines that offer similar such as the Google Cloud engine which can dynamically generate IAM credentials.
Dynamic generation of secrets like this would also allow us to give employees temporary and limited access to GCP or PostgreSQL if necessary without giving them long lived credentials that someone will need to go clean up later or otherwise keep track of.
There is an official gem called vault-rails which is designed to interface with vault via Rails. However, it doesn't seem to be designed for what we need for managing infrastructure credentials but instead for de/encrypting things like secret CI variables. But vault integration for CI/CD is tracked in this issue and not part of our design.
gitlab-runner.tomlrunner config file)
All of the above secrets are stored in various JSON files encrypted via GKMS.
The relevant files are placed on disk via
gitlab.rb is also
Hashicorp recently ported a feature to the open source product that automatically unseals vault using GKMS. To maintain the balance between security and recovery time objective we should use automatic unsealing in the event of an unplanned reboot but require a minimum number of unseal keys if vault was sealed on purpose by an operator.
SREs should provide their public gpg keys against which their individual unseal keys will be encrypted. In case of on/offboarding of an SRE all unsealing keys will need to be recreated (which requires no downtime).
Without automatic unsealing, we would need to have enough people on hand to unseal the vault after an unplanned reboot. This could be problematic if it happened during a holiday or other time when many people might be unavailable at the same time.
However, it would be more secure to require several SREs to come together to
unseal the vault in the event of an unexpected sealing. Unsealing the vault
requires a minimum threshold of unseal keys to successfully unseal. For example,
if we had 5 unseal keys we might have a threshold of 3 such that 3 people with
keys would need to be present to unseal. In order to facilitate this, we would
have a PagerDuty service specific to Vault which would be triggered on a
status change provided by the
vault_exporter mentioned below. When the seal
status became sealed for whatever reason, this service would page everyone with
a key. Because each vault instance is independently sealed/unseald, we will need
to take steps to ensure that a single sealed vault instance would not block the
entire cluster. Consul can be used as a service discovery mechanism to ensure
that sealed vault instances are never queried. We already have a consul cluster
configured in each environment, but we do not actively use it for service
discovery at this time.
We should consider building support for managing GitLab configuration secrets
via Vault into the product. If nothing else, we could include
in omnibus and provide an example configuration for how to use it with Vault
for generating and accessing dynamic secrets.
Vault would be monitored via Prometheus as the rest of our infrastructure is. There is also a Vault exporter that reports on the health of Vault.
vault-exporter provides metrics on the following
It also provides grafana dashboards via prometheus-ksonnet.
We want to have Vault staying independent from anything in production for security, reliability and to prevent circular dependencies. That rules out to use the existing Consul cluster in production. Standing up an independent Consul cluster just for vault seems to introduce more complexity than necessary if GCS as storage backend and K8s for cluster management are sufficient already.