Our GitLab.com core infrastructure is primarily hosted in Google Cloud Platform's (GCP) us-east1
region (see Regions and Zones).
This document does not cover servers that are not integral to the public facing operations of GitLab.com.
This is a Controlled Document
Inline with GitLab's regulatory obligations, changes to controlled documents must be approved or merged by a code owner. All contributions are welcome and encouraged.
This page is our document that captures an overview of the production architecture for GitLab.com.
The compute and network layout that runs GitLab.com
Role | Responsibility |
---|---|
Infrastructure Team | Responsible for configuration and management |
Infrastructure Management (Code Owners) | Responsible for approving significant changes and exceptions to this procedure |
Source, GitLab internal use only
Most of GitLab.com is deployed on Kubernetes using GitLab cloud native helm
chart. There are a few exceptions for this
which are mainly the datastore services like PostgresSQL
, Gitaly
, Redis
, Elasticsearch
.
GitLab.com uses 4 Kubernetes clusters for production with similarly configured clusters for staging.
One cluster is a Regional cluster in the us-east1
region, and the remaining three are zonal clusters that correspond to GCP availability zones us-east1-b
, us-east1-c
and us-east1-d
.
The reasons for having multiple clusters are as follows:
For more information on why we chose to split traffic into multiple zonal clusters see this issue exploring alternatives to the single regional cluster. A single regional cluster is also used for services like Sidekiq and Kas that do not have a high bandwidth requirement and services that are a better fit for a regional deployment.
In keeping with GitLab's value of transparency, all of the Kubernetes cluster configuration for GitLab.com is public, including infrastructure and configuration.
The following projects are used to manage the installation:
All inbound web, git http, and git ssh requests are received at Cloudflare which has HAProxy as an origin. For Sidekiq, multiple pods are configured for Sidekiq cluster to divide Sidekiq queues into different resource groups. See the chart documention for Sidekiq for more details.
Monitoring for GitLab.com runs in the same cluster as the application. Metrics are aggregated in the ops cluster using Thanos that has multiple components.
Prometheus is configured using the kube-prometheus-stack helm chart in the namespace monitoring
, and every cluster has its own Prometheus which gives us some sharding for metrics.
Source, GitLab internal use only
Alerting for the cluster uses generated rules that feed up to our overall SLA for the platform.
Logging is configured using tanka where the logs for every pod is forwarded to a unique Elasticsearch index. fluentd is deployed in the namespace logging
.
There is a single namespace gitlab
that is used exclusively for the GitLab application.
Chart configuration updates are set in the gitlab-com k8s-workloads project where there are yaml configuration files that set defaults for the GitLab.com environment with per-environment overrides.
Changes to this configuration are applied by the SRE and Delivery team after a review using a MR review workflow.
When a change is approved on GitLab.com the pipeline that applies the change is run on a separate operations environment to ensure that configuration updates do not depend on the availability of the production environment.
For namespaces in the cluster for other services like logging, monitoring, etc. a similar GitOps workflow is followed using the gitlab-helmfiles and tanka-deployments.
GitLab.com does not depend on itself when pulling images utilized in our Kubernetes clusters. Instead, we utilize our dev.gitlab.org container registry for CNG images. This is to ensure that during an incident, we will still maintain the ability to pull images and run our applications as necessary. For any image that we do not build ourselves, these may be pulled from Docker Hub. Conveniently, these images are mirrored on Google's Container Registry product. Our GKE nodes are configured from the start with this mirror already in place providing further redundancy in the event that the Docker Hub is unavailable.
Source, GitLab internal use only
Source, GitLab internal use only
Source, GitLab internal use only
GitLab.com uses several Redis shards for various use cases such as caching, rate-limiting, Sidekiq queueing. More info on various Redis shards, their configuration, and usage can be found in the chef-repo and GitLab. The relationship between Redis instances and GitLab deployments can be tracked via this Thanos link.
Redis Infrastructure Strategy
GitLab.com's Redis, as seen from above, is mostly Redis Sentinel deployed on VMs. There are plans to deploy Redis in Cluster mode (for horizontal scalability) in epic-823 and/or migrate from VM to Kubernetes (reduce engineering toil) in epic-618. The table below summarises the current and expected states of various Redis types:
Type | Current Setup | Expected Future Setup | Driver of State Change |
---|---|---|---|
Cache | Redis Cluster on VM | Redis Cluster on K8s | Reduce toil |
ChatCache | Redis Cluster on VM | Redis Cluster on K8s | Reduce toil |
DbLoadBalancing | Redis Sentinel on VM | Redis Cluster on K8s | Reduce toil |
FeatureFlag | Redis Cluster on VM | Redis Cluster on K8s | Reduce toil |
PubSub | Redis Sentinel on K8s | - | - |
Queues | Redis Sentinel on VM | Redis Sentinel on K8s | Reduce toil |
QueuesMeta | Redis Cluster on VM | Redis Cluster on K8s | Reduce toil |
RateLimiting | Redis Cluster on VM | Redis Cluster on K8s | Reduce toil |
Registry Cache | Redis Sentinel on k8s | - | - |
Repository Cache | Redis Sentinel on VM | Redis Cluster on K8s | CPU saturation |
Sessions | Redis Sentinel on VM | Redis Sentinel on K8s | Reduce toil |
SharedState | Redis Sentinel on VM | Redis Cluster on VM | CPU Saturation |
TraceChunks | Redis Sentinel on VM | Redis Sentinel on K8s | Reduce toil |
When needed we also sometimes deal with CPU saturation by making application changes. Some of the techniques for this are discussed in this video.
Source, GitLab internal use only
Our network infrastructure consists of networks for each class of server as defined in the Current Architecture diagram. Each network contains a similar ruleset as defined above.
We currently peer our ops network. Inside of this network is most of our monitoring infrastructure where we allow InfluxDB and Prometheus data to flow in order to populate our metrics systems.
For alert management, we peer all of our networks together such that we have a cluster of alert managers to ensure we get alerts out no matter a potential failure of an environment.
No application or customer data flows through these network peers.
We host our DNS with Cloudflare (gitlab.com, gitlab.net) and route53 (gitlab.io and others). For more information about CloudFlare see the runbook and the architecture overview.
When it comes to DNS names all services providing GitLab as a service shall be in the gitlab.com
domain, ancillary services in the support of GitLab (i.e. Chef, ChatOps, VPN, Logging, Monitoring) shall be in the gitlab.net
domain.
Access is granted to only those whom need access to production through bastion hosts. Instructions for configuring access through bastions are found in the bastion runbook.
GitLab utilizes two different secret management approaches, GKMS for Google Cloud Platform (GCP) services, and Chef Encrypted Data Bags for all other host secrets.
For more information about secret management see the runbook for Chef secrets using GKMS, Chef vault and how we manage secrets in Kubernetes.
See how it's doing, for more information on that, visit the monitoring handbook.
Exceptions to this architecture policy and design will be tracked in the compliance issue tracker.