Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Real-Time Design Document

On this page

Introduction

This document captures the design decisions behind the current implementation of real-time features in GitLab. It is adapted from the description and discussion within issues in this epic.

The Problem

We want to implement real-time issue boards but the current way of doing real-time (polling with ETag caching) would not work well with it. We'd have to track a lot of things in an issue board (lists changing, issues within lists changing, etc..) and polling for each of those isn't feasible due the number of requests that would be required and the resultant load on server nodes and the database. It is possible to use just one polling endpoint, but it just makes the code harder to understand.

Polling is also not very real-time unless polling intervals can be dropped substantially.

Even pages that successfully use multiple polling requests; such as the MR page for title and description, notes, widgets, and so on, would benefit from faster updates.

Objectives

The objective is to implement a real-time solution that satisfies the following criteria:

This objective is an iterative step in the long-term plan to implement real-time collaboration.

Solution Proposed

We will roll out the use of WebSockets by starting with a small, relatively low-risk feature. When we've identified and solved the problems of maintaining WebSocket connections at scale and captured the lessons of designing a feature for persistent connections, we'll produce documentation that will allow other developers to work on real-time features.

The initial feature is viewing assignees on issues in real-time and the chosen technology is Action Cable.

How it works

For the simplest deployments, enabling Action Cable enables the first feature by default.

The feature can also be toggled using two feature flags:

   
real_time_issue_sidebar Attempts to establish a WebSocket connection when viewing an issue and responds to update signals
broadcast_issue_updates Broadcasts a signal when an issue is updated

Sometimes, the nodes serving Web requests aren't the same ones serving WebSocket connections (see "How to implement it on premise") so don't have Action Cable enabled. The feature flags can be used to enable the feature explicitly.

This diagram shows the current steps involved in establishing an open WebSocket connection for bidirectional communication. This is subject to change as work progresses.

sequenceDiagram participant Client participant Workhorse participant Rails/AC Client->>Workhorse: HTTP GET `Upgrade: websocket` Workhorse->>Rails/AC: proxy Rails/AC->>Rails/AC: open connection Rails/AC-->>Workhorse: HTTP 101 Switching Protocols Workhorse-->>Client: 200 OK Rails/AC->>Client: websocket traffic Client->>Rails/AC: websocket traffic
  1. The client sends a connection upgrade request to /-/cable;
  2. Workhorse proxies this to the correct backend (set using the cableBackend option, defaulting to authBackend);
  3. The backend responds with 101 Switching Protocols and upgrades the request;
  4. The client subscribes to the channel(s) it's interested in (IssuesChannel, specifying project_path and iid);
  5. The server confirms subscription and publishes a signal when an update is made to the issue; and
  6. The client responds by requesting up-to-date state via GraphQL.

†: This step is especially subject to change as we consider using GraphQL Subscriptions instead.

How it solves the problem

Prototype model / Testing plan

The feature is currently available for internal team-members to demo on the dev.gitlab.org instance. This is a single-instance deployment of CE.

Performance testing Action Cable with Puma determined no impact on resource usage but only tested while idle. In the absence of simulated workloads, the recommendation was to roll the feature out gradually.

An end-to-end test for real-time assignees was added in this MR.

How to implement it on premise

Instance administrators have a number of options for using Action Cable. Admins of single-instance and small cluster deployments may choose to simply serve WebSocket connections from existing nodes. By enabling Action Cable the first feature will be immediately available.

Administrators of larger deployments may wish to proxy WebSocket connections to a separate set of nodes to protect their main Web nodes from saturation. This can be done in one of two ways:

  1. Use the cableBackend option to specify a separate address, this defaults to authBackend;
  2. Implement traffic splitting manually at the load balancer or ingress stage, typically based on path. This is the option used for gitlab.com.

In both cases only embedded mode is supported for Action Cable. In the latter case the separate nodes are running full GitLab Web processes additionally running Action Cable. See the decision to support only embedded mode here.

It's important to note that Action Cable channels (similar to controllers) can do anything that can be done in the web context; such as using models or reading from the cache, so it is important that these processes are treated like existing web processes. They should have the same configuration and should be able to connect to the DB, Redis cache, shared state, sidekiq, etc. Although we probably would just be doing permission checks in the initial implementation, it could be a source of weird bugs in the future if these dependencies aren't setup properly.

How to implement it on .com

  1. Infrastructure supporting WebSocket connections will run in Kubernetes;
  2. WebSocket traffic is split by path to a separate, independently scalable deployment; and
  3. Pods servicing WebSocket connections run ordinary webservice processes with Action Cable enabled.

Since the nodes serving Web requests do not have Action Cable enabled on gitlab.com, the feature can be controlled using the feature flags :real_time_issue_sidebar and :broadcast_issue_updates. These will be used to roll-out the first feature in a controlled way.

Note: The feature has been trialled on gitlab.com already using the ACTION_CABLE_IN_APP environment variable (via the extraEnv section exposed by our Helm charts) to proxy WebSocket requests to a dedicated set of pods. This coincided with elevated memory consumption on our Workhorse nodes and was subsequently rolled-back.

Monitoring

Possible Costs

Based on the rollout of the first real-time feature, the current esimated cost per connection at peak is $0.02 (USD) per month.

This figure is derived by dividing total cost of WebSocket nodes by the number of concurrent connections at peak, visible on this chart (internal). It does not take into account load on downstream services, such as the primary database or Redis. It is most likely an over-estimation and expected to decrease as connections are added, as the current nodes can support more connections.

Alternatives Considered

Action Cable was the first choice because it is included with Rails. Scalability is a known concern but if it becomes a problem Anycable implements the same API. We could switch to that in the future with minimal to no changes in the application code.

  1. Long-polling / Server-sent Events (SSE)

    Both long-polling and SSE have the problem detailed above with having to poll / request multiple endpoints. Also, even if we do this, we'd have to implement some custom backend logic similar to our current ETag caching that checks Redis or something similar. It's not worth it when ActionCable provides the full stack.

    The message_bus gem implements multiple subscriptions in one polling endpoint. But since we're planning to do real-time collaboration which would need lower latencies and bi-directional communication, it's better to just go with websockets directly.

  2. Go / Erlang / Elixir websocket servers

    It is known that these languages are better than Ruby at concurrency but without booting our Rails app / Ruby libraries, we can't reuse the code we already have; for example, permissions checks. These are complex and very easy to get wrong so we definitely don't want to re-implement this. We could do a separate API call to our Rails backend but more on that below.

  3. Anycable

    Anycable has websocket servers in Go / Erlang and it solves the problem of not having Rails context by using gRPC. The downside is that we'd have to spin up another gRPC server which boots our Rails app. This complicates our infrastructure setup and would take longer to setup everything that's needed. This option was discussed in this issue.

    Since it is easy to switch to this later on if needed, we decided to defer this and start with Action Cable.

  4. Other Ruby websocket servers (Faye)

    Gitter uses this and we looked into this briefly but we didn't really have a strong reason to choose this over Action Cable.

Documentation

Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license