Performance

On this page

Meta issue to track various issues listed here is at on the infrastructure tracker.

Definitions

To clarify what we mean when discussing performance of GitLab, we use the following metrics:

For each metric, the following modifiers can be applied:

In everything that is to follow, times are measured from a single geo-location (in Europe) using "Cable" connectivity for that location (5 /1 Mbps).

First Byte

First Byte (sometimes referred to as time to first byte or TTFB) measures the time between making a request and receiving the first byte of information in return. As a result, First Byte encompasses everything that is the backend as well as network transit issues. It differs from Speed Index mostly by frontend related issues which are included in Speed Index such as javascript loading, page rendering, and so on (for more details, see the steps of a web request below).

External

Timing history and targets for First Byte are listed in the table below (click on the tachometer icons for current timings).

Notes on the table below:

Type URL BoQ Now EoQ
Issue https://gitlab.com/gitlab-org/gitlab-ce/issues/4058 3693 1000
Merge request https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/9546 6347 1000
Pipeline https://gitlab.com/gitlab-org/gitlab-ce/pipelines/9360254 2987 1000
Repo http://gitlab.com/gitlab-org/gitlab-ce/tree/master 1080 1000

Internal

To go a little deeper and measure performance of the application & infrastructure without consideration for frontend and network aspects, we look at "transaction timings" as recorded by Unicorn. These timings can be seen on the Rails Controller dashboard per URL that is accessed .

For instance, to get the transaction timing for the merge request referenced above first visit the merge request page, then visit the Rails Controller dashboard and scroll down to the Transaction Details table. We do not currently have time series graphs per URL nor do we have specific targets in terms of what this timing should be.

Speed Index

Performance of GitLab and GitLab.com is ultimately about the user experience. As also described in the product management handbook, "faster applications are better applications".

External

Since the speed of the application depends on the usage of it, we've decided to use heavy use cases as the basis for measuring performance. Specifically, the URLs from GitLab.com listed in the table below form the basis for measuring performance improvements. The times indicate time passed from web request to "the average time at which visible parts of the page are displayed" (per the definition of Speed Index). Since the User is a controlled entity in this case, it represents "Speed Index - External".

Per available benchmark data (from 2013) for the Speed Index of the web's 300,000 most popular sites at the connectivity that we are using for our testing as well (5 Mbps cable), we have set the target for the Speed Index to be 2000 ms, in order for GitLab.com URLs to belong in the top 25% of fast sites.

Type URL BoQ Now EoQ
Issue https://gitlab.com/gitlab-org/gitlab-ce/issues/4058 7365 2000
Merge request https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/9546 9034 2000
Pipeline https://gitlab.com/gitlab-org/gitlab-ce/pipelines/9360254 14454 2000
Repo http://gitlab.com/gitlab-org/gitlab-ce/tree/master 3230 2000

Steps

Web Request

All items that start with the tachometer () symbol represent a step in the flow that we measure. Wherever possible, the tachometer icon links to the relevant dashboard in our monitoring. Each step in the listing below links back to its corresponding entry in the goals table.

Consider the scenario of a user opening their browser, and surfing to their dashboard by typing gitlab.com/dashboard, here is what happens:

  1. User request
    1. User enters gitlab.com/dashboard in their browser and hits enter
    2. Lookup IP in DNS (not measured)
      • Browser looks up IP address in DNS server
      • DNS request goes out and comes back (typically ~10-20 ms, [data?]; often times it is already cached so then it would be faster).
      • For more details on the steps from browser to application, enjoy reading https://github.com/alex/what-happens-when
    3. Browser to Azure LB (not measured)
      • Now that the browser knows where to find the IP address, browser sends the web request (for gitlab.com/dashboard) to Azure's load balancer (LB).
  2. Backend processes
    1. Azure LB to HAProxy (not measured)
      • Azure's load balancer determines where to route the packet (request), and sends the request to our Frontend Load Balancer(s) (also referred to as HAProxy).
    2. HAProxy SSL with browser (not measured)
      • HAProxy (load balancer) does SSL negotiation with the browser
    3. HAProxy to NGINX (not measured)
      • HAProxy forwards the request to NGINX in one of our front end workers. In this case, since we are tracking a web request, it would be the NGINX box in the "Web" box in the production-architecture diagram; but alternatively the request can come in via API or a git command from the command line, hence the API, and git "boxes" in that diagram.
      • Since all of our servers are in ONE Azure VNET, the overhead of SSL handshake and teardown between HAProxy and NGINX should be close to negligible.
    4. NGINX buffers request (not measured)
      • NGINX gathers all network packets related to the request ("request buffering"). The request may be split into multiple packets by the intervening network, for more on that, read up on MTUs.
      • In other flows, this won't be true. Specifically, request buffering is switched off for LFS.
    5. NGINX to Workhorse (not measured)
      • NGINX forwards the full request to Workhorse (in one combined request).
    6. Workhorse distributes request
      • Workhorse splits the request into parts to forward to:
      • Unicorn. Time spent waiting for Unicorn to pick up a request is HTTP queue time.
      • Gitaly [not in this scenario, but not measured in any case]
      • NFS (git clone through HTTP) [not in this scenario, but not measured in any case]
      • Redis (long polling) [not in this scenario, but not measured in any case]
    7. Unicorn calls services
      • Unicorn, (often just called "Rails", or "application server"), translates the request into a Rails controller request; in this case RootController#index. The round trip time it takes for a request to start in Unicorn and leave Unicorn is what we call Transaction Timings. RailsController requests are sent to (and data is received from):
      • PostgreSQL (SQL timings),
      • NFS (git timings),
      • Redis (cache timings).
      • In this gitlab.com/dashboard example, the controller addresses all three .
      • There are usually multiple SQL calls (or file, or cache, etc.) calls for a given controller request. These add to the overall timing, especially since they are sequential. For example, in this scenario, there are 29 SQL calls (search for Load) when this particular user hits gitlab.com/dashboard/issues. The number of SQL calls will depend on how many projects the person has, how much may already be in cache, etc.
      • Rails tackles the steps within a controller request sequentially. In other words if it needs to make calls out to the database and to git, it is not set up to those in parallel but rather has to wait for the response to the first step before proceeding to the next step.
      • In the Rails stack, middleware typically adds to the number of round trips to Redis, NFS, and PostgreSQL, per controller call, in addition to the timings of Rails controllers. Middleware is used for {session state, user identity, endpoint authorization, rate limiting, logging, etc} while the controllers typically have at least one round trip for each of {retrieve settings, cache check, build model views, cache store, etc.}. Each such roundtrip is estimated to take < 10 ms.
    8. Unicorn constructs Views
      • The construction of views can take a long time (view timings). In some controllers, data is gathered first after which a view is constructed. In other controllers, data is gathered from within a View, so that the view timing in those cases includes the time it took to call NFS, PostgreSQL, Redis, etc. And in many cases, both are done.
      • A particular view in Rails will often be constructed from multiple partial views. These will be used from a template file, specified by the controller action, that is, itself, generally included within a layout template. Partials can include other partials. This is done for good code organization and reuse. As an example, when the particular user from the example above loads gitlab.com/dashboard/issues, there are 56 nested / partial views rendered (search for View::)
      • Partial views may be cached via various Rails techniques, such as Fragment Caching. In addition, GitLab has a Markdown cache stored in the database that is used to speed up the conversion of Markdown to HTML.
      • Perceived performance in the way of First Paint can be affected by how much of the content of a view is rendered by the backend vs. sending a "minimal" html blob to the user and relying on Javascript / AJAX / etc. to fetch additional elements that take the page from First Paint to "Fully Loaded". See the section about the frontend for more on this.
    9. Unicorn makes HTML (not measured)
      • Once the Views are built, Unicorn completes making the "HTML blob" that is then returned to the browser.
      • Some of these blobs are expensive to compute, and are sometimes hard-coded to be sent from Unicorn to Redis (i.e. to cache) once rendered.
    10. HTML to Browser (not measured)
  3. Render Page
    1. First Byte
      • The time when the browser receives the first byte. In addition to everything in the backend, this also depends on network speed. In the dashboard linked to by the tachometer above, First Byte is measured from a Digital Ocean box in the US with relatively little network lag thus representing an estimate of First Byte - Internal.
      • For any page, you can use your browser's "inspect" tool to look at "TTFB" (time to first byte).
      • First Byte - External is measured for a hand selected number of URLs using SiteSpeed
    2. Speed Index
      • Browser parses the HTML blob and sends out further requests to GitLab.com to fetch assets such as javascript bundles, CSS, images, and webfonts.
      • The timing of this step depends (amongst other things) on the number and the size of assets, as well as network speed. For each static asset, there is a round-trip of:
        • for cached assets: browser nginx nginx confirms cached asset is still valid browser
        • for non-cached or expired cached assets: browser workhorse workhorse grabs asset from local cache browser.
        • for a page that is served through GitLab Pages: browser pages daemon (independent service in the architecture) browser.
      • Stylesheets can block page rendering by default, which can lead to unnecessary delays in page rendering.
      • Starting in 9.5, scripts won't block rendering anymore as they are loaded with defer="true", so they are parsed and executed in the same order as they are called but only after html + css has been rendered.
      • Enough meaningful content is rendered on screen to calculated the "Speed Index".
    3. Fully Loaded
      • When the scripts are loaded, Javascript compiles and evaluates them within the page.
      • On some pages, we use AJAX to allow for async loading. The AJAX call can be triggered by all kinds of things; for example a frontend element (button) or e.g. the DOMContentLoaded event. The new call is for a new URL, and such requests are routed either through the Web or API workers, invoke their respective Rails controllers on the backend, and return the requested files (HTML, JSON, etc). For example, the calendar and activity feeds on a username page gitlab.com/username are two separate AJAX calls, triggered by DOMContentLoaded. (The DOMContentLoaded event "marks the point when both the DOM is ready and there are no stylesheets that are blocking JavaScript execution" (taken from an article about the critical rendering path)). The alternative to using AJAX would be to include the full Rails code to generate the calendar and activity feed within the same controller that is called by the gitlab.com/username URL; which would lead to slower First Paint since it simply involves more calls to the database etc.

Git Commit Push

First read Flow of web request above, then pick up the thread here.

After pushing to a repository, e.g. from the web UI:

  1. In a web browser, make an edit to a repo file, type a commit message, and hit "Commit"
  2. NGINX receives the git commit and passes it to Workhorse
  3. Workhorse launches a git-receive-pack process (on the workhorse machine) to save the new commit to NFS
  4. On the workhorse machine, git-receive-pack fires a git hook to trigger GitLab Shell.
    • GitLab Shell accepts Git payloads pushed over SSH and acts upon them (e.g. by checking if you're authorized to perform the push, scheduling the data for processing, etc).
    • In this case, GitLab Shell provides the post-receive hook, and the git-receive-pack process passes along details of what was pushed to the repo to the post-receive hook. More specifically, it passes a list of three items: old revision, new revision, and ref (e.g. tag or branch) name.
  5. Workhorse then passes the post-receive hook to Redis, which is the Sidekiq queue.
    • Workhorse informed that the push succeeded or failed (could have failed due to the repo not available, Redis being down, etc.)
  6. Sidekiq picks up the job from Redis and removes the job from the queue
  7. Sidekiq updates PostgreSQL
  8. Unicorn can now query PostgreSQL.

Goals

Web Request

Consider the scenario of a user opening their browser, and surfing to their favorite URL on GitLab.com. The steps are described in the section on "web request". In this table, the steps are measured and goals for improvement are set.

Guide to this table:

Step # per request p99 BoQ p99 Now p99 EoQ goal Issue links and impact
USER REQUEST          
Lookup IP in DNS 1 ~10 ? ~10 Use a second DNS provider
Browser to Azure LB 1 ~10 ? ~10  
BACKEND PROCESSES         Extend monitoring horizon
Azure LB to HAProxy 1 ~2 ? ~2  
HAProxy SSL with Browser 1 ~10 ? ~10 Speed up SSL
HAProxy to NGINX 1 ~2 ? ~2  
NGINX buffers request 1 ~10 ? ~10  
NGINX to Workhorse 1 ~2 ? ~2  
Workhorse distributes request 1       Adding monitoring to workhorse
    Workhorse to Unicorn 1 18 10 Adding Unicorns
    Workhorse to Gitaly     ?    
    Workhorse to NFS     ?    
    Workhorse to Redis     ?    
Unicorn calls services 1 2500 1000 Allow more GitLab internals monitoring
    Unicorn Postgres   250 100 Speed up slow queries
    Unicorn NFS   460 200 Move to Gitaly - sample result
    Unicorn Redis   18    
Unicorn constructs Views   1500    
Unicorn makes HTML          
HTML to Browser          
    Unicorn to Workhorse 1 ~2 ? ~2  
    Workhorse to NGINX 1 ~2 ? ~2  
    NGINX to HAProxy 1 ~2 ? ~2 Compress HTML in NGINX
    HAProxy to Azure LB 1 ~2 ? ~2  
    Azure LB to Browser 1 ~20 ? ~20  
RENDER PAGE          
FIRST BYTE (see note 1)]   1080 - 6347 1000  
SPEED INDEX (see note 2)   3230 - 14454 2000 Remove inline scripts, Defer script loading when possible, Lazy load images, Set up a CDN for faster asset loading, Use image resizing in CDN
Fully Loaded (see note)   6093 - 14003 not specified Enable webpack code splitting

Notes:

Git Commit Push

Table to be built; merge requests welcome!


Availability and Performance Priority Labels

To clarify the priority of issues that relate to GitLab.com's availability and performance consider adding an Availability and Performance Priority Label, ~AP1 through ~AP3. This is similar to what is in use in the Support and Security teams, they use ~SP and ~SL labels respectively to indicate priority.

Use the following as a guideline to determine which Availability and Performance Priority label to use for bugs and feature proposals. Consider the likelihood and urgency of the "scenario" that could result from this issue (not) being resolved.

Urgency \ Impact I1 - High I2 - Medium I3 - Low
U1 - High AP1 AP1 AP2
U2 - Medium AP1 AP2 AP3
U3 - Low AP2 AP3 AP3

Database Performance

Some general notes about parameters that affect database performance, at a very crude level.