Performance

On this page

Standards we use to measure performance

Overall

Performance of GitLab and GitLab.com is ultimately about the user experience. As also described in the product management handbook, "faster applications are better applications".

Since the speed of the application depends on the usage of it, we've decided to use heavy use cases as the basis for measuring performance. Specifically, the URLs from GitLab.com listed in the table below form the basis for measuring performance improvements. The times indicate from web request to "Last Visual Change", and are noted in milliseconds. (BoQ is the timing at the beginning of the current quarter; EoQ is the target for the end of the current quarter). Times are measured using ?? connectivity.

Type URL BoQ EoQ
Issue https://gitlab.com/gitlab-org/gitlab-ce/issues/4058 16800 5000
Merge request https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/9546 13217 5000
Pipeline https://gitlab.com/gitlab-org/gitlab-ce/pipelines/9360254 21117 5000
Repo http://gitlab.com/gitlab-org/gitlab-ce/tree/master 6834 5000

Backend

For the purposes of the teams that can work on improving backend performance, we
measure (click on the tachometer icons) and set targets for "backendTime":

Type URL BoQ Now EoQ
Issue https://gitlab.com/gitlab-org/gitlab-ce/issues/4058 3693 1000
Merge request https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/9546 6347 1000
Pipeline https://gitlab.com/gitlab-org/gitlab-ce/pipelines/9360254 2987 1000
Repo http://gitlab.com/gitlab-org/gitlab-ce/tree/master 1080 1000

Flow of information in various scenarios, and its performance

All items that start with the symbol represent a step in the flow that we measure . Wherever possible, the tachometer icon links to the relevant dashboard in our monitoring. Also take a look at some recently added User perspective times along with a breakdown of the steps involved on the frontend

Flow of web request

Considering the scenarios of a user opening their browser, and surfing to their dashboard by typing gitlab.com/dashboard, here is what happens:

  1. User request reaches backend
    1. User enters gitlab.com/dashboard in their browser and hits enter
    2. Browser looks up IP address in DNS server
      • DNS request goes out and comes back (typically ~10-20 ms, [data?]; often times it is already cached so then it would be faster).
      • For more details on the steps from browser to application, enjoy reading https://github.com/alex/what-happens-when
      • Opportunities:
      • not measured
    3. From browser to Azure load balancer
      • Now that the browser knows where to find the IP address, browser sends the web request (for gitlab.com/dashboard) to Azure.
      • not measured
  2. Backend processes request and returns to browser
    1. Azure's LB to HAProxy
      • Azure's load balancer determines where to route the packet (request), and sends the request to our Frontend Load Balancer(s) (also referred to as HAProxy).
      • not measured
    2. HAProxy (load balancer) does SSL negotiation with the browser (takes time)
    3. HAProxy forwards to NGINX in one of our front end workers
      • In this case, since we are tracking a web request, it would be the nginx box in the "Web" box in the production-architecture diagram; but alternatively the request can come in via API or a git command from the command line, hence the API, and git "boxes")
      • Since all of our servers are in ONE Azure VNET, the overhead of SSL handshake and teardown between HAProxy and NGINX should be close to negligible.
      • not measured
    4. NGINX gathers all network packets related to the request ("request buffering")
      • the request may be split into multiple packets by the intervening network, for more on that, read up on MTUs.
      • In other flows, this won't be true. Specifically, request buffering is switched off for LFS.
      • not measured, and not in our control.
    5. NGINX forwards full request to workhorse (in one combined request)
      • not measured
    6. Workhorse splits the request into parts to forward to
      • Unicorn (time spent waiting for Unicorn to pick up a request = HTTP queue time).
      • [not in this scenario, but not measured in any case] Gitaly
      • [not in this scenario, but not measured in any case] NFS (git clone through HTTP)
      • [not in this scenario, but not measured in any case] Redis (long polling)
    7. Unicorn processes request and returns to Workhorse
      • Unicorn, (often just called "Rails", or "application server"), translates the request into a Rails controller request; in this case RootController#index.
      • The round trip time it takes for a request to start in Unicorn and leave Unicorn is what we call Transaction Timings.
      • RailsController requests are sent to (and data is received from):
        • PostgreSQL (SQL timings),
        • NFS (git timings),
        • Redis (cache timings).
      • In this gitlab.com/dashboard example, the controller addresses all three .
        • There are usually multiple SQL calls (or file, or cache, etc.) calls for a given controller request. These add to the overall timing, especially since they are sequential. For example, in this scenario, there are 29 SQL calls (search for Load) when this particular user hits gitlab.com/dashboard/issues. The number of SQL calls will depend on how many projects the person has, how much may already be in cache, etc.
        • There's generally no multi-tasking within a single Rails request. In a number of places we multi-task by serving a HTML page that uses AJAX to fill in some data, for example on gitlab.com/username the contribution calendar and the "most recent activity" sections are loaded in parallel.
        • In the Rails stack, middleware typically adds to the number of round trips to Redis, NFS, and PostgreSQL, per controller call, in addition to the timings of Rails controllers. Middleware is used for {session state, user identity, endpoint authorization, rate limiting, logging, etc} while the controllers typically have at least one round trip for each of {retrieve settings, cache check, build model views, cache store, etc.}. Each such roundtrip estimated to take < 10 ms.
      • Unicorn constructs the relevant html blob (view) to be served back to the user. (view timings).
        • In our gitlab.com/dashboard example, view timings p99 are in the multiple seconds with mean < 1s. This is partially in parallel with the prior steps as the views dictate what they need to call from PostgreSQL, Redis, etc.
        • A particular view in Rails will often be constructed from multiple partial views. These will be used from a template file, specified by the controller action, that is, itself, generally included within a layout template. Partials can include other partials. This is done for good code organization and reuse. As an example, when the particular user from the example above loads gitlab.com/dashboard/issues, there are 56 nested / partial views rendered (search for View::)
        • GitLab renders a lot of the views in the backend (i.e. in Unicorn) vs. frontend. To see the split, use your browser's "inspect" tool and look at TTFB (time to first byte, this is the browser waiting to hear anything back, which is due to work happening in the backend) and compare it to the download time.
      • Unicorn sends html blob back to workhorse
        • Some of these blobs are expensive to compute, and are sometimes hard-coded to be sent from Unicorn to Redis (i.e. to cache) once rendered.
    8. Workhorse sends html blob to NGINX
      • not measured
    9. NGINX sends html blob to HAProxy
      • not measured
    10. HAProxy send blob to Azure load balancer
      • not measured
    11. Azure load balancer sends blob to browser
  3. Frontend processes received HTML and renders page
    1. Browser receives first byte.
      • Depends on network speed
      • not measured
    2. Browser receives additional assets, such as javascript bundles, CSS, images, and webfonts and starts incrementally parsing.
      • Depends on number and size of assets, as well as network speed. For each asset, there is a round-trip of - for cached assets: browser nginx nginx confirms cached asset is still valid browser - for non-cached or expired cached assets: browser unicorn unicorn grabs asset from web worker host local cache browser.
      • Scripts and stylesheets block page rendering by default.
      • Opportunities - Prevent scripts from blocking page rendering by deferring non-essential scripts with <script defer>, moving them to the bottom of the <body>, or lazy-loading them with webpack code-splitting. Related issue: gitlab-ce#33391. - Move non-HTML assets to a CDN to cut out transfer time. Related issues: Setting up a CDN, infrastructure#57; Using image resizing service in CDN, gitlab-ce#34364
      • not measured
    3. Browser compiles and evaluates Javascript within the page.
      • Opportunities
        • Embedded <script> tags both block the page rendering and prevent us from deferring any javascript that they depend on. These should be eliminated. See related MR discussion.
      • not measured
    4. Browser runs events that depend on DOMContentLoaded firing.
      • DOMContentLoaded fires when the initial HTML document has been completely loaded and parsed, without waiting for stylesheets, images, and subframes to finish loading. - Various features are not activated until this event triggers, such as enabling a button to be pressed, or a scrollbar to be scrolled.
      • not measured
    5. Browser completes rendering and page is fully interactive.
      • Total time is currently not measured!

The flow displayed as a table with timings

All times are reported in milliseconds.

Guide to this table:

Step # per request p99 BoQ p99 Now p99 EoQ goal Issue links and impact
USER REQUEST REACHES BACKEND          
Lookup IP in DNS 1 10* ? 10*  
Browser to Azure LB 1 10* ? 10*  
BACKEND PROCESSES AND RETURNS TO BROWSER          
Azure LB to HAProxy 1 2* ? 2*  
HAProxy SSL with Browser 1 10* ? 10*  
HAProxy forwards to NGINX 1 2* ? 2*  
NGINX buffers request 1 10* ? 10*  
NGINX forwards to Workhorse 1 2* ? 2*  
Workhorse distributes request 1       Adding monitoring to workhorse
Workhorse sends to Unicorn 1 18 10 Adding Unicorns
Workhorse sends to Gitaly     ?    
Workhorse sends to NFS     ?    
Workhorse send to Redis     ?    
Unicorn works and returns HTML to Workhorse 1 2500 1000  
Unicorn to Postgres and back   250 100 Speed up slow queries
Unicorn to NFS and back   460 200 Move to Gitaly fast - sample result
Unicorn to Redis and back   18    
Unicorn builds views   1500    
Unicorn sends HTML to Workhorse 1        
Workhorse sends to NGINX 1 2* ? 2*  
NGINX sends to HAProxy 1 2* ? 2*  
HAProxy sends to Azure LB 1 2* ? 2*  
Azure LB sends to Browser 1 20* ? 20*  
Subtotal see note 1   3833      
FRONTEND PROCESSES HTML AND RENDERS PAGE          
Browser receives first byte          
Browser receives all assets (see note 2)   340     Smarter asset loading
Browser compiles JS (see note)   710      
Browser runs post-DOM events (see note 2)   630      
TOTAL TIME (see note 3)          

Notes: 1. Based on mean, p95, p99 of all non-staging URL's measured in our blackbox monitoring, between 2017-03-30 and 2017-06-28. 2. The 340ms, 710ms, and 630ms measurements here are from a sample size of 1 for a specific merge request URL, on a specific hardware configuration. For details, see the related issue. 3. TOTAL TIME is not equal to sum of all of above in the table, due to parallel tasks.


Flow of git commit push

First read Flow of web request above, then pick up the thread here.

After pushing to a repository, e.g. from the web UI:

  1. In a web browser, make an edit to a repo file, type a commit message, and hit "Commit"
  2. NGINX receives the git commit and passes it to Workhorse
  3. Workhorse launches a git-receive-pack process (on the workhorse machine) to save the new commit to NFS
  4. On the workhorse machine, git-receive-pack fires a git hook to trigger GitLab Shell.
    • GitLab Shell accepts Git payloads pushed over SSH and acts upon them (e.g. by checking if you're authorized to perform the push, scheduling the data for processing, etc).
    • In this case, GitLab Shell provides the post-receive hook, and the git-receive-pack process passes along details of what was pushed to the repo to the post-receive hook. More specifically, it passes a list of three items: old revision, new revision, and ref (e.g. tag or branch) name.
  5. Workhorse then passes the post-receive hook to Redis, which is the Sidekiq queue.
    • Workhorse informed that the push succeeded or failed (could have failed due to the repo not available, Redis being down, etc.)
  6. Sidekiq picks up the job from Redis and removes the job from the queue
  7. Sidekiq updates PostgreSQL
  8. Unicorn can now query PostgreSQL.

Availability and Performance Priority Labels

To clarify the priority of issues that relate to GitLab.com's availability and performance consider adding an Availability and Performance Priority Label, ~AP1 through ~AP3. This is similar to what is in use in the Support and Security teams, they use ~SP and ~SL labels respectively to indicate priority.

Use the following as a guideline to determine which Availability and Performance Priority label to use for bugs and feature proposals. Consider the likelihood and urgency of the "scenario" that could result from this issue (not) being resolved.

Urgency \ Impact I1 - High I2 - Medium I3 - Low
U1 - High AP1 AP1 AP2
U2 - Medium AP1 AP2 AP3
U3 - Low AP2 AP3 AP3

Database Performance

Some general notes about parameters that affect database performance, at a very crude level.