Performance

On this page

Flow of information in various scenarios, and its performance

Issue that spawned this page: https://gitlab.com/gitlab-com/infrastructure/issues/1878

All items that start with the symbol represent a step in the flow that we measure . Wherever possible, the tachometer icon links to the relevant dashboard in our monitoring.

Flow of web request

Considering the scenarios of a user opening their browser, and surfing to their dashboard by typing gitlab.com/dashboard, here is what happens:

  1. User enters gitlab.com/dashboard in their browser and hits enter
  2. Browser looks up IP address in DNS server
    • DNS request goes out and comes back (typically ~10-20 ms, [can use link to data]; often times it is already cached so then it would be faster).
    • We use Route53 for DNS, and will start using DynDNS soon as well.
    • For more details on the steps from browser to application, enjoy reading

    https://github.com/alex/what-happens-when

    • not measured
  3. From browser to load balancers
    • Now that the browser knows where to find the IP address, browser sends the web request (for gitlab.com/dashboard) to Azure; Azure determines where to route the packet (request), and sends the request to our Frontend Load Balancer(s) (also referred to as HAProxy).
    • not measured
  4. HAProxy (load balancer) does SSL negotiation with the browser (takes time)
  5. HAProxy forwards to NGINX in one of our front end workers
    • In this case, since we are tracking a web request, it would be the nginx box in the "Web" box in the production-architecture diagram; but alternatively the request can come in via API or a git command from the command line, hence the API, and git "boxes")
    • Since all of our servers are in ONE Azure VNET, the overhead of SSL handshake and teardown between HAProxy and NGINX should be close to negligible.
    • not measured
  6. NGINX gathers all network packets related to the request ("request buffering")
    • the request may be split into multiple packets by the intervening network, for more on that, read up on MTUs.
    • In other flows, this won't be true. Specifically, request buffering is switched off for LFS
    • not measured, and not in our control.
  7. NGINX forwards full request to workhorse (in one combined request)
    • not measured
  8. Workhorse splits the request into parts to forward to
    • Unicorn (time spent waiting for Unicorn to pick up a request = HTTP queue time).
    • [not in this scenario, but not measured in any case] Gitaly
    • [not in this scenario, but not measured in any case] NFS (git clone through HTTP)
    • [not in this scenario, but not measured in any case] Redis (long polling)
  9. Unicorn (often just called "Rails", or "application server"), translates the request into a Rails controller request; in this case RootController#index. RailsController requests are sent to:
    • PostgreSQL (SQL timings),
    • NFS (git timings),
    • Redis (cache timings).
    • In this gitlab.com/dashboard example, the controller addresses all three . Typically 20 ms in cache, git timings in the 100's of ms (peaky), sql timings (mean in 10's of ms, peaks to 5 s).
    • There are usually multiple SQL calls (or file, or cache, etc.) calls for a given controller request. These add to the overall timing, especially since they are sequential. For example, in this scenario, there are 29 SQL calls (search for Load) when this particular user hits gitlab.com/dashboard/issues. The number of SQL calls will depend on how many projects the person has, how much may already be in cache, etc.
    • There's generally no multi-tasking within a single Rails request. In a number of places we multi-task by serving a HTML page that uses AJAX to fill in some data, for example on gitlab.com/username the contribution calendar and the "most recent activity" sections are loaded in parallel.
    • In the Rails stack, middleware typically adds to the number of round trips to Redis, NFS, and PostgreSQL, per controller call, in addition to the timings of Rails controllers. Middleware is used for {session state, user identity, endpoint authorization, rate limiting, logging, etc} while the controllers typically have at least one round trip for each of {retrieve settings, cache check, build model views, cache store, etc.}. Each such roundtrip estimated to take < 10 ms.
  10. Unicorn receives the information from the database, NFS, and cache
    • no data on the round trip time for asking / receiving the data.
  11. Unicorn constructs the relevant html blob (view) to be served back to the user.
    • In our gitlab.com/dashboard example, view timings p99 in multiple seconds with mean < 1s. See the View Timings.
    • A particular view in Rails will often be constructed from multiple partial views. These will be used from a template file, specified by the controller action, that is, itself, generally included within a layout template. Partials can include other partials. This is done for good code organization and reuse. As an example, when the particular user from the example above loads gitlab.com/dashboard/issues, there are 56 nested / partial views rendered (search for View::)
    • GitLab renders a lot of the views in the backend (i.e. in Unicorn) vs. frontend. To see the split, use your browser's "inspect" tool and look at TTFB (time to first byte, this is the browser waiting to hear anything back, which is due to work happening in the backend) and compare it to the download time.
    • Some of these blobs are expensive to compute, and are sometimes hard-coded to be sent from Unicorn to Redis (i.e. to cache) once rendered.
  12. Unicorn sends html blob back to workhorse
    • The round trip time it takes for a request to start in Unicorn and leave Unicorn is what we call Transaction Timings.
  13. Workhorse sends html blob to NGINX
    • not measured
  14. NGINX sends html blob to HAProxy
    • not measured
  15. HAProxy send blob to Azure load balancer
    • not measured
  16. Azure load balancer sends blob to browser
  17. Browser renders page.
    • not measured
    • The rendering refers to the html blob. However, the browser also needs to load JS, CSS, images, and webfonts before the user can interact with it. As the page is streamed to the browser, the browser will be incrementally parsing it, looking for additional resources that it can start fetching. If these resources are on a different hostname, the browser will need to perform further DNS lookups (see step 2). For more, see the related issue.

Flow of git commit push

First read Flow of web request above, then pick up the thread here.

After pushing to a repository, e.g. from the web UI:

  1. In a web browser, make an edit to a repo file, type a commit message, and hit "Commit"
  2. NGINX receives the git commit and passes it to Workhorse
  3. Workhorse launches a git-receive-pack process (on the workhorse machine) to save the new commit to NFS
  4. On the workhorse machine, git-receive-pack fires a git hook to trigger GitLab Shell.
    • GitLab Shell accepts Git payloads pushed over SSH and acts upon them (e.g. by checking if you're authorized to perform the push, scheduling the data for processing, etc).
    • In this case, GitLab Shell provides the post-receive hook, and the git-receive-pack process passes along details of what was pushed to the repo to the post-receive hook. More specifically, it passes a list of three items: old revision, new revision, and ref (e.g. tag or branch) name.
  5. Workhorse then passes the post-receive hook to Redis, which is the Sidekiq queue.
    • Workhorse informed that the push succeeded or failed (could have failed due to the repo not available, Redis being down, etc.)
  6. Sidekiq picks up the job from Redis and removes the job from the queue
  7. Sidekiq updates PostgreSQL
  8. Unicorn can now query PostgreSQL.

Availability and Performance Priority Labels

To clarify the priority of issues that relate to GitLab.com's availability and performance consider adding an Availability and Performance Priority Label, ~AP1 through ~AP3. This is similar to what is in use in the Support and Security teams, they use ~SP and ~SL labels respectively to indicate priority.

Use the following as a guideline to determine which Availability and Performance Priority label to use for bugs and feature proposals. Consider the likelihood and urgency of the "scenario" that could result from this issue (not) being resolved.

Urgency \ Impact I1 - High I2 - Medium I3 - Low
U1 - High AP1 AP1 AP2
U2 - Medium AP1 AP2 AP3
U3 - Low AP2 AP3 AP3

Database Performance

Some general notes about parameters that affect database performance, at a very crude level.