Why we switched our philosophy from Ops to Infrastructure

Pablo Carranza ·
Aug 12, 2016 · 9 min read

There is Ops, Infrastructure, Performance, DevOps etc. The terms and titles go on and they vary based on a variety of industries, companies, and cultures. At GitLab, we focus on the philosophy not the title. In this post, I’ll explain why and how our team shifted our philosophy on how we approach GitLab's performance from an Operations mindset to an Infrastructure mindset.

Operations mindset

With more and more people using GitLab to host their public and private repos, run CI tests, and deploy to a number of different environments, we started experiencing noticeable performance and scaling challenges. We’d spot a problem and then race to get it fixed. The team was incredibly reactionary, working to fix this and change that. The reality is that computers will break and as you scale more things will fail. With this in mind, we could’ve taken the “Mongolian hoard approach” and thrown more people at the problem. However, that would have been another knee-jerk reaction and we could already see that the reactive way of doing things would never scale. So, we had to change. Our goal was to stop running behind the issues and start anticipating challenges in order to stay steps ahead of them.

The transition

Like most things, change is a process. Here are the steps we took:

Cultural shift

Making this transition really forced the company to tear down the wall between development and production and collectively focus on building a better product. It’s been very important for our infrastructure team to have a "developer mindset". We need to find simple solutions to complex problems and constantly be working to code ourselves out of a job. Our team works to scale our software and our infrastructure by automating solutions. There will always be new challenges for the team work on next. For example, one of the problems that we faced recently was that we were going to run out of storage in a single appliance and we needed to fix this before we ran out, a show stopper kind of problem. Our process to get ahead of this was:

  1. Identify the problem: we are running out of storage space and performance. This opens the questions: how much time left we have?
  2. Add monitoring to understand what the context and environment is: monitor iostat, monitor filesystem growth, plan how much time we have left.
  3. Build a hypothesis and an experiment to challenge our assumptions: by using multiple shards we can buy time increasing complexity to move to a better solution.
  4. Run the experiment by building a small piece of infrastructure: attach a new filesystem shard to the nodes, and set up new projects to be created there.
  5. Learn, and move to the next iteration of solving this long running issue, leaving better tooling behind to make a better decision next time.

In this iteration we realized that our git ssh access timings where not to blame to NFS at all, it was all within ssh. We also learned that most of our traffic comes from new projects that are being imported into GitLab.com so most of the write load moved to the new shard. This is good information that we can use to plan our infrastructure using our resources better.

The story of a recent win: improve our ssh git access time

  1. Assumption: ssh is slow because we are doing a linear search in the authorized_keys file. Data to back up this assumption is the current graphs for io metrics in the main NFS server.
  2. Experiment: adding an authorized keys api command and using it for openssh authorization will give better performance.
  3. Result: stabilized API access because of less filesystem access, but API is still slow: Stabilized API Access
  4. Assumption: web in general (API included) is slow because the worker nodes are being restarted too often
  5. Experiment: adding queueing times for http requests will give us understanding of how much time is a request waiting to be served.
  6. Result: we had way better information and realized that our http requests where queueing for 1 second in the p99 case.
  7. Assumption: by preventing unicorn processes from being killed too often we will avoid enqueuing requests for too long.
  8. Experiment: increasing out of memory killer will keep workers running for longer.
  9. Result: http queueing time dropped to virtually zero and transaction timings were also massively impacted: HTTP Queueing time
  10. New data: from the wider picture perspective, our ssh access is still irregular and quite slow intermittently: intermittent slow ssh access
  11. Assumption: after deeper investigation, dbus is queueing connections because of an arbitrary max sockets limit and bad file descriptor handling.
  12. Experiment: patching dbus in a PPA package and bouncing all the workers will remove the dbus queuing time.
  13. Result: git ssh access stabilized at ~2 seconds for push, ~5 seconds for pull: stable ssh access times
  14. Ongoing further actions: investigate how we can reduce those ssh access timings, and contribute back to the community so everyone can benefit from this.

Other things happened in the mean time, we added a public black box monitoring system to make our performance improvement efforts public. We used this monitoring to start with simple things, and over time we added more and more metrics to get better insight. For example, monitoring our ssh access times was as easy as writing a simple script and adding a cronjob to probe the access every minute. Only with this boring solution we managed to understand how was GitLab.com behaving, and we managed to see how it was evolving in our efforts to build a better system.

There were also some assumptions that proved wrong, but led to better understanding, for example: we assumed that our ssh access was being slow because of the TCP load balancing we do before reaching a worker node, this turned not to be the case when we started monitoring each node individually for better understanding. This kind of experiments are extremely useful because they will invalidate the assumption and make you look somewhere else - failing is an extremely important part of the process.

Our toolbox

Here is a list of the tools we use right now:

Edit this page View source