GitLab has to be a highly-available, mission critical system. To achieve this, we must design and deploy the system in such a way that a number of principles are met:
Below is a list of examples of concrete items that will help improve GitLab fault-tolerance:
Note that the above list does not mention microservices as a cure-all. A
microservice architecture can help provide fault isolation, but it
does not inherently do this. For example, let's suppose we introduce
UserAPI
microservice that creates an API for all services to retrieve
users in the system. Now our architecture may look like:
The UserAPI
microservice could still be a single point of failure
here; if that goes down, all the other services in the system
(e.g. Rails, Sidekiq, etc.) also stop working. We've introduced a new
service that can be owned by a single team, but in doing so we haven't
necessarily improved isolation. Can the system function without this
service? Probably not, although there may be other advantageous to doing
this (e.g. make it possible to shard user data in multiple servers,
performance, etc.). We still have to think about how to avoid a SPOF.
In addition, GitLab also is unique in that every microservice that we create has to be shipped to customers, so there is overhead in managing configuration and redundancy of these services as well.
That being said, microservices may be worth it if we can clearly define the engineering benefit towards maintainability, scalability, and reliability. For example, we've considered introducing a GitLab CI service daemon that can better handle CI queues.