For any new or existing system or a large feature set the questions in this guide help make the existing service more robust, and for new systems it helps prepare them and speed up the process of becoming fully production-ready.
Initially, this guide is likely to be used by Production Engineers who are embedded with other teams working on existing services and features. However, anyone working on a new service or feature set is encouraged to use this guide as well.
The goal of this guide is to help others understand how the new service or feature set may impact the rest of the (production) system; what steps need to be taken (besides deploying this new system) to ensure that it can be properly managed; and to understand what it will take to manage the reliability of the new system / feature / service beyond its' initial deployment.
To use the guide, simply copy paste this list into an issue and edit the body of the issue as you advance.
1. Are we storing data? - [ ] If yes, where? - [ ] If we use a database, is the data structure verified and vetted by the database team? - [ ] Does the _kind_ of data that is stored affect what is required for it in terms of availability, backup frequency, or security? - [ ] Does it have to be HA, specifically, does it have to have a failover mechanism? - [ ] Do we have an approximate growth rate of the stored data (for capacity planning)? - [ ] Can we age data and delete data of a certain age? - [ ] Do we have a backup mechanism already or do we need to set it up? 1. How can we scale this service out? - [ ] Can we add servers/workers on the fly? - [ ] Can we determine the load balancing algorithm? What would be the most boring setup to prevent a SPOF? - [ ] What were the limits during the benchmark? (Or if there was no benchmark, that should be made explicit.) - [ ] Which metrics show that we need to scale this out? 1. Interdependence - [ ] Which services rely on this one? - [ ] Are there _internal_ services which this service relies upon? (e.g. NFS, Redis, etc.) - [ ] Are there _external_ services which this service relies upon? (e.g. S3, bitly, etc.) - [ ] Can we disable it via a feature flag? - [ ] What is the customer impact if we turn this off? - [ ] What is the impact on this service if the dependencies are turned off? 1. Monitoring - [ ] Are all the services configured to log and forward these logs to Logstash - [ ] Is the service reporting metrics to Prometheus? - [ ] Is there a dashboard set up on performance.gitlab.net to view the key metrics? - [ ] Do we have a target SLA in place for this service? - [ ] Do we know what the indicators (SLI) are that map to the target SLA? - [ ] Do we have alerts that are triggered when the SLI's (and thus the SLA) are not met? - [ ] Do we have troubleshooting runbooks linked to these alerts? 1. Security - [ ] If this is adding any hosts, do they have the chef security role applied? - [ ] Is the OS of this host being automatically updated? If not, who will be responsible for updating it? 1. Responsibility - [ ] Which individuals know the most about this feature/system? - [ ] Are base architecture, functions, and administrative tasks documented and published? - [ ] Which team or set of individuals will take responsibility for the reliability of the feature/system once it is in production? - [ ] Is someone from the team who built the feature on call for the launch? If not, why not?