The engineer that gave the unfortunate command to delete our primary database was not only on our minds but also of other people. He's known by the community as "team-member-1", as we referred to him by this expression in our public communications during the incident.
After we posted the postmortem of the incident with GitLab.com, we received notes from our community asking how was team-member-1 doing. We're here to tell you that.
We are still putting all our efforts into improving GitLab.com's infrastructure as a whole, to ensure this type of incident never happens again.
We were all really touched by the love our community sent us via #HugOps:
At the end of the day, we are all team-member-1:
Who is not the "team-member-1"? #HugOps— Erkan Erol (@erkan_erol_) February 2, 2017
Support from our community
The support our engineers received from our community was fantastic, people were appreciative of our transparency in working on the solution. Even teams from other companies, who had been in that situation themselves, showed their support.
Kudos got @gitlab on their level of transparency. I have a lot of respect for you...— Ben Kuhl (@bkuhlorelse) February 1, 2017
❤️ Thank you all! We're very touched by your support! ❤️
The Codefresh team brought cookies to our Experience Center in San Francisco with a card saying:
Hey GitLab, We thought you could use a chance to destress a little. Don't sweat it we're rooting for you. We've all been there before and we have a lot of faith in you to work it out. You got THIS! From Dan, and the Codefresh Team
A group from Google came to our Experience Center and got our engineers $300 to spend on something to make them feel better after all of this was over:
Yay for you! We've all been there! :) From friends of yours at Google
❤️ Thank you Codefresh and Google! ❤️
The engineers involved have agreed that this extremely generous gift from Google will be spent on sponsoring Rails Girls events. But the cookies were obviously eaten by the one who grabbed the pack! 😛
Needless to say, some special thank you swag is on its way to these amazing people that took the time to come to our boardroom and try to help us feel a bit better.
At GitLab we think that the people making the most mistakes frequently correlate with the people doing the most work. I certainly make a lot of mistakes every day. It is important not to double down on them but to acknowledge them and learn from it.
We make mistakes. What's different from person to person, organization to organization, is how to deal with them. What we value most at GitLab is:
- Transparency: "Don't be afraid to admit you made a mistake or were wrong. When something went wrong it is a great opportunity to say 'What’s the kaizen moment here?' and find a better way without hurt feelings."
- Collaboration: "Say sorry if you made a mistake apologize. Saying sorry is not a sign of weakness but one of strength. The people that do the most will likely make the most mistakes."
- Be truthful and honest.
- Be dependable, reliable, fair, and respectful.
- Be committed, creative, inspiring, and passionate.
How could we value transparency if our team members were afraid of assuming their mistakes?
It is not our intent to have one of our team members implicated by the transparency. (...) We are very aware of the stress that such a mistake might cause and the rest of the team has been very supportive. (...) We recognize the risk to the company of being transparent, but your values are defined by what you do when it is hard. Sid Sijbrandij, CEO.
While we were putting the fire out, we received this comment:
The GitLab engineer (team-member-1) is one of the smartest people he's ever known - in fact, he's actually brilliant. So you can rest assured that the data loss wasn't caused by a an inexperienced kid. (HN)
And this thoughtful tweet:
We heard you Rubén! :)
Yes, team-member-1 is doing very well!
Coincidentally, just before the DB incident, team-member-1 had qualified for a promotion to senior developer. The outage did not change that decision.
Promotions are to be based on meeting the criteria of the role the individual is to be promoted in to (i.e. promote based on performance) - GitLab PeopleOps Handbook
When we promote people at GitLab, or give a them a bonus, we share the reasons for that with the whole company. With the permission of team-member-1, this blog post is both the internal and external announcement of that promotion.
Reasons for promoting team-member-1
Pablo Carranza (Production Lead), provided GitLab with the following reasons to promote team-member-1:
Following what is expected out of a Senior Developer in the job description:
Senior Developers are experienced developers who meet the following criteria:
- Technical Skills
- Are able to write modular, well-tested, and maintainable code - I think this is out of the question here and we have enough samples
- Know a domain really well and radiate that knowledge - He already got a bonus for how he shares his knowledge, he is always raising the bar here.
- Contribute to one or more complementary projects - His contributions are numerous to all the GitLab ecosystem, including building the whole performance monitoring metrics system that we use to understand why GitLab is slow. Including projects like allocations that is used to track low level Ruby metrics
- Begins to show architectural perspective - He is involved in Gitaly since before it had this name, and architectural paradigm change that will affect GitLab profoundly. He is also heavily involved in whatever goes near the database.
- Proposing new ideas, performing feasibility analyses and scoping the work - This is what he is doing all the time, all of GitLab development community values his input and his opinion is highly valued.
- Code quality
- Leaves code in substantially better shape than before - Agreed, he does this. Refer to the links provided at the bottom
- Fixes bugs/regressions quickly - He has even performed there operations in production itself providing hotfixes.
- Monitors overall code quality/build failures - MR
- Creates test plans - MR
- Provides thorough and timely code feedback for peers - Refer to code quality section.
- Able to communicate clearly on technical topics - He created our performance and sidekiq style guidelines - Both these samples are pushing for improving the quality of GitLab as a whole, way beyond the scope of a single MR.
- Keeps issues up-to-date with progress - He is a bar raiser here, check any of his issues, they are up to date all the time.
- Helps guide other merge requests to completion - He does this leading to a successful merge and has the capacity to reject MRs depending on how they will impact GitLab (another sample on pushing back can be found here).
- Helps with recruiting - He is currently helping me to find a database specialist by reviewing job applications performing the initial filter.
- Performance & Scalability
- Excellent at writing production-ready code with little assistance - He has been leading the performance effort for the last year, I think this is out of the question - He owned moving to postgres 9.6 on his own, a massive undertaking that was performed without a single glitch.
- Able to write complex code that can scale with a significant number of users - Same thing as before, he is the one paving the path for the rest of GitLab in this area. Definitely a bar riser.
A final sample can be found here where one of our backend leads asks him for support on how to perform a large migration without causing downtime.
These samples are just the tipping point the work that he has been performing for quite a while already. This work got us to the situation where we can deploy the application with minimum downtime, even adding automation to detect when a migration will force us to cause downtime, removing human judgment from the equation.
His work is constantly impacting all the company in both depth and breath, I can find lots of samples on any of the items that are included in the job description, I can even find samples that match to the staff developer. Therefore I think that he has been behaving as a Senior Developer, and I ask that this behavior gets recognized and formalized by GitLab.
We support each other
For those developers involved in the outage, we made a special T-shirt:
This T-shirt has two purposes. It reminds us what happened and motivates us not to let this happen again. It was also meant to thank the team that handled the incident, for that reason only they have gotten one.
At GitLab, we value most positive achievements and performance improvements of our team members, instead of focusing our attention on negative random situations.
Of course, we take situations like this very seriously, but we'd rather learn from them, and put all our efforts to avoid that they happen again, than punish honest and talented people for mistakes that can happen to anyone.
What doesn't kill you makes you stronger. (HN)