Engineering
FY24 Direction
GitLab has a Three-Year Strategy, and we’re excited to see every member of the Engineering division contribute to achieving it. Whether you’re creating something new or improving something that already exists, we want you to feel empowered to bring your best ideas for influencing the product direction through improved scalability, usability, resilience, and system architectures. And when you feel like you need to expand your knowledge in a particular area, know that you’re supported in having the resources to learn and improve your skills.
Our focus on FY24 is to make sure that GitLab is enterprise grade in all its abilities and to support the AI efforts required to successfully launch Code Suggestions and GitLab Duo to General Availability.
Making sure that GitLab is enterprise grade involves several teams collaborating on improving our disaster recovery and support offerings through ongoing work with GitLab Dedicated, and Cells infrastructure. Our goal here is improved availability and service recovery.
Engineering Culture
Engineering culture at GitLab encompasses the processes, workflows, principles and priorities that all stem from our GitLab Values. All these things continuously strengthen our engineering craftsmanship and allow engineers to achieve engineering excellence, while growing and having a significant, positive impact on the product, people and the company as a whole. Our engineering culture is primarily being carried, and evolves, through knowledge sharing and collaboration. Everyone can be part of this process because at GitLab everyone can contribute.
Engineering Excellence
Engineering excellence can be defined as an intrinsic motivation to improve engineering efficiency, software quality, and deliver better results while building software products. Engineering excellence is being fueled by a strong engineering culture combined with a mission: to build better software that allows everyone to contribute.
Engineering Initiatives
Engineering is the primary advocate for the performance, availability, and security of the GitLab project. Product Management prioritizes 80% of engineering time, so everyone in the engineering function should participate in the Product Management prioritization process to ensure that our project stays ahead in these areas. Engineering prioritizes 20% of time on initiatives that improve the underlying platform and foundational technologies we use.
- Review fixes from our support team. These merge requests are tagged with the
Support Team Contributions
label. You can filter on open MRs. - Working on high priority issues as a result of issue triaging. This is our commitment to the community and we need to include some capacity to review MRs or work on defects raised by the community.
- Improvements to the performance, stability and scalability of a feature. Again, the Product team should be involved in the definition of these issues but Engineering may lead here by clearly defining the recommended improvements.
- Improvements and upgrades to our toolchain in order to boost efficiency.
Community Contributions
We have a 3-year goal of reaching 1,000 monthly contributors as a way to mature new stages, add customer-desired features that aren’t on our roadmap, and even translate our product into multiple languages.
Diversity
Diverse teams perform better. They provide a sense of belonging that leads to higher levels of trust, better decision making, and a larger talent pool. They also focus more on facts, process facts more carefully, and are more innovative. By hiring globally and increasing the numbers of women and ethnic minorities in the Engineering division, we’re helping everyone bring their best selves to work.
Growing our team
Hiring is still a top priority in FY24, and we’re excited to continue hiring people who are passionate about our product and have the skills to make it the best DevSecOps tool in the market. Our current focus areas include reducing the amount of time between offer and start dates and hiring a diverse team (see above). We’re also implementing industry-standard approaches like structured, behavioral, and situational interviewing to help ensure a consistent interview process that helps to identify the best candidate for every role. We’re excited to have a recruiting org to partner with as we balance the time that managers spend recruiting against the time they spend investing in their current team members.
Expand customer focus through depth, and stability
As expected, a large part of our focus in FY24 is on improving our product.
For Enterprise customers, we’re refining our product to meet the levels of security and reliability that customers rightfully demand from SaaS platforms (SaaS Reliability). We’re also providing more robust utilization metrics to help them discover features relevant to their own DevOps transformations (Usage Reporting) and offering the ability to purchase and manage licenses without spending time contacting Sales or Support (E-Commerce and Cloud Licensing). Lastly, in response to Enterprise customer requests, we’re adding features to support Suggested Reviewers, better portfolio management through Work Items, and Audit Events that provide additional visibility into user passive actions.
For Free Users, we’re becoming more efficient with our open core offering, so that we can continue to support and give back to students, startups, educational institutions, open source projects, GitLab contributors, and nonprofits
For Federal Agencies, we’re obtaining FedRAMP certification to strengthen confidence in the security standards required on our SaaS offering. This is a mandated prerequisite for United States federal agencies to use our product.
For Hosted Customers, we’re supporting feature parity between Self-Managed and GitLab Hosted environments through the Workspace initiative. We’re also launching GitLab Dedicated for customers who want the flexibility of cloud with the security and performance of a single-tenant environment.
For customers using CI/CD, we’re expanding the available types of Runners to include macOS, Linux/Docker, and Windows, and we’re autoscaling build agents.
Engineering Departments
There are four departments within the Engineering Division:
- Core Development Department
- Expansion Development Department
- Infrastructure & Quality Department
- Support Engineering Department
Other Related Pages
- CTO Staff
- Communication
- Database Engineering
- Development Principles
- Engineering Automation
- Engineering Metrics
- Engineering OKRs
- Engineering READMEs
- Frequently Used Projects
- GitLab Innovation Program, managed by the GitLab Legal Team
- Hiring
- Mentorship
- Pajamas Design System
- R&D Tax Credit Applications
Workflows
- Developer onboarding
- Engineering Demo Process
- Engineering Workflow
- GitLab Repositories
- Issue Triage Policies
- Contributing to Go projects
- Wider Community Merge Request Triage Policies
- Root Cause Analysis
- Critical Security Releases
- Incident Management
GitLab in Production
- Workflow Diagram
- Error Budgets
- Performance of GitLab
- Monitoring of GitLab.com
- Production Readiness Guide
People Management
- Engineering Compensation Roadmaps
- Engineering Career Development
- Engineering Career Mobility Principles
- Engineering Internship
- Engineering Secondments
- Engineering Management
- Volunteer Coaching program for URGs
- Starting New Teams
Cross-Functional Prioritization
See the Cross-Functional Prioritization page for more information.
SaaS Availability Weekly Standup
To maintain high availability, Engineering runs a weekly SaaS Availability standup to:
- Review high severity (S1/S2) public facing incidents
- Review important SaaS metrics
- Track progress of Corrective Actions
- Track progress of Feature Change Locks
Infrastructure Items
Each week the Infrastructure team reports on incidents and key metrics. Updating these items at the top of the Engineering Allocation Meeting Agenda is the responsibility of the Engineering Manager for the General Squad in Reliability.
- Incident Review
- Include any S1 incidents that have occurred since the previous meeting.
- Include any incidents that required a status page update.
- SaaS Metrics Review
- Include screenshots of the following graphs in the agenda.
Development Items
For the core and expansion development departments, updates on current status of:
- Error budgets
- Reliability issues (infradev)
- Security issues
Groups under Feature Change Locks should update progress synchronously or asynchronously in the weekly agenda. The intention of this meeting is to communicate progress and to evaluate and prioritise escalations from infrastructure.
Feature Change Locks progress reports should appear in the following format in the weekly agenda:
FCL xxxx - [team name]
- FCL planning issue:
<issue link>
- Incident Issue:
<issue link>
- Incident Review Issue:
<issue link>
- Incident Timeline:
<link to Timeline tab of the Incident issue>
- e.g. time to detection, time to initiate/complete rollback (as applicable), time to mitigation
- Cause of Incident
- Mitigation
- Status of Planned/completed work associated with FCL
Feature Change Locks
A Feature Change Lock (FCL) is a process to improve the reliability and availability of GitLab.com. We will enact an FCL anytime there is an S1 or public-facing (status page) S2 incident on GitLab.com (including the License App, CustomersDot, and Versions) determined to be caused by an engineering department change. The team involved should be determined by the author, their line manager, and that manager’s other direct reports.
If the incident meets the above criteria, then the manager of the team is responsible for:
- Form the group of engineers working under the FCL. By default, it will be the whole team, but it could be a reduced group if there is not enough work for everyone.
- Plan and execute the FCL.
- Inform their manager (e.g. Senior Manager / Director) that the team will focus efforts towards an FCL.
- Provides updates at the SaaS Availability Weekly Standup.
If the team believes there does not need to be an FCL, approval must be obtained from either the VP of Infrastructure or VP of Development.
Direct reports involved in an active borrow should be included if they were involved in the authorship or review of the change.
The purpose is to foster a sense of ownership and accountability amongst our teams, but this should not challenge our no-blame culture.
Timeline
Rough guidance on timeline is provided here to set expectations and urgency for an FCL. We want to balance moving urgently with doing thoughtful important work to improve reliability. Note that as times shift we can adjust accordingly. The DRI of an FCL should pull in the timeline where possible.
The following bulleted list provides a suggested timeline starting from incident to completion of the FCL. “Business day x” in this case refers to the x business day after the incident.
- Day 0: Incident:
- Business day 1: relevant Engineering Director collaborates with VP of Development and/or VP of Infrastructure or their designee to establish if FCL is required.
- Business day 2: confirmation that an FCL is required for this incident and start planning.
- Business days 3-4: planning time
- Business days 5-9 (1 week): complete planned work
- Business days 10-11: closing ceremony, retrospective and report back to standup
Activities
During the FCL, the team(s) exclusive focus is around reliability work, and any feature type of work in-flight has to be paused or re-assigned. Maintainer duties can still be done during this period and should keep other teams moving forward. Explicitly higher priority work such as security and data loss prevention should continue as well. The team(s) must:
- Create a public slack channel called
#fcl-incident-[number]
, with members- The Team’s Manager
- The Author and their teammates
- The Product Manager, the stage’s Product leader, and the section’s Product leader
- All reviewer(s)
- All maintainers(s)
- Infrastructure Stable counterpart
- The chain-of-command from the manager to the VP (Sr Manager, Sr/Director, VP, etc)
- Create an FCL issue in the FCL Project with the information below in the description:
- Name the issue:
[Group Name] FCL for Incident ####
- Links to the incident, original change, and slack channel
- FCL Timeline
- List of work items
- Name the issue:
- Complete the written Incident Review documentation within the Incident Issue as the first priority after the incident is resolved. The Incident Review must include completing all fields in the Incident Review section of the incident issue (see incident issue template). The incident issue should serve as the single source of truth for this information, unless a linked confidential issue is required. Completing it should create a common understanding of the problem space and set a shared direction for the work that needs to be completed.
- See that not only all procedures were followed but also how improvements to procedures could have prevented it
- A work plan referencing all the Issues, Epics, and/or involved MRs must be created and used to identify the scope of work for the FCL. The work plan itself should be an Issue or Epic.
- Daily - add an update comment in your FCL issue or epic using the template:
- Exec-level summary
- Target End Date
- Highlights/lowlights
- Exec-level summary
- Add an agenda item in the SaaS Availability weekly standup and summarize status each week that the FCL remains open.
- Hold a synchronous
closing ceremony
upon completing the FCL to review the retrospectives and celebrate the learnings.- All FCL stakeholders and participants shall attend or participate async. Managers of the groups participating in the FCL, including Sr. EMs and Directors should be invited.
- Agenda includes reviewing FCL retrospective notes and sharing learnings about improving code change quality and reducing risk of availability.
- Outcome includes handbook and GitLab Docs updates where applicable.
Scope of work during FCL
After the Incident Review is completed, the team(s) focus is on preventing similar problems from recurring and improving detection. This should include, but is not limited to:
- Address immediate corrective actions to prevent incident reoccurrence in the short term
- Introduce changes to reduce incident detection time (improve collected metrics, service level monitoring, which users are impacted)
- Introduce changes to reduce mitigation time (improve rollout process through feature flags, and clean rollbacks)
- Ensure that the incident is reproducible in environments outside of production (Detect issues in staging, increase end-to-end integration test coverage)
- Improve development test coverage to detect problems (Harden unit testing, make it simpler to detect problems during reviews)
- Create issues with general process improvements or asks for other teams
Examples of this work include, but are not limited to:
- Fixing items from the Incident Review which are identified as causal or contributing to the incident.
- Improving observability
- Improving unit test coverage
- Adding integration tests
- Improving service level monitoring
- Improving symmetry of pre-production environments
- Improving the GitLab Performance Tool
- Adding mock data to tests or environments
- Making process improvements
- Populating their backlog with further reliability work
- Security work
- Improve communication and workflows with other teams or counterparts
Any work for the specific team kicked off during this period must be completed, even if it takes longer than the duration of the FCL. Any work directly related to the incident should be kicked off and completed even if the FCL is over. Work paused due to the FCL should be the priority to resume after the FCL is over. Items created for other teams or on a global level don’t affect the end of the FCL.
A stable counterpart from Infrastructure will be available to review and consult on the work plan for Development Department FCLs. Infrastructure FCLs will be evaluated by an Infrastructure Director.
Engineering Performance Indicator process
The Quality Department is the DRI for Engineering Performance Indicators. Work regarding KPI / RPI is tracked on the engineering metrics board and task process.
Manual verification
We manually verify that our code works as expected. Automated test coverage is essential, but manual verification provides a higher level of confidence that features behave as intended and bugs are fixed.
We manually verify issues when they are in the workflow::verification
state.
Generally, after you have manually verified something, you can close the associated issue.
See the Product Development Flow to learn more about this issue state.
We manually verify in the staging environment whenever possible. In certain cases we may need to manually verify in the production environment.
If you need to test features that are built for GitLab Ultimate then you can get added to the issue-reproduce group on production and staging environments by asking in the #development Slack channel. These groups are are on an Ultimate plan.
Critical Customer Escalations
We follow the below process when existing critical customer escalations requires immediate scheduling of bug fixes or development effort.
Requirements for critical escalation
- Customer is in critical escalation state
- The issues escalated have critical business impact to the customer, determined by Customer Success and Support Engineering leadership
- Failure to expedite scheduling may have cascading business impact to GitLab
- Approval from a VP from Customer Success AND a Director of Support Engineering are required to expedite scheduling
- Customer Success: approval from either Sherrod Patching or David Sakamoto
- Support Engineering: approval from either Lee Matos or Lyle Kozloff or Shaun McCann or Val Parsons
Process
- The issue priority is set to
~"priority::1"
regardless of severity - The label
~"critical-customer-escalation"
is applied to the issue - The issue is scheduled within 1 business day
- For issues of type features, approval from the Product DRI is needed.
- The DRI or their delegate provides daily process updates in the escalated customer slack channel
DRI
- If issue is type bug DRI is the Director of Development
- If issue is type feature DRI is the Director of Product
- If issue requires Infrastructure work the DRI is the Engineering Manager in Infrastructure
The DRI can use the customer critical merge requests process to expedite code review & merge.
Pairing Engineers on priority::1/severity::1 Issues
In most cases, a single engineer and maintainer review are adequate to handle a priority::1/severity::1 issue. However, some issues are highly difficult or complicated. Engineers should treat these issues with a high sense of urgency. For a complicated priority::1/severity::1 issue, multiple engineers should be assigned based on the level of complexity. The issue description should include the team member and their responsibilities.
Team Member | Responsibility |
---|---|
Team Member 1 |
Reproduce the Problem |
Team Member 2 |
Audit Code Base for other places where this may occur |
If we have cases where three or five or X people are needed, Engineering Managers should feel the freedom to execute on a plan quickly.
Following this procedure will:
- Decrease the time it takes to resolve priority::1/severity::1 issues
- Allow for a smooth handover of the issue in case of OOO or End of the Work Day
- Provide support for Engineers if they are stuck on a problem
- Provide another set of eyes on topics with high urgency or securing security-related fixes
Canary Testing
Information on canary testing has been moved to dedicated page covering the canary stage and how to use it
Engineering private handbook
There are some engineering handbook topics that we cannot be publicly transparent about. These topics can be viewed by GitLab team members in the engineering section of the private handbook.
If you experience a page not found (404) error when attempting to access the internal handbook, you may need to register to use it via first browsing to the internal handbook authorization page.
Core Development Department
Cross Functional Prioritization
CTO Staff
Deployments and Releases
Engineering Career Development
Engineering Communication
Engineering Dashboarding and Metrics
Engineering Demo Process
Engineering Error Budgets
Engineering Fellow Shadow
Engineering Function Performance Indicators
Engineering Hiring
Engineering IC Leadership
Engineering Internships
Engineering Management
Engineering Mentorship
Engineering OKRs
Engineering Projects
Engineering Secondments
Engineering Workflow
Expansion Development Department
Fast Boot
Frontend Group
GitLab Plato HQ Mentoring Program
GitLab Repositories
Guide to Engineering Analytics Data
Guidelines for automation and access tokens
Incident
Infrastructure
Infrastructure and Quality department
Monitoring of GitLab.com
Open Source at GitLab
Performance
Quality Department
R&D Tax Credits
Recognition in Engineering
Releases
Root Cause Analysis
Starting new teams
Volunteer Coaches for URGs
91c7f097
)