According to the Nielsen/Norman Group, usability or user experience benchmarking is "the process of evaluating a product or service's user experience by using metrics to gauge its relative performance against a meaningful standard." While benchmarking can require a good amount of time and effort, there are few alternate research methods that provide both the same volume of data and granularity of insights. This page describes the processes and considerations that we put into usability benchmarking at GitLab.
Usability benchmarking takes a collection of related workflows, breaks them down into discrete tasks, and measures how usable they are across several dimensions. What this generates is a rich body of quantitative and qualitative data that highlights specific pain points and areas for improvement. These pain points can then be turned into actionable insights which will improve the overall usability of a product. By performing a usability benchmarking study, you give your team validated, granular recommendations for making your product better.
Benchmarking can be a useful method for generating longitudinal, quantitative comparisons (for example, tracking the time it takes to complete a task before and after a significant change to the user interface). Because of the benefits of running benchmarking studies several times, it can be useful to think of benchmarking as a research program, as opposed to a one-off activity. This is how we tend to think of it at GitLab.
Benchmarking is a thorough and time-intensive undertaking that requires a high degree of rigor from the research lead. This section lists some guidelines and best practices to ensure the quality and uniformity of our benchmarking efforts, while also reducing 'start-up costs' for new benchmarking efforts.
Every benchmarking study has a similar 'skeleton' - the basic elements one needs in order to successfully run the study. Typically, these are:
When leading a benchmarking study, you will need to coordinate with several stakeholders including the Product Manager (PM) and Product Designer (PD) to ensure that everyone agrees upon the study protocol, tasks, metrics, and timeline.
The early planning stages are crucial for setting the rest of your study up for success. In order to maximize the first few weeks of study planning, communicate with your stakeholders on how they would prefer to provide feedback. If you are getting feedback async, make sure to provide firm deadlines that are non-negotiable. In cases where those deadlines are not being met, try pivoting to sync communication such as a scheduled meeting to help stay on the research timeline.
Each protocol is tailored to the topic and specifics of the study it belongs to, but often contains similar sections. Generally, you will want to cover the following (many of these are commmon to all moderated studies):
The Usability Benchmarking template contains some boilerplate language to work off of for your own study.
A few guidelines when selecting tasks for your study:
Here is an example of a singular task as written in the task list:
|Cut off time||5:00|
|Description||Make this branch a protected branch, where devs and maintainers can merge, push, and force push.|
|Notes||Must be 'maintainer' role or higher. Will need to find project settings. Push rules are also allowed in GitLab Enterprise Edition.|
|Related JTBD||When product improvements are identified, I want to propose changes that address them, so that I can help build a better product.|
|Happy Path||Left nav, settings -> repository. Find protected branches section, click expand. Under protect a branch section, use dropdown to select 'newbranch'. In 'allowed to merge' and 'allowed to push' dropdowns, select 'Developers + Maintainers'. Toggle 'allowed to force push' to on. Click 'protect' button.|
|Completion Criteria||•Correct branch is protected
•Allowed to merge checked
•Allowed to push checked
•Force push toggled
•Protect button clicked
There isn't a hard and fast rule when it comes to assigning a cut-off time for your tasks. There are a few ways you might do this:
One way to do weighting is to look at the number of steps necessary in the completion criteria and use that number. In the example above, note that there are five steps listed in the completion criteria section, and the assigned weight is five. This assigns weight as a proxy for task complexity (which may or may not be appropriate for your study). How you use the weight in your calculations is usually as a multiplier for reporting summative metrics.
As a simple example:
The following metrics and definitions are the core of how GitLab performs benchmarking.
|Completion Rate||% of users who successfully complete the task||Measured per participant, reported per task||Completed = 1, incomplete = 0.
Then sum and divide by total participants. Note: completion criteria for each task should be carefully enumerated with your stakeholders prior to running your study
|4 completions out of 5 participants for task 3 results in 80% completion rate|
|Time on task||Time spent from start to finish of task||Measured per participant, reported per task||Start timer on verbal cue, stop timer when participant signals that they have completed the task or have no path forward||Avg. time to completion for task 9 = 2:11|
|CES (Customer Effort Score)||Qualitative measure of perceived effort (1-7, 1 = extreme effort, 7 = effortless)||Measured per task, reported as average||At the end of the task, ask the participant, "On a scale of 1-7 where, 1 is extremely difficult and 7 is extremely easy, how easy was it for you to complete this task?"||Avg. CES for task 13 = 5.9|
|Error Type Count||Number of the different type of errors or mistakes made during task completion||Measured per task, reported as average or mean||Errors need to be defined alongside the 'happy' or optimal path the user should take||2.6 avg. errors for task X|
|Error Rate||The number of different types of errors observed over the number of steps in the task||Per task||Take the number of observed types of errors and divide by the number of steps or actions in that task.||Task A has 5 steps. There are 10 participants in the study. Our total steps (denominator) is therefore 50. The numerator is the observed errors across all participants for task A. Suppose there are 20 errors recorded for task A. Error rate is thus 20/50, or 40%|
|Severity||Judged severity of the problem||Per task, overall||See this handbook page for details||Critical|
|Grade||A cumulative letter grade portraying the usability of the task overall||per task, overall||see 'Per task grade calculation section below'||C|
|UMUX lite||Canonically, UMUX lite is a 2-question survey that measures perceived usefulness and usability of a system or product. For benchmarking at GitLab, we tend to use it to measure usability against the specific JTBD in our study.||Collected once per JTBD at end of session.||1 question, on a 7-point Likert scale from strongly disagree to strongly agree.||On a scale of 1-7, where 1 is strongly disagree and 7 is strongly agree, how much to do agree with the statement, "This system helps me perform (insert description of JTBD here)"|
For completion and scoring:
For per-task metrics:
In each session, you will record (per task) the severity number that most closely represents that user's experience as defined on this handbook page. This methodology is similar to the widely-known Nielsen/Norman system, but inverse (where low numbers in our system are of greater severity).
For each incomplete task, rate the severity as 1. For a very painful completion, rate the severity as 2. For a mildly painful completion, rate the severity as 3 (and so on). If the user doesn't encounter any usability issues, rate the severity as 5.
Average the severity score for each task, and assign the overall severity according to the following scores:
|0.0 - 1.9||Severity 1: Blocker|
|2.0 - 2.9||Severity 2: Critical|
|3.0 - 3.9||Severity 3: Major|
|4.0 - 5.0||Severity 4: Low|
For example, pretend you have 10 participants in a study. For task A, the severity scores on the individual runs are 1, 4, 1, 2, 2, 3, 4, 5, 1, 3, which sums to 26. Divide by 10, and you'll get the average of 2.6 and a severity assignment of 2: Critical for that task.
For each task, calculate a final grade using the criteria below. Note: This is an example. There are other potential ways to grade your study that might make more sense for your team or situation.
Here's our above example in table form:
|Number off completions (out of total # of participants)||-||15 (of 20)|
|Avg. CES for task (of 7)||add to completions||avg CES = 5, running score = 20|
|Avg. error count||subtract from total||avg. errors = 2, running score = 18|
|Running total||Divide by total possible||18 / 27 = 0.67|
|Decimal result||multiply by 100||67|
|Integer result||map to grading scale (letter grades in this case)||67 = D|
Preparing and conducting a benchmarking study takes time. Below is a sample timeline for starting a typical benchmarking program and running the first study.
|Clarify your goals and conduct background research.||Be clear on the why, what, who, when, and how you are going to approach the benchmarking.||1 week|
|Begin issue||Open a research issue and fill it out.||1 day|
|Conduct kickoff||Include your stakeholders in helping to refine the scope and direction of the study (focus areas, important metrics, personas, and so on)||1 week|
|Plan: Overview||Begin your study plan, record the context, reasoning, metrics, personas, and outcomes for your study.||1 week|
|Plan: Protocol||Write your introduction (exactly what you are going to say), opening questions, and the general flow for your study.||1 week|
|Plan: Tasks||Enumerate the exact tasks you will measure, the happy path, completion criteria, and weight for each task.||2 weeks|
|Plan: Test environment||Set up your test environment with any projects and sample data you will use for testing||2 weeks|
|Recruitment||Open a research recruitment issue at least a month prior to when you wish to run your sessions. A typical benchmarking study uses about 20 participants.||Opening ticket: 1 day. Recruitment itself: 1 month|
|Run Pilot(s)||The week before your sessions, run 1 or 2 pilot sessions to perfect your protocol and tasks.||1-2 days|
|Run Sessions||Benchmarking sessions typically last from 90 minutes to two hours. Meaning that for 20 participants, conducting two sessions per day, you are looking at a solid two to four weeks of conducting sessions. Note that you will need to invite more participants than necessary to fill 20 sessions, since not everyone who qualifies will accept the research invite. In order to maximize participant attendance and avoid late cancelations, send reminder emails within 24 hours of each session.||2-4 weeks|
|Analyze the results.||Calculate metrics, extract recommendations, pull verbatim, put things into Dovetail, and so on.||2 weeks|
|Prepare the report and share it.||Produce research report, slides, recordings, and so on to disseminate your findings.||2 weeks|
You should plan for a full quarter from start to finish for your first benchmarking study, but the time commitment will vary week to week. Benchmarking should not be the only research activity you plan during this time, with the exception of weeks when you are conducting sessions. Note: This timeline may be significantly reduced in subsequent benchmarking studies for the same tasks and personas.
A few resources to help reduce the start-up cost for a new benchmarking effort:
Q: How is benchmarking different from the Category Maturity Scorecard or UX Scorecard research?
A: To borrow an analogy, you might think of CMS/UX scorecards as looking at usability through a magnifying glass. Usability benchmarking is like looking at usability through a microscope. Benchmarking is more time intensive. It is not meant to be lightweight. You will speak with more participants (~20) for a longer period of time (~120 minutes, and you will run through more tasks (~25) that might cover more than one JTBD. Benchmarking reports many of the same metrics (and more!), and those metrics are more likely to be representative due to greater N. At GitLab, we report metrics using industry-standard statistics (confidence intervals at 95%, adjusted Wald calculation for completion rate, and so on). In this sense, you can use benchmarking in concert with CMS or UX scorecards to validate and update those findings. Benchmarking is also meant to be a program, rather than a one-off study. Benchmarking generates very granular recommendations for usability improvements, and once enough of those recommendations have been implemented, you can run the benchmarking again and start to see trends over time.
Q: Does conducting a usablity benchmarking study mean that I do not need to run a UX Scorecard or CM Scorecard study?
A: No, you should still conduct UX Scorecard and CM Scorecard studies in addition to usability benchmarking studies. All three types of studies are valuable for understanding the users' experience with the product.
Q: How should I view this score vs the UX Scorecard score vs the CM Scorecard score?
A: Since all three (UX scorecard, CMS, and benchmarking) provide an overall letter grade, it is ideal to see all of the grades agree when they are focused on the same task or JTBD.
Q: How often should I conduct usability benchmarking?
A: This is variable based on need and how quickly the recommendations from a previous benchmarking study are implemented. But, generally, there is no reason to conduct benchmarking more than once or twice a year for the same tasks.
Q: Does usability benchmarking have to be conducted on the most current release?
A: Yes. Benchmarking is far too heavy-handed to perform for solution validation of upcoming features, and while you could perform benchmarking on a previous release, the results you gather may already be invalid when you collect them. Given the time commitment, this is highly discouraged.
Q: What Gitlab environment should the usability benchmarking be tested on?
A: The UX Researcher on the project can set up a cloud instance of Gitlab and create sample data in a project by following the instructions on the UX Cloud Sandbox page. Make sure there is enough sample data to complete all tasks in the benchmarking study when you run your pilot study. You can also ask for help on the #ux-cloud-sandbox Slack channel.
Q: How complex/realistic does my testing environment need to be?
A: It depends. Basically you need to look at your task list and go through them, ensuring that every area of the project that you go to (or might reasonably expect participants to go) has something there. Ideally, there's more than 1 item there (file, merge request, comment, environment, deployment, etc.) so that the task time accounts for scanning and searching the desired path. Depending on the areas you're focused on, this may be a lot of sample data, or not so much.
Q:Where does this fit in the product lifecycle?
A: Benchmarking lives at a space bridging the 'improve' stage of the build track and the 'collect ideas/understand the problem' parts of the validation track. Benchmarking is a great way to see how the current product is performing, so it fits well with the 'measure, learn, and improve' part of the cycle. The 'learn' and 'improve' parts of are then taken as inputs into the collect ideas/understand the problem parts of the validation track. So, in benchmarking, we're measuring our current product, validating existing problems we know of, and generating ideas on how to address these problems or generally improve the product. Note: Benchmarking is somewhat outside of our normal workflow. Since it is so time intensive, it's not something that could or should be a part of every feature or issue, and it's not something we should conduct more than once or twice a year for the same tasks.
Q: For efficiency, can I align a benchmarking study with an upcoming Category Maturity Scorecard study?
A: Yes, this is a great use of benchmarking! You can easily modify your benchmarking study to allow for CMS results. For example, you might run a small number of participants through the CMS protocol in combination with the benchmarking tasks. Because these are both based off of JTBD, the tasks ought to align very well.
Q: What score are we trying to achieve?
A: Rather than aiming for a specific score off the bat, what you really want to see with benchmarking is improvement over time (from one benchmarking study to the next). That being said, benchmarking is metrics heavy, and there are a lot of 'scores' to sift through. Your goal is to get a 10,000-foot view at a glance, with overall grades for each task and each JTBD (A-F, severity 1-5), and also to provide detailed metrics for those who are curious.
Q: Is the scoring dependent on the level of maturity of that category/JTBD?
A: As it stands, no. You should consider maturity during the analysis and when setting expectations, but benchmarking just measures the current experience, however mature it is at the time.
Q: How do designers process and manage the recommendations?
A: For benchmarking, the UX Researcher is responsible for much of the 'processing'. The UX Research team has an FY23 objective to better utilize actionable insights. One of the main outputs of benchmarking is a set of recommendations that will then go through the process of turning into actionable insights or informative insights. Part of this process involves categorizing actionable insights into either 'exploration needed' or 'product change' categories, each of which becomes an issue with this label.
The UX Researcher is responsible for sharing the list of recommendations, along with all other findings from the benchmarking with all relevant stakeholders (specifically Product Designers and Product Managers). The exact structure of this will likely vary between different groups and stages, but part of this process needs to involve the communication and handoff of all 'product change' actionable insights. From there, the team should work together to prioritize these issues and ensure they are completed (which leads to the next question).
Q: How can I prevent recommendations from going stale (not prioritized/implemented)?
A: This is the critical question. The research team has a KR around research prioritization for the actionable insights marked
Actionable Insight::Exploration Needed (needing more research) that are generated from benchmarking. These issues go through a prioritization process for additional research. For those recommendations in the
Actionable Insight::Product change category, and thus of more interest to Product Designers and Product Managers, there is a process where a severity label is applied. Generally, Product Designers will assign severity for 'product change' insights, and UX Researchers will assign severity labels for 'exploration needed' insights. For more on the process, refer to this handbook page.
Q: Does the GitLab benchmarking approach involve a competitive analysis?
A: No, not at this time. To use the analogy above, Competitive usability evaulations are also like a microscope, in that UX Researchers look at specific tasks and fine-grained aspects of performance. We might use some of our benchmark tasks in a competitive usability evaluation to see how we measure up to our competitors, but that work is currently out of scope. We first want to introduce the benchmarking approach with our own product, before we look at performance across competitors. Additionally, the first iteration for a competitive usability evaluation at GitLab might be to use them as a magnifying glass for looking at what we can learn from our competition before we put those tasks under the microscope.
Q: How can my team request a benchmarking study?
A: Speak with the UX Researcher on your team to get the process started!