Blog Open Source A Google Summer of Code project: creating a benchmarking framework for SAST
September 27, 2022
10 min read

A Google Summer of Code project: creating a benchmarking framework for SAST

Our 2022 Google Summer of Code project helped to create a benchmarking framework for SAST.


In summer 2022, the Vulnerability Research team at GitLab launched the Google Summer of Code (GSoC) project: A benchmarking framework for SAST.

The goal of the project was to create a benchmarking framework, which would assess the impact and quality of a security analyzer or configuration change before it reaches the production environment.



As a complete DevOps Platform, GitLab has a variety of integrated static analysis (SAST) tools for different languages and frameworks. These tools help developers find vulnerabilities as early as possible in the software development lifecycle. These tools are constantly being updated, either by upgrading the underlying security analyzers or by applying configuration changes.

Since all the integrated SAST tools are very different in terms of implementation, and depend on different tech stacks, they are all wrapped in Docker images. The wrappers translate tool-native vulnerability reports to a generic, common report format which is made available by means of the gl-sast-report.json artifact. This generic report is GitLab's common interface between analyzers and the GitLab Rails backend.

Benchmarking is important to assess the efficacy of analyzers and helps to make data-driven decisions. For example, benchmarking is useful for QA testing (spotting regressions), for data-driven decision making, and for research by assessing the progression of the GitLab security feature performance over time.

Google Summer Of Code (GSoC)

Google Summer of Code (GSoC) is a 10-week program that enlists contributors to work on open source projects in collaboration with open source organizations. For GSoC 2022, GitLab offered four projects to GSoC contributors. The contributors completed each of the projects with the guidance from GitLab team members who mentored them and provided regular feedback and assistance when needed.

Terms & Notation

In this blog post, we use the terms/acronyms below to classify findings reported by security analyzers.

Acronym Meaning Description
TP True Positive Analyzer correctly identifies a vulnerability.
FP False Positive Analyzer misidentifies a vulnerability or reported a vulnerability where none exist.
TN True Negative Analyzer correctly ignores a potential false positive.
FN False Negative Analyzer does not report a known vulnerability.

For the figures in the blog post we use the following notation: processes are depicted as rounded boxes, whereas artifacts (e.g., files) are depicted as boxes; arrows denote an input/output (IO) relationship between the connected nodes.

flowchart TB;
subgraph legend[ Legend ]
   proc -->|IO relation|art;


The authors of the paper How to Build a Benchmark distilled the desirable characteristics of a benchmark below:

  1. Relevance: How closely the benchmark behavior correlates to behaviors that are of interest to consumers of the results.
  2. Reproducibility: The ability to consistently produce similar results when the benchmark is run with the same test configuration.
  3. Fairness: Allowing different test configurations to compete on their merits without artificial limitations.
  4. Verifiability: Providing confidence that a benchmark result is accurate.
  5. Usability: Avoiding roadblocks for users to run the benchmark in their test environments.

There currently is no standard nor de facto language-agnostic SAST benchmark satisfying all the criteria mentioned above. Many benchmark suites focus on specific languages, are shipped with incomplete or missing ground-truths, or are based on outdated technologies and/or frameworks. A ground-truth or baseline is the set of findings a SAST tool is expected to detect.

The main objective of the GSoC project was to close this gap and start to create a benchmarking framework that addresses all the desirable charateristics mentioned above in the following manner:

  1. Relevance: Include realistic applications (in terms of size, framework usage and customer demand).
  2. Reproducibility: Automate the whole benchmarking process in CI.
  3. Fairness: Make it easy to integrate new SAST tools by just tweaking the CI configuration and use the GitLab security report schema as a common standard.
  4. Verifiability: Assemble baseline that includes all the relevant vulnerabilities and make it publicly available. The baseline is the north star that defines what vulnerabilities are actually included in a test application.
  5. Usability: Benchmark users can just integrate the benchmark as a downstream pipeline to their CI configuration.

A benchmarking framework for SAST

The benchmarking framework compares the efficacy of an analyzer against a known baseline. This is very useful for monitoring the efficacy of the analyzer that participates in the benchmarking. The baseline is the gold standard that serves as a compass to guide analyzer improvements.


For using the framework, the following requirements have to be met:

  1. The analyzer has to be dockerized.
  2. The analyzer has to produce a vulnerability report that adheres to the GitLab security report schema format, which serves as our generic intermediate representation to compare analyzer efficacy.
  3. The baseline expectations have to be provided as GitLab security report schema so that we can compare the analyzer output against it.

The framework is designed in such a way that it can be easily integrated into the CI configuration of existing GitLab projects by means of a downstream pipeline. There are many possible ways in which a downstream pipeline can be triggered: source code changes applied to an analyzer, configuration changes applied to an analyzer, or scheduled pipeline invocation. By using the pipeline, we can run the benchmarking frameworks continuously and instantaneously on the GitLab projects that host the source code of the integrated analyzers whenever code or configuration changes are applied.


The figure below depicts the benchmarking framework when comparing an analyzer against a baseline.

We assume that we have a baseline configuration available; a baseline consists of an application that is an actual test application that includes vulnerabilities. These vulnerabilities are documented in an expectation file that adheres to the security report schema.

Note that we use the terms baseline and expectation interchangeably. As mentioned earlier, the benchmarking framework is essentially a GitLab pipeline that can be triggered downstream. The configured analyzer then takes the baseline app as input and generates a gl-sast-report.json file. The heart of the benchmarking framework is the compare step, which compares the baseline against the report generated by the analyzer, both of which adhere to the security report schema.

The compare step also computes the TP, FN and FP that have been reported by the analyzer and computes different metrics based on this information. The compare step is implemented in the evaluator tool.

flowchart LR;

config --> bf;

subgraph Baseline

subgraph bf [ Benchmarking Framework ]
   orig --> sbx;
   sbx --> compare;

baseline --> compare;
compare --> breport
bcollection --> orig

Using the security report format as a common standard makes the benchmarking framework very versatile: the baseline could be provided by an automated process, by another analyzer, or manually, which happened to be the case in this GSoC project.


The main functionality of the evaluator tool is to compute the overlap/intersection, and difference between a baseline and generated report in order to uncover true positives, false positives, and false negatives.

The relationship between TP, FP, FN, TN, baseline, and generated report can be seen in the table below; it includes three columns analyzer, baseline and classification. The column analyzer represents the findings included in the report generated by the analyzer; column baseline represents the findings included in the baseline; column classification denotes the verdict/classification that the evaluator tool attaches to the analyzer finding when performing the comparison. The X and - denote reported and non-reported findings, respectively.

analyzer baseline classification
- - TN
- X FN
X - FP

The classification column in the table above shows that a TP is a vulnerability existing in both baseline and generated report; similarly, an FP is a vulnerability detected by an analyzer without a corresponding baseline entry, while an FN is a vulnerability present in the baseline but not detected by an analyzer. Note, that TN is practically not relevant for our use-case since the analyzers we are looking at only report unsafe, vulnerable cases instead of safe, non-vulnerable cases.

At the moment, the evaluator tool computes the metrics below:

  • Precision: P = TP /( TP + FP )
  • Recall: R = TP / ( TP + FN )
  • F-Score: F = 2 * ( P * R ) / ( P + R )
  • Jaccard-Index: J = TP / ( TP + FP + FN )

A higher precision indicates that an analyzer is less noisy due to the low(er) number of FPs. Hence, a high precision leads to a reduction of auditing effort of irrelevant findings. A high recall represents an analyzer's detection capacity. F-Score is a combined measure so that precision and recall can be condensed to a single number. The Jaccard-Index is a single value to capture the similarity between analyzer and baseline.

The evaluator tool supports the addition of custom metrics via a simple call-back mechanism; this enables us to add support more metrics in the future that help us to gain additional or new insights with regards to the efficacy of our analyzers.

Framework Properties

In principle, the implemented benchmarking framework is language-agnostic: new analyzers and baselines can be plugged-in as long as they adhere to the security report schema.

Establishing baselines is laborious since it requires (cross-)validation, trying out attacks on the running baseline application and code auditing.

For the GSoC project, we established baselines for the applications below covering Java (Spring) and Python (Flask) as they are ranking high in the most used languages and frameworks. For a benchmark application to have practical utility, it is important that the application itself is based on technology, including programming languages and frameworks, that are used in the industry.

For both of these applications, the baseline/expectations have been collected, verified and are publicly availabe:

  • WebGoat. WebGoat is a deliberately insecure Web application used to teach security vulnerabilities. We chose this as baseline application because it is often used as a benchmark app in the Java world and it is based on Spring which is one of the most popular frameworks in the Java world.
  • vuln-flask-web-app Like WebGoat, this application is deliberately insecure. vuln-flask-web-app covers both Python and Flask, one of the most popular web frameworks in the Python world.


This GSoC project was a first step towards building a FOSS benchmarking framework that helps the community to test their own tools and to build up a relevant suite of baselines covering various languages and frameworks. With the help of the community, we will continue adding more baselines to the benchmarking framework in the future to cover more languages and frameworks.

If you found the project interesting, you might want to check out the following repositories:

Cover image by Maxim Hopman on Unsplash

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert