Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.
We want to safely allow operators to run downtime scenarios in pre-prod environments randomly. Starting with the min unit (pod) all the way to the max unit (cluster). Once operators have built/configured a good fail-over plan, allow then to run downtime scenarios in production environments. Provide relevant metrics alongside incidents.
Interested in joining the conversation for this category? Please join us in our public epic where we discuss this topic and can answer any questions you may have. Your contributions are more than welcome.
GitLab has used Kube Monkey as part of testing our helm charts. kube-monkey is an implementation of Netflix's Chaos Monkey for Kubernetes clusters. It randomly deletes Kubernetes (k8s) pods in the cluster encouraging and validating the development of failure-resilient services.
Gremlin provides a framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks.
Chaos Toolkit is a project whose mission is to provide a free, open and community-driven toolkit and API to all the various forms of chaos engineering tools that the community needs.