# Inject 5s latency into 50% of scrape requests for 2 minutes curl -X POST http://localhost:9091/inject/latency \ -d '"duration":"2m","percent":50,"delay":"5s"' If you run Prometheus Operator, pair it with Chaos Mesh (CNCF project) and a NetworkChaos experiment:
Create a small proxy that intercepts /metrics endpoints:
Run this between Prometheus and your real exporters. Watch Prometheus log parse error and target down – then verify your alerts fire correctly.
Despite its dramatic name, Prometheus Chaos Edition is not an official Prometheus release. It is a concept (and accompanying script/container) popularized by the Prometheus community and tools like kube-prometheus-stack chaos experiments. prometheus chaos edition
# malicious_exporter.py from flask import Flask, Response import random app = Flask()
| Risk | Mitigation | | --- | --- | | PCE accidentally runs on production | Use namespace isolation, explicit --chaos.enabled=false flag in prod. | | Permanent data loss | Run against a replica Prometheus with --storage.tsdb.retention.time=6h . | | Alert fatigue | Notify a separate “chaos channel” during experiments. | | Controller plane overload | Limit chaos duration (e.g., 5 minutes max). |
Enter – a little-known, experimental tool designed to do the unthinkable: intentionally break your Prometheus deployment so you can fix it before a real disaster. # Inject 5s latency into 50% of scrape
In short: How to Run Prometheus Chaos Edition (Step-by-Step)
The result? A telemetry system that survives real network partitions, overloaded exporters, and misconfigured rules. And a team that actually knows how to debug their monitoring stack under pressure.
@app.route('/metrics') def metrics(): if random.random() < 0.2: # 20% of the time return "malformed_metric{ invalid syntax", 200 return Response(real_metrics(), mimetype='text/plain') | | Alert fatigue | Notify a separate
| | With PCE | | --- | --- | | You assume Prometheus is always healthy. | You prove it can survive partial failures. | | Alertmanager might be misconfigured for months. | You test silences, inhibitions, and receivers. | | A slow scrape delays critical alerts. | You detect latency thresholds before they matter. | | Grafana dashboards freeze, but no one notices. | You build fallback visualizations. |
Breaking Monitoring Before It Breaks You: A Hands-On Guide to Prometheus Chaos Edition
# Pull the chaos edition sidecar docker pull quay.io/prometheuschaos/chaos-sidecar:latest docker run -d --name prometheus-chaos --network container:prometheus quay.io/prometheuschaos/chaos-sidecar
We all love Prometheus. It scrapes metrics, fires alerts, and helps us sleep at night. But here’s a painful truth most engineers realize at 3 AM: Your monitoring system can fail, and you won’t know about it until the real outage happens.
Before we dive into code, let’s address the obvious question: Why would I voluntarily break my monitoring?