Operations: The 2AM Debugging Toolkit
This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can browse by specific problem. Suffering from high operational costs? Check out the Cost Optimization community insights guide.
When Everything is Broken: Emergency Diagnostics
Community philosophy: "If something looks odd, even just slightly, something is wrong. Investigate before it gets worse."
EMERGENCY: Production Incident Queries (Copy-Paste Ready)
When your ClickHouse is down at 2AM, run these in order:
Learning: Incident Pattern Recognition
Understand the failure modes with working examples:
Memory Exhaustion Detection
Bad Data Detection
The 2AM Methodology
Follow this exact sequence when everything is broken:
Phase 1: Immediate Triage (30 seconds)
- Run
system.errors
- any non-zero = active incident - Check disk space - "It was as simple as low disk it took us from 12 to 4 AM"
- Look for replication lag > 5 minutes
Phase 2: Resource Investigation (2 minutes)
- Find memory-hungry queries in
system.processes
- Check for stuck merges running >1 hour
- Kill obviously problematic queries
Phase 3: Data Quality Check (5 minutes)
- Look for bad partitions (1998, 2050 dates)
- Check for parts explosion (>1000 parts per table)
Emergency Actions Reference
Production-tested solutions:
Problem | Detection Query | Solution |
---|---|---|
Memory OOM | SELECT * FROM system.processes WHERE memory_usage > 8GB | "Enable external aggregation-it will be a little bit slower...but it will use much less memory" |
Disk Full | SELECT sum(bytes_on_disk) FROM system.parts | Delete old partitions, expand disk |
Replication Lag | SELECT * FROM system.replicas WHERE absolute_delay > 300 | Check network, restart lagging replica |
Stuck Query | SELECT * FROM system.processes WHERE elapsed > 300 | KILL QUERY WHERE query_id = '...' |
Parts Explosion | SELECT count() FROM system.parts WHERE active=1 | Enable async_insert, increase batch sizes |
The golden rule: "Problems very rarely just pop out of nowhere there are signs... investigate it before it goes from 15 milliseconds to 30 seconds"
Video Sources
- 10 Lessons from Operating ClickHouse - Source of the disk space, memory, and bad data lessons from production operations