Skip to main content
Skip to main content

Operations: The 2AM Debugging Toolkit

This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can browse by specific problem. Suffering from high operational costs? Check out the Cost Optimization community insights guide.

When Everything is Broken: Emergency Diagnostics

Community philosophy: "If something looks odd, even just slightly, something is wrong. Investigate before it gets worse."

EMERGENCY: Production Incident Queries (Copy-Paste Ready)

When your ClickHouse is down at 2AM, run these in order:

Learning: Incident Pattern Recognition

Understand the failure modes with working examples:

Memory Exhaustion Detection

Bad Data Detection

The 2AM Methodology

Follow this exact sequence when everything is broken:

Phase 1: Immediate Triage (30 seconds)

  1. Run system.errors - any non-zero = active incident
  2. Check disk space - "It was as simple as low disk it took us from 12 to 4 AM"
  3. Look for replication lag > 5 minutes

Phase 2: Resource Investigation (2 minutes)

  1. Find memory-hungry queries in system.processes
  2. Check for stuck merges running >1 hour
  3. Kill obviously problematic queries

Phase 3: Data Quality Check (5 minutes)

  1. Look for bad partitions (1998, 2050 dates)
  2. Check for parts explosion (>1000 parts per table)

Emergency Actions Reference

Production-tested solutions:

ProblemDetection QuerySolution
Memory OOMSELECT * FROM system.processes WHERE memory_usage > 8GB"Enable external aggregation-it will be a little bit slower...but it will use much less memory"
Disk FullSELECT sum(bytes_on_disk) FROM system.partsDelete old partitions, expand disk
Replication LagSELECT * FROM system.replicas WHERE absolute_delay > 300Check network, restart lagging replica
Stuck QuerySELECT * FROM system.processes WHERE elapsed > 300KILL QUERY WHERE query_id = '...'
Parts ExplosionSELECT count() FROM system.parts WHERE active=1Enable async_insert, increase batch sizes

The golden rule: "Problems very rarely just pop out of nowhere there are signs... investigate it before it goes from 15 milliseconds to 30 seconds"

Video Sources