Dashboard Operations Playbook¶
This playbook gives practical, repeatable response flows for common incidents.
Daily Health Check (5 minutes)¶
- Open
/(overview). - Confirm
total_workersis near expected baseline. - Check failed/retry movement on
/metrics. - Review queue backlog on
/queues. - Scan
/auditfor unexpected destructive actions.
Incident: Queue Backlog Rising¶
flowchart TD
A["Waiting jobs rising"] --> B["Open /queues"]
B --> C["Identify impacted queue"]
C --> D["Open /workers and /queues/{name}/jobs?state=active"]
D --> E{"Workers unhealthy or active jobs stuck?"}
E -->|Yes| F["Pause queue to contain impact"]
F --> G["Fix worker/task root cause"]
G --> H["Resume queue"]
E -->|No| I["Scale workers via orchestrator"]
Checklist:
- Capture queue name and backlog change rate.
- Confirm worker heartbeat recency.
- Verify active-job payload pattern (single task type vs broad).
- Pause queue only when downstream impact is unacceptable.
Incident: Failure Spike¶
- Open
/queues/{name}/dlq. - Sample failed payloads and trace common error shape.
- Open
/audit?queue={name}&action=job.retryto review recent retry operations. - Fix root cause in task or dependency.
- Retry a controlled subset first.
- Watch
/metricsand/queues/{name}/jobs?state=failedfor regression.
Suggested phased retry¶
- Phase 1: retry 5 to 10 representative jobs.
- Phase 2: retry next 10%.
- Phase 3: full retry once error rate stabilizes.
Incident: Misbehaving Job Payload¶
- Search in
/queues/{name}/jobsusingjob_id,task, andq. - Cancel if currently harmful.
- Remove if payload is poison and should never rerun.
- Document action and rationale in incident notes.
- Verify action exists in
/audit.
Example URL:
Incident: Suspicious Operator Activity¶
- Open
/audit. - Filter by
status=failedand targetqueue. - Search by actor id (
q=<actor-id>) or action (action=queue.pause). - Cross-check with deployment timeline and on-call logs.
Metrics Triage Pattern¶
Use /metrics with two time horizons:
- Short horizon (live SSE): immediate reactions after a retry/pause/resume action.
- Recent horizon (
/metrics/history): trend direction and whether changes are stabilizing.
Change Management Example¶
flowchart TD
A["Deploy change"] --> B["Watch /metrics and /queues"]
B --> C{"Failures increased?"}
C -->|No| D["Continue monitoring"]
C -->|Yes| E["Pause affected queue"]
E --> F["Rollback or patch"]
F --> G["Retry DLQ subset"]
G --> H{"Stable over multiple snapshots?"}
H -->|Yes| I["Resume queue"]
H -->|No| E
Practical Limits¶
- Dashboard actions invoke backend APIs; they do not replace process orchestration.
- Metrics history and audit data are in-memory dashboard process stores.
- For durable long-term analytics/audit retention, export data to external observability tooling.