Production Operations¶
This guide focuses on how AsyncMQ behaves once it stops being a local development tool and starts carrying real traffic.
The themes are the same ones you would expect from BullMQ in production:
- queue partitioning
- worker topology
- retries and idempotency
- delayed and repeatable correctness
- incident response
- backend-specific tradeoffs
Recommended Deployment Model¶
For most teams, the cleanest production topology is:
- application processes that only produce jobs
- dedicated worker processes or containers per queue family
- one optional stalled-recovery process when stalled recovery is enabled
- dashboard and admin surfaces as separate operational services
Avoid embedding high-volume workers into the same process that handles your user-facing HTTP traffic unless you are deliberately optimizing for a small deployment footprint.
Queue Design¶
Split queues by operational profile, not by arbitrary code ownership.
Good queue boundaries:
emails: moderate latency tolerance, external provider throttlingwebhooks: bursty I/O work, retry-heavybilling: low concurrency, stronger operator scrutinymedia: CPU-heavy or long-running tasks
Bad queue boundaries:
- one queue per micro-feature
- one giant queue for unrelated workloads with different SLAs
The goal is to make concurrency, rate limiting, alerting, and incident containment predictable.
Backend Selection¶
Choose the backend based on operational guarantees, not familiarity alone.
Redis¶
Best default production backend when you want:
- strong shared coordination
- durable repeatable schedules
- distributed deduplication and scheduler locks
- a simple operating model
Postgres¶
Good fit when you want:
- SQL-native persistence
- advisory-lock-based coordination
- durable schedules without running Redis
Bootstrap the schema before first use.
MongoDB¶
Good fit when your platform is already document-centric, but remember:
- coordination locks are process-local
- some queue-control guarantees are intentionally weaker across multiple processes
RabbitMQ¶
Best when AMQP broker delivery is the requirement, but remember:
- AsyncMQ still needs a metadata store for job state, results, schedules, and locks
- operational quality depends partly on that metadata store
In-memory¶
Use only for:
- local development
- tests
- ephemeral demos
It is intentionally not a durable production backend.
Scaling Workers¶
Scale with a combination of:
- queue-level concurrency
- more worker processes
- queue separation by workload type
Example thought process:
- I/O-heavy email/webhook jobs: higher concurrency, higher worker count
- CPU-heavy PDF generation: lower concurrency, often isolated workers
- rate-limited downstream API: dedicated queue with moderate concurrency plus rate limiting
Remember:
- concurrency is per worker process
- rate limiting is per worker process
- backend coordination quality varies by backend
Idempotency and Retries¶
AsyncMQ gives you retries, but it does not give you exactly-once execution.
Plan for:
- worker crash before result persistence
- retry after transient failure
- manual operator replay
- stalled recovery re-enqueue
Production-safe handlers typically use:
- database upserts
- external idempotency keys
- state-machine checks before mutating downstream systems
- deterministic output paths
If the side effect matters, idempotency belongs in task logic, not only in the queue runtime.
Delayed and Repeatable Work¶
For production scheduling:
- use
Queue.upsert_repeatable(...)for durable schedules - reserve
Queue.add_repeatable(...)for local code-owned schedules - keep
scan_intervallow enough for acceptable latency - verify scheduler ownership guarantees for your chosen backend
For deduplicated delayed work:
- align
delayand deduplication TTL windows - use
replace=Trueonly when replacing older delayed work is actually correct
Observability¶
At minimum, watch:
- waiting count
- active count
- delayed count
- failed count
- DLQ growth
- worker heartbeat freshness
- repeatable next-run drift
Practical surfaces:
If you need durable long-term analytics, export metrics and audit information to your observability stack. AsyncMQ's built-in dashboard history is aimed at operations, not long-term warehouse analytics.
Incident Playbooks¶
Backlog rising¶
- inspect waiting counts and worker heartbeats
- confirm the queue is not paused
- sample active jobs and failed jobs
- scale workers or reduce upstream production pressure
- pause the queue only when containment is worth the backlog cost
Failure spike¶
- inspect recent failures and traceback patterns
- identify whether the root cause is code, dependency, or payload-specific
- patch or roll back
- retry a controlled subset first
- expand retries only after the error rate stabilizes
Stalled jobs¶
- confirm
enable_stalled_check=True - confirm
stalled_recovery_scheduler(...)is actually running - verify long-running handlers refresh heartbeats when needed
Retention and Cleanup¶
Use queue admin APIs as part of operations hygiene:
await queue.clean_jobs(grace=3600, limit=1000, state="completed")
await queue.clean_jobs(grace=86400, limit=1000, state="failed")
await queue.drain(include_delayed=True)
Recommended pattern:
- keep enough completed jobs for recent debugging
- keep failed jobs long enough for incident review
- drain or obliterate only with clear operator intent
obliterate(force=True) is intentionally destructive. Treat it as an incident
or environment reset tool, not a daily maintenance command.
Safe Rollouts¶
Before deploying worker changes:
- confirm new task modules are imported by the worker runtime
- confirm payload schema changes remain backward compatible with queued jobs
- confirm retry/backoff changes are acceptable for in-flight jobs
- confirm repeatable schedules will not duplicate work after restart
If a deploy changes task semantics in a non-backward-compatible way, drain or migrate queued jobs explicitly rather than hoping the old payload shape still works.
Backend Caveats to Remember¶
- Redis and Postgres give the strongest built-in coordination.
- MongoDB and in-memory intentionally do not pretend to be distributed lock services.
- RabbitMQ coordination depends on the metadata store choice.
- Queue pause, repeatable ownership, and deduplication are only as strong as the backend implementation behind them.