Queues with BullMQ

The API has one job: respond to requests as fast as possible. Anything that does not need to happen right now should not happen on the request path. Sending a confirmation email, generating a PDF, indexing a document for AI search, computing a daily digest, all of these are slow, and forcing the user to wait for them is unkind.

The classic answer is a queue. The API drops a small "do this later" note into a queue; a separate worker process picks it up and runs the slow work in the background. The API returns to the user immediately.

This page explains the queue used in Dashify, the patterns it supports, and how it stays reliable even when things go wrong.

What BullMQ is

BullMQ is a Node.js queue library built on Redis. The queue itself is a Redis list; jobs are JSON payloads; a worker process subscribes and consumes them. BullMQ adds the things you actually need on top: job priorities, delayed jobs (run this in 5 minutes), repeatable jobs (run this every day at 3am), retries with backoff, dead-letter queues for jobs that fail too many times, and a UI for inspecting it all.

Queues in Dashify

The platform has a small set of named queues, each focused:

emails, outbound transactional emails (welcome, password reset, invite, daily digests).
reports, generating PDFs and CSV exports that take longer than a request budget.
indexer, embedding documents into Qdrant for AI search.
notifications, fanout of internal notifications (creating notification rows for each recipient).
audit, async audit-log writes (when the API does not want to wait).

Each queue has its own concurrency limit (how many jobs can run in parallel), priority, and retry policy. Tuning these is operator work; the defaults are sensible.

How a job moves

The API returns its response in the same millisecond it enqueued the job. The user does not wait for the email to actually be sent.

Reliability

A queue gives reliability for free. Three guarantees:

At-least-once delivery. A job, once accepted, will be attempted at least once. If the worker crashes mid-job, BullMQ requeues it. The cost is that some jobs may run more than once if a crash happens at exactly the wrong moment, so jobs must be idempotent, repeating them must be safe.

Retry with backoff. Failed jobs retry, typically three attempts with exponential backoff (1 second, 5 seconds, 25 seconds, then dead-letter). This handles transient failures (a flaky third-party API, a brief network blip).

Dead-lettering. Jobs that fail every retry land in a "failed" state. They are not lost, they sit in Redis with their full input, the error, and the stack trace. An operator can inspect them, fix the underlying issue, and retry them or discard.

Repeatable jobs (cron)

Some work is not triggered by user actions but by the clock: nightly indexer, daily reports, hourly cleanups. BullMQ supports repeatable jobs, a cron-like schedule that re-emits the job at the configured interval.

Examples in Dashify:

The indexer runs at 02:00 every day.
A purge-deleted-accounts job runs at 03:00 every day.
Daily digests fire at 07:00 in each tenant's timezone.

Repeatable jobs are stored in Redis and survive restarts.

Bull Board, the UI

BullMQ ships with Bull Board, a small web UI for inspecting queues. It shows pending jobs, completed jobs, failed jobs (with full error and stack), and lets operators retry, remove, or fast-forward jobs.

In Dashify, Bull Board is mounted at /admin/queues and gated to SuperAdmin only. Org Admins do not see queues, they are platform infrastructure, not tenant features.

Concurrency

Each queue has a per-worker concurrency setting, how many jobs that worker will run in parallel. The email queue might run 10 in parallel; the indexer (which loads big chunks of text into RAM) might run 2; the reports queue (which can spin up significant CPU work) might run 1.

When you scale workers horizontally, the total concurrency is workerCount × perWorkerConcurrency. Tuning is an operator decision based on the bottleneck, emails are I/O bound, indexer is RAM bound, reports are CPU bound.

Idempotency

Because jobs may run more than once, every job handler must be idempotent, running it twice with the same input produces the same result as running it once.

Patterns we use:

Use a job-level natural key. Sending an email is keyed by (recipient, template, contextHash). If the same email is enqueued twice, the second send is a no-op.
Check before write. Indexing a document checks "has this version already been indexed?" before doing the work.
Upsert, not insert. Database writes use upserts so a duplicate run does not create a duplicate row.

Idempotency is a discipline. Skipping it leads to weird user-visible bugs that are hard to reproduce.

What about the API path?

You might be wondering: why not just setTimeout or setImmediate after returning the response, and skip the queue entirely?

Three reasons:

Process death loses the work. If the API process dies between returning the response and finishing the background task, the task is gone. The queue persists across restarts.
You cannot scale a setTimeout. The work is tied to whichever instance handled the original request. The worker pool can scale independently.
You cannot retry a setTimeout. Failure handling becomes ad-hoc.

The queue is more code, but it is the right amount of code.

Failure budget

How many failed jobs is too many? Dashify exposes Prometheus metrics for queue health: depth (how many jobs are pending), age (how long jobs are waiting), failure rate. Grafana dashboards show all three. A spike in failure rate or queue depth fires an alert.

In production, the operator should treat repeated failures the same way they treat repeated 500 errors: as a real bug to investigate.

Key takeaways

Slow work goes into a queue so the API can return immediately.
Dashify uses BullMQ on top of Redis, multiple named queues, each with its own concurrency and retry policy.
Jobs retry with exponential backoff; jobs that fail every retry are dead-lettered, not lost.
Repeatable jobs cover scheduled work, indexer, daily digests, account purge.
Every job handler must be idempotent because jobs can run more than once.

What BullMQ is​

Queues in Dashify​

How a job moves​

Reliability​

Repeatable jobs (cron)​

Bull Board, the UI​

Concurrency​

Idempotency​

What about the API path?​

Failure budget​

Key takeaways​