Observability

When something is wrong with a running system, the operator needs to be able to ask the system three questions: what is happening?, what just happened?, and why?. Observability is the discipline of building the system so those questions have clear answers.

Observability has four pillars: logs, metrics, traces, and errors. Dashify implements all four, all in the same stack.

Logs, what happened

Logs are the running narrative. Every request, every job, every notable event leaves a line in the log.

Dashify uses Pino for structured logging. Each log line is JSON, not a free-text string. A typical line might look like (in plain English):

Level: info. Message: "request completed". Method: POST. Path: /api/v1/pm/work-items. Status: 200. Duration: 47ms. Request id: req_abc123. User id: 65f.... Tenant id: 63a....

That structure means logs are searchable and filterable. "Show me every error in the last hour for tenant X" is a single grep or one query in a log aggregator. Free-text logs cannot answer that.

In development, Pino logs go to stdout in human-readable form. In production they go to stdout in pure JSON, where they are picked up by the hosting platform and shipped to a log aggregator (Datadog, Loki, CloudWatch, operator's choice).

Metrics, what is happening

Metrics are numbers measured over time. Dashify exposes a /metrics endpoint that Prometheus scrapes every fifteen seconds. The metrics include:

Request counters per route, status code, method.
Request duration histograms (p50, p95, p99).
Queue depths and ages.
Active WebSocket connections.
Database connection pool usage.
AI provider call counts and latencies.
Custom business counters (signups, password resets, AI questions answered).

Grafana sits on top of Prometheus and renders dashboards. The platform ships with default dashboards covering API health, worker health, queue health, and a synthetic SLO panel.

Alerts are configured in Grafana, slow request rate, queue depth above threshold, error rate spike. They page the operator via the configured notification channel (Slack, PagerDuty, email).

Traces, why was that slow?

A trace tells the full story of one request as it travels through every service. A request that took 800 ms might break down as: 50 ms in the API, 100 ms in MongoDB, 600 ms calling out to Cloudinary, 50 ms in another MongoDB write. Without traces you would only see "800 ms" and have to guess. With traces you see exactly where the time went.

Dashify instruments traces with OpenTelemetry and exports them to Jaeger.

Each trace is a tree of "spans", a span for the HTTP request, a span for the database call inside it, a span for the external HTTP call inside that. Spans are correlated by a trace id that flows through every service via HTTP headers, so a request that touches the API, the database, and an external API produces a single coherent trace.

The trace id also appears in the Pino log line for the same request. Click a slow trace in Jaeger, copy the id, search the logs for it, you have the full story instantly.

Errors, what broke

Logs and metrics tell you about the system. Errors are about specific code paths that failed.

Dashify uses Sentry for error capture. Any uncaught exception, unhandled promise rejection, or explicitly-reported error lands in Sentry with the full stack trace, the request context, the user id, the tenant id, and any custom tags.

Sentry deduplicates similar errors and shows them as a single issue, the first occurrence, the latest occurrence, how many users it has affected, and a frequency chart. Operators see "this exact bug has happened 1,372 times in the last day" rather than 1,372 individual error reports.

Sentry runs only in production. Local dev does not initialise it (no DSN configured) so errors during development do not pollute the production issue list.

Together, the four lenses

The four pillars form a complete picture.

A real incident usually walks through all four. Metrics show the symptom (latency spike). Logs show the events (a flurry of 500s on one endpoint). Traces show the cause (the database call inside that endpoint went from 50 ms to 4 seconds). Errors show the specific exception and the line of code.

Performance budgets

Beyond reactive tools, Dashify enforces budgets, numbers that should not be exceeded.

API requests should complete in under 200 ms p95.
Background jobs should complete in under 30 seconds for the email queue, under 5 minutes for the indexer.
AI questions should complete in under 30 seconds.

Grafana dashboards show actual vs budget for each. Spikes that breach the budget fire alerts.

What does not get logged

Privacy and noise are real. Dashify does not log:

Request bodies (which can contain user-submitted text, chat messages, KB articles).
Response bodies.
Cookies or tokens.

The log line carries the route, the user id, the tenant id, the duration, the status, and any error. The contents of what was said or stored stay in the database where they belong.

Cost discipline

Observability is not free. Logs ingestion can dominate a hosting bill if it is wide open. Dashify minimises cost by:

Using info-level logs for normal flow, debug-level for diagnostics. Production runs at info level, dropping debug entirely.
Sampling traces. In production, only a percentage of requests are traced fully; the rest get just headline metrics.
Aggregating Prometheus data over time so old, fine-grained samples are kept at lower resolution.

These are operator decisions. The platform supports both extremes, full fidelity for a low volume deployment, sampled for a high volume one.

Key takeaways

The four pillars: Logs (Pino), Metrics (Prometheus + Grafana), Traces (OpenTelemetry + Jaeger), Errors (Sentry).
Logs are structured JSON, searchable by request id, user id, tenant id.
Metrics drive dashboards and alerts; traces show where time went; Sentry shows exact stack traces.
A trace id appears in the corresponding log line, linking the four pillars together.
The platform never logs request/response bodies or auth tokens.

Logs, what happened​

Metrics, what is happening​

Traces, why was that slow?​

Errors, what broke​

Together, the four lenses​

Performance budgets​

What does not get logged​

Cost discipline​

Key takeaways​