Skip to main content

Real-time at Scale

Real-time features change the feel of an application. A chat where messages appear immediately. A board where another user's edit lights up instantly on your screen. Notifications that arrive without a page reload. They make the platform feel like a place, not a series of pages.

Real-time is also the scariest piece to scale. A single misbehaving connection can take down a server. A naive fanout can flood the network. Cross-instance routing has to be exactly right or messages get lost. This page explains how Dashify handles all of that.

What "real time" means here

Specifically: a persistent connection between the browser and the server, over which either side can send messages at any time without polling. The protocol is WebSocket, the library is Socket.IO, the transport is upgraded from a normal HTTP connection at handshake time.

When user A sends a chat message, it travels: A's WebSocket to A's API instance to Redis Pub/Sub to every API instance to every connected user in the channel.

Why a WebSocket and not polling

Polling would be sending an HTTP request every few seconds asking "anything new?" That works but it wastes a lot of resources. Most polls return nothing. Server CPU goes up. Bandwidth goes up. Latency goes up to the polling interval.

A WebSocket is a single open connection. The server pushes new data the moment it has it. The client pushes new data when the user does something. There is no waiting interval.

The tradeoff is that WebSockets are stateful in a way HTTP is not, the connection lives across many requests. We have to think about what happens when the connection drops, when the server restarts, when the user changes networks.

Authentication on the WebSocket

A WebSocket connection has to be authenticated just like an HTTP request. Two challenges complicate it:

  1. WebSocket handshakes pass through some proxies awkwardly when cookies are involved.
  2. Once the connection is open, every subsequent message has to be associated with the right user without re-checking the cookie every time.

Dashify's solution is the JWT pattern covered in the JWT page. The browser asks the API for a short lived JWT (15 minutes), opens the WebSocket with the JWT in a query parameter, and the Socket.IO middleware verifies the JWT once at handshake time. Once verified, the connection's userId and tenantId are attached and used on every subsequent message.

A small but critical detail: the JWT claim Socket.IO reads is id, not _id. Reading the wrong field returns "unauthorized: malformed token" on every handshake, a real bug we hit and fixed once.

Rooms

Socket.IO has a concept called rooms, named groups of connections that messages can be broadcast to. Dashify uses rooms heavily:

  • Every user has a personal room, user:<userId>, for notifications targeted at them.
  • Every project has a room, project:<projectId>, for board updates.
  • Every chat channel has a room, chat:<channelId>, for messages in that channel.

When a user opens a project board, the client emits subscribe project:abc123; the server adds the connection to the room. When the user navigates away, the client emits unsubscribe; the server removes it. The user only receives events for what they are looking at.

Cross-instance fanout

The Socket.IO Redis adapter is the magic. With it, every API instance is part of a Redis Pub/Sub mesh. When instance 1 emits to room: project:abc, the adapter publishes to a Redis channel. Every instance is subscribed; every instance forwards to its locally-connected clients in that room. The user does not know or care which instance they are connected to.

Without the Redis adapter, real time would only work for users connected to the same instance, which would mean it does not really work at all in a horizontally-scaled deployment.

Per-tab support

A user may have the platform open in three tabs. Each tab is a separate WebSocket connection. Each connection is in the same user:<userId> room. When the platform sends a notification, all three tabs receive it.

This is what makes "you have a new message" badges accurate even when the user has multiple tabs open. Implemented correctly, it is invisible. Implemented incorrectly, you see the same notification on one tab while another tab is silent, which we explicitly debugged and fixed.

Ordering and atomicity

In a chat, message order matters. If Alice sends "what time?" and then "actually 3pm works", and they arrive in reverse order, the conversation makes no sense.

Dashify guarantees order per channel with an atomic per-channel sequence number. When the API accepts a chat message, it atomically increments a counter in Redis for that channel and assigns the new sequence number to the message. Clients sort messages by this number, not by the timestamp (which can drift across machines).

This is more reliable than any timestamp-based ordering scheme.

Reconnection

Networks drop. Phones go to sleep. The platform handles this transparently:

  • When the connection drops, the client retries with exponential backoff.
  • When the connection succeeds, it re-subscribes to the rooms it was in before.
  • It also fetches "what did I miss?", for chat, this is a since=<sequenceNumber> request. The server returns every message after that point.

Real-time is best-effort while connected; the catch up fetch on reconnect is what makes the experience reliable across flaky networks.

Backpressure

A misbehaving client can flood a server with messages. Dashify caps incoming WebSocket messages per connection per second. Over the cap, the connection is throttled or, in extreme cases, disconnected. The cap is generous (well above any realistic typing or clicking rate) but it bounds the worst case.

On the server-to-client side, slow consumers (a client whose downstream is full) will eventually disconnect. The connection is dropped politely; the client reconnects when it is ready.

Memory pressure

Open connections cost memory. Each WebSocket carries a small amount of state, buffered frames, the room list, identity. With many thousands of concurrent connections, this adds up. The platform is comfortable with several thousand concurrent connections per instance; beyond that, scaling out (more instances) is the right move.

Prometheus exposes the connection count as a metric. Grafana shows it. Alerts fire if a single instance is hosting too many.

What a real chat message looks like in transit

That whole dance, end to end, takes well under 100 milliseconds in normal conditions.

Key takeaways

  • Real-time uses Socket.IO over WebSockets, authenticated by a short lived JWT issued at handshake time.
  • The Redis adapter is what makes real time work across multiple API instances.
  • Rooms scope events, users only receive what they are looking at.
  • Per-channel atomic sequence numbers guarantee message order.
  • Clients reconnect transparently with since= catch up to handle network blips.
  • Connection counts are bounded by per-instance memory; scaling means more instances.