Replication lag is the delay between a write operation on the primary and its application on a secondary node, with the top three causes being insufficient secondary resources, network limitations, and oplog issues.
Replication lag in MongoDB is the time difference between a write operation occurring on the primary node and that same operation being applied on a secondary node . It's calculated by comparing the timestamp of the last operation applied on the secondary with the timestamp of the latest operation on the primary . In a Node.js-backed system, this metric is critical because high lag can lead to stale reads from secondaries, increased failover times, and in extreme cases, secondaries falling so far behind they require full resyncing. Understanding the root causes helps maintain application performance and data consistency.
The most common cause of replication lag is when secondary nodes lack the hardware resources to keep pace with the primary's write workload . Secondaries must apply every write operation that the primary processes, which requires comparable CPU, memory, and disk I/O capacity. When secondaries are under-provisioned, work queues build up, creating lag . Disk speed is particularly critical—using SSDs instead of HDDs can dramatically improve write performance and reduce lag . This issue is especially prevalent in read-scaling architectures where secondaries handle application read traffic while simultaneously trying to replicate writes .
Network limitations between primary and secondary nodes directly impact replication speed . In geographically distributed deployments where nodes span different data centers or cloud regions, physical distance introduces latency that slows oplog transmission . Bandwidth constraints become critical when large write operations or batch inserts need to be transmitted across limited network pipes . This problem can cascade when multiple lagging secondaries simultaneously attempt to sync from the primary, overwhelming available bandwidth and creating a feedback loop of increasing lag .
The oplog (operations log) is a capped collection that records all write operations . If the oplog is too small, it can wrap around and overwrite entries before secondaries have had a chance to replicate them . This forces secondaries into a state where they must perform a full resync, causing significant downtime. The oplog window—the time span covered by the oplog—should ideally accommodate at least 24-72 hours of operation to allow for recovery from temporary outages . On busy systems with high write throughput, default oplog sizes may be inadequate, requiring adjustment .
Long-running operations or transactions on the primary that block oplog application on secondaries
Unique indexes on secondaries that require additional validation time during replication
Chain replication where secondaries sync from other secondaries rather than directly from the primary
Inefficient sync source selection causing nodes to replicate from suboptimal sources
Node.js applications can both contribute to and be affected by replication lag. Connection pool configurations in Mongoose or the native driver need proper timeout settings to handle replica set failover scenarios . Applications that read from secondaries using read preferences like secondary or secondaryPreferred must tolerate potentially stale data and should implement appropriate error handling for when secondaries fall behind . Monitoring replication lag and setting up alerts allows Node.js applications to degrade gracefully or temporarily route reads to the primary during high-lag periods.