MongoDB replication uses a Raft-like consensus protocol to manage leader elections and maintain data consistency across replica sets, ensuring high availability and fault tolerance.
MongoDB's replication protocol is based on the Raft consensus algorithm, which provides the foundation for leader election, log replication, and safety guarantees in distributed systems. The protocol ensures that all nodes in a replica set maintain a consistent view of the data by electing a single primary node (leader) that handles all write operations, while secondary nodes (followers) replicate the primary's oplog to maintain data consistency [citation:3][citation:5]. When the primary becomes unavailable, the remaining secondaries automatically trigger a new election to select a new primary, ensuring continued availability [citation:4][citation:9].
The election process begins when a secondary node detects that the primary is unresponsive, typically after a configurable heartbeat timeout (default 10 seconds). This node increments its term counter and transitions to candidate status, sending RequestVote messages to all other voting members [citation:4][citation:9]. Each node votes for at most one candidate per term, prioritizing candidates with the most up-to-date oplog to ensure the new primary has the most complete data. The candidate that receives votes from a majority of voting nodes becomes the new primary [citation:4].
A key safety property of the Raft-based protocol is ElectionSafety, which guarantees that there can never be two primaries in the same term [citation:1]. MongoDB extends this with LeaderCompleteness, ensuring that any new leader contains all log entries committed in earlier terms [citation:1]. These properties are formally verified using TLA+ specifications, which have helped identify and fix subtle bugs in the protocol's design, particularly around dynamic reconfiguration scenarios [citation:1][citation:8].
MongoDB implements several optimizations over standard Raft. It uses a pull-based replication mechanism where secondaries fetch new oplog entries from the primary, rather than Raft's push-based approach [citation:3]. This asynchronous replication enables faster writes, as the primary can acknowledge writes once they're committed to a majority of nodes, while secondaries apply operations asynchronously [citation:7]. The protocol also supports configurable write concerns that allow applications to balance between write performance and durability guarantees [citation:9].
For dynamic reconfiguration—adding or removing replica set members—MongoDB implements a specialized logless protocol that extends Raft's single-node reconfiguration approach. This decouples configuration changes from the main operation log, allowing reconfigurations to proceed even when the main log is stalled, and enables quick recovery from node failures [citation:1][citation:8]. The protocol was formally modeled and verified using TLA+ to ensure safety properties like quorum intersection across configuration changes [citation:1][citation:10].
MongoDB's implementation also includes mechanisms to handle edge cases like network partitions and stale reads. Read preferences allow applications to read from secondary nodes, potentially seeing slightly stale data in exchange for reduced load on the primary [citation:9]. The maxStalenessSeconds setting prevents reading from secondaries that are too far behind [citation:9]. During failover, MongoDB handles rollback of uncommitted writes, ensuring that the system returns to a consistent state when the old primary rejoins the cluster [citation:3].