ZooKeeper server theory of operation

ZooKeeperServer is designed to work in standalone mode and also be extensible so that it can be used to implement the quorum based version of ZooKeeper.

ZooKeeper maintains a order when processing requests:

All requests will be processed in order.
All responses will return in order.
All watches will be sent in the order that the update takes place.

We will explain the three aspects of ZooKeeperServer: request processing, data structure maintenance, and session tracking.

Request processing

Requests are received by the ServerCnxn. Demarshalling of a request is done by ClientRequestHandler. After a request has been demarshalled, ClientRequestHandler invokes the relevant method in ZooKeeper and marshals the result.

If the request is just a query, it will be processed by ZooKeeper and returned. Otherwise, the request will be validated and a transaction will be generated and logged. This the request will then wait until the request has been logged before continuing processing.

Requests are logged as a group. Transactions are queued up and the SyncThread will process them at predefined intervals. (Currently 20ms) The SyncThread interacts with ZooKeeperServer the txnQueue. Transactions are added to the txnQueue of SyncThread via queueItem. When the transaction has been synced to disk, its callback will be invoked which will cause the request processing to be completed.

Data structure maintenance

ZooKeeper data is stored in-memory. Each znode is stored in a DataNode object. This object is accessed through a hash table that maps paths to DataNodes. DataNodes also organize themselves into a tree. This tree is only used for serializing nodes.

We guarantee that changes to nodes are stored to non-volatile media before responding to a client. We do this quickly by writing changes as a sequence of transactions in a log file. Even though we flush transactions as a group, we need to avoid seeks as much as possible. Also, since the server can fail at any point, we need to be careful of partial records.

We address the above problems by

Pre-allocating 1M chunks of file space. This allows us to append to the file without causing seeks to update file size. It also means that we need to check for the end of the log by looking for a zero length transaction rather than simply end of file.
Writing a signature at the end of each transaction. When processing transactions, we only use transactions that have a valid signature at the end.

As the server runs, the log file will grow quite large. To avoid long startup times we periodically take a snapshot of the tree of DataNodes. We cannot take the snapshot synchronously as the data takes a while to write out, so instead we asynchronously write out the tree. This means that we end up with a "corrupt" snapshot of the data tree. More formally if we define T to be the real snapshot of the tree at the time we begin taking the snapshot and l as the sequence of transactions that are applied to the tree between the time the snapshot begins and the time the snapshot completes, we write to disk T+l' where l' is a subset of the transactions in l. While we do not have a way of figuring out which transactions make up l', it doesn't really matter. T+l'+l = T+l since the transactions we log are idempotent (applying the transaction multiple times has the same result as applying the transaction once). So when we restore the snapshot we also play all transactions in the log that occur after the snapshot was begun. We can easily figure out where to start the replay because we start a new logfile when we start a snapshot. Both the snapshot file and log file have a numeric suffix that represent the transaction id that created the respective files.

Session tracking

Rather than tracking sessions exactly, we track them in batches. That are processed at fixed intervals. This is easier to implement than exact session tracking and it is more efficient in terms of performance. It also provides a small grace period for session renewal.

Interface Summary
Interface	Description
ConnectionMXBean	This MBean represents a client connection.
DataTreeMXBean	Zookeeper data tree MBean.
NodeHashMap	The interface defined to manage the hash based on the entries in the nodes map.
RequestProcessor	RequestProcessors are chained together to process transactions.
RequestRecord
ServerStats.Provider
SessionTracker	This is the basic interface that ZooKeeperServer uses to track sessions.
SessionTracker.Session
SessionTracker.SessionExpirer
ZooKeeperServerListener	Listener for the critical resource events.
ZooKeeperServerMXBean	ZooKeeper server MBean.

Class Summary
Class	Description
AuthenticationHelper	Contains helper methods to enforce authentication
BlueThrottle	Implements a token-bucket based rate limiting mechanism with optional probabilistic dropping inspired by the BLUE queue management algorithm [1].
ByteBufferInputStream
ByteBufferOutputStream
ByteBufferRequestRecord
ConnectionBean	Implementation of connection MBean interface.
ContainerManager	Manages cleanup of container ZNodes.
DatadirCleanupManager	This class manages the cleanup of snapshots and corresponding transaction logs by scheduling the auto purge task with the specified 'autopurge.purgeInterval'.
DataNode	This class contains the data for a node in the data tree.
DataTree	This class maintains the tree data structure.
DataTree.ProcessTxnResult
DataTreeBean	This class implements the data tree MBean.
DigestCalculator	Defines how to calculate the digest for a given node.
DumbWatcher	A empty watcher implementation used in bench and unit test.
ExpiryQueue<E>	ExpiryQueue tracks elements in time sorted fixed duration buckets.
FinalRequestProcessor	This Request processor actually applies any transaction associated with a request and services any queries.
NettyServerCnxn
NettyServerCnxnFactory
NIOServerCnxn	This class handles communication with clients using NIO.
NIOServerCnxnFactory	NIOServerCnxnFactory implements a multi-threaded ServerCnxnFactory using NIO non-blocking socket calls.
NodeHashMapImpl	a simple wrapper to ConcurrentHashMap that recalculates a digest after each mutation.
ObserverBean	ObserverBean
PrepRequestProcessor	This request processor is generally at the start of a RequestProcessor change.
PurgeTxnLog	this class is used to clean up the snapshot and data log dir's.
RateLogger	This logs the message once in the beginning and once every LOG_INTERVAL.
ReferenceCountedACLCache
Request	This is the structure that represents a request moving through a chain of RequestProcessors.
RequestThrottler	When enabled, the RequestThrottler limits the number of outstanding requests currently submitted to the request processor pipeline.
ResponseCache
ServerCnxn	Interface to a Server connection - represents a connection from a client to the server.
ServerCnxnFactory
ServerCnxnHelper
ServerConfig	Server configuration storage.
ServerMetrics
ServerStats	Basic Server Statistics
SessionTrackerImpl	This is a full featured SessionTracker.
SessionTrackerImpl.SessionImpl
SimpleRequestRecord
SnapshotComparer	SnapshotComparer is a tool that loads and compares two snapshots with configurable threshold and various filters, and outputs information about the delta.
SnapshotFormatter	Dump a snapshot file to stdout.
SnapshotRecursiveSummary	Recursively processes a snapshot file collecting child node count and summarizes the data size below each node.
SyncRequestProcessor	This RequestProcessor logs requests to disk.
TraceFormatter
TxnLogEntry	A helper class to represent the txn entry.
TxnLogProposalIterator	This class provides an iterator interface to access Proposal deserialized from on-disk txnlog.
UnimplementedRequestProcessor	Manages the unknown requests (i.e.
WorkerService	WorkerService is a worker thread pool for running tasks and is implemented using one or more ExecutorServices.
WorkerService.WorkRequest	Callers should implement a class extending WorkRequest in order to schedule work with the service.
ZKDatabase	This class maintains the in memory database of zookeeper server states that includes the sessions, datatree and the committed logs.
ZooKeeperCriticalThread	Represents critical thread.
ZooKeeperSaslServer
ZooKeeperServer	This class implements a simple standalone ZooKeeperServer.
ZooKeeperServerBean	This class implements the ZooKeeper server MBean interface.
ZooKeeperServerConf	Configuration data for a `ZooKeeperServer`.
ZooKeeperServerMain	This class starts and runs a standalone ZooKeeperServer.
ZooKeeperServerShutdownHandler	ZooKeeper server shutdown handler which will be used to handle ERROR or SHUTDOWN server state transitions, which in turn releases the associated shutdown latch.
ZooKeeperThread	This is the main class for catching all the uncaught exceptions thrown by the threads.
ZooTrace	This class encapsulates and centralizes tracing for the ZooKeeper server.

Enum Summary
Enum	Description
DatadirCleanupManager.PurgeTaskStatus	Status of the dataDir purge task
EphemeralType	Abstraction that interprets the `ephemeralOwner` field of a ZNode.
EphemeralTypeEmulate353	See https://issues.apache.org/jira/browse/ZOOKEEPER-2901 version 3.5.3 introduced bugs associated with how TTL nodes were implemented.
ExitCode	Exit code used to exit server
NettyServerCnxn.HandshakeState
PrepRequestProcessor.DigestOpCode
ServerCnxn.DisconnectReason
ZooKeeperServer.State

Exception Summary
Exception	Description
ClientCnxnLimitException	Indicates that the number of client connections has exceeded some limit.
RequestProcessor.RequestProcessorException
ServerCnxn.CloseRequestException
ServerCnxn.EndOfStreamException
ZooKeeperServer.MissingSessionException

Package org.apache.zookeeper.server

ZooKeeper server theory of operation

Request processing

Data structure maintenance

Session tracking