ZooKeeper Monitor Guide
New Metrics System
The feature:New Metrics System
has been available since 3.6.0 which provides the abundant metrics to help users monitor the ZooKeeper on the topic: znode, network, disk, quorum, leader election, client, security, failures, watch/session, requestProcessor, and so forth.
Metrics
All the metrics are included in the ServerMetrics.java
.
Prometheus
- Running a Prometheus monitoring service is the easiest way to ingest and record ZooKeeper's metrics.
- Pre-requisites:
- enable the
Prometheus MetricsProvider
by settingmetricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
in the zoo.cfg. - the Port is also configurable by setting
metricsProvider.httpPort
(the default value:7000) - Install Prometheus: Go to the official website download page, download the latest release.
-
Set Prometheus's scraper to target the ZooKeeper cluster endpoints:
cat > /tmp/test-zk.yaml <<EOF global: scrape_interval: 10s scrape_configs: - job_name: test-zk static_configs: - targets: ['192.168.10.32:7000','192.168.10.33:7000','192.168.10.34:7000'] EOF cat /tmp/test-zk.yaml
-
Set up the Prometheus handler:
nohup /tmp/prometheus \ --config.file /tmp/test-zk.yaml \ --web.listen-address ":9090" \ --storage.tsdb.path "/tmp/test-zk.data" >> /tmp/test-zk.log 2>&1 &
-
Now Prometheus will scrape zk metrics every 10 seconds.
Alerting with Prometheus
-
We recommend that you read Prometheus Official Alerting Page to explore some principles of alerting
-
We recommend that you use Prometheus Alertmanager which can help users to receive alerting email or instant message(by webhook) in a more convenient way
-
We provide an alerting example where these metrics should be taken a special attention. Note: this is for your reference only, and you need to adjust them according to your actual situation and resource environment
use ./promtool check rules rules/zk.yml to check the correctness of the config file cat rules/zk.yml groups: - name: zk-alert-example rules: - alert: ZooKeeper server is down expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} ZooKeeper server is down" description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]." - alert: create too many znodes expr: znode_count > 1000000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many znodes" description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]." - alert: create too many connections expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many connections" description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]." - alert: znode total occupied memory is too big expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB) for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} znode total occupied memory is too big" description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB." - alert: set too many watch expr: watch_count > 10000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} set too many watch" description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]." - alert: a leader election happens expr: increase(election_time_count[5m]) > 0 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} a leader election happens" description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]." - alert: open too many files expr: open_file_descriptor_count > 300 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} open too many files" description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]." - alert: fsync time is too long expr: rate(fsynctime_sum[1m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} fsync time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]." - alert: take snapshot time is too long expr: rate(snapshottime_sum[5m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} take snapshot time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]." - alert: avg latency is too high expr: avg_latency > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} avg latency is too high" description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]." - alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: "JVM memory filling up (instance {{ $labels.instance }})" description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }} value = {{ $value }}\n"
Grafana
- Grafana has built-in Prometheus support; just add a Prometheus data source:
Name: test-zk Type: Prometheus Url: http://localhost:9090 Access: proxy
- Then download and import the default ZooKeeper dashboard template and customize.
- Users can ask for Grafana dashboard account if having any good improvements by writing a email to dev@zookeeper.apache.org.
InfluxDB
InfluxDB is an open source time series data that is often used to store metrics from Zookeeper. You can download the open source version or create a free account on InfluxDB Cloud. In either case, configure the Apache Zookeeper Telegraf plugin to start collecting and storing metrics from your Zookeeper clusters into your InfluxDB instance. There is also an Apache Zookeeper InfluxDB template that includes the Telegraf configurations and a dashboard to get you set up right away.
JMX
More details can be found in here
Four letter words
More details can be found in here