Skip to main content
Loading

Network Heartbeat Configuration

Aerospike's heartbeat protocols are responsible for maintaining cluster integrity. There are two supported heartbeat modes:

  • Multicast (UDP)
  • Mesh (TCP)

Cloud Consideration

  1. Lack of Multicast Support: Cloud providers, such as, Amazon and Google Compute Engine do not support multicast networking. For these providers, we offer Mesh heartbeats, which uses point-to-point TCP connections for heartbeats.

  2. Network Variability: Often, the network latency on cloud platforms is not consistent over time. This can cause problems with heartbeat packet delivery times. For these providers, we recommend setting the heartbeat interval to 150 and the heartbeat timeout to 20.

  3. Instance Pauses: At times, your cloud instance could be paused by the cloud provider for short durations. For example, Google Compute Engine (GCE) employs live migration which could pause your instance for short time durations for maintenance or software updates. The short pauses might cause the other instances in the cluster to consider this instance as "dead". We recommend upgrading to server versions 3.13 and above to help your cluster recover quickly after any network disruption or cluster changes. Refer to paxos-recovery-policy. This policy has been introduced with Aerospike server version 3.7.0.1, but it requires explicit configuration to auto-reset-master until version 3.8.1.

Multicast Heartbeat

We recommend using the multicast heartbeat protocol when available. For various reasons your network may not support multicast. See our troubleshooting guide for information on how to validate multicast in your environment.

note

See Upgrade-Network to 3.10 for more details if upgrading from versions prior to 3.10.

Configuration Steps

In the heartbeat sub-stanza:

  1. Set mode to multicast.
  2. Set multicast-group to a valid multicast address (239.0.0.0-239.255.255.255).
  3. (Optional) Set address to the IP of the interface intended for intracluster communication. This setting also controls the interface fabric will use. Needed when isolating intra-cluster traffic to a particular network interface.
  4. Set interval and timeout
    • interval (recommended: 150) controls how often to send a heartbeat packet.
    • timeout (recommended: 10) controls the number of intervals after which a node is considered to be missing by rest of nodes in the cluster if they haven't received the heartbeat from missing node.
    • With the default settings, a node will be aware of another node leaving the cluster within 1.5 seconds.

Example

...
heartbeat {
mode multicast # Send heartbeats using Multicast
multicast-group 239.1.99.2 # multicast address
port 9918 # multicast port
address 192.168.1.100 # (Optional) (Default any) IP of the NIC to
# use to send out heartbeat and bind
# fabric ports
interval 150 # Number of milliseconds between heartbeats
timeout 10 # Number of heartbeat intervals to wait
# before timing out a node
}
...

Mesh (Unicast) Heartbeat

Mesh uses TCP point to point connections for heartbeats. Each node in the cluster maintains a heartbeat connection to all other nodes, resulting in many connections required for mesh. For this reason, we recommend using multicast heartbeat protocol when available.

note

See Upgrade-Network to 3.10 for more details if upgrading from versions prior to 3.10.

Configuration Steps

In the heartbeat sub-stanza:

  1. Set mode to mesh.
  2. (Optional) Set address to the IP of the local interface intended for intracluster communication. This setting also controls the interface fabric will use. Needed when isolating intra-cluster traffic to a particular network interface.
  3. Set mesh-seed-address-port to be the IP address (or qualified DNS name as of version 3.10) and heartbeat port of a node in the cluster.
  4. Set interval and timeout
    • interval (recommended: 150) controls how often to send a heartbeat packet.
    • timeout (recommended: 10) controls the number of intervals after which a node is considered to be missing by the rest of the nodes in the cluster if they haven't received the heartbeat from the missing node.
    • With the recommended settings, a node will be aware of another node leaving the cluster within 1.5 seconds.
caution

When using fully qualified names in versions 4.3.1 and earlier, names that would not DNS resolve could cause clusters to split if the DNS server slows down and the name resolution takes longer to fail. A successful DNS resolution will replace the name with the IP address until the subsquent restart.

Example

...
heartbeat {
mode mesh # Send heartbeats using Mesh (Unicast) protocol
address 192.168.1.100 # (Optional) (Default: any) IP of the NIC on
# which this node is listening to heartbeat
port 3002 # port on which this node is listening to
# heartbeat
mesh-seed-address-port 192.168.1.100 3002 # IP address for seed node in the cluster
# This IP happens to be the local node
mesh-seed-address-port 192.168.1.101 3002 # IP address for seed node in the cluster
mesh-seed-address-port 192.168.1.102 3002 # IP address for seed node in the cluster
mesh-seed-address-port 192.168.1.103 3002 # IP address for seed node in the cluster

interval 150 # Number of milliseconds between heartbeats
timeout 10 # Number of heartbeat intervals to wait before
# timing out a node
}
...

Where to Next?