Delaying "Fill" migrations
When a node is removed from a cluster, other nodes will stand in to hold replicas (masters or proles) that the missing node held.
The stand-in replicas will be filled using migrations from the other replica(s) remaining in the cluster. Write transactions will also replicate to the stand-in replicas. While these replications are essential to maintaining consistency, the filling of stand-in nodes using migration is not essential.
The only reason to do such "fill" migrations is to prepare the cluster for a (long-term) future in which the missing node remains out of the cluster, and the stand-in replicas become "normal" replicas. If the missing node returns to the cluster, the stand-in replicas will ultimately be dropped.
Therefore, in situations where a node will only be out of the cluster for a short time, such as during rolling upgrades, it may be better to not bother doing these "fill" migrations at all -- avoiding them will save on background processing, device IO if data not in memory, and data movement. It may also save on temporary memory and storage consumption, which may otherwise be hazardous in certain scenarios, such as rack-aware clusters with very few racks.
The (dynamic) configuration item migrate-fill-delay
(Enterprise Edition only) can be used to suppress "fill" migrations.
Defining "Fill" migrations
A "fill" migration is one which is filling a node that is not normally a replica.
In SC (strong-consistency
) enabled namespaces, this is easy to identify -- if the node is not a "roster replica" for the partition in question, any migration to that node is deemed a "fill" migration. The roster allows us to know which nodes are intended to be in the cluster, and (per partition) which nodes hold the "normal" replicas.
In AP (non SC) clusters, we have no such guidance as to what nodes are supposed to be in the cluster. Instead we use the heuristic that a migration to any node which is empty (or more exactly, started out empty and hasn't been filled) is a "fill" migration.
This means that there are ordinary scenarios in AP clusters where "fill" migrations should not be suppressed. For example, if a new empty node is added and becomes a replica for a given partition, we would want the node to be filled.
The same situation in an SC cluster would be fine with "fill" migrations suppressed, since the new node would have been added to the roster, would be identified as a "roster replica", and therefore the migration would not be deemed a "fill" migration, even though it will fill the new node.
Operators must be aware of their situation in order to choose how to set up "fill" migrations.
Configuring migrate-fill-delay
migrate-fill-delay
is a dynamic service context configuration item with units of seconds. It can be configured with 'm', 'h', or 'd' for minutes, hours, or days. For example, 1h instead of 3600. The default is 0, meaning "fill" migrations are not delayed. A non-zero value does the obvious -- delays beginning "fill" migrations for the specified amount of time, measured from the rebalance that scheduled them.
For versions 4.5.0.2 and earlier, using time units ('m', 'h' or 'd') does not work when setting this configuration parameter dynamically.
Changing the configuration value takes effect within a second. For example, if it had been set to 5 minutes, and only one minute has elapsed since rebalance, changing it to 0 (or anything less than a minute) will cause the delayed "fill" migrations to start right away.
Possible Scenario
For a cluster where nodes can be upgraded and fast restarted in approximately 2 minutes: configure migrate-fill-delay
to be 5 minutes.
Rolling upgrades should then see no fill migrations at all -- when a node is removed, the fill migrations will not start, and the node should be back before they would have started. All migrations on the node's return will be "delta" migrations -- not "fill" migrations -- and will proceed immediately.
For an SC cluster one may consider leaving migrate-fill-delay
configured this way normally, and only change it if there is an unusual scenario in which it is desired for fill migrations to begin immediately. Even if a node goes down unexpectedly, after 5 minutes a stand-in node would still be filled, so that when what happened is later discovered, if the roster is to be changed to permanently remove the downed node, a new roster replica already full is ready.
For a full cluster restart, where hosts will be rebooted and cold restarted in under 1 hour,
configure migrate-fill-delay
to be 1 hour. The migrate-fill-delay
time is reset on
any cluster change (cluster_size
changing). Therefore, as long as there is at least 1 node
re-joining the cluster every 1 hour, 'fill' migrations will not happen. The cluster-stable
command can be used to check that nodes are re-joining the cluster and the migrate-fill-delay
can be dynamically updated if necessary if some nodes take longer to restart.
The migrate-fill-delay
is observed by the nodes that would be sending out data to fill
other nodes. Let's consider the following scenario:
- 2 rack cluster. Racks A and B
migrate-fill-delay
kept to the default of 0.- 3 namespaces:
- Namespace
persisted
with persisted storage configuration (data persisted upon restart), running in AP mode. - Namespace
memory
that is not persisted (storage-engine
memory) running in AP mode. - Namespace
memory-sc
that is not persisted (same as above, in memory only) but is running in SC mode (strong-consistency
enabled).notesuch namespace configuration is for illustration purposes only and not likely for production use.
- Namespace
replication-factor
set to 2 on both namespaces.
For a planned maintenance, the following steps can be considered:
- Dynamically set the
migrate-fill-delay
to 3600 seconds (or enough time to perform maintenance on a rack). - Quiesce rack A, wait for transactions to stop going through nodes in the rack and proceed to shut it down.
- At this point, rack B is taking all the traffic but is not replicating existing data until 1 hour has elapsed. Incoming write transactions would still replicate within rack B.
- Bring rack A back online. At this point, the nodes in rack A have had their
migrate-fill-delay
configuration reset to the default from the configuration file (0 in this case). Fill migration will not start for another 1 hour. - Assuming namespace
persisted
had its storage preserved during maintenance:- Delta migrations will immediately start for the
persisted
namespace so that nodes from rack A can quickly resume masterhood for partitions that were owned as master on that rack. The delta migration only transfer records that have changed while rack A was down. - For the namespace
memory
, as the nodes would come back without data, the migrations will not start until the 1 hour has elapsed. Themigrate-fill-delay
takes effect on the nodes that would be migrating the data out. Dynamically setting themigrate-fill-delay
to 0 on the nodes in rack B would then immediately take effect and trigger the migrations to fill thememory
namespace back up. - For the namespace
memory-sc
, the migrations will immediately start since the roster would allow to know which node owns which partition. In SC mode, themigrate-fill-delay
only impacts nodes that aren't roster replicas.
- Delta migrations will immediately start for the
- Once migrations have completed, repeat all the steps for rack B, starting with setting the
migrate-fill-delay
to the desired value again (3600 seconds in this example).