Troubleshooting startup problems
This page describes startup problems, and how to handle them.
Overview​
The asd
process (daemon) is the main Aerospike process that gets started. In System V Linux variants, this is done using /etc/init.d/aerospike start
. In SystemD Linux variants this is done through systemctl
. See Aerospike Daemon Management.
The asd daemon will not start​
Server aborts because it could not allocate enough shared memory​
ISSUE
In Aerospike Database 7.0 or later, if you're using an in-memory namespace, the following error means that your operating system's kernel.shmmax
or kernel.shmall
is set too low to pre-allocate the in-memory data storage for this namespace.
May 09 2024 22:31:59 GMT: CRITICAL (drv-mem): (drv_mem_ee.c:1078) {test} could not allocate 1342177280-byte shmem stripe
May 09 2024 22:31:59 GMT: WARNING (as): (signal.c:259) SIGUSR1 received, aborting Aerospike Enterprise Edition build 7.0.0.8 os ubuntu20.0>
SOLUTION
kernel.shmmax
should be 1/8th the data-size
if you're using in-memory without storage-backed persistence. Otherwise, it set it to at least the filesize
or the device
size of your storage-backed persistence.
sysctl -n kernel.shmmax
kernel.shmmax = 2147483648
# If data-size is 64GiB, 2GiB shmmax is too small for the 8GiB stripes
sysctl -w kernel.shmmax=17592186044416
sysctl -n kernel.shmall
kernel.shmall = 2097152
getconf PAGE_SIZE
4096
# 8GiB (2097152 * 4096) shmall is too small for a 64GiB data-size
sysctl -w kernel.shmall=4294967296
See configuring namespace data storage for more details.
The header has not been zeroized​
ISSUE If you try to start Aerospike Database 6.0 or later, the following error means that one of the devices configured as namespace storage is not recognized as an Aerospike device. This may be a good thing, as you would not want a misconfiguration to result in the server writing over the wrong device, such as the root partition.
Apr 13 2022 05:42:46 GMT: CRITICAL (drv_ssd): (drv_ssd.c:2216) /dev/nvme1n1p1: not an Aerospike device but not erased - check config or erase device
SOLUTION
Verify that the devices in the namespace configuration are actually supposed to be used by the server.
Verify that the devices have been properly initialized.
Permission denied for /var/lock/subsys/aerospike​
ISSUE
You try to start asd
and get the following error:
touch: cannot touch `/var/lock/subsys/aerospike’: Permission denied
SOLUTION
Confirm you are starting the process as the correct user with the appropriate permissions, using sudo
when needed.
You must be logged in as the root user to start the daemon.
Failed to get the feature-key​
ISSUE
If you try to start asd
and get the following error, you must provide a feature-key file:
Apr 09 2021 06:35:12 GMT: CRITICAL (config): (features_ee.c:142) failed to get feature key /etc/aerospike/features.conf
Starting with Database 6.1, a simple feature-key file is included. This feature-key file only allows deployment of a single-node cluster.
SOLUTION For more information, see Configuring the Feature-Key File.
Problem with network interface​
ISSUE The server won't start due to inability to get the physical address, and you see the following message in the log file:
Jun 22 2014 02:34:10 GMT: WARNING (cf:misc): (id.c::249) Tried eth,bond,wlan and list of all available interfaces on device.Failed to retrieve physical address with errno 19 No such device
Jun 22 2014 02:34:10 GMT: CRITICAL (config): (cfg.c:3363) could not get unique id and/or ip address
Jun 22 2014 02:34:10 GMT: WARNING (as): (signal.c::120) SIGINT received, shutting down
Jun 22 2014 02:34:10 GMT: WARNING (as): (signal.c::123) startup was not complete, exiting immediately
SOLUTION
- Check the name of your network interface:
ifconfig -a
- Specify the
node-id-interface
in the configuration.
The interface name in the following example is p2p1:
service {
...
node-id-interface p2p1
...
}
...
For more information on the configuration, see Configuration Reference.
Not enough file descriptors error in log​
ISSUE Look for the follwing message in the Aerospike log:
Aug 24 2012 16:43:10 GMT: INFO (as): (base/as.c:172) File descriptor limit is : 1024 and proto-fd-max is : 2048
Aug 24 2012 16:43:10 GMT: CRITICAL GLOBAL (as): (base/as.c:174) Not enough file descriptors, Starting with 1024 and needs 2048
critical error: backtrace: frame 0 /usr/bin/asd() [0x460cef]
critical error: backtrace: frame 1 /usr/bin/asd() [0x404b59]
critical error: backtrace: frame 2 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f620b53030d] critical error: backtrace: frame 3 /usr/bin/asd() [0x403e79]
At Aerospike server start, this value must not exceed the system's file descriptor limit.
SOLUTION To avoid a startup problem, there are two alternatives:
- Decrease the value of
proto-fd-max
in your Aerospike configuration file. - Increase the process file descriptor limit (Linux system setting).
Prior to Aerospike Database 4.9, for a dynamic change, this limit was enforced only if the new value was lower than the system setting.
Stuck in defrag loop at startup​
When Aerospike starts up it requires some percentage of storage available to join the cluster. We recommend keeping the SSD at 50% utilization to allow efficient defragmentation.
The minimum percentage of storage that is required at startup depends on the value of
defrag-startup-minimum
.
In Database 5.7 and later, this value is 0 by default, meaning the server will never get stuck in a defrag loop at startup. If nonzero, a typical value might be 10, which is the default in server versions prior to 5.7:
defrag-startup-minimum 10
If a node does not start up for a long time and the log file at /var/log/aerospike/aerospike.log shows something like:
Aug 22 2012 21:16:38 GMT: INFO (as): (base/as.c:265) waiting for defrag: namespace devices percent 0 waiting for 10
Apr 24 2013 17:19:59 GMT: INFO (drv_ssd): (storage/drv_ssd.c:1544) read_bin: could not read first
Apr 24 2013 17:20:00 GMT: WARNING (drv_ssd): (storage/drv_ssd.c:1390) **** ssd_read: record f8de4ae1d039c87 has no block associated, fail
Apr 24 2013 17:20:00 GMT: WARNING (drv_ssd): (storage/drv_ssd.c:1390) **** ssd_read: record f8de4ae1d039c87 has no block associated, fail
When the server starts, it tries to defrag and it won't start until it has defragged enough space to get the space available to match the startup minimum. If the SSD is at too high a use percentage, the node may not be able to get enough free contiguous space to startup.
To get out of the defrag loop:
- For an in-memory database with persistence, in general, the filesize should be 8x the amount of memory (e.g., if the memory for data is 20 GB, then the filesize should be 160 GB). Increase the filesize to the 8x limit or higher. If your database has very high traffic, you may require higher than 8x.
- To resolve this for nodes with SSDs, lower the high-water-memory-pct in the Aerospike configuration file. Lower the high-water-memory-pct, so it starts evicting objects to create some free space. If the node still does not startup/does not evict, lower the high water percentage more so the node will evict more records and the server will be able to get enough free space to startup.
Eviction is typically a back-stopping strategy – data should expire before you reach the high water marks that trigger eviction. This problem is a symptom is a larger problem that you need to address:
- If your storage is insufficient, you need to re-evaluate your storage/capacity strategy, as described in Managing Storage Capacity.
- Contact Aerospike for the capacity planning spreadsheet to help you re-configure your cluster capacity.
With a non-standard network device, the server won't start​
ISSUE
On some distributions and with some unusual network devices, the server won't start and the log shows the following error. This means the network interface name on the node is something other than “eth”, “bond”, or “wlan”.
Aug 22 2012 06:34:18 GMT: WARNING (cf:misc): (id.c:163) can’t get physical address, tried eth, bond, wlan. fatal: 19 No such device
SOLUTION
Add the network-interface-name parameter to your configuration file. In the example below, the device is named vlan708:
network {
service {
address 10.0.2.131
port 3000
network-interface-name vlan708
}
}
If you have multiple network devices, you must specify which one or Aerospike will choose one . For example, if you have eth0 and eth1, then Aerospike will choose one. A common situation is a node that has two ethernet ports, one for the internet and one for internal traffic. In this case, Aerospike needs to access the internet port, but it may choose the wrong one, resulting in a node that does not see traffic correctly.
To specify a specific network device, add the access-address parameter to your configuration file.
Warning messages on starting Aerospike​
If you start Aerospike and get the following warnings, they are related to memory management in the Linux kernel.
- Warning about SHMMAX:
SHMMAX is the maximum size of a single shared memory segment. If you see the following output when you start a server node, it means that the system was configured with a shared memory maximum block size that's less than the 1GB required by Aerospike. The start script dynamically raises the limit to 1GB.
sudo service aerospike start
kernel.shmmax too low, setting to 1GB
kernel.shmmax = 1073741824
Starting aerospike: [OK]
Unless the machine is rebooted, you should only see this happen on the first start.
- Warning about SHMALL:
SHMALL is the sum of all shared memory segments on the whole system. If you see the following output when you start a server node, it means the start script dynamically raised the maximum number of shared memory pages. Unless the machine is rebooted, you should only see this happen on the first start. It's possible to see both limits raised during a single (first) start.
sudo service aerospike start
kernel.shmall too low, setting to 4G pages
kernel.shmall = 4294967296
Starting aerospike: [OK]
Other problems​
For other problems, check for consistent settings between all nodes in the cluster, including service and network settings. Make sure that the namespaces are configured the same on each node.
Verify that no firewall is interfering with communication between nodes (ports 3000 through 3004).