Tales From A Lazy Fat DBA

Its all about Databases & their performance, troubleshooting & much more …. ¯\_(ツ)_/¯

Redis cluster node with status DISCONNECTED – [ERR] Not all 16384 slots are covered by nodes, and a crude method to deal that …

Posted by FatDBA on November 8, 2021

Hi All,

Today’s post doesn’t follow the usual problem-solution approach, but about a hit and trial that fixed my problem on Redis. This is about a strange problem which I’d encountered while doing a POC, when one of the cluster node, after reboot stopped responding and coming with status as ‘disconnected‘. This was a multi master configuration (4 master nodes) with no slaves, Redis version was 6.2.6 on RHEL7. For simulation purposes, I have invoked four different Redis server instances on the same node/machine using separate ports. One running on port 6391 is not responding.

Note: This is just a test I did on one sandbox environment and it’s might not match your problem, I recommend not to try these steps in production as they are just part of hit-and-trial that I did.

-- Four different Redis Server Instances running on same host.
[root@fatdba redis-cluster]# ps -ef|grep redis
root       3337   3177  0 23:14 pts/1    00:00:03 redis-server 127.0.0.1:6391 [cluster]
root       3763   3177  0 23:20 pts/1    00:00:01 redis-server 127.0.0.1:6392 [cluster]
root       3814   3177  0 23:21 pts/1    00:00:01 redis-server 127.0.0.1:6394 [cluster]
root       4001   3177  0 23:24 pts/1    00:00:01 redis-server 127.0.0.1:6390 [cluster]



-- one with port 6391 is with status 'disconnected'.
[root@fatdba node0]# redis-cli -a secret -p 6390 cluster nodes
bdcc3381941e141c89110bbcffe66c73bc927054 :0@0 master,noaddr - 1635737041203 1635737041203 2 disconnected 4096-8191
8751bde72e08fd1b08cc34c82cd5868475320a6a 127.0.0.1:6392@16392 master - 0 1635737634979 3 connected 8192-12287
a2b89ff5be2eafeba1ae3aea78f871d0ccc16283 127.0.0.1:6390@16390 myself,master - 0 1635737633000 1 connected 0-4095
c548fe11041b636d627cafffd23f8ad642ac4085 127.0.0.1:6394@16394 master - 0 1635737634000 4 connected 12288-16383
[root@fatdba node0]#

Cluster check command is showing details only for three healthy masters and throwing error at the end of the check '[ERR] Not all 16384 slots are covered by nodes'. I tried to remove the problematic node using CLUSTER FORGET and re-add it to the cluster using addnode but even that didn’t worked, ideally it should, but it didn’t, and it was still coming with status ‘disconnected‘.

Meanwhile the 'cluster info' command showing known nodes as 4 with cluster state as OK.

[root@fatdba redis-cluster]# redis-cli -a secret --cluster check 127.0.0.1:6394
127.0.0.1:6394 (c548fe11...) -> 0 keys | 4096 slots | 0 slaves.
127.0.0.1:6390 (a2b89ff5...) -> 0 keys | 4096 slots | 0 slaves.
127.0.0.1:6392 (8751bde7...) -> 0 keys | 4096 slots | 0 slaves.
[OK] 0 keys in 3 masters.
0.00 keys per slot on average.
>>> Performing Cluster Check (using node 127.0.0.1:6394)
M: c548fe11041b636d627cafffd23f8ad642ac4085 127.0.0.1:6394
   slots:[12288-16383] (4096 slots) master
M: a2b89ff5be2eafeba1ae3aea78f871d0ccc16283 127.0.0.1:6390
   slots:[0-4095] (4096 slots) master
M: 8751bde72e08fd1b08cc34c82cd5868475320a6a 127.0.0.1:6392
   slots:[8192-12287] (4096 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.



127.0.0.1:6391> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:4
cluster_size:4
cluster_current_epoch:4
cluster_my_epoch:1
cluster_stats_messages_ping_sent:3342
cluster_stats_messages_pong_sent:3732
cluster_stats_messages_sent:7074
cluster_stats_messages_ping_received:3732
cluster_stats_messages_pong_received:3339
cluster_stats_messages_received:7071

So, we tried all possibilities but still our problem endured …

Now, if you have carefully checked the syntax that I’ve used to connect with any of the nodes is via -a option, as there was the password set, so every time it asks me to input the passcode. Just for the trial purpose I removed the password (PassThru parameter in Redis config file) and recreated the cluster, and oddly that trick worked. Of course that is something you will never do in production setups, but can try when learning the database or if not running on production and you’ve the luxury to play around with the database – Just a crude method to deal the issue 🙂

Still it’s incomprehensible for me about how this password removal thing fixed the issue 😦 or maybe I was doing some mistake, I am perplexed and will be troubleshooting it further …

[root@fatdba node0]# redis-cli --cluster create 127.0.0.1:6390 127.0.0.1:6391 127.0.0.1:6392 127.0.0.1:6394 --cluster-replicas 0
>>> Performing hash slots allocation on 4 nodes...
Master[0] -> Slots 0 - 4095
Master[1] -> Slots 4096 - 8191
Master[2] -> Slots 8192 - 12287
Master[3] -> Slots 12288 - 16383
M: 4c9785e37002cf340842ee176d21244497898d4a 127.0.0.1:6390
   slots:[0-4095] (4096 slots) master
M: ce3c377b47d78f45cf824355926a4763eb4eb6fa 127.0.0.1:6391
   slots:[4096-8191] (4096 slots) master
M: 08f772eab05b03b5559b15af0848822a383093d9 127.0.0.1:6392
   slots:[8192-12287] (4096 slots) master
M: 923df80e2201c6c35de10560dc04420b5742ea58 127.0.0.1:6394
   slots:[12288-16383] (4096 slots) master
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join
...
>>> Performing Cluster Check (using node 127.0.0.1:6390)
M: 4c9785e37002cf340842ee176d21244497898d4a 127.0.0.1:6390
   slots:[0-4095] (4096 slots) master
M: 923df80e2201c6c35de10560dc04420b5742ea58 127.0.0.1:6394
   slots:[12288-16383] (4096 slots) master
M: ce3c377b47d78f45cf824355926a4763eb4eb6fa 127.0.0.1:6391
   slots:[4096-8191] (4096 slots) master
M: 08f772eab05b03b5559b15af0848822a383093d9 127.0.0.1:6392
   slots:[8192-12287] (4096 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

-- Let's check once again the cluster status
127.0.0.1:6391> cluster nodes
4c9785e37002cf340842ee176d21244497898d4a 127.0.0.1:6390@16390 master - 0 1635742000349 1 connected 0-4095
923df80e2201c6c35de10560dc04420b5742ea58 127.0.0.1:6394@16394 master - 0 1635741998338 4 connected 12288-16383
ce3c377b47d78f45cf824355926a4763eb4eb6fa 127.0.0.1:6391@16391 myself,master - 0 1635741998000 2 connected 4096-8191
08f772eab05b03b5559b15af0848822a383093d9 127.0.0.1:6392@16392 master - 0 1635741999344 3 connected 8192-12287
127.0.0.1:6391>


127.0.0.1:6391> cluster keyslot prashant
(integer) 6942

127.0.0.1:6391> keys *
(empty array)

-- No keys, let me try to create one string type key.
127.0.0.1:6391> set prashant dixit
OK

127.0.0.1:6391> get prashant
"dixit"

Hope It Helped
Prashant Dixit

Advertisement

2 Responses to “Redis cluster node with status DISCONNECTED – [ERR] Not all 16384 slots are covered by nodes, and a crude method to deal that …”

  1. Saibal Ghosh said

    This is a very instructive piece and provides a lot of very valuable information that is not readily available! Good job, Prashant!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: