Kubernetes nodes not working, runtime network not ready
I messed up and deleted my nodes using kubectl delete node
thinking that was how to restart nodes, but now they aren't working. I have reduced the number of nodes in the pool and increased it again to make new nodes but now the new nodes are all NotReady when I look at their status. Has anyone seen that and been able to fix it in Linode's managed Kubernetes environment?
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 19 Aug 2020 17:45:16 -0400 Wed, 19 Aug 2020 17:24:42 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 19 Aug 2020 17:45:16 -0400 Wed, 19 Aug 2020 17:24:42 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 19 Aug 2020 17:45:16 -0400 Wed, 19 Aug 2020 17:24:42 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Wed, 19 Aug 2020 17:45:16 -0400 Wed, 19 Aug 2020 17:24:42 -0400 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
1 Reply
Hey there! Kubernetes can be pretty dense on the best of days, and punishingly so on the worst, so I'd be more than happy to provide a bit of context around what's happening with your LKE cluster.
Though there are some Kubernetes resources that are gracefully removed when deleted through kubectl (such as pods and services), this is not the case for nodes -- when kubectl delete <nodeName>
is run, that node is ungracefully removed, which is to say that the node is deleted regardless of the processes it's handling. This, as you've noticed, can leave your cluster in a pretty wonky state.
In this particular instance, it's likely that your cluster's webhook controller server (responsible for interacting with your cluster's kube-apiserver and ensuring that the proper services are able to be deployed) was deleted along with its parent node which deadlocked your cluster. The last line in the output you shared refers to your cluster's inability to deploy Calico nodes which manage your cluster's internal networking.
Part of your cluster's deadlock condition can be found in its aforementioned webhook server -- this being the case, you should be able to resolve the deadlock by deleting that resource. You can do so by first saving your webhooks with kubectl get mutatingwebhookconfigurations -oyaml > mutatingwebhooks.txt
(in case they're needed later), and then deleting them with kubectl delete mutatingwebhookconfigurations <NAME>
. After doing so, you should be able to add new nodes to your cluster without issue. If you happen to notice otherwise, though, feel free to follow up here with the output from kubectl get events
to get some assistance from the rest of the Community!