NodeNotReady Kubernetes: Shocking Fixes DevOps Must Know 2025

It’s Monday morning, and you’re preparing to deploy a critical application update to your production Kubernetes cluster. You run your usual pre-deployment checks, starting with kubectl get nodes, expecting to see all nodes in the familiar “Ready” state. Instead, your heart skips a beat as you see this:

$ kubectl get nodes
NAME           STATUS     ROLES    AGE   VERSION
master-node    Ready      master   45d   v1.28.2
worker-node-1  Ready      <none>   45d   v1.28.2
worker-node-2  NotReady   <none>   45d   v1.28.2
worker-node-3  Ready      <none>   45d   v1.28.2

One of your worker nodes is stuck in NodeNotReady status, and you need to understand what’s happening before proceeding with the deployment. Sound familiar? If you’re a DevOps engineer working with Kubernetes, you’ve likely encountered this scenario or will soon enough.

In this comprehensive guide, we’ll dive deep into NodeNotReady Kubernetes issues, covering everything from understanding what this status means to implementing robust prevention strategies that will save you from 3 AM troubleshooting sessions.

What Does NodeNotReady Mean in Kubernetes?

When you see NodeNotReady Kubernetes status, it indicates that the kubelet running on that node has failed its health checks and cannot communicate properly with the Kubernetes API server. This means the node is considered unfit for scheduling new pods, and existing pods on that node may be at risk.

The kubernetes node health mechanism works through regular heartbeats between the kubelet and the control plane. When these heartbeats fail or the node conditions indicate problems, the node transitions to NotReady state. In some cases, you might also see node status unknown kubernetes when the control plane completely loses communication with the node.

Kubernetes Node Health Flow - NodeNotReady Kubernetes - thedevopstooling.com
Kubernetes Node Health Flow – NodeNotReady Kubernetes – thedevopstooling.com

Step-by-Step Debugging Process

When facing NodeNotReady Kubernetes issues, follow this systematic approach to identify and resolve the problem:

Step 1: Confirm the Node Status

First, verify which nodes are affected and gather basic information:

# Check all nodes status
kubectl get nodes

# Get detailed node information with labels
kubectl get nodes -o wide --show-labels

Step 2: Examine Node Conditions and Events

Use kubectl describe node to get detailed information about the problematic node:

# Replace 'worker-node-2' with your actual node name
kubectl describe node worker-node-2

Pay attention to the Conditions section, which shows:

  • Ready: Whether the node is ready to accept pods
  • MemoryPressure: If the node has memory pressure
  • DiskPressure: If the node has disk space pressure
  • PIDPressure: If the node has process pressure
  • NetworkUnavailable: If the node network is configured correctly

Step 3: Check Kubelet Logs

SSH into the problematic node and examine kubelet logs:

# Check kubelet status
sudo systemctl status kubelet

# View real-time kubelet logs
sudo journalctl -u kubelet -f

# Check recent kubelet logs
sudo journalctl -u kubelet --since "1 hour ago"

Step 4: Identify Affected Pods

Check which pods are running on the NotReady node:

# List all pods with their node assignments
kubectl get pods -o wide --all-namespaces | grep worker-node-2

# Check for pods stuck in Pending state
kubectl get pods --field-selector=spec.nodeName=worker-node-2

Node Problem Detector GitHub (Kubernetes project)

Common Causes of NodeNotReady Issues

Understanding the root causes helps you troubleshoot more effectively. Here are the most frequent NodeNotReady Kubernetes scenarios:

1. Network Connectivity Issues

The most common cause is when the node loses network connectivity to the Kubernetes API server. This can happen due to:

  • Network configuration changes
  • Firewall rule modifications
  • DNS resolution problems
  • Load balancer issues in multi-master setups

2. Kubelet Service Problems

The kubelet not ready state often results from:

  • Kubelet service crashed or stopped
  • Incorrect kubelet configuration
  • Missing or corrupted kubelet certificates
  • Resource constraints preventing kubelet from functioning

3. Resource Pressure

Nodes can become NotReady due to resource exhaustion:

  • Disk Pressure: Insufficient disk space (typically <10% free)
  • Memory Pressure: High memory utilization
  • PID Pressure: Too many processes running

4. Cloud Provider Issues

In cloud environments, NodeNotReady can result from:

  • EC2 instance stopped or terminated (AWS)
  • Compute Engine VM preempted (GCP)
  • Virtual Machine deallocated (Azure)
  • Instance networking or security group changes

5. Certificate or Configuration Expiration

Expired certificates or configuration issues can cause:

  • Kubelet unable to authenticate with API server
  • Container runtime (Docker/containerd) connectivity problems
  • Incorrect cluster CA certificates

Practical Fixes with Examples

Based on the root cause identified, apply the appropriate fix:

Fix 1: Restart Kubelet Service

For kubelet-related issues:

# SSH to the problematic node
ssh user@worker-node-2

# Restart kubelet service
sudo systemctl restart kubelet

# Verify kubelet is running
sudo systemctl status kubelet

# Check if node becomes Ready
kubectl get nodes

Fix 2: Resolve Network Connectivity

Ensure the node can reach the API server:

# Test API server connectivity (replace with your API server endpoint)
curl -k https://your-api-server:6443/version

# Check DNS resolution
nslookup kubernetes.default.svc.cluster.local

# Verify required ports are open
telnet your-api-server 6443

Fix 3: Address Resource Pressure

For disk or memory pressure:

# Check disk usage
df -h

# Clean up unused Docker images and containers
docker system prune -a

# Clear kubelet logs if they're too large
sudo truncate -s 0 /var/log/pods/*/*/*.log

# For memory issues, identify high-memory processes
top -o %MEM

Fix 4: Replace Unhealthy Nodes

When nodes are permanently damaged:

# Safely drain the node
kubectl drain worker-node-2 --ignore-daemonsets --force --delete-emptydir-data

# Cordon the node to prevent new scheduling
kubectl cordon worker-node-2

# Remove the node from the cluster
kubectl delete node worker-node-2

# Launch a replacement node and join it to the cluster

Fix 5: Cloud-Specific Solutions

AWS EC2:

# Check instance status
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0

# Start stopped instance
aws ec2 start-instances --instance-ids i-1234567890abcdef0

Google GCP:

# Check VM status
gcloud compute instances describe worker-node-2 --zone=us-central1-a

# Start stopped VM
gcloud compute instances start worker-node-2 --zone=us-central1-a

Azure:

# Check VM status
az vm get-instance-view --name worker-node-2 --resource-group myResourceGroup

# Start stopped VM
az vm start --name worker-node-2 --resource-group myResourceGroup

Troubleshooting Checklist

Use this table to quickly identify and fix common NodeNotReady scenarios:

Error/SymptomLikely CauseQuick Fix
Node shows NotReady after rebootKubelet service not auto-startingsudo systemctl enable kubelet && sudo systemctl start kubelet
“connection refused” in kubelet logsAPI server unreachableCheck network connectivity and firewall rules
“certificate signed by unknown authority”Expired or wrong certificatesRegenerate kubelet certificates
DiskPressure condition trueLow disk spaceClean up logs, images: docker system prune -a
MemoryPressure condition trueHigh memory usageRestart node or increase memory
Kubelet logs show “failed to sync node lease”Clock synchronization issueSync time: sudo ntpdate -s time.nist.gov
Pods stuck in “Terminating”Node completely unreachableForce delete: kubectl delete pod --grace-period=0 --force
Cloud VM stopped unexpectedlyInstance preempted/stoppedRestart instance via cloud console/CLI

Best Practices for Prevention

Implement these strategies to minimize NodeNotReady Kubernetes incidents:

1. Comprehensive Monitoring

Set up monitoring for kubernetes node health:

# Prometheus alert for NodeNotReady
- alert: NodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="false"} == 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} is not ready"

Create Grafana dashboards to visualize:

  • Node resource utilization
  • Kubelet status and errors
  • Network connectivity metrics
  • Node condition changes over time

2. Automated Node Management

Deploy Node Problem Detector to automatically identify node issues:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
spec:
  selector:
    matchLabels:
      app: node-problem-detector
  template:
    spec:
      containers:
      - name: node-problem-detector
        image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.13
        resources:
          limits:
            cpu: 10m
            memory: 80Mi
          requests:
            cpu: 10m
            memory: 80Mi
        volumeMounts:
        - name: log
          mountPath: /var/log
          readOnly: true

3. Cluster Autoscaler Configuration

Configure Cluster Autoscaler to automatically replace unhealthy nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
spec:
  template:
    spec:
      containers:
      - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.27.0
        name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws  # or gce, azure
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/kubernetes

4. Regular Health Checks

Implement automated health checking:

#!/bin/bash
# Daily node health check script
for node in $(kubectl get nodes -o name); do
    node_name=$(basename $node)
    if ! kubectl get nodes $node_name | grep -q "Ready"; then
        echo "ALERT: Node $node_name is not Ready"
        # Send notification to Slack/PagerDuty
        curl -X POST -H 'Content-type: application/json' \
        --data '{"text":"Node '$node_name' is NotReady in Kubernetes cluster"}' \
        YOUR_SLACK_WEBHOOK_URL
    fi
done

FAQ: NodeNotReady Kubernetes Issues

What does NodeNotReady mean in Kubernetes?

NodeNotReady means the kubelet on that node failed health checks and cannot communicate properly with the Kubernetes control plane. The node is marked as unfit for scheduling new pods, ensuring workload reliability.

How do I fix a NodeNotReady node?

To troubleshoot NodeNotReady issues:

1. Run kubectl describe node <node-name> to check conditions
2. SSH to the node and check journalctl -u kubelet -f
3. Restart kubelet: systemctl restart kubelet
4. Verify network connectivity to API server
5. Check for resource pressure (disk/memory)
6. Consider replacing the node if issues persist

Can pods run on a NodeNotReady node?

No, new pods cannot be scheduled on NodeNotReady nodes. Existing pods may continue running temporarily, but they’ll be rescheduled to healthy nodes if the NotReady state persists beyond the toleration period (typically 5 minutes).

How do I prevent NodeNotReady issues in production?

Prevent NodeNotReady Kubernetes problems by:

1. Implementing comprehensive monitoring (Prometheus + Grafana)
2. Using Node Problem Detector for early issue detection
3. Configuring Cluster Autoscaler for automatic node replacement
4. Setting up proper resource limits and monitoring
5. Regularly updating and maintaining node configurations

What’s the difference between NodeNotReady and Unknown status?

NodeNotReady means the kubelet is running but reporting unhealthy conditions. Unknown status indicates complete loss of communication between the node and control plane, often due to network issues or node crashes.

Conclusion: Mastering NodeNotReady Kubernetes Troubleshooting

NodeNotReady Kubernetes issues are among the most common challenges DevOps engineers face, but they don’t have to derail your deployments or cause extended downtime. By understanding the root causes, following systematic troubleshooting approaches, and implementing proper monitoring and automation, you can minimize both the frequency and impact of these issues.

Remember that prevention is always better than reaction. Investing time in setting up comprehensive monitoring, automated health checks, and proper cluster autoscaling will pay dividends in reduced operational overhead and improved system reliability.

The next time you encounter kubectl get nodes NotReady in your terminal, you’ll have the knowledge and tools to quickly diagnose, fix, and prevent similar issues in the future. Keep this guide handy, and consider automating the most common fixes to reduce your mean time to recovery (MTTR).

Ready to take your Kubernetes operations to the next level? Start implementing these monitoring and automation strategies today, and transform NodeNotReady from a crisis into a manageable operational event.


Related crash scenario troubleshooting:

Similar Posts

Leave a Reply