<h2><span style="color:#2980b9"><strong>Kubernetes CIDR Overlap Issue on RHEL 9  Debugging and Fix </strong></span></h2> <h2><span style="color:#2980b9"><strong>Kubernetes Pod-to-ClusterIP Connectivity Failure</strong></span></h2> <h2><span style="color:#2980b9"><strong>Failed to initialize datastore error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0.1:443: i/o timeout</strong></span></h2>

Issue Summary

calico-kube-controllers pod was stuck in CrashLoopBackOff and pods could not communicate with the Kubernetes API server via ClusterIP 10.96.0.1:443.

Environment

Component	Detail
OS	RHEL 9.6
Kubernetes	v1.29.15
CNI	Calico v3.27.0
Container Runtime	containerd 2.2.2
Node IPs	192.168.241.140/141/142
Pod CIDR (configured)	192.168.0.0/16
Service CIDR	10.96.0.0/12

Timeline of Symptoms

calico-kube-controllers pod stuck in CrashLoopBackOff
Error: dial tcp 10.96.0.1:443: i/o timeout
Test busybox pod on workernode1 also could not reach 10.96.0.1
However, all 3 nodes could reach 10.96.0.1:443 directly from the host (got 403 Forbidden — meaning host-level connectivity was fine)

Investigation Steps

Step 1 — Ruled out common suspects

Firewalld — disabled on all nodes ✅
SELinux — disabled on all nodes ✅
ip_forward — enabled (net.ipv4.ip_forward = 1) on all nodes ✅
kube-proxy — running on all nodes ✅
iptables KUBE-SERVICES chain — rules for 10.96.0.1 existed on all nodes ✅
Nodes reaching 10.96.0.1:443 directly — working on all nodes ✅

Step 2 — Identified pod-specific failure

Host-to-ClusterIP worked but pod-to-ClusterIP timed out. This pointed to a problem specifically with how pod traffic was being NAT'd through the ClusterIP rules.

Step 3 — Found the smoking gun

Running this command on workernode1:

iptables -t nat -L KUBE-SVC-NPX46M4PTMTKRN6Y -v -n
```

Revealed this rule:
```
KUBE-MARK-MASQ  tcp  --  *  *  !192.168.0.0/16  10.96.0.1  tcp dpt:443
```

The `!192.168.0.0/16` means — **only masquerade (SNAT) traffic coming from OUTSIDE 192.168.0.0/16**. Traffic from inside that range is excluded from masquerading.

---

## Root Cause

**Pod CIDR `192.168.0.0/16` overlapped with Node IP range `192.168.241.x`.**

This caused a chain reaction:
```
Pod IP: 192.168.212.4
        ↓
Sends packet to 10.96.0.1:443
        ↓
kube-proxy KUBE-SERVICES chain matches → forwards to KUBE-SVC-NPX46M4PTMTKRN6Y
        ↓
KUBE-MARK-MASQ rule checks source IP:
192.168.212.4 is INSIDE 192.168.0.0/16
        ↓
MASQUERADE is SKIPPED ← problem here
        ↓
Packet reaches API server (192.168.241.140:6443)
with source IP 192.168.212.4 (pod IP)
        ↓
API server tries to reply to 192.168.212.4
but has no route back to that pod IP
        ↓
Connection times out

kube-proxy intentionally excludes pod CIDR from masquerading to avoid unnecessary NAT for pod-to-pod traffic. But when the pod CIDR overlaps with the node network, this optimization breaks pod-to-ClusterIP communication.

Why Nodes Could Reach 10.96.0.1 But Pods Could Not

Source	Source IP	In 192.168.0.0/16?	Masqueraded?	Works?
Node (host)	192.168.241.x	Yes	No	✅ Yes — node IP is routable
Pod	192.168.212.4	Yes	No	❌ No — pod IP not directly routable to API server

Nodes have real routable IPs so replies come back fine even without masquerading. Pods do not — they need SNAT so the reply goes back to the node, which then forwards it to the pod.

Temporary Fix Applied

Set masqueradeAll: true in kube-proxy configmap:

yaml

iptables:
  masqueradeAll: true

This forces SNAT on all pod-to-ClusterIP traffic regardless of source IP, bypassing the overlap problem. This worked but adds NAT overhead on every pod connection.

Permanent Fix — Reinstall with Non-Overlapping CIDRs

Network	Old (broken)	New (correct)
Pod CIDR	`192.168.0.0/16`	`172.16.0.0/16`
Service CIDR	`10.96.0.0/12`	`10.96.0.0/12`
Node IPs	`192.168.241.x`	`192.168.241.x`

Reinstall command:

bash

kubeadm init \
  --pod-network-cidr=172.16.0.0/16 \
  --service-cidr=10.96.0.0/12 \
  --apiserver-advertise-address=192.168.241.140

With Calico configured to match: