calico-kube-controllers pod was stuck in CrashLoopBackOff and pods could not communicate with the Kubernetes API server via ClusterIP 10.96.0.1:443.
| Component | Detail |
|---|---|
| OS | RHEL 9.6 |
| Kubernetes | v1.29.15 |
| CNI | Calico v3.27.0 |
| Container Runtime | containerd 2.2.2 |
| Node IPs | 192.168.241.140/141/142 |
| Pod CIDR (configured) | 192.168.0.0/16 |
| Service CIDR | 10.96.0.0/12 |
calico-kube-controllers pod stuck in CrashLoopBackOffdial tcp 10.96.0.1:443: i/o timeout10.96.0.110.96.0.1:443 directly from the host (got 403 Forbidden — meaning host-level connectivity was fine)net.ipv4.ip_forward = 1) on all nodes ✅10.96.0.1 existed on all nodes ✅10.96.0.1:443 directly — working on all nodes ✅Host-to-ClusterIP worked but pod-to-ClusterIP timed out. This pointed to a problem specifically with how pod traffic was being NAT'd through the ClusterIP rules.
Running this command on workernode1:
iptables -t nat -L KUBE-SVC-NPX46M4PTMTKRN6Y -v -n
```
Revealed this rule:
```
KUBE-MARK-MASQ tcp -- * * !192.168.0.0/16 10.96.0.1 tcp dpt:443
```
The `!192.168.0.0/16` means — **only masquerade (SNAT) traffic coming from OUTSIDE 192.168.0.0/16**. Traffic from inside that range is excluded from masquerading.
---
## Root Cause
**Pod CIDR `192.168.0.0/16` overlapped with Node IP range `192.168.241.x`.**
This caused a chain reaction:
```
Pod IP: 192.168.212.4
↓
Sends packet to 10.96.0.1:443
↓
kube-proxy KUBE-SERVICES chain matches → forwards to KUBE-SVC-NPX46M4PTMTKRN6Y
↓
KUBE-MARK-MASQ rule checks source IP:
192.168.212.4 is INSIDE 192.168.0.0/16
↓
MASQUERADE is SKIPPED ← problem here
↓
Packet reaches API server (192.168.241.140:6443)
with source IP 192.168.212.4 (pod IP)
↓
API server tries to reply to 192.168.212.4
but has no route back to that pod IP
↓
Connection times out
kube-proxy intentionally excludes pod CIDR from masquerading to avoid unnecessary NAT for pod-to-pod traffic. But when the pod CIDR overlaps with the node network, this optimization breaks pod-to-ClusterIP communication.
| Source | Source IP | In 192.168.0.0/16? | Masqueraded? | Works? |
|---|---|---|---|---|
| Node (host) | 192.168.241.x | Yes | No | ✅ Yes — node IP is routable |
| Pod | 192.168.212.4 | Yes | No | ❌ No — pod IP not directly routable to API server |
Nodes have real routable IPs so replies come back fine even without masquerading. Pods do not — they need SNAT so the reply goes back to the node, which then forwards it to the pod.
Set masqueradeAll: true in kube-proxy configmap:
yaml
iptables:
masqueradeAll: true
This forces SNAT on all pod-to-ClusterIP traffic regardless of source IP, bypassing the overlap problem. This worked but adds NAT overhead on every pod connection.
| Network | Old (broken) | New (correct) |
|---|---|---|
| Pod CIDR | 192.168.0.0/16 |
172.16.0.0/16 |
| Service CIDR | 10.96.0.0/12 |
10.96.0.0/12 |
| Node IPs | 192.168.241.x |
192.168.241.x |
Reinstall command:
bash
kubeadm init \
--pod-network-cidr=172.16.0.0/16 \
--service-cidr=10.96.0.0/12 \
--apiserver-advertise-address=192.168.241.140
With Calico configured to match:
yaml
- name: CALICO_IPV4POOL_CIDR
value: "172.16.0.0/16"
192.168.0.0/16 pod CIDR is just a default, not a requirement — it can and should be changed if your node network uses the same range.