abhilashthale.tech
  • Home
  • BlogCategories
    • coding
    • other
    • n
  • Images to Pdf
  • My Files
  • Shares Average
  • About Me
  • Server Stats
  • Day
  • Night
  • Birds
  • Waves
  • Net
  • Dots
  • Halo
  • Rings
  • Fog
  • Clouds

    Kubernetes Cidr Overlap Issue On Rhel 9&Nbsp;&Nbsp;Debugging And Fix

    Kubernetes Pod-To-Clusterip Connectivity Failure

    Failed To Initialize Datastore Error=Get &Quot;Https://10.96.0.1:443/Apis/Crd.Projectcalico.Org/V1/Clusterinformations/Default&Quot;: Dial Tcp 10.96.0.1:443: I/O Timeout

    by abhilash - March 28, 2026

    Issue Summary

    calico-kube-controllers pod was stuck in CrashLoopBackOff and pods could not communicate with the Kubernetes API server via ClusterIP 10.96.0.1:443.


    Environment

    Component Detail
    OS RHEL 9.6
    Kubernetes v1.29.15
    CNI Calico v3.27.0
    Container Runtime containerd 2.2.2
    Node IPs 192.168.241.140/141/142
    Pod CIDR (configured) 192.168.0.0/16
    Service CIDR 10.96.0.0/12

    Timeline of Symptoms

    1. calico-kube-controllers pod stuck in CrashLoopBackOff
    2. Error: dial tcp 10.96.0.1:443: i/o timeout
    3. Test busybox pod on workernode1 also could not reach 10.96.0.1
    4. However, all 3 nodes could reach 10.96.0.1:443 directly from the host (got 403 Forbidden — meaning host-level connectivity was fine)

    Investigation Steps

    Step 1 — Ruled out common suspects

    • Firewalld — disabled on all nodes ✅
    • SELinux — disabled on all nodes ✅
    • ip_forward — enabled (net.ipv4.ip_forward = 1) on all nodes ✅
    • kube-proxy — running on all nodes ✅
    • iptables KUBE-SERVICES chain — rules for 10.96.0.1 existed on all nodes ✅
    • Nodes reaching 10.96.0.1:443 directly — working on all nodes ✅

    Step 2 — Identified pod-specific failure

    Host-to-ClusterIP worked but pod-to-ClusterIP timed out. This pointed to a problem specifically with how pod traffic was being NAT'd through the ClusterIP rules.

    Step 3 — Found the smoking gun

    Running this command on workernode1:

     

    iptables -t nat -L KUBE-SVC-NPX46M4PTMTKRN6Y -v -n
    ```
    
    Revealed this rule:
    ```
    KUBE-MARK-MASQ  tcp  --  *  *  !192.168.0.0/16  10.96.0.1  tcp dpt:443
    ```
    
    The `!192.168.0.0/16` means — **only masquerade (SNAT) traffic coming from OUTSIDE 192.168.0.0/16**. Traffic from inside that range is excluded from masquerading.
    
    ---
    
    ## Root Cause
    
    **Pod CIDR `192.168.0.0/16` overlapped with Node IP range `192.168.241.x`.**
    
    This caused a chain reaction:
    ```
    Pod IP: 192.168.212.4
            ↓
    Sends packet to 10.96.0.1:443
            ↓
    kube-proxy KUBE-SERVICES chain matches → forwards to KUBE-SVC-NPX46M4PTMTKRN6Y
            ↓
    KUBE-MARK-MASQ rule checks source IP:
    192.168.212.4 is INSIDE 192.168.0.0/16
            ↓
    MASQUERADE is SKIPPED ← problem here
            ↓
    Packet reaches API server (192.168.241.140:6443)
    with source IP 192.168.212.4 (pod IP)
            ↓
    API server tries to reply to 192.168.212.4
    but has no route back to that pod IP
            ↓
    Connection times out

    kube-proxy intentionally excludes pod CIDR from masquerading to avoid unnecessary NAT for pod-to-pod traffic. But when the pod CIDR overlaps with the node network, this optimization breaks pod-to-ClusterIP communication.


    Why Nodes Could Reach 10.96.0.1 But Pods Could Not

    Source Source IP In 192.168.0.0/16? Masqueraded? Works?
    Node (host) 192.168.241.x Yes No ✅ Yes — node IP is routable
    Pod 192.168.212.4 Yes No ❌ No — pod IP not directly routable to API server

    Nodes have real routable IPs so replies come back fine even without masquerading. Pods do not — they need SNAT so the reply goes back to the node, which then forwards it to the pod.


    Temporary Fix Applied

    Set masqueradeAll: true in kube-proxy configmap:

     

     

    yaml

    iptables:
      masqueradeAll: true

    This forces SNAT on all pod-to-ClusterIP traffic regardless of source IP, bypassing the overlap problem. This worked but adds NAT overhead on every pod connection.

     


    Permanent Fix — Reinstall with Non-Overlapping CIDRs

    Network Old (broken) New (correct)
    Pod CIDR 192.168.0.0/16 172.16.0.0/16
    Service CIDR 10.96.0.0/12 10.96.0.0/12
    Node IPs 192.168.241.x 192.168.241.x

    Reinstall command:

     

     

    bash

    kubeadm init \
      --pod-network-cidr=172.16.0.0/16 \
      --service-cidr=10.96.0.0/12 \
      --apiserver-advertise-address=192.168.241.140

    With Calico configured to match:

     

     

    yaml

    - name: CALICO_IPV4POOL_CIDR
      value: "172.16.0.0/16"

    Key Lessons Learned

    1. Always ensure pod CIDR, service CIDR, and node IP ranges are non-overlapping before installing Kubernetes. This is the most common mistake in home lab setups.
    2. RHEL 9 uses nf_tables backend for iptables — be aware that some manual iptables commands and older kube-proxy behaviors may not work as expected. Plan for this when setting up Kubernetes on RHEL 9.
    3. Host-to-ClusterIP working does not mean pod-to-ClusterIP works — always test connectivity from inside a pod, not just from the node.
    4. Calico's default 192.168.0.0/16 pod CIDR is just a default, not a requirement — it can and should be changed if your node network uses the same range.

abhilashthale.tech