Cilium sometimes ends up in a failed state unable to contact K8s API Server #592

playworker · 2024-08-08T10:24:18Z

Summary

Sometimes when things go wrong with the k8s cluster Cilium ends up in a failed state and I don't know how to recover it.

The Cilium failure looks a lot like this: cilium/cilium#20679

The Cilium Operator and the Daemonset pods are trying to contact the K8s API Server but can't, I don't believe the IP address they're trying is correct, it's a 10.x.x.x address.

I've not been able to recover from this error, but I haven't really tried much. Disabling the network using the k8s CLI tool doesn't have any effect.

What Should Happen Instead?

I'm not sure what the underlying issue is, if it is a issue with the API Server IP address being wrong then I guess that needs to be set correctly somehow.

Reproduction Steps

The most recent time this happened I set the containerd_custom_registries setting to a bad value, it included a semi-colon in the middle of the string:

juju config k8s containerd_custom_registries='[{"url": "https://hostname";, "host": "host:4567", "username": "user", "password": "pass"}]'

I corrected the setting but the k8s cluster in Juju ended up in an errored state and the Cilium Operator and Pods ended up in the situation described above. I managed to recover the k8s units in Juju by downgrading the release then bumping it back up again, but I am unable to recover the Cilium installation back to a working state

System information

inspection-report-20240808_102202.tar.gz

Can you suggest a fix?

No response

Are you interested in contributing with a fix?

No response

The text was updated successfully, but these errors were encountered:

mateoflorido · 2024-09-02T13:25:34Z

Hello @playworker ,
We are aware of this issue and are currently working on a fix. In the meantime, here are a couple of workarounds we've tested to temporarily fix the issue:

Run the following command:

/opt/cni/bin/cilium-dbg cleanup --all-state --force

Restart the affected node.

marcofranssen · 2024-09-10T12:08:10Z

I'm experiencing this issue as well on my EKS cluster.

I configured cilium 1.16.1 as following:

helm upgrade --install --create-namespace --namespace kube-system cilium cilium/cilium \
        --values cilium-bootstrap-values.yaml \
        --set cluster.id=1 \
        --set cluster.name="$cluster_name" \
        --set eni.iamRole="$cilium_role_arn" \
        --set "serviceAccounts.operator.annotations.eks\.amazonaws\.com/role-arn"="$cilium_role_arn" \
        --set k8sServiceHost="$cluster_api_endpoint"

The important part is the k8sServiceHost which I pointed at my eks cluster api endpoint.

In my case this results in nodes being destroyed and created continously by Karpenter which is our autoscaler.

playworker changed the title ~~Cilium sometimes ends up in a failed state~~ Cilium sometimes ends up in a failed state unable to contact K8s API Server Aug 8, 2024

amc94 mentioned this issue Aug 30, 2024

Restart of snap.k8s.kube-apiserver.service causes deployment of pods to fail #642

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cilium sometimes ends up in a failed state unable to contact K8s API Server #592

Cilium sometimes ends up in a failed state unable to contact K8s API Server #592

playworker commented Aug 8, 2024

mateoflorido commented Sep 2, 2024

marcofranssen commented Sep 10, 2024 •

edited

Loading

Cilium sometimes ends up in a failed state unable to contact K8s API Server #592

Cilium sometimes ends up in a failed state unable to contact K8s API Server #592

Comments

playworker commented Aug 8, 2024

Summary

What Should Happen Instead?

Reproduction Steps

System information

Can you suggest a fix?

Are you interested in contributing with a fix?

mateoflorido commented Sep 2, 2024

marcofranssen commented Sep 10, 2024 • edited Loading

marcofranssen commented Sep 10, 2024 •

edited

Loading