ip-reconciler issue with StatefullSet cleanup #182

xagent003 · 2022-02-02T21:57:59Z

@maiqueb @dougbtv There is a larger problem with StatefulSets and container IDs. My current PR (use the container ID to match when deleting) only fixes one issue #180

A short while back we changed the isPodAlive() to skip pods in Pending phase: 55be906

This was because if the ip-reconciler ran while a pod was being deployed, it cleaned up the reservation because kubelet takes some time before adding the network-status annotations and IPs to the Pod resource. So the Pod resource is missing the IP temporarily, and the reservation would be erroneously marked for cleanup.

Now there is new issue where, if the IP range is full. Suppose 3 IPs in range, 3 Pods exist. Now suppose an STS Pod gets recreated after a node goes down. It hits ContainerCreating state and stays there because No IP in Range.

When ip-reconciler runs, it only checks by podRef. and it will match because it's a StatefulSet. The Pod that is stuck ContainerCreating will also be in Pending Phase, as kubelet is still retrying/awaiting the CNI. Therefore the orhpaned IP won't get cleaned up, and new Pod can never come up.

func (rl ReconcileLooper) isPodAlive(podRef string, ip string) bool {
	for livePodRef, livePod := range rl.liveWhereaboutsPods {
		if podRef == livePodRef {
			livePodIPs := livePod.ips
			logging.Debugf(
				"pod reference %s matches allocation; Allocation IP: %s; PodIPs: %s",
				livePodRef,
				ip,
				livePodIPs)
			_, isFound := livePodIPs[ip]
			return isFound || livePod.phase == v1.PodPending
		}
	}
	return false

How can we match by the containerID in the IPPool? Except the containerID stored in the reservation is the pause container ID. And this container ID is nowhere to be found in the k8s Pod resource. And I couldn't find an easy way to get the mapping (without going on actual host, grepping output of docker ps, etc...). So not sure how the ip-reconciler could achieve this. Meaning we are stuck using the podRef... which isn't unique for StatefulSets.

If we remove the PodPending check, we are back to issue #162: #162

But this seems less likely to occur than the issue described in this ticket.

The text was updated successfully, but these errors were encountered:

Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]>

* build: generate ip pool clientSet/informers/listers Signed-off-by: Miguel Duarte Barroso <[email protected]> * vendor: update vendor stuff Signed-off-by: Miguel Duarte Barroso <[email protected]> * build: vendor net-attach-def-client types Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: look for the whereabouts config file in multiple places The reconciler controller will have access to the whereabouts configuration via a mount point. As such, we need a way to specify its path. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: requires the IP ranges in normalized format The IP reconcile loop also requires the IP ranges in a normalized format; as such, we export it into a function, which will be used in a follow-up commit. Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: allow IPAM config parsing from a NetConfList Currently whereabouts is only able to parse network configurations in the strict [0] format - i.e. **do not accept** a plugin list - [1]. The `ip-control-loop` must recover the full plugin configuration, which may be in the network configuration format. This commit allows whereabouts to now understand both formats. Furthermore, the current CNI release - v1.0.Z - removed the support for [0], meaning that only the configuration list format is now supported [2]. [0] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration [1] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration-lists [2] - https://github.com/containernetworking/cni/blob/master/SPEC.md#released-versions Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: add a controller Listen to pod deletion, and for every deleted pod, assure their IPs are gone. The rough algorithm goes like this: - for every network-status in the pod's annotations: - read associated net-attach-def from the k8s API - extract the range from the net-attach-def - find the corresponding IP pool - look for allocations belonging to the deleted pod - delete them using `IPManagement(..., types.Deallocate, ...)` All the API reads go through the informer cache, which is kept updated whenever the objects are updated on the API. The dockerfiles are also updated, to ship this new binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: remove manual cluster reconciliation This would leave the `ip-control-loop` as the reconciliation tool. Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: assure stale IPAllocation cleanup This commit adds a unit where it is checked that the pod deletion leads to the cleanup of a stale IP address. This commit features the automatic provisioning of the controller informer cache with the data present on the fake clientset tracker (the "fake" datastore). This way, users can just create the client with provisioned data, and that'll trickle down to the informer cache of the pod controller. Because the `network-attachment-definitions` resources feature dashes, the heuristic function that guesses - yes, guesses. very deterministic ... - the name of the resource can't be used - [0]. As such, it was needed to create an alternate `newFakeNetAttachDefClient` where it is possible to specify the correct resource name. [0] - https://github.com/k8snetworkplumbingwg/network-attachment-definition-client/blob/2fd7267afcc4d48dfe6a8cd756b5a08bd04c2c97/vendor/k8s.io/client-go/testing/fixture.go#L331 Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: move helper funcs to other files The helper files are tagged with the `test` build tag, to prevent them from being shipped on the production code binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop, queueing: use a rate-limiting queue Using a queue allows us to re-queue errors. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop: add IPAllocation cleanup related events Adds two new events related to garbage collection of the whereabouts IP addresses: - when an IP address is garbage collected - when a cleanup operation fails and is not re-queued The former event looks like: ``` 116s Normal IPAddressGarbageCollected pod/macvlan1-worker1 \ successful cleanup of IP address [192.168.2.1] from network \ whereabouts-conf ``` The latter event looks like: ``` 10s Warning IPAddressGarbageCollectionFailed failed to garbage \ collect addresses for pod default/macvlan1-worker1 ``` Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: check out statefulset scenarios Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test different scale up/down order and instance deltas Signed-off-by: Miguel Duarte Barroso <[email protected]> * ci: test e2e bash scripts last These ugly tests do not cleanup after themselves; this way, the golang based tests (which **do** cleanup after themselves) will not be impacted by these left-overs. Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip control loop, unit tests: test negative scenarios Check the event thrown when a request is dropped from the queue, and assure reconciling an allocation is impossible without having access to the attachment configuration data. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test fix for issue #182 Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - #182 Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip-control-loop: strip pod before queueing it The ip reconcile loop only requires the pod metadata and its network status annotatations to garbage collect the stale IP addresses. As such, we remove the status and spec parameters from the pod before queueing it. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: focus on networks w/ whereabouts IPAM type Signed-off-by: Miguel Duarte Barroso <[email protected]>

* build: generate ip pool clientSet/informers/listers Signed-off-by: Miguel Duarte Barroso <[email protected]> * vendor: update vendor stuff Signed-off-by: Miguel Duarte Barroso <[email protected]> * build: vendor net-attach-def-client types Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: look for the whereabouts config file in multiple places The reconciler controller will have access to the whereabouts configuration via a mount point. As such, we need a way to specify its path. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: requires the IP ranges in normalized format The IP reconcile loop also requires the IP ranges in a normalized format; as such, we export it into a function, which will be used in a follow-up commit. Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: allow IPAM config parsing from a NetConfList Currently whereabouts is only able to parse network configurations in the strict [0] format - i.e. **do not accept** a plugin list - [1]. The `ip-control-loop` must recover the full plugin configuration, which may be in the network configuration format. This commit allows whereabouts to now understand both formats. Furthermore, the current CNI release - v1.0.Z - removed the support for [0], meaning that only the configuration list format is now supported [2]. [0] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration [1] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration-lists [2] - https://github.com/containernetworking/cni/blob/master/SPEC.md#released-versions Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: add a controller Listen to pod deletion, and for every deleted pod, assure their IPs are gone. The rough algorithm goes like this: - for every network-status in the pod's annotations: - read associated net-attach-def from the k8s API - extract the range from the net-attach-def - find the corresponding IP pool - look for allocations belonging to the deleted pod - delete them using `IPManagement(..., types.Deallocate, ...)` All the API reads go through the informer cache, which is kept updated whenever the objects are updated on the API. The dockerfiles are also updated, to ship this new binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: remove manual cluster reconciliation This would leave the `ip-control-loop` as the reconciliation tool. Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: assure stale IPAllocation cleanup This commit adds a unit where it is checked that the pod deletion leads to the cleanup of a stale IP address. This commit features the automatic provisioning of the controller informer cache with the data present on the fake clientset tracker (the "fake" datastore). This way, users can just create the client with provisioned data, and that'll trickle down to the informer cache of the pod controller. Because the `network-attachment-definitions` resources feature dashes, the heuristic function that guesses - yes, guesses. very deterministic ... - the name of the resource can't be used - [0]. As such, it was needed to create an alternate `newFakeNetAttachDefClient` where it is possible to specify the correct resource name. [0] - https://github.com/k8snetworkplumbingwg/network-attachment-definition-client/blob/2fd7267afcc4d48dfe6a8cd756b5a08bd04c2c97/vendor/k8s.io/client-go/testing/fixture.go#L331 Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: move helper funcs to other files The helper files are tagged with the `test` build tag, to prevent them from being shipped on the production code binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop, queueing: use a rate-limiting queue Using a queue allows us to re-queue errors. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop: add IPAllocation cleanup related events Adds two new events related to garbage collection of the whereabouts IP addresses: - when an IP address is garbage collected - when a cleanup operation fails and is not re-queued The former event looks like: ``` 116s Normal IPAddressGarbageCollected pod/macvlan1-worker1 \ successful cleanup of IP address [192.168.2.1] from network \ whereabouts-conf ``` The latter event looks like: ``` 10s Warning IPAddressGarbageCollectionFailed failed to garbage \ collect addresses for pod default/macvlan1-worker1 ``` Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: check out statefulset scenarios Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test different scale up/down order and instance deltas Signed-off-by: Miguel Duarte Barroso <[email protected]> * ci: test e2e bash scripts last These ugly tests do not cleanup after themselves; this way, the golang based tests (which **do** cleanup after themselves) will not be impacted by these left-overs. Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip control loop, unit tests: test negative scenarios Check the event thrown when a request is dropped from the queue, and assure reconciling an allocation is impossible without having access to the attachment configuration data. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test fix for issue k8snetworkplumbingwg#182 Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip-control-loop: strip pod before queueing it The ip reconcile loop only requires the pod metadata and its network status annotatations to garbage collect the stale IP addresses. As such, we remove the status and spec parameters from the pod before queueing it. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: focus on networks w/ whereabouts IPAM type Signed-off-by: Miguel Duarte Barroso <[email protected]> Replaced bash e2e test with Golang e2e test (k8snetworkplumbingwg#181) Fixed code errors causing e2e tests to not compile Signed-off-by: nicklesimba <[email protected]> e2e tests: remove manual ip pool reconciliation Signed-off-by: Miguel Duarte Barroso <[email protected]> e2e tests: use the default namespace Signed-off-by: Miguel Duarte Barroso <[email protected]> e2e tests: read the IPAllocations from the correct namespace Furthermore, harden the `isIPPoolAllocationsEmpty` function - currently, if the IPPool does not exist, it will be created without allocations. When that happens, the pool is indeed empty. Signed-off-by: Miguel Duarte Barroso <[email protected]> Changed variable name for readability (envVars *envVars changed to testConfig *envVars) Signed-off-by: nicklesimba <[email protected]>

xagent003 mentioned this issue Feb 16, 2022

Rechecking pending Pods #195

Open

maiqueb mentioned this issue Mar 10, 2022

IP control loop #185

Merged

dougbtv closed this as completed in #185 Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ip-reconciler issue with StatefullSet cleanup #182

ip-reconciler issue with StatefullSet cleanup #182

xagent003 commented Feb 2, 2022

ip-reconciler issue with StatefullSet cleanup #182

ip-reconciler issue with StatefullSet cleanup #182

Comments

xagent003 commented Feb 2, 2022