Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_interfaces_by_mac_on_linux: RuntimeError: duplicate mac found (driver: mlx5_core) #5794

Open
jeremy-oracle opened this issue Oct 7, 2024 · 7 comments
Labels
bug Something isn't working correctly

Comments

@jeremy-oracle
Copy link

jeremy-oracle commented Oct 7, 2024

Bug report

We are creating an instance with multiple network interfaces with the same MAC address on purpose because they are part of the same SR-IOV bond, but cloud-init code throws an exception.

Steps to reproduce the problem

Create an instance connected with 2 or more Mellanox CX5 or CX6 SR-IOV virtual functions with the same MAC address. The driver is mlx5_core.

Environment details

  • Cloud-init version: 23.4-7.0.1.el9_4.3
  • Operating System Distribution: Oracle Linux 9.4
  • Cloud provider, platform or installer type: Oracle Cloud

cloud-init logs

[   27.864643] cloud-init[1535]: Cloud-init v. 23.4-7.0.1.el9_4.3 running 'init-local' at Fri, 04 Oct 2024 21:57:35 +0000. Up 27.84 seconds.
[   27.932817] cloud-init[1535]: 2024-10-04 21:57:35,819 - util.py[WARNING]: Failed to parse IMDS network configuration!
[   27.937519] cloud-init[1535]: 2024-10-04 21:57:35,824 - util.py[WARNING]: failed stage init-local
[   27.941049] cloud-init[1535]: failed run of stage init-local
[   27.941601] cloud-init[1535]: ------------------------------------------------------------
[   27.943182] cloud-init[1535]: Traceback (most recent call last):
[   27.943841] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 781, in status_wrapper
[   27.945303] cloud-init[1535]:     ret = functor(name, args)
[   27.948436] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 442, in main_init
[   27.949544] cloud-init[1535]:     init.apply_network_config(bring_up=bring_up_interfaces)
[   27.950535] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 997, in apply_network_config
[   27.951721] cloud-init[1535]:     netcfg, src = self._find_networking_config()
[   27.952608] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 936, in _find_networking_config
[   27.954105] cloud-init[1535]:     if self.datasource and hasattr(self.datasource, "network_config"):
[   27.955260] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 273, in network_config
[   27.956327] cloud-init[1535]:     _ensure_netfailover_safe(self._network_config)
[   27.956994] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 96, in _ensure_netfailover_safe
[   27.958135] cloud-init[1535]:     mac_to_name = get_interfaces_by_mac()
[   27.958751] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 897, in get_interfaces_by_mac
[   27.959788] cloud-init[1535]:     return get_interfaces_by_mac_on_linux()
[   27.960406] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 996, in get_interfaces_by_mac_on_linux
[   27.961503] cloud-init[1535]:     raise RuntimeError(msg)
[   27.962022] cloud-init[1535]: RuntimeError: duplicate mac found! both 'ens6' and 'ens5' have mac '00:13:97:6f:3d:9f'.
[   27.962936] cloud-init[1535]: ------------------------------------------------------------

We want those ens5 and ens6 Mellanox SR-IOV / virtual function interfaces to be ignored, as a custom script will configure bonding, what would be the best solution for this within the cloud-init framework?

The workaround today is to only attach those SR-IOV interfaces after the first boot, but this problem occurs if attached at first boot.

Also, I couldn't run collect-logs because I couldn't log in to the instance since the cloud-init process was stopped by this problem.

Thank you,
Jeremy

@jeremy-oracle jeremy-oracle added bug Something isn't working correctly new An issue that still needs triage labels Oct 7, 2024
@TheRealFalcon
Copy link
Member

TheRealFalcon commented Oct 7, 2024

Create an instance connected with 2 or more Mellanox CX5 or CX6 SR-IOV virtual functions with the same MAC address. The driver is mlx5_core.

Can you provide more information on how to do this? Is there a way I can specify an SR-IOV bonded device at launch time? Is it inherent to a certain instance shape? If you use specific CLI launch args or options in the web interface, that would be helpful.

what would be the best solution for this within the cloud-init framework?

We currently workaround these types of devices in cloud-init on other platforms. We would need to adapt similar code for Oracle's platform.

I couldn't run collect-logs because I couldn't log in to the instance since the cloud-init process was stopped by this problem.

It'd be very helpful to get access to /var/log/cloud-init.log. Is the serial console an option?

@TheRealFalcon TheRealFalcon added incomplete Action required by submitter and removed new An issue that still needs triage labels Oct 7, 2024
@jeremy-oracle
Copy link
Author

In terms of reproducing on your side, this is running on a PCA (Private Cloud Appliance) for development, so not yet available broadly. PCA is basically a mini OCI in a rack that customers can purchase to run on-premises with OCI compatible API.

So, looking at the serial console, we have this: (init-local)

[   45.708569] cloud-init[1544]: 2024-10-07 21:49:10,625 - util.py[WARNING]: Failed to parse IMDS network configuration!
[   45.711427] cloud-init[1544]: 2024-10-07 21:49:10,628 - util.py[WARNING]: failed stage init-local
[   45.712641] cloud-init[1544]: failed run of stage init-local
[   45.713359] cloud-init[1544]: ------------------------------------------------------------
[   45.714347] cloud-init[1544]: Traceback (most recent call last):
[   45.715086] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 781, in status_wrapper
[   45.716336] cloud-init[1544]:     ret = functor(name, args)
[   45.717017] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 442, in main_init
[   45.718226] cloud-init[1544]:     init.apply_network_config(bring_up=bring_up_interfaces)
[   45.719183] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 997, in apply_network_config
[   45.720464] cloud-init[1544]:     netcfg, src = self._find_networking_config()
[   45.721320] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 936, in _find_networking_config
[   45.722633] cloud-init[1544]:     if self.datasource and hasattr(self.datasource, "network_config"):
[   45.723712] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 273, in network_config
[   45.725106] cloud-init[1544]:     _ensure_netfailover_safe(self._network_config)
[   45.725982] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 96, in _ensure_netfailover_safe
[   45.727467] cloud-init[1544]:     mac_to_name = get_interfaces_by_mac()
[   45.728246] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 897, in get_interfaces_by_mac
[   45.729576] cloud-init[1544]:     return get_interfaces_by_mac_on_linux()
[   45.730364] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 996, in get_interfaces_by_mac_on_linux
[   45.731812] cloud-init[1544]:     raise RuntimeError(msg)
[   45.732467] cloud-init[1544]: RuntimeError: duplicate mac found! both 'ens7' and 'ens8' have mac '00:13:97:87:a1:47'.
[   45.733673] cloud-init[1544]: ------------------------------------------------------------

and later this:

[   76.141036] cloud-init[2164]: Cloud-init v. 23.4-7.0.1.el9_4.3 running 'init' at Mon, 07 Oct 2024 21:49:41 +0000. Up 76.12 seconds.
[   76.155407] cloud-init[2164]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
[   76.156448] cloud-init[2164]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
[   76.157451] cloud-init[2164]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
[   76.158443] cloud-init[2164]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
[   76.159424] cloud-init[2164]: ci-info: |  ens3  | True |         192.168.0.3         | 255.255.255.0 | global | 00:13:97:0e:a9:93 |
[   76.160404] cloud-init[2164]: ci-info: |  ens3  | True | fe80::213:97ff:fe0e:a993/64 |       .       |  link  | 00:13:97:0e:a9:93 |
[   76.161389] cloud-init[2164]: ci-info: |  ens5  | True |              .              |       .       |   .    | 00:13:97:44:d5:fd |
[   76.162373] cloud-init[2164]: ci-info: |  ens6  | True |              .              |       .       |   .    | 00:13:97:44:d5:fd |
[   76.163374] cloud-init[2164]: ci-info: |  ens7  | True |              .              |       .       |   .    | 00:13:97:87:a1:47 |
[   76.164361] cloud-init[2164]: ci-info: |  ens8  | True |              .              |       .       |   .    | 00:13:97:87:a1:47 |
[   76.165343] cloud-init[2164]: ci-info: |   lo   | True |          127.0.0.1          |   255.0.0.0   |  host  |         .         |
[   76.166333] cloud-init[2164]: ci-info: |   lo   | True |           ::1/128           |       .       |  host  |         .         |
[   76.167313] cloud-init[2164]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
[   76.168295] cloud-init[2164]: ci-info: +++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
[   76.169156] cloud-init[2164]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   76.170016] cloud-init[2164]: ci-info: | Route | Destination |   Gateway   |    Genmask    | Interface | Flags |
[   76.170871] cloud-init[2164]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   76.171723] cloud-init[2164]: ci-info: |   0   |   0.0.0.0   | 192.168.0.1 |    0.0.0.0    |    ens3   |   UG  |
[   76.172581] cloud-init[2164]: ci-info: |   1   | 192.168.0.0 |   0.0.0.0   | 255.255.255.0 |    ens3   |   U   |
[   76.173438] cloud-init[2164]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   76.174290] cloud-init[2164]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   76.175009] cloud-init[2164]: ci-info: +-------+-------------+---------+-----------+-------+
[   76.175743] cloud-init[2164]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   76.176467] cloud-init[2164]: ci-info: +-------+-------------+---------+-----------+-------+
[   76.177189] cloud-init[2164]: ci-info: |   1   |  fe80::/64  |    ::   |    ens3   |   U   |
[   76.177899] cloud-init[2164]: ci-info: |   3   |    local    |    ::   |    ens3   |   U   |
[   76.178618] cloud-init[2164]: ci-info: |   4   |  multicast  |    ::   |    ens3   |   U   |
[   76.179342] cloud-init[2164]: ci-info: |   5   |  multicast  |    ::   |    ens5   |   U   |
[   76.180060] cloud-init[2164]: ci-info: |   6   |  multicast  |    ::   |    ens6   |   U   |
[   76.180789] cloud-init[2164]: ci-info: |   7   |  multicast  |    ::   |    ens7   |   U   |
[   76.181506] cloud-init[2164]: ci-info: |   8   |  multicast  |    ::   |    ens8   |   U   |
[   76.182228] cloud-init[2164]: ci-info: +-------+-------------+---------+-----------+-------+
[   76.224201] cloud-init[2164]: 2024-10-07 21:49:41,141 - util.py[WARNING]: Failed to parse IMDS network configuration!
[   76.227326] cloud-init[2164]: 2024-10-07 21:49:41,144 - util.py[WARNING]: failed stage init
[   76.228539] cloud-init[2164]: failed run of stage init
[   76.229014] cloud-init[2164]: ------------------------------------------------------------
[   76.229726] cloud-init[2164]: Traceback (most recent call last):
[   76.230284] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 781, in status_wrapper
[   76.231208] cloud-init[2164]:     ret = functor(name, args)
[   76.231680] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 442, in main_init
[   76.232578] cloud-init[2164]:     init.apply_network_config(bring_up=bring_up_interfaces)
[   76.233276] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 997, in apply_network_config
[   76.234230] cloud-init[2164]:     netcfg, src = self._find_networking_config()
[   76.234843] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 936, in _find_networking_config
[   76.235810] cloud-init[2164]:     if self.datasource and hasattr(self.datasource, "network_config"):
[   76.236573] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 273, in network_config
[   76.237595] cloud-init[2164]:     _ensure_netfailover_safe(self._network_config)
[   76.238226] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 96, in _ensure_netfailover_safe
[   76.239319] cloud-init[2164]:     mac_to_name = get_interfaces_by_mac()
[   76.239873] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 897, in get_interfaces_by_mac
[   76.240866] cloud-init[2164]:     return get_interfaces_by_mac_on_linux()
[   76.241438] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 996, in get_interfaces_by_mac_on_linux
[   76.242507] cloud-init[2164]:     raise RuntimeError(msg)
[   76.242970] cloud-init[2164]: RuntimeError: duplicate mac found! both 'ens7' and 'ens8' have mac '00:13:97:87:a1:47'.
[   76.243868] cloud-init[2164]: ------------------------------------------------------------

Then I disconnect the SR-IOV interfaces, managed to reboot the instance properly, and ran sudo cloud-init collect-logs.
See attached: issue-5794_collect-logs_cloud-init.tar.gz

Thank you 🙂

@jeremy-oracle
Copy link
Author

jeremy-oracle commented Oct 7, 2024

Also, it seems like adding mlx5_core to the tuple here stops the trace-back, but doesn't stop cloud-init to configure IPs on the base interfaces. In this case, those IPs should not be configured because they should be member of a bond interface. The bond carries the IP address, not its member interfaces. I am not sure there is such a mechanism in cloud-init to designate certain interfaces to be bonded with certain policy. As such I wrote a script to do this, but I want to avoid conflicts with the existing cloud init automation. 🙂

@jeremy-oracle
Copy link
Author

jeremy-oracle commented Oct 7, 2024

A patch like this might resolve this issue.
I did a few reboot tests after doing sudo cloud-init clean --configs network --machine-id and it seemed to be working.

jeremy@jeremy-lx:~/dev/cloud-init$ git diff e10b09be321b81f82f1a2cb3b3724deedfefe9ff
diff --git a/cloudinit/net/__init__.py b/cloudinit/net/__init__.py
index 78b15a47b..dfd02f087 100644
--- a/cloudinit/net/__init__.py
+++ b/cloudinit/net/__init__.py
@@ -971,7 +971,7 @@ def get_interfaces_by_mac_on_linux() -> dict:
             # cloud-init happens to enumerate network interfaces before drivers
             # have fully initialized the leader/subordinate relationships for
             # those devices or switches.
-            if driver in ("fsl_enetc", "mscc_felix", "qmi_wwan"):
+            if driver in ("fsl_enetc", "mscc_felix", "qmi_wwan", "mlx5_core"):
                 LOG.debug(
                     "Ignoring duplicate macs from '%s' and '%s' due to "
                     "driver '%s'.",
diff --git a/tests/unittests/test_net.py b/tests/unittests/test_net.py
index 590061e03..9924a296e 100644
--- a/tests/unittests/test_net.py
+++ b/tests/unittests/test_net.py
@@ -5249,7 +5249,8 @@ class TestGetInterfacesByMac:
         assert expected == result
 
 
-@pytest.mark.parametrize("driver", ("mscc_felix", "fsl_enetc", "qmi_wwan"))
+@pytest.mark.parametrize("driver", ("mscc_felix", "fsl_enetc", "qmi_wwan",
+                                    "mlx5_core"))
 @mock.patch("cloudinit.net.get_sys_class_path")
 @mock.patch("cloudinit.util.system_info", return_value={"variant": "ubuntu"})
 class TestDuplicateMac:

I couldn't push my branch to origin, it seems like I am not allowed 🙂

@TheRealFalcon
Copy link
Member

Also, it seems like adding mlx5_core to the tuple here stops the trace-back, but doesn't stop cloud-init to configure IPs on the base interfaces.

Yes, your patch is essentially ignoring one of the duplicates but configuring the other, which is unideal as you mention.

We dealt with a similar issue on Azure where there was similar ignoring of 'mlx5_core', but it eventually evolved into this: #2153 . The solution doesn't work for you because it is on a different hypervisor, but I'd think the solution could look similar but using the driver name as surfaced in your cloud.

I couldn't push my branch to origin, it seems like I am not allowed

Correct. If you're looking to submit a PR, you need to fork the repo, push a branch to your remote, and then create a PR against the Canonical main branch.

@jeremy-oracle
Copy link
Author

Thank you, I will have a look at #2153 .

Also, is this bug still incomplete? I still see the incomplete label. I couldn't find a way to remove it, as it is my understanding that it is now not missing any information 🙂

@TheRealFalcon TheRealFalcon removed the incomplete Action required by submitter label Oct 11, 2024
@TheRealFalcon
Copy link
Member

Sorry, removed the incomplete label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly
Projects
None yet
Development

No branches or pull requests

2 participants