Field note: NSX Edge CPU lockups and Enhanced Datapath on Minisforums MS-A2 MiniPCs

This is a short field note from my VCF lab after a run of odd NSX Edge instability. The symptom pattern was confusing at first: the Edges did not have much tenant traffic on them, but they would still show very high CPU symptoms, management instability, missing tunnel endpoint state, and occasional datapath alarms.

The physical lab hosts in my lab are Minisforum MS-A2 Mini PCs. Across the 11x ESX 9.0.2 hosts, the 10GbE NSX uplinks are Intel Ethernet Controller X710 for 10GbE SFP+ adapters using the i40en driver. Each host also presents an onboard Intel I226-V interface, but the 10GbE X710 adapters are the relevant NICs for the NSX datapath and TEP traffic discussed here.

The useful lesson was that low traffic does not necessarily mean the Edge is healthy. In my case, the better signal was the relationship between host CPU pressure, the Edge datapath, and the physical NIC path underneath it.

The symptom

The NSX Edges would periodically look unhealthy even though the lab was not pushing much real traffic. I saw combinations of:

  • MPA or management connectivity down on an Edge
  • missing VTEP IPs during Edge redeploy or reset attempts
  • high Edge CPU with no obvious workload explanation
  • receive-side packet loss on the Edge fastpath NIC

The most useful NSX Manager alarm was this one:

Edge NIC fp-eth0 receive ring buffer has overflowed by 20.560213% on Edge node 3eb1a3fa-acee-46f3-a21b-ddb9ec129c8d.
The missed packet count is 2158 and processed packet count is 8338.

Many Red Herrings!

My first instinct was to look at the obvious underlay items: DHCP for the host and Edge TEP VLANs, UniFi VLAN behaviour, host placement, and whether some ESXi hosts were worse than others. Those were worth checking, but alas they weren’t the answer.

I tried moving Edges around, rebuilding Edges MANY times, extending DHCP lease timers, moving the management TEPs to static assignment, and adding temporary DRS rules to narrow down the issue. Some of those changes made the environment easier are more conservative design decisions but they were worth trying.

The Fix (…maybe?)

It’s still only been a short time since I regained stability so I can’t say there was a single root cause, because I made several changes in a short space of time. But the Edge behaviour improved materially after this group of fixes:

  • Reduced host CPU contention. I started applying conservative CPU limits to ordinary lab VMs so they could not crowd out the platform appliances. For this lab it runs on AMD Ryzen 9 9955HX 16-Core Processors. Their native clock speed is 2.5Ghz but can burst to twice that amount on a single core. So I was conservative and applied a CPU Limit of around 2 GHz per vCPU on each non-VCF VM, and lower per-vCPU limits for VMs that only need to boot and run light scripts.
  • Enabled Enhanced Datapath for the NSX host and Edge path. The aim was to reduce overhead in the forwarding path and avoid making the Edge fastpath compete unnecessarily with other host work.
  • Updated the Intel i40en driver through the ESXi cluster image. I tested the image change on one host first, then applied it across the remaining management-cluster hosts once the Edge/RX checks stayed clean. The drive I used was the i40en-2.11.1.0 version from Intel.
  • Moved host and Edge TEPs away from DHCP dependence. Static IP pools for TEPs removed one variable from Edge redeploy and transport-node preparation.

Current result

After the changes, the Edges are behaving much better. The latest health checks show all six Edges up across the management and both workload domains, with management, MPA, and VTEP state all healthy. The ESXi RX-buffer log checks are also clean across the monitored hosts for far longer than they’ve ever been!

That does not prove Enhanced Datapath alone fixed the issue. The more defensible conclusion is that the Edges were highly sensitive to host CPU pressure and receive-path overhead, and the combination of CPU control, Enhanced Datapath, NIC driver updates, and static TEP assignment has stopped the failure pattern from recurring so far.

Takeaway

For a small VCF lab, the NSX Edges need to be treated like a crown jewel. They need protected CPU headroom where simple CPU guarantees just aren’t enough. If the Edge receive ring starts overflowing, I would check host CPU pressure, pNIC driver/firmware, Enhanced Datapath support, and RX drop counters before spending too long on tenant traffic volume.