Hello,
I have a major issue and I have been having it for a long time (so the versions mentioned alone are the current versions, but it has been happening with older vSphere 6 and NSX 6.x versions too).
Often (so not always), when I put a Host from the "normal cluster" (with NSX) into Maint.mode and DRS vmotions all VM's away to other hosts and one of the NSX Logical Router VM's is on there and is also vMotioned to another host, the entire network goes down for 5 to 10 Minutes.
From my physical admin workstation, I lose ping to everything. And i mean everything: Core-Switch, Distribution switches, edge-switches, WiFi Accesspoints, all the ESXi servers and all the VM's. Really **everything** becomes unreachable.
After 5 to 10 minutes, everything comes back.
From the tsunami of Alert-emails that flood my mailbox (once my PC's email client re-established connection to the mailserver), it becomes clear that "everybody lost everybody" during the outage.
Environment:
NSX 6.3.2 running in Unicast mode.
vCenter Appliance 6.5 U1
1 x "Management Cluster" without NSX. This is where the NSX Manager and the three NSX Controller VM's live.
1 x "normal Cluster", all ESXi 6.5 U1 hosts run NSX and the two (HA-pair) Logical Router VM's run on this cluster.
No NSX Edge appliances other than the LR's.
Anti-affinity rules to ensure the LR VM's don't end up on the same host.
NSX shows no errors before the outage. Everything humming along fine.
Firewalls are non-vmware virtualized firewall-appliances (NSX distr.firewall not in use at this site).
I spent a lot of time going through logs on all devices etc. and what it is starting to look like, is that Spanning-tree (rstp) goes flippy, effectively killing layer-2, until everything settles down and gradually starts coming back (I keep a screen open with dozens and dozens of pings running to all kinds of systems and network-components).
The pattern that I have started noticing, is that, when it goes wrong (again, not always), it is always when the LR VM that is vmotioned was part of a mass-vmotion when a host, through DRS, is evacuated due to entering maint.mode.
What it looks like: "in vCenter, I see all the ESXi server's VM's be in a vMotioning state, at various percentages, some already done, but most not yet, and the LR VM that is amongst the ones that are not done yet. Then the webbrowser freezes and on my other monitor, where all those pings are running, all the pings to all devices and hosts are dead within a couple of seconds.
Question: Am I doing something wrong, that causes such total network outages, when a LR Vm is vmotioned? It happened twice this evening (we were patching the ESXi hosts) and the LR VM's where vmotioned around a few times. During two such vmotions, the entire network went down as described.
Does anyone else have such experiences?