Troubleshooting NSX-T orphaned “Transport node with IP already exists”

During a recent vSphere Upgrade in my staging environment an ESX Host went into an unrecoverable error completely destroying the existing OS installation.

While reinstalling was easy, I couldn’t add the Host again as there still was a metadata entry in the NSX’s policy set. The installation of the NSX bits failed with a message reading “Transport Node with ID / IP already exists”, containing the old ID of the node.

I also got an alarm mentioning “Management Channel to Transport Node down” and “Control Channel to Transport Node down”, containing the orphaned node’s ID in the description.

As I searched for troubleshooting steps, I ran into a lot of articles and blog posts telling me to “force remove” the node in NSX, which failed, as the node entry visible in the Fabric View in NSX’s settings was already the reinstalled node.


Previous Troubleshooting

In my specific failure scenario I resolved this with these steps:

  • Check if node is really orphaned
    • Get the orphaned node’s ID
    • Check via GET-API-Call to https://nsx-fqdn/api/v1/transport-nodes/ if the Node is still existing in the full list of all transport nodes (Full text search in the response for the ESXi Management IP or the former fqdn). This returned zero results.
    • Check on the ESXi host if all NSX components are fully uninstalled. You can do this via a SSH connection to the ESXi host and invoking nsxcli. If this already fails, NSX bits are probably removed properly. If not, delete them via the command del nsx.

With that I was sure that the node definitely didn’t participate in the datapath, so I went ahead and sent an DELETE-API Call to https://nsx-fqdn/api/v1/transport-nodes/node_id?force=true&unprepare_host=false

Unfortunately this returned an HTTP status code of 500 and the error still was visible in the NSX Manager UI.


Resolution

What most articles don’t tell you is that the ID mentioned before in the alarm messages is a different one than the one used in the NSX policy set.

To get the “real” ID, the easiest solution is to just use the Global Search feature in the NSX UI. I just searched for the former hostname of the Node and there was a entry for a transport node with the exact same hostname. When clicking on the search result, the UI lead me to a page where the node wasn’t present. Upon expanding the result NSX will tell you the ID.

After then checking back the transport node list via a GET-API-Call to https://nsx-fqdn/api/v1/transport-nodes/ and not finding the newly discovered ID, I was sure that this was the right entry now.

So I again went ahead and sent a DELETE-API Call to https://nsx-fqdn/api/v1/transport-nodes/search_result_node_id?force=true&unprepare_host=false which this time returned an empty result with an HTTP status code of 200.

Right after removing the old entry, the installation of the NSX bits on the fresh installed ESX host went through and succeeded.