Connection collision and IP re-usage in case of NSMgr and NSC failover

**Describe the bug**
Track an NSM issue.

This issue occurs when a TAPA process (an NSC) is force-killed (sigkill) while its local NSMgr POD is simultaneously being deleted. In this specific scenario, neither the newly started TAPA process nor the new NSMgr instance retains knowledge of the previous NSM connection associated with the killed TAPA.

Consequently, TAPA will request a new NSM connection, reusing the same Connection_ID and Interface_Name. This new request will be handled as a separate connection by the local vpp-forwarder.
- The following sequence of events occurs:
- The new connection acquires the same IP addresses from the IPAM.
- The collocated Proxy (NSE) will then present two NSM interfaces sharing the same IP addresses.
- Within TAPA, the new connection effectively replaces the old connection's interface by overwriting the existing interface with the same name.
- Approximately 10 minutes after the last successful refresh of the old NSM connection, the vpp-forwarder closes the stale entry. As part of this close procedure, the Proxy releases the IPs that are still actively used by the new connection. This premature release marks these IPs as free, making the system vulnerable to duplicated IP issues until the new connection's next successful refresh. Concurrently, TAPA's policy routing for the old connection is removed breaking further communication.

**To Reproduce**
1. Deploy Trench with 1 Conduit, Stream, Flow, VIP and an Attractor with a single replica.
2. Deploy example-target deployment with a single replica and open the Stream.
3. Force kill the TAPA and at the same time delete the local NSMgr POD.
4. Make sure to open the Stream again once the TAPA process restarted.

**Expected behavior**
By design NSM keeps track of connections, thus should not allow two interfering connections to co-exist.
Sorting out clean-up at the expiration of the old state connection is not a solution due to the issues around IP allocation.
Instead, probably vpp-forwarder would need to lookup and close the stale connection upon the request to the new connection is received (NSC path ID, Network Service, NSE triplet could be used to identify connections for this feature - maybe mechanisms could be also considered).

**Context**
 - Network Service Mesh: v1.14.5
 - Meridio: v1.2.2



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection collision and IP re-usage in case of NSMgr and NSC failover #584

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Connection collision and IP re-usage in case of NSMgr and NSC failover #584

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions