-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Revert l3vni optimization #19344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Revert l3vni optimization #19344
Conversation
raja-rajasekar
commented
Aug 5, 2025
This reverts commit bd32706.
2ee2705
to
d72e275
Compare
…sing" This reverts commit 0f2cb27. There is an issue in this L3vni optimization design. With the current optimization, we have two issues. 1) Overwriting the states (ADD/Withdraw) leading to inconsistencies between Zebra and BGP. 2) Post processing handling in the delete - When BGP receives a L3VNI add * Pre-processes (update RDs, tunnel ip etc..) * Schedule the remote route processing for global walk * No post processing - However, in case of L3VNI del * No pre-processing * Schedules the remote route processing for global walk * Post-processing: handle the delete/memset portion of code Imagine the below situation T0: L3VNI Up in zebra T1: L3VNI added in BGP via Zebra T2: BGP imports type-5 route in the tenant VRF for this L3VNI based on import RTs T3: VRF routes runs best path and programmed in zebra T4: Before Zebra gets a chance to process this routes L3VNI is oper down (due to SVI down or some other reason) T5: When the route does get processed in zebra, nexthop_active marks this as in-active based on the DVNI/SVD check T6: L3VNI oper down goes down to BGP T7: L3VNI comes up, this trigger masks the l3vni down in BGP T8: BGP does not unimport/import the routes again and zebra is in stale state The right fix for this is 1) DO NOT overwrite the updates (ADD/DELETE) 2) Queue the entire handling i.e. pre-processing + remote-route + post-processing for later and not chunks of it. Note: L2VNI optimization still remains intact (zebra cleansup the state + no post processing). This issues is specific to L3vni optimization only because of post-processing in the DELETE handling. Signed-off-by: Rajasekar Raja <[email protected]>
d72e275
to
04ecfcb
Compare
@raja-rajasekar how is this critical, because I see it's already in 10.4, and 10.3. So we need urgent bugfix releases? |
@Mergifyio backport stable/10.4 stable/10.3 |
🟠 Waiting for conditions to match
|
In our internal testing, we did a quick flap of an L3vni where this was exposed(Meaning that the ADD-DEL-ADD is in queue).(kind of timing/race condition). Assuming it had been in upstream for almost a year kind of hints that no one has tried such quick flaps. The right way is to queue the entire pre/remote/post processing (altogether) for later processing and not chunks of it in add and delete.. |