Encourage parallel replace_string_in_file calls #2770
Unanswered
yemohyleyemohyle
asked this question in
Extension Development QnA
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Here we try to encourage Sonnet 4 model to call replace_string_in_file tool in parallel in situations when multiple changes are needed at the same time, for example when same file needs to be changed in multiple locations or a new functionality is added along with a unit test for it among others.
Roughly it affects about third of all replace_string edits when looking at swebench run. Below we look at counts of consecutive replace_sting calls by length (we group consecutive replace_string calls together and look at length of these blocks in terms of number of distinct turns).
We consider and compare four different solutions:
Add encouraging clauses in system message, initial user message, and tool description (parallel_rs_no_tool_reminder).
In addition to 1, append just-in-time reminder to the last tool call before a possible replace_string_in_file. To predict a possible replace_string_in_file call we add a parameter next_tool_prediction to each tool asking a model to provide a short list of possible follow up tools. When it contains replace_string_in_file value in the list, we insert a reminder at the end of tool output (parallel_rs_tool_reminder).
https://github.com/yemohyleyemohyle/vscode-copilot-chat/tree/yemohyle/parallel_replace_string
Same as 2, except we add a reminder as a user message instead of attaching it to the last tool (parallel_rs_user_reminder).
https://github.com/yemohyleyemohyle/vscode-copilot-chat/tree/yemohyle/parallel_replace_string_user_reminder
Introduce a new multi_replace_string tool that would accept an array of edits just like in replace_string_in_file tool, instead of a single edit. Hope is that if model is reluctant to call more than one tool at a time, having a batch tool will free it to parallelize more (parallel_rs_multi_rs).
https://github.com/yemohyleyemohyle/vscode-copilot-chat/tree/yemohyle/multi_replace_string
We see the following results regarding replace_string parallelization when we try these 4 approaches (based on swebench runs):
Above table columns parallel_rs_turns refer to distinct model responses containing multiple rs calls, and parallel_rs_calls refer to distinct rs calls that had rs siblings.
In terms of pass rate of the trajectories we see the following:

*The pass rate is highly affected by the model endpoint error rate which greatly varies for these four runs and should not be compared directly (on average the longer the trajectory is allowed to run the higher is a pass rate and model endpoint errors prematurely terminate trajectories):

In approaches 2 and 3 we ask a model to predict if it plans to do editing next and in case it does we insert an instruction to do so in parallel, so we can measure instruction response rate of such a approach by counting recall of cases when the model predicted replace_string use and followed it by a consecutive calls of replace_string:
parallel_rs_tool_reminder
qualified for parallel replace_string count: 274
followed instruction proportion: 0.0620
parallel_rs_user_reminder
qualified for parallel replace_string count: 366
followed instruction proportion: 0.5546
The largest effect was achieved by the most invasive strategy 3 that uses just-in-time reminder formulated as a user message.
The other strategies have a very marginal effect.
Now, if we substitute a replace-string in parallel reminder to general parallel reminder we see similar effect on replace_string parallel usage with overall increase in tool parallelization:
And even higher rate of (more general in this case) parallelize replace_string instruction following:
parallel_rs_user_reminder
qualified for parallel replace_string count: 366
followed instruction proportion: 0.5546448087431693
parallel_general_user_reminder
qualified for parallel replace_string count: 309
followed instruction proportion: 0.6375404530744336
It seems like a general parallelization may be a more effective approach in addition to a multiedit tool in case a multiple small edits need to be applied to a file (add a parameter to a frequently called function, ...) which we do not see much in swebench.
Now, If we look at refactorbench benchmark we see much higher use of parallelization across all of the strategies. multi_replace_string tool performs roughly half of edits:

Introducing special parallel execution tool increases parallelization rate significantly even without user request, but still less then by user request. And the more intrusive (just-in-time) user request is the higher is parallelization rate.
Now there are two options for multi_replace_string implementation, one is restrict it to a single file, so that parameters look like
{"file": ... ,"edits":[{"old_str": ..., "new_str": ...},{"old_str": ..., "new_str": ...},...]}
Or not restricted to a single file, so that parameters look like
{"edits":[{"file": ... ,"old_str": ..., "new_str": ...},{"file": ... ,"old_str": ..., "new_str": ...},...]}
.There are examples that justify the second approach (not restricting to a single file), where related edits scattered across multiple files:
Introducing two more strategies to the mix. One is just a general (not only replace_string) request to parallelize, another is multiedit _ general request to parallelized:

It seems like gentle parallelization + multi task tools provide for better efficiency (trajectory cost) than other approaches by reducing the number of agent turns (comparing to no parallelization main) and also reducing overall number of tool calls comparing to aggressive parallelization (parallel_rs_user_reminder).
Beta Was this translation helpful? Give feedback.
All reactions