-
Notifications
You must be signed in to change notification settings - Fork 302
Open
Description
Hi! If I get ART right, llm outputs evaluation for multi agents trajectories. And reward values come from these outputs.
So, I just want to suggest a paper i saw recently which is related to this issue.
"Black-Box Prompt Optimization: Aligning Large Language Models without Model Training"
This paper's method does not require training process and achieves further optimization.
"Black-Box Prompt Optimization" Summary
This method improves prompts without any model training, using only interactions with an LLM:
Given a user prompt, the LLM generates two responses, and the user selects the better one.
Ask the LLM to explain why the worse answer is bad, then rewrite the prompt to fix the issue.
Collect (original_prompt, optimized_prompt) pairs to train a prompt preference optimizer.
🔧 Loss: Maximize the log-probability of generating the optimized prompt tokens.
→ Enables alignment without fine-tuning, aligned in spirit with ART's RULER approach.
This method can generate more precise reward score in my sense.
Metadata
Metadata
Assignees
Labels
No labels