You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: fastchat/llm_judge/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as j
10
10
-[Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments)
11
11
-[MT-Bench](#mt-bench)
12
12
-[Agreement Computation](#agreement-computation)
13
-
-[Dataset](#dataset)
13
+
-[Datasets](#datasets)
14
14
-[Citation](#citation)
15
15
16
16
## Install
@@ -133,7 +133,7 @@ We released 3.3K human annotations for model responses generated by 6 models in
133
133
134
134
This Colab [notebook](https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8-bRZCl1WNcT8De6?usp=sharing) shows how to compute the agreement between humans and GPT-4 judge with the dataset. Our results show that humans and GPT-4 judge achieve over 80\% agreement, the same level of agreement between humans.
0 commit comments