Skip to content

Commit c4407f3

Browse files
authored
Merge pull request #126 from e06084/dev
feat: add hallucination detection with GPT and HHEM-2.1-Open
2 parents 5991ead + 444954a commit c4407f3

28 files changed

+1973
-79
lines changed

README.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -215,21 +215,29 @@ input_data = {
215215

216216
You can customize these prompts to focus on specific quality dimensions or to adapt to particular domain requirements. When combined with appropriate LLM models, these prompts enable comprehensive evaluation of data quality across multiple dimensions.
217217

218+
### Hallucination Detection & RAG System Evaluation
219+
220+
For detailed guidance on using Dingo's hallucination detection capabilities, including HHEM-2.1-Open local inference and LLM-based evaluation:
221+
222+
📖 **[View Hallucination Detection Guide →](docs/hallucination_guide.md)**
223+
218224
# Rule Groups
219225

220226
Dingo provides pre-configured rule groups for different types of datasets:
221227

222228
| Group | Use Case | Example Rules |
223229
|-------|----------|---------------|
224230
| `default` | General text quality | `RuleColonEnd`, `RuleContentNull`, `RuleDocRepeat`, etc. |
225-
| `sft` | Fine-tuning datasets | Rules from `default` plus `RuleLineStartWithBulletpoint` |
231+
| `sft` | Fine-tuning datasets | Rules from `default` plus `RuleHallucinationHHEM` for hallucination detection |
232+
| `rag` | RAG system evaluation | `RuleHallucinationHHEM`, `PromptHallucination` for response consistency |
233+
| `hallucination` | Hallucination detection | `PromptHallucination` with LLM-based evaluation |
226234
| `pretrain` | Pre-training datasets | Comprehensive set of 20+ rules including `RuleAlphaWords`, `RuleCapitalWords`, etc. |
227235

228236
To use a specific rule group:
229237

230238
```python
231239
input_data = {
232-
"eval_group": "sft", # Use "default", "sft", or "pretrain"
240+
"eval_group": "sft", # Use "default", "sft", "rag", "hallucination", or "pretrain"
233241
# other parameters...
234242
}
235243
```
@@ -246,6 +254,8 @@ input_data = {
246254

247255
- **Built-in Rules**: 20+ general heuristic evaluation rules
248256
- **LLM Integration**: OpenAI, Kimi, and local models (e.g., Llama3)
257+
- **Hallucination Detection**: HHEM-2.1-Open local model and GPT-based evaluation
258+
- **RAG System Evaluation**: Response consistency and context alignment assessment
249259
- **Custom Rules**: Easily extend with your own rules and models
250260
- **Security Evaluation**: Perspective API integration
251261

@@ -390,6 +400,7 @@ The current built-in detection rules and model methods focus on common data qual
390400

391401
- [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
392402
- [mlflow](https://github.com/mlflow/mlflow)
403+
- [deepeval](https://github.com/confident-ai/deepeval)
393404

394405
# Contribution
395406

README_ja.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -213,21 +213,29 @@ input_data = {
213213

214214
これらのプロンプトは、特定の品質次元に焦点を当てたり、特定のドメイン要件に適応させるためにカスタマイズできます。適切なLLMモデルと組み合わせることで、これらのプロンプトは複数の次元にわたる包括的なデータ品質評価を可能にします。
215215

216+
### 幻覚検出とRAGシステム評価
217+
218+
HHEM-2.1-Openローカル推論とLLMベース評価を含む、Dingoの幻覚検出機能の使用に関する詳細なガイダンス:
219+
220+
📖 **[幻覚検出ガイドを見る →](docs/hallucination_guide.md)**
221+
216222
# ルールグループ
217223

218224
Dingoは異なるタイプのデータセット用に事前設定されたルールグループを提供します:
219225

220226
| グループ | 使用例 | ルール例 |
221227
|----------|--------|----------|
222228
| `default` | 一般的なテキスト品質 | `RuleColonEnd`, `RuleContentNull`, `RuleDocRepeat`など |
223-
| `sft` | ファインチューニングデータセット | `default`のルールに加えて`RuleLineStartWithBulletpoint` |
229+
| `sft` | ファインチューニングデータセット | `default`のルールに加えて幻覚検出用の`RuleHallucinationHHEM` |
230+
| `rag` | RAGシステム評価 | 応答一貫性検出用の`RuleHallucinationHHEM`, `PromptHallucination` |
231+
| `hallucination` | 幻覚検出 | LLMベース評価の`PromptHallucination` |
224232
| `pretrain` | 事前学習データセット | `RuleAlphaWords`, `RuleCapitalWords`などを含む20以上のルールの包括的セット |
225233

226234
特定のルールグループを使用するには:
227235

228236
```python
229237
input_data = {
230-
"eval_group": "sft", # "default", "sft", または "pretrain"を使用
238+
"eval_group": "sft", # "default", "sft", "rag", "hallucination", または "pretrain"を使用
231239
# その他のパラメータ...
232240
}
233241
```
@@ -245,6 +253,8 @@ input_data = {
245253
評価システムには以下が含まれます:
246254
- **テキスト品質評価メトリクス**: DataMan手法と拡張された多次元評価を使用した事前学習データの品質評価
247255
- **SFTデータ評価メトリクス**: 教師ありファインチューニングデータの正直、有用、無害評価
256+
- **幻覚検出**: HHEM-2.1-OpenローカルモデルとGPTベースの評価
257+
- **RAGシステム評価**: 応答一貫性とコンテキスト整合性評価
248258
- **分類メトリクス**: トピック分類とコンテンツ分類
249259
- **マルチモーダル評価メトリクス**: 画像分類と関連性評価
250260
- **ルールベース品質メトリクス**: ヒューリスティックルールによる効果性と類似性検出を用いた自動品質チェック
@@ -390,6 +400,7 @@ result = executor.execute()
390400

391401
- [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
392402
- [mlflow](https://github.com/mlflow/mlflow)
403+
- [deepeval](https://github.com/confident-ai/deepeval)
393404

394405
# 貢献
395406

README_zh-CN.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -212,21 +212,29 @@ input_data = {
212212

213213
您可以自定义这些prompt,以关注特定的质量维度或适应特定的领域需求。当与适当的LLM模型结合时,这些prompt能够在多个维度上对数据质量进行全面评估。
214214

215+
### 幻觉检测和RAG系统评估
216+
217+
有关使用Dingo幻觉检测功能的详细指导,包括HHEM-2.1-Open本地推理和基于LLM的评估:
218+
219+
📖 **[查看幻觉检测指南 →](docs/hallucination_guide.md)**
220+
215221
# 规则组
216222

217223
Dingo为不同类型的数据集提供预配置的规则组:
218224

219225
| 组名 | 用例 | 示例规则 |
220226
|-------|----------|---------------|
221227
| `default` | 通用文本质量 | `RuleColonEnd`, `RuleContentNull`, `RuleDocRepeat`|
222-
| `sft` | 微调数据集 | `default`中的规则加上`RuleLineStartWithBulletpoint` |
228+
| `sft` | 微调数据集 | `default`中的规则加上用于幻觉检测的`RuleHallucinationHHEM` |
229+
| `rag` | RAG系统评估 | 用于响应一致性检测的`RuleHallucinationHHEM`, `PromptHallucination` |
230+
| `hallucination` | 幻觉检测 | 基于LLM评估的`PromptHallucination` |
223231
| `pretrain` | 预训练数据集 | 包括`RuleAlphaWords`, `RuleCapitalWords`等20多条规则的全面集合 |
224232

225233
使用特定规则组:
226234

227235
```python
228236
input_data = {
229-
"eval_group": "sft", # 使用"default"、"sft"或"pretrain"
237+
"eval_group": "sft", # 使用"default"、"sft"、"rag"、"hallucination"或"pretrain"
230238
# 其他参数...
231239
}
232240
```
@@ -243,6 +251,8 @@ input_data = {
243251

244252
- **内置规则**:20多种通用启发式评估规则
245253
- **LLM集成**:OpenAI、Kimi和本地模型(如Llama3)
254+
- **幻觉检测**:HHEM-2.1-Open本地模型和基于GPT的评估
255+
- **RAG系统评估**:响应一致性和上下文对齐评估
246256
- **自定义规则**:轻松扩展自己的规则和模型
247257
- **安全评估**:Perspective API集成
248258

@@ -387,6 +397,7 @@ result = executor.execute()
387397

388398
- [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
389399
- [mlflow](https://github.com/mlflow/mlflow)
400+
- [deepeval](https://github.com/confident-ai/deepeval)
390401

391402
# 贡献
392403

dingo/data/converter/base.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,11 @@ def _convert(raw: Union[str, Dict]):
261261
if input_args.column_content != ""
262262
else ""
263263
),
264+
"context": (
265+
cls.find_levels_data(j, input_args.column_context)
266+
if input_args.column_context != ""
267+
else j.get("context", None) # Fallback to 'context' key if column_context not specified
268+
),
264269
"raw_data": j,
265270
}
266271
)

dingo/io/input/Data.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from typing import Dict, List, Optional
1+
from typing import Dict, List, Optional, Union
22

33
from pydantic import BaseModel
44

@@ -12,4 +12,5 @@ class Data(BaseModel):
1212
prompt: str = None
1313
content: str = None
1414
image: Optional[List] = None
15+
context: Optional[Union[str, List[str]]] = None # Added for hallucination detection
1516
raw_data: Dict = {}

dingo/io/input/InputArgs.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ class InputArgs(BaseModel):
4141
column_id: str = ""
4242
column_prompt: str = ""
4343
column_content: str = ""
44+
column_context: str = ""
4445
column_image: str = ""
4546

4647
custom_config: Optional[str | dict] = None

dingo/model/llm/llm_hallucination.py

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
import json
2+
from typing import List, Union
3+
4+
from dingo.io import Data
5+
from dingo.model import Model
6+
from dingo.model.llm.base_openai import BaseOpenAI
7+
from dingo.model.modelres import ModelRes
8+
from dingo.model.prompt.prompt_hallucination import PromptHallucination
9+
from dingo.model.response.response_hallucination import HallucinationScoreReason, HallucinationVerdict, HallucinationVerdicts
10+
from dingo.utils import log
11+
from dingo.utils.exception import ConvertJsonError
12+
13+
14+
@Model.llm_register("LLMHallucination")
15+
class LLMHallucination(BaseOpenAI):
16+
"""
17+
Hallucination detection LLM based on DeepEval's HallucinationMetric.
18+
Evaluates whether LLM outputs contain factual contradictions against provided contexts.
19+
20+
This implementation adapts DeepEval's verdict-based approach to Dingo's architecture:
21+
1. Generates verdicts for each context against the actual output
22+
2. Calculates hallucination score based on contradiction ratio
23+
3. Returns standardized ModelRes with error_status based on threshold
24+
"""
25+
26+
prompt = PromptHallucination
27+
threshold = 0.5 # Default threshold for hallucination detection
28+
29+
@classmethod
30+
def build_messages(cls, input_data: Data) -> List:
31+
"""
32+
Build messages for hallucination detection.
33+
Expects input_data to have:
34+
- prompt: The question/prompt
35+
- content: The actual response to evaluate
36+
- context: List of reference contexts (can be string or list)
37+
"""
38+
question = input_data.prompt or ""
39+
response = input_data.content
40+
41+
# Handle context - can be string or list
42+
if hasattr(input_data, 'context') and input_data.context:
43+
if isinstance(input_data.context, list):
44+
contexts = input_data.context
45+
else:
46+
# Try to parse as JSON list, fallback to single context
47+
try:
48+
contexts = json.loads(input_data.context)
49+
if not isinstance(contexts, list):
50+
contexts = [str(input_data.context)]
51+
except (json.JSONDecodeError, ValueError):
52+
contexts = [str(input_data.context)]
53+
else:
54+
# No context provided - cannot evaluate hallucination
55+
log.warning("No context provided for hallucination detection")
56+
contexts = []
57+
58+
# Format contexts for display
59+
contexts_str = json.dumps(contexts, ensure_ascii=False, indent=2)
60+
61+
prompt_content = cls.prompt.content % (question, response, contexts_str)
62+
63+
messages = [{"role": "user", "content": prompt_content}]
64+
return messages
65+
66+
@classmethod
67+
def process_response(cls, response: str) -> ModelRes:
68+
"""
69+
Process LLM response to calculate hallucination score.
70+
Follows DeepEval's approach:
71+
1. Parse verdicts from LLM response
72+
2. Calculate hallucination score = (num_contradictions / total_verdicts)
73+
3. Set error_status based on threshold
74+
"""
75+
log.info(f"Raw LLM response: {response}")
76+
77+
# Clean response format
78+
if response.startswith("```json"):
79+
response = response[7:]
80+
if response.startswith("```"):
81+
response = response[3:]
82+
if response.endswith("```"):
83+
response = response[:-3]
84+
85+
try:
86+
response_json = json.loads(response)
87+
except json.JSONDecodeError:
88+
raise ConvertJsonError(f"Convert to JSON format failed: {response}")
89+
90+
try:
91+
verdicts_data = HallucinationVerdicts(**response_json)
92+
verdicts = verdicts_data.verdicts
93+
except Exception as e:
94+
raise ConvertJsonError(f"Failed to parse verdicts: {e}")
95+
96+
# Calculate hallucination score (like DeepEval)
97+
score = cls._calculate_hallucination_score(verdicts)
98+
99+
# Generate detailed reason
100+
reason = cls._generate_reason(verdicts, score)
101+
102+
result = ModelRes()
103+
104+
# Set error_status based on threshold
105+
if score > cls.threshold:
106+
result.error_status = True
107+
result.type = "QUALITY_BAD_HALLUCINATION"
108+
result.name = "HALLUCINATION_DETECTED"
109+
else:
110+
result.type = "QUALITY_GOOD"
111+
result.name = "NO_HALLUCINATION"
112+
113+
result.reason = [reason]
114+
115+
# Store additional metadata
116+
result.score = score
117+
result.verdict_details = [
118+
f"{v.verdict}: {v.reason}" for v in verdicts
119+
]
120+
121+
log.info(f"Hallucination score: {score:.3f}, threshold: {cls.threshold}")
122+
123+
return result
124+
125+
@classmethod
126+
def _calculate_hallucination_score(cls, verdicts: List[HallucinationVerdict]) -> float:
127+
"""
128+
Calculate hallucination score following DeepEval's approach.
129+
Score = number_of_contradictions / total_verdicts
130+
Higher score = more hallucinations (worse)
131+
"""
132+
if not verdicts:
133+
return 0.0
134+
135+
hallucination_count = 0
136+
for verdict in verdicts:
137+
if verdict.verdict.strip().lower() == "no":
138+
hallucination_count += 1
139+
140+
score = hallucination_count / len(verdicts)
141+
return score
142+
143+
@classmethod
144+
def _generate_reason(cls, verdicts: List[HallucinationVerdict], score: float) -> str:
145+
"""Generate human-readable reason for the hallucination assessment"""
146+
147+
contradictions = []
148+
alignments = []
149+
150+
for verdict in verdicts:
151+
if verdict.verdict.strip().lower() == "no":
152+
contradictions.append(verdict.reason)
153+
else:
154+
alignments.append(verdict.reason)
155+
156+
reason_parts = [
157+
f"Hallucination score: {score:.3f} (threshold: {cls.threshold})"
158+
]
159+
160+
if contradictions:
161+
reason_parts.append(f"Found {len(contradictions)} contradictions:")
162+
for i, contradiction in enumerate(contradictions, 1):
163+
reason_parts.append(f" {i}. {contradiction}")
164+
165+
if alignments:
166+
reason_parts.append(f"Found {len(alignments)} factual alignments:")
167+
for i, alignment in enumerate(alignments, 1):
168+
reason_parts.append(f" {i}. {alignment}")
169+
170+
if score > cls.threshold:
171+
reason_parts.append("❌ HALLUCINATION DETECTED: Response contains factual contradictions")
172+
else:
173+
reason_parts.append("✅ NO HALLUCINATION: Response aligns with provided contexts")
174+
175+
return "\n".join(reason_parts)
176+
177+
@classmethod
178+
def eval(cls, input_data: Data) -> ModelRes:
179+
"""
180+
Override eval to add context validation
181+
"""
182+
# Validate that context is provided
183+
if not hasattr(input_data, 'context') or not input_data.context:
184+
return ModelRes(
185+
error_status=True,
186+
type="QUALITY_BAD",
187+
name="MISSING_CONTEXT",
188+
reason=["Context is required for hallucination detection but was not provided"]
189+
)
190+
191+
# Call parent eval method
192+
return super().eval(input_data)

dingo/model/modelres.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from typing import List
1+
from typing import Any, List, Optional
22

33
from pydantic import BaseModel
44

@@ -8,3 +8,11 @@ class ModelRes(BaseModel):
88
type: str = "QUALITY_GOOD"
99
name: str = "Data"
1010
reason: List[str] = []
11+
12+
# Optional fields for enhanced functionality (e.g., hallucination detection)
13+
score: Optional[float] = None
14+
verdict_details: Optional[List[str]] = None
15+
16+
class Config:
17+
# Allow extra attributes to be set dynamically
18+
extra = "allow"

dingo/model/prompt/prompt_classify_qr.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ class PromptClassifyQR(BasePrompt):
88
# Metadata for documentation generation
99
_metric_info = {
1010
"category": "Multimodality Assessment Metrics",
11-
"metric_name": "Image Classification",
11+
"metric_name": "PromptClassifyQR",
1212
"description": "Identifies images as CAPTCHA, QR code, or normal images",
1313
"evaluation_results": ""
1414
}

0 commit comments

Comments
 (0)