MigoXLab
diff --git a/‎README.md
Lines changed: 13 additions & 2 deletions b/‎README.md
Lines changed: 13 additions & 2 deletions
diff --git a/‎README_ja.md
Lines changed: 13 additions & 2 deletions b/‎README_ja.md
Lines changed: 13 additions & 2 deletions
diff --git a/‎README_zh-CN.md
Lines changed: 13 additions & 2 deletions b/‎README_zh-CN.md
Lines changed: 13 additions & 2 deletions
diff --git a/‎dingo/data/converter/base.py
Lines changed: 5 additions & 0 deletions b/‎dingo/data/converter/base.py
Lines changed: 5 additions & 0 deletions
diff --git a/‎dingo/io/input/Data.py
Lines changed: 2 additions & 1 deletion b/‎dingo/io/input/Data.py
Lines changed: 2 additions & 1 deletion
diff --git a/‎dingo/io/input/InputArgs.py
Lines changed: 1 addition & 0 deletions b/‎dingo/io/input/InputArgs.py
Lines changed: 1 addition & 0 deletions
diff --git a/‎dingo/model/llm/llm_hallucination.py
Lines changed: 192 additions & 0 deletions b/‎dingo/model/llm/llm_hallucination.py
Lines changed: 192 additions & 0 deletions
diff --git a/‎dingo/model/modelres.py
Lines changed: 9 additions & 1 deletion b/‎dingo/model/modelres.py
Lines changed: 9 additions & 1 deletion
diff --git a/‎dingo/model/prompt/prompt_classify_qr.py
Lines changed: 1 addition & 1 deletion b/‎dingo/model/prompt/prompt_classify_qr.py
Lines changed: 1 addition & 1 deletion
@@ -215,21 +215,29 @@ input_data = {
 
 You can customize these prompts to focus on specific quality dimensions or to adapt to particular domain requirements. When combined with appropriate LLM models, these prompts enable comprehensive evaluation of data quality across multiple dimensions.
 
+### Hallucination Detection & RAG System Evaluation
+
+For detailed guidance on using Dingo's hallucination detection capabilities, including HHEM-2.1-Open local inference and LLM-based evaluation:
+
+📖 **[View Hallucination Detection Guide →](docs/hallucination_guide.md)**
+
 # Rule Groups
 
 Dingo provides pre-configured rule groups for different types of datasets:
 
 | Group | Use Case | Example Rules |
 |-------|----------|---------------|
 | `default` | General text quality | `RuleColonEnd`, `RuleContentNull`, `RuleDocRepeat`, etc. |
-| `sft` | Fine-tuning datasets | Rules from `default` plus `RuleLineStartWithBulletpoint` |
+| `sft` | Fine-tuning datasets | Rules from `default` plus `RuleHallucinationHHEM` for hallucination detection |
+| `rag` | RAG system evaluation | `RuleHallucinationHHEM`, `PromptHallucination` for response consistency |
+| `hallucination` | Hallucination detection | `PromptHallucination` with LLM-based evaluation |
 | `pretrain` | Pre-training datasets | Comprehensive set of 20+ rules including `RuleAlphaWords`, `RuleCapitalWords`, etc. |
 
 To use a specific rule group:
 
 ```python
 input_data = {
-    "eval_group": "sft",  # Use "default", "sft", or "pretrain"
+    "eval_group": "sft",  # Use "default", "sft", "rag", "hallucination", or "pretrain"
     # other parameters...
 }
 ```
@@ -246,6 +254,8 @@ input_data = {
 
 - **Built-in Rules**: 20+ general heuristic evaluation rules
 - **LLM Integration**: OpenAI, Kimi, and local models (e.g., Llama3)
+- **Hallucination Detection**: HHEM-2.1-Open local model and GPT-based evaluation
+- **RAG System Evaluation**: Response consistency and context alignment assessment
 - **Custom Rules**: Easily extend with your own rules and models
 - **Security Evaluation**: Perspective API integration
 
@@ -390,6 +400,7 @@ The current built-in detection rules and model methods focus on common data qual
 
 - [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
 - [mlflow](https://github.com/mlflow/mlflow)
+- [deepeval](https://github.com/confident-ai/deepeval)
 
 # Contribution
 
 
@@ -213,21 +213,29 @@ input_data = {
 
 これらのプロンプトは、特定の品質次元に焦点を当てたり、特定のドメイン要件に適応させるためにカスタマイズできます。適切なLLMモデルと組み合わせることで、これらのプロンプトは複数の次元にわたる包括的なデータ品質評価を可能にします。
 
+### 幻覚検出とRAGシステム評価
+
+HHEM-2.1-Openローカル推論とLLMベース評価を含む、Dingoの幻覚検出機能の使用に関する詳細なガイダンス：
+
+📖 **[幻覚検出ガイドを見る →](docs/hallucination_guide.md)**
+
 # ルールグループ
 
 Dingoは異なるタイプのデータセット用に事前設定されたルールグループを提供します：
 
 | グループ | 使用例 | ルール例 |
 |----------|--------|----------|
 | `default` | 一般的なテキスト品質 | `RuleColonEnd`, `RuleContentNull`, `RuleDocRepeat`など |
-| `sft` | ファインチューニングデータセット | `default`のルールに加えて`RuleLineStartWithBulletpoint` |
+| `sft` | ファインチューニングデータセット | `default`のルールに加えて幻覚検出用の`RuleHallucinationHHEM` |
+| `rag` | RAGシステム評価 | 応答一貫性検出用の`RuleHallucinationHHEM`, `PromptHallucination` |
+| `hallucination` | 幻覚検出 | LLMベース評価の`PromptHallucination` |
 | `pretrain` | 事前学習データセット | `RuleAlphaWords`, `RuleCapitalWords`などを含む20以上のルールの包括的セット |
 
 特定のルールグループを使用するには：
 
 ```python
 input_data = {
-    "eval_group": "sft",  # "default", "sft", または "pretrain"を使用
+    "eval_group": "sft",  # "default", "sft", "rag", "hallucination", または "pretrain"を使用
     # その他のパラメータ...
 }
 ```
@@ -245,6 +253,8 @@ input_data = {
 評価システムには以下が含まれます：
 - **テキスト品質評価メトリクス**: DataMan手法と拡張された多次元評価を使用した事前学習データの品質評価
 - **SFTデータ評価メトリクス**: 教師ありファインチューニングデータの正直、有用、無害評価
+- **幻覚検出**: HHEM-2.1-OpenローカルモデルとGPTベースの評価
+- **RAGシステム評価**: 応答一貫性とコンテキスト整合性評価
 - **分類メトリクス**: トピック分類とコンテンツ分類
 - **マルチモーダル評価メトリクス**: 画像分類と関連性評価
 - **ルールベース品質メトリクス**: ヒューリスティックルールによる効果性と類似性検出を用いた自動品質チェック
@@ -390,6 +400,7 @@ result = executor.execute()
 
 - [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
 - [mlflow](https://github.com/mlflow/mlflow)
+- [deepeval](https://github.com/confident-ai/deepeval)
 
 # 貢献
 
 
@@ -212,21 +212,29 @@ input_data = {
 
 您可以自定义这些prompt，以关注特定的质量维度或适应特定的领域需求。当与适当的LLM模型结合时，这些prompt能够在多个维度上对数据质量进行全面评估。
 
+### 幻觉检测和RAG系统评估
+
+有关使用Dingo幻觉检测功能的详细指导，包括HHEM-2.1-Open本地推理和基于LLM的评估：
+
+📖 **[查看幻觉检测指南 →](docs/hallucination_guide.md)**
+
 # 规则组
 
 Dingo为不同类型的数据集提供预配置的规则组：
 
 | 组名 | 用例 | 示例规则 |
 |-------|----------|---------------|
 | `default` | 通用文本质量 | `RuleColonEnd`, `RuleContentNull`, `RuleDocRepeat`等 |
-| `sft` | 微调数据集 | `default`中的规则加上`RuleLineStartWithBulletpoint` |
+| `sft` | 微调数据集 | `default`中的规则加上用于幻觉检测的`RuleHallucinationHHEM` |
+| `rag` | RAG系统评估 | 用于响应一致性检测的`RuleHallucinationHHEM`, `PromptHallucination` |
+| `hallucination` | 幻觉检测 | 基于LLM评估的`PromptHallucination` |
 | `pretrain` | 预训练数据集 | 包括`RuleAlphaWords`, `RuleCapitalWords`等20多条规则的全面集合 |
 
 使用特定规则组：
 
 ```python
 input_data = {
-    "eval_group": "sft",  # 使用"default"、"sft"或"pretrain"
+    "eval_group": "sft",  # 使用"default"、"sft"、"rag"、"hallucination"或"pretrain"
     # 其他参数...
 }
 ```
@@ -243,6 +251,8 @@ input_data = {
 
 - **内置规则**：20多种通用启发式评估规则
 - **LLM集成**：OpenAI、Kimi和本地模型（如Llama3）
+- **幻觉检测**：HHEM-2.1-Open本地模型和基于GPT的评估
+- **RAG系统评估**：响应一致性和上下文对齐评估
 - **自定义规则**：轻松扩展自己的规则和模型
 - **安全评估**：Perspective API集成
 
@@ -387,6 +397,7 @@ result = executor.execute()
 
 - [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
 - [mlflow](https://github.com/mlflow/mlflow)
+- [deepeval](https://github.com/confident-ai/deepeval)
 
 # 贡献
 
 
@@ -261,6 +261,11 @@ def _convert(raw: Union[str, Dict]):
                         if input_args.column_content != ""
                         else ""
                     ),
+                    "context": (
+                        cls.find_levels_data(j, input_args.column_context)
+                        if input_args.column_context != ""
+                        else j.get("context", None)  # Fallback to 'context' key if column_context not specified
+                    ),
                     "raw_data": j,
                 }
             )
 
@@ -1,4 +1,4 @@
-from typing import Dict, List, Optional
+from typing import Dict, List, Optional, Union
 
 from pydantic import BaseModel
 
@@ -12,4 +12,5 @@ class Data(BaseModel):
     prompt: str = None
     content: str = None
     image: Optional[List] = None
+    context: Optional[Union[str, List[str]]] = None  # Added for hallucination detection
     raw_data: Dict = {}
@@ -41,6 +41,7 @@ class InputArgs(BaseModel):
     column_id: str = ""
     column_prompt: str = ""
     column_content: str = ""
+    column_context: str = ""
     column_image: str = ""
 
     custom_config: Optional[str | dict] = None
 
@@ -0,0 +1,192 @@
+import json
+from typing import List, Union
+
+from dingo.io import Data
+from dingo.model import Model
+from dingo.model.llm.base_openai import BaseOpenAI
+from dingo.model.modelres import ModelRes
+from dingo.model.prompt.prompt_hallucination import PromptHallucination
+from dingo.model.response.response_hallucination import HallucinationScoreReason, HallucinationVerdict, HallucinationVerdicts
+from dingo.utils import log
+from dingo.utils.exception import ConvertJsonError
+
+
+@Model.llm_register("LLMHallucination")
+class LLMHallucination(BaseOpenAI):
+    """
+    Hallucination detection LLM based on DeepEval's HallucinationMetric.
+    Evaluates whether LLM outputs contain factual contradictions against provided contexts.
+
+    This implementation adapts DeepEval's verdict-based approach to Dingo's architecture:
+    1. Generates verdicts for each context against the actual output
+    2. Calculates hallucination score based on contradiction ratio
+    3. Returns standardized ModelRes with error_status based on threshold
+    """
+
+    prompt = PromptHallucination
+    threshold = 0.5  # Default threshold for hallucination detection
+
+    @classmethod
+    def build_messages(cls, input_data: Data) -> List:
+        """
+        Build messages for hallucination detection.
+        Expects input_data to have:
+        - prompt: The question/prompt
+        - content: The actual response to evaluate
+        - context: List of reference contexts (can be string or list)
+        """
+        question = input_data.prompt or ""
+        response = input_data.content
+
+        # Handle context - can be string or list
+        if hasattr(input_data, 'context') and input_data.context:
+            if isinstance(input_data.context, list):
+                contexts = input_data.context
+            else:
+                # Try to parse as JSON list, fallback to single context
+                try:
+                    contexts = json.loads(input_data.context)
+                    if not isinstance(contexts, list):
+                        contexts = [str(input_data.context)]
+                except (json.JSONDecodeError, ValueError):
+                    contexts = [str(input_data.context)]
+        else:
+            # No context provided - cannot evaluate hallucination
+            log.warning("No context provided for hallucination detection")
+            contexts = []
+
+        # Format contexts for display
+        contexts_str = json.dumps(contexts, ensure_ascii=False, indent=2)
+
+        prompt_content = cls.prompt.content % (question, response, contexts_str)
+
+        messages = [{"role": "user", "content": prompt_content}]
+        return messages
+
+    @classmethod
+    def process_response(cls, response: str) -> ModelRes:
+        """
+        Process LLM response to calculate hallucination score.
+        Follows DeepEval's approach:
+        1. Parse verdicts from LLM response
+        2. Calculate hallucination score = (num_contradictions / total_verdicts)
+        3. Set error_status based on threshold
+        """
+        log.info(f"Raw LLM response: {response}")
+
+        # Clean response format
+        if response.startswith("```json"):
+            response = response[7:]
+        if response.startswith("```"):
+            response = response[3:]
+        if response.endswith("```"):
+            response = response[:-3]
+
+        try:
+            response_json = json.loads(response)
+        except json.JSONDecodeError:
+            raise ConvertJsonError(f"Convert to JSON format failed: {response}")
+
+        try:
+            verdicts_data = HallucinationVerdicts(**response_json)
+            verdicts = verdicts_data.verdicts
+        except Exception as e:
+            raise ConvertJsonError(f"Failed to parse verdicts: {e}")
+
+        # Calculate hallucination score (like DeepEval)
+        score = cls._calculate_hallucination_score(verdicts)
+
+        # Generate detailed reason
+        reason = cls._generate_reason(verdicts, score)
+
+        result = ModelRes()
+
+        # Set error_status based on threshold
+        if score > cls.threshold:
+            result.error_status = True
+            result.type = "QUALITY_BAD_HALLUCINATION"
+            result.name = "HALLUCINATION_DETECTED"
+        else:
+            result.type = "QUALITY_GOOD"
+            result.name = "NO_HALLUCINATION"
+
+        result.reason = [reason]
+
+        # Store additional metadata
+        result.score = score
+        result.verdict_details = [
+            f"{v.verdict}: {v.reason}" for v in verdicts
+        ]
+
+        log.info(f"Hallucination score: {score:.3f}, threshold: {cls.threshold}")
+
+        return result
+
+    @classmethod
+    def _calculate_hallucination_score(cls, verdicts: List[HallucinationVerdict]) -> float:
+        """
+        Calculate hallucination score following DeepEval's approach.
+        Score = number_of_contradictions / total_verdicts
+        Higher score = more hallucinations (worse)
+        """
+        if not verdicts:
+            return 0.0
+
+        hallucination_count = 0
+        for verdict in verdicts:
+            if verdict.verdict.strip().lower() == "no":
+                hallucination_count += 1
+
+        score = hallucination_count / len(verdicts)
+        return score
+
+    @classmethod
+    def _generate_reason(cls, verdicts: List[HallucinationVerdict], score: float) -> str:
+        """Generate human-readable reason for the hallucination assessment"""
+
+        contradictions = []
+        alignments = []
+
+        for verdict in verdicts:
+            if verdict.verdict.strip().lower() == "no":
+                contradictions.append(verdict.reason)
+            else:
+                alignments.append(verdict.reason)
+
+        reason_parts = [
+            f"Hallucination score: {score:.3f} (threshold: {cls.threshold})"
+        ]
+
+        if contradictions:
+            reason_parts.append(f"Found {len(contradictions)} contradictions:")
+            for i, contradiction in enumerate(contradictions, 1):
+                reason_parts.append(f"  {i}. {contradiction}")
+
+        if alignments:
+            reason_parts.append(f"Found {len(alignments)} factual alignments:")
+            for i, alignment in enumerate(alignments, 1):
+                reason_parts.append(f"  {i}. {alignment}")
+
+        if score > cls.threshold:
+            reason_parts.append("❌ HALLUCINATION DETECTED: Response contains factual contradictions")
+        else:
+            reason_parts.append("✅ NO HALLUCINATION: Response aligns with provided contexts")
+
+        return "\n".join(reason_parts)
+
+    @classmethod
+    def eval(cls, input_data: Data) -> ModelRes:
+        """
+        Override eval to add context validation
+        """
+        # Validate that context is provided
+        if not hasattr(input_data, 'context') or not input_data.context:
+            return ModelRes(
+                error_status=True,
+                type="QUALITY_BAD",
+                name="MISSING_CONTEXT",
+                reason=["Context is required for hallucination detection but was not provided"]
+            )
+
+        # Call parent eval method
+        return super().eval(input_data)
@@ -1,4 +1,4 @@
-from typing import List
+from typing import Any, List, Optional
 
 from pydantic import BaseModel
 
@@ -8,3 +8,11 @@ class ModelRes(BaseModel):
     type: str = "QUALITY_GOOD"
     name: str = "Data"
     reason: List[str] = []
+
+    # Optional fields for enhanced functionality (e.g., hallucination detection)
+    score: Optional[float] = None
+    verdict_details: Optional[List[str]] = None
+
+    class Config:
+        # Allow extra attributes to be set dynamically
+        extra = "allow"
@@ -8,7 +8,7 @@ class PromptClassifyQR(BasePrompt):
     # Metadata for documentation generation
     _metric_info = {
         "category": "Multimodality Assessment Metrics",
-        "metric_name": "Image Classification",
+        "metric_name": "PromptClassifyQR",
         "description": "Identifies images as CAPTCHA, QR code, or normal images",
         "evaluation_results": ""
     }
Original file line number	Diff line number	Diff line change
`@@ -261,6 +261,11 @@ def _convert(raw: Union[str, Dict]):`
`261`	`261`	`if input_args.column_content != ""`
`262`	`262`	`else ""`
`263`	`263`	`),`
	`264`	`+ "context": (`
	`265`	`+ cls.find_levels_data(j, input_args.column_context)`
	`266`	`+ if input_args.column_context != ""`
	`267`	`+ else j.get("context", None) # Fallback to 'context' key if column_context not specified`
	`268`	`+ ),`
`264`	`269`	`"raw_data": j,`
`265`	`270`	`}`
`266`	`271`	`)`
Original file line number	Diff line number	Diff line change
`@@ -8,7 +8,7 @@ class PromptClassifyQR(BasePrompt):`
`8`	`8`	`# Metadata for documentation generation`
`9`	`9`	`_metric_info = {`
`10`	`10`	`"category": "Multimodality Assessment Metrics",`
`11`		`- "metric_name": "Image Classification",`
	`11`	`+ "metric_name": "PromptClassifyQR",`
`12`	`12`	`"description": "Identifies images as CAPTCHA, QR code, or normal images",`
`13`	`13`	`"evaluation_results": ""`
`14`	`14`	`}`