feat(llm): support TextRank #32

Gfreely · 2025-07-10T07:25:39Z

support TextRank, update new UI to support change keyword extraction method.

Main changes:
Added options to the RAG interface for selecting the keyword extraction method, along with several settings for TextRank, such as the number of keywords to extract, the size of the sliding window, and the processing language ('en' for English, 'zh' for Chinese).

A 'TextRank mask words' setting has also been added. It allows users to manually input specific phrases composed of letters and symbols to prevent them from being split during Chinese word segmentation. This requires manual entry by the user.

Summary by CodeRabbit

新功能
- 关键词提取支持三种模式（llm/textrank/hybrid），新增多语言 TextRank 实现与混合加权融合策略；可配置窗口大小与权重。
- 新增 NLTK 数据检查与自动下载机制。
行为变更
- 关键词输出改为带重要性分数（0.00–1.00）、按分数降序、以 KEYWORDS:keyword:score 格式返回。
- 语言来源统一为配置项并标准化处理。
文档
- 更新提示模板与配置文档，新增 KEYWORD_EXTRACT_TYPE、WINDOW_SIZE、HYBRID_LLM_WEIGHTS。
杂务
- 添加运行时依赖：scipy、python-igraph；调整资源忽略规则。

fix apache#224 problem, update new UI to support change keyword extracion method

fix the pylint check bug

coderabbitai · 2025-07-10T07:25:47Z

Walkthrough

新增多语言 TextRank 实现并将关键词抽取改为三种模式（llm/textrank/hybrid），抽取结果改为 keyword→score；更新 RAG 流程中 extract_keywords 签名与调用，修改提示模板与配置项，添加 scipy 与 python-igraph 依赖，增强 NLTK 资源检查，并调整若干导入与 .gitignore。

Changes

Cohort / File(s)	Change Summary
依赖与约束更新 `pyproject.toml`, `hugegraph-llm/pyproject.toml`	在约束依赖中新增 `scipy~=1.15.3` 与 `python-igraph~=0.11.9`。
关键词抽取功能重构 `src/hugegraph_llm/operators/llm_op/keyword_extract.py`	KeywordExtract 支持三种模式（`llm`/`textrank`/`hybrid`），新增 `_extract_with_llm`/`_extract_with_textrank`/`_extract_with_hybrid`，将解析改为返回 Dict[keyword→score]、引入混合加权与错误/时长处理，移除构造中的 language 参数并改用配置驱动。
文档级 TextRank 模块 `src/hugegraph_llm/operators/document_op/textrank_word_extract.py`	新增 MultiLingualTextRank（词屏蔽、预处理、共现图构建、PageRank 排名），支持中英分词并返回 keyword→score 映射。
RAG 流水线接口变化 `src/hugegraph_llm/operators/graph_rag_task.py`	`RAGPipeline.extract_keywords` 移除 `language` 与 `max_keywords` 参数，改为使用配置 `llm_settings`，并更新内部调用与文档注释。
提示词模板更新 `src/hugegraph_llm/config/prompt_config.py`	英/中关键词提取提示改为要求对每个关键词给出 0.0–1.0 重要性分数并按降序排序，输出格式改为 `KEYWORDS:keyword:score,...`（两位小数），并规范同义词与空格/标点规则。
配置项扩展 `src/hugegraph_llm/config/llm_config.py`, `config.md`	新增 `keyword_extract_type: Literal["llm","textrank","hybrid"]`（默认 `"llm"`）、`window_size`（默认 3）与 `hybrid_llm_weights`（默认 0.5）；文档同步新增说明。
NLTK 辅助改进 `src/hugegraph_llm/operators/common_op/nltk_helper.py`	`stopwords` 增加空列表兜底；新增 `check_nltk_data` 方法以确保并按需下载 punkt/标注器等 NLTK 数据包，更新 nltk.data.path 使用逻辑。
词抽取模块接口简化 `src/hugegraph_llm/operators/document_op/word_extract.py`	移除构造函数语言参数与上下文中对 `language` 的传递，语言由配置 `llm_settings.language` 决定并规范为 `chinese`/`english`。
文档与示例格式/导入调整 `src/hugegraph_llm/demo/rag_demo/app.py`, `src/hugegraph_llm/demo/rag_demo/rag_block.py`, `src/hugegraph_llm/api/rag_api.py`, `src/hugegraph_llm/config/models/base_prompt_config.py`	仅调整导入顺序、空行与小的格式化；无行为或签名改变。
资源忽略规则调整 `hugegraph-llm/.gitignore`	修改反忽略条目，放开 `resources/demo/`、`resources/prompt_examples/` 与 `resources/nltk_data/corpora/stopwords/` 等路径。
文档更新 `config.md`	在配置文档中添加 `KEYWORD_EXTRACT_TYPE`、`WINDOW_SIZE`、`HYBRID_LLM_WEIGHTS` 条目并调整表格格式。

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant RAG as RAGPipeline.extract_keywords
  participant KE as KeywordExtract.run
  participant LLM as LLM Backend
  participant TR as MultiLingualTextRank

  User->>RAG: 提交待抽取文本
  RAG->>KE: 调用 KeywordExtract.run()
  alt mode == "llm"
    KE->>LLM: 发送提示，等待响应
    LLM-->>KE: 返回 KEYWORDS 格式响应
    KE->>KE: 解析为 {keyword:score}
  else mode == "textrank"
    KE->>TR: extract_keywords(text, lang)
    TR-->>KE: 返回 {keyword:score}
  else mode == "hybrid"
    par LLM 路径
      KE->>LLM: 请求关键词得分
      LLM-->>KE: 返回 {keyword:score}
    and TextRank 路径
      KE->>TR: extract_keywords(text, lang)
      TR-->>KE: 返回 {keyword:score}
    end
    KE->>KE: 根据 hybrid_llm_weights 融合并排序
  end
  KE-->>RAG: 返回关键词与分数映射
  RAG-->>User: 返回结果

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

我是码间小兔，跳入词海忙采撷，
TexRank 与 LLM 携手，权重轻轻来摆摊，
NLTK 把包裹妥当，依赖与数据都安放，
KEYWORDS: 排序列分数，两位小数显光彩，
小兔欢跳推一版，项目花园又新开。 🥕

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66c7ea8 and 00edd28.

📒 Files selected for processing (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.11)
GitHub Check: build (3.10)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-07-10T07:25:49Z

@codecov-ai-reviewer review

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
257-259: 考虑优化窗口大小判断逻辑。

当唯一词数量少于窗口大小时，仍然可以构建有意义的共现图。建议调整判断逻辑。
-if len(unique_words) < self.window:
+if len(unique_words) < 2:
     return
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)
297-302: 建议改进mask_words输入框的提示信息。

当前的提示信息较长，可以考虑简化并提供示例。
 mask_words_input = gr.Textbox(
     label="TextRank mask words",
-    info="""Enter any words you want to protect from being split during Chinese word segmentation(e.g., C++, website URLs). Separate each entry with a comma.""",
+    info="保护词列表，防止中文分词时被切分。用逗号分隔。示例：C++,GitHub,ChatGPT",
     show_copy_button=True,
     lines=7,
 )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8280589 and a4180ea.

📒 Files selected for processing (4)

hugegraph-llm/pyproject.toml (1 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (11 hunks)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (5 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

hugegraph-llm/pyproject.toml (1)

Learnt from: cgwer
PR: hugegraph/hugegraph-ai#10
File: hugegraph-python-client/pyproject.toml:0-0
Timestamp: 2025-05-27T06:55:13.779Z
Learning: The hugegraph-python-client is a component within the hugegraph-ai project repository (apache/incubator-hugegraph-ai), not a standalone repository. When reviewing project URLs in pyproject.toml files within this project, they should point to the main hugegraph-ai repository.

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

🧬 Code Graph Analysis (1)

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (4)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)

RAGPipeline (38-267)

extract_keywords (68-100)

keywords_to_vid (106-130)

import_schema (102-104)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/answer_synthesize.py (1)

AnswerSynthesize (36-259)

hugegraph-llm/src/hugegraph_llm/utils/decorators.py (1)

with_task_id (109-123)

hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (1)

import_schema (46-55)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.11)
GitHub Check: build (3.10)

🔇 Additional comments (4)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

68-100: 代码实现正确！

新增的TextRank相关参数设计合理，文档说明清晰。语言代码从"english"标准化为"en"是个好的改进。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

281-301: TextRank主方法实现完善！

关键词提取的完整流程实现合理，包含了适当的边界条件检查。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (2)

83-90: 参数传递实现正确！

新增的TextRank相关参数正确传递给了extract_keywords方法。

431-460: 批处理函数更新正确！

新参数在批处理函数中的传递保持了一致性。

hugegraph-llm/pyproject.toml

coderabbitai · 2025-07-10T07:30:15Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

+        if self._extract_method == "TextRank":
+            # 使用 TextRank 提取关键词
+            keywords = self._extract_with_textrank()
+        else:
+            # 使用 LLM 提取关键词
+            keywords = self._extract_with_llm()
+        keywords = {k.replace("'", "") for k in keywords}
+        context["keywords"] = list(keywords)[:self._max_keywords]
+        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+
+        # extracting keywords & expanding synonyms increase the call count by 1
+        context["call_count"] = context.get("call_count", 0) + 1
+        return context


🛠️ Refactor suggestion

建议为LLM提取方法也添加错误处理。

当前只有TextRank方法有try-catch错误处理，建议为LLM提取方法也添加类似的错误处理，以提高系统的健壮性。

if self._extract_method == "TextRank": # 使用 TextRank 提取关键词 keywords = self._extract_with_textrank() else: # 使用 LLM 提取关键词 - keywords = self._extract_with_llm() + try: + keywords = self._extract_with_llm() + except Exception as e: + log.error("LLM Keyword extraction error: %s", e) + keywords = set()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self._extract_method == "TextRank":

# 使用 TextRank 提取关键词

keywords = self._extract_with_textrank()

else:

# 使用 LLM 提取关键词

keywords = self._extract_with_llm()

keywords = {k.replace("'", "") for k in keywords}

context["keywords"] = list(keywords)[:self._max_keywords]

log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])

# extracting keywords & expanding synonyms increase the call count by 1

context["call_count"] = context.get("call_count", 0) + 1

return context

if self._extract_method == "TextRank":

# 使用 TextRank 提取关键词

keywords = self._extract_with_textrank()

else:

# 使用 LLM 提取关键词

try:

keywords = self._extract_with_llm()

except Exception as e:

log.error("LLM Keyword extraction error: %s", e)

keywords = set()

keywords = {k.replace("'", "") for k in keywords}

context["keywords"] = list(keywords)[:self._max_keywords]

log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])

# extracting keywords & expanding synonyms increase the call count by 1

context["call_count"] = context.get("call_count", 0) + 1

return context

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around lines 73 to 85, the LLM keyword extraction method lacks error handling unlike the TextRank method. Add a try-except block around the call to self._extract_with_llm() to catch and handle any exceptions, ensuring the system remains robust. Log the error appropriately and decide on a fallback or empty keyword set if an exception occurs.

The self.llm.generate method is also called elsewhere in the project and is already sufficiently robust, so it is not necessary to add error handling.

@Gfreely, understood! Thank you for the clarification. You have better context about the project's LLM implementation and its robustness across the codebase. I'll defer to your technical judgment on this matter.

🧠 Learnings used

Learnt from: day0n PR: hugegraph/hugegraph-ai#16 File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137 Timestamp: 2025-06-25T09:50:06.213Z Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
73-85: LLM方法仍缺少错误处理。

根据之前的评审意见，TextRank方法已经添加了错误处理，但LLM方法仍然缺少相应的错误处理机制。
 if self._extract_method == "TextRank":
     # 使用 TextRank 提取关键词
     keywords = self._extract_with_textrank()
 else:
     # 使用 LLM 提取关键词
-    keywords = self._extract_with_llm()
+    try:
+        keywords = self._extract_with_llm()
+    except Exception as e:
+        log.error("LLM Keyword extraction error: %s", e)
+        keywords = set()

🧹 Nitpick comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
263-264: 图节点数量检查逻辑可能存在问题。

当unique_words数量小于window_size时直接返回，但这可能过于严格。即使词汇较少也应该尝试构建图。
-if len(unique_words) < self.window:
-    return
+if len(unique_words) < 2:
+    return

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4180ea and 98471a4.

📒 Files selected for processing (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (4)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

153-164: 停用词文件加载已正确添加错误处理。

根据之前的评审意见，停用词文件加载现在已经正确添加了try-catch错误处理，能够优雅地处理文件不存在的情况。

199-200: 正则表达式模式已修复。

之前评审中提到的正则表达式语法错误已经修复，现在正确使用了(?![a-zA-Z0-9])作为否定前瞻断言。

292-294: 文本预处理结果检查逻辑合理。

检查预处理后的词汇列表是否为空，如果为空则直接返回空列表，避免后续处理出错。

300-301: 图状态检查逻辑完善。

在运行PageRank前检查图的状态，确保图存在且有节点，避免算法执行错误。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
139-142: 构造函数参数缺少默认值

根据之前的评审意见，构造函数参数应该提供默认值以提高代码的健壮性。
-    def __init__(self, keyword_num=5, window_size=5, mask_words=""):
+    def __init__(self, keyword_num=5, window_size=5, mask_words=""):
注意：这个问题在之前的评审中已经被标记，但代码中仍然存在。

168-168: mask_words 参数需要空值检查

根据之前的评审意见，当 mask_words 为 None 时，split() 方法会导致异常。
-        self.mask_words = list(filter(None, mask_words.split(',')))
+        self.mask_words = list(filter(None, (mask_words or "").split(',')))
注意：这个问题在之前的评审中已经被标记，但代码中仍然存在。

🧹 Nitpick comments (3)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
170-253: 文本预处理逻辑过于复杂，建议重构

_preprocess 方法包含了复杂的中英文文本处理逻辑，建议将其拆分为更小的专门方法以提高可读性和可维护性。

建议重构为以下结构：
def _preprocess(self, text, lang):
    if lang.startswith('zh'):
        return self._preprocess_chinese(text)
    elif lang.startswith('en'):
        return self._preprocess_english(text)
    return []

def _preprocess_chinese(self, text):
    # 中文预处理逻辑
    pass

def _preprocess_english(self, text):
    # 英文预处理逻辑
    pass
217-218: 动态添加词典可能影响性能

在每次调用时动态向 jieba 词典添加占位符可能会影响性能，特别是在处理大量文本时。

考虑在构造函数中一次性添加常用的占位符模式，或者使用 jieba 的临时词典功能：
# 在构造函数中预处理
jieba.initialize()
263-264: 图节点数量检查逻辑可能不准确

当唯一词汇数量小于滑动窗口大小时直接返回可能不是最佳策略，因为即使词汇较少也可能构建有意义的图。

建议修改为更合理的检查条件：
-        if len(unique_words) < self.window:
+        if len(unique_words) < 2:  # 至少需要两个词才能构建图
             return

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98471a4 and 750d338.

📒 Files selected for processing (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

287-307: 关键词提取主函数逻辑清晰

extract_keywords 方法的实现逻辑清晰，正确处理了边界条件，包括空词汇列表和空图的情况。错误处理和流程控制都很合理。

98-109: TextRank 提取方法实现良好

方法包含了适当的错误处理、性能监控和日志记录。异常类型覆盖了常见的运行时错误，返回类型与 LLM 方法保持一致。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
48-57: textrank_kwargs参数仍需要空值检查

尽管您提到所有实例化代码都设置了默认值，但在构造函数中直接使用**textrank_kwargs仍存在潜在风险。如果textrank_kwargs为None，会导致运行时错误。

建议添加空值检查以提高代码健壮性：
-        self._textrank_model = MultiLingualTextRank(**textrank_kwargs)  # TextRank 参数
+        self._textrank_model = MultiLingualTextRank(**(textrank_kwargs or {}))  # TextRank 参数

🧹 Nitpick comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

169-252: 文本预处理逻辑复杂但功能全面

中英文预处理逻辑设计良好，包含了特殊词遮蔽、正则表达式清理、分词和词性过滤等步骤。不过复杂的正则表达式模式可能会影响性能。

对于大量文本处理，建议考虑以下优化：

预编译常用的正则表达式模式

考虑使用更高效的文本处理库

为极长文本添加长度限制或分块处理

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 750d338 and 29ddeb1.

📒 Files selected for processing (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (6)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)

18-37: 导入和常量定义看起来合理

新增的导入项都是TextRank功能所需的，包括jieba用于中文分词、networkx用于图算法、nltk用于英文处理等。EXTRACT_STOPWORDS常量定义也是合适的。

73-85: 提取方法分发逻辑实现正确

条件逻辑正确地在TextRank和LLM方法之间进行分发，错误处理也已经适当实现。关键词后处理和上下文更新逻辑看起来合理。

98-108: TextRank提取方法实现良好

错误处理覆盖了多种异常类型，日志记录有助于调试，返回空集合作为fallback是合理的设计。

139-168: MultiLingualTextRank构造函数实现完善

构造函数现在有了合适的默认值，POS过滤器定义合理，停用词加载包含了适当的错误处理，mask_words的空值检查也已实现。

254-285: 图构建和PageRank算法实现正确

词共现图构建逻辑合理，避免了自环问题，边权重计算正确。PageRank算法使用了标准参数，空图检查确保了健壮性。

286-306: 关键词提取主流程实现完整

主函数按照标准TextRank流程执行：预处理→图构建→PageRank→Top-K提取。边界条件处理得当，包括空词列表和空图的检查。

Copilot

Pull Request Overview

This PR adds support for TextRank-based keyword extraction in the RAG pipeline, updates the pipeline operator and demo UI to expose TextRank parameters, and introduces new dependencies for graph-based ranking.

Added a MultiLingualTextRank class and branched keyword extraction logic between LLM and TextRank.
Extended RAGPipeline.extract_keywords to accept method, window size, mask words, and language settings.
Updated the Gradio demo (rag_block.py) to include UI controls for extraction method, language, mask words, keyword count, and sliding window.
Added networkx and scipy to project dependencies.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py	Implemented TextRank extraction, added `MultiLingualTextRank`, and method branching
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py	Propagated new `extract_method` and `textrank_kwargs` through pipeline
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py	Added Gradio inputs for extraction method, language, mask words, max keywords, and window size
hugegraph-llm/pyproject.toml	Added `networkx` and `scipy` dependencies

imbajin · 2025-07-11T10:08:25Z

@codecov-ai-reviewer review

hugegraph-llm/pyproject.toml

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
179-262: 建议将复杂的预处理方法拆分

当前_preprocess方法承担了过多职责，包括遮蔽、清理、分词、标注等多个步骤。建议拆分为更小的方法以提高可维护性。

可以考虑将方法拆分为：
def _preprocess(self, text, lang):
    if lang.startswith('zh'):
        return self._preprocess_chinese(text)
    elif lang.startswith('en'):
        return self._preprocess_english(text)
    return []

def _preprocess_chinese(self, text):
    # 中文预处理逻辑
    
def _preprocess_english(self, text):
    # 英文预处理逻辑

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29ddeb1 and d2e846c.

📒 Files selected for processing (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

🧬 Code Graph Analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/ollama.py (1)

generate (37-60)

hugegraph-llm/src/hugegraph_llm/models/llms/openai.py (1)

generate (57-85)

hugegraph-llm/src/hugegraph_llm/models/llms/qianfan.py (1)

generate (38-53)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

extract_keywords (68-100)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)

NLTKHelper (30-80)

stopwords (36-53)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.11)
GitHub Check: build (3.10)

🔇 Additional comments (11)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (11)

18-37: 导入和常量定义合理

新增的导入项支持TextRank功能实现，常量定义清晰明确。

41-57: 构造函数参数设计合理

新增的extract_method和textrank_kwargs参数支持TextRank功能，参数传递逻辑正确。

73-85: 方法调度逻辑清晰

根据提取方法正确调度到相应的实现，关键词处理和上下文管理保持一致。

87-96: LLM提取方法保持向后兼容

维护了原有的LLM关键词提取逻辑，时间记录和日志功能完整。

98-114: TextRank提取方法实现完善

包含了适当的异常处理，时间记录与LLM方法保持一致，错误处理策略合理。

116-141: 响应解析方法保持稳定

方法逻辑与原有实现一致，格式化调整合理，停用词处理正确。

144-162: TextRank类构造函数设计良好

默认参数合理，支持中英文词性过滤，mask_words空值检查已实现，延迟加载机制恰当。

164-177: 停用词加载方法改进良好

已添加适当的文件存在性检查和错误处理，延迟加载模式避免重复加载，日志记录完善。

264-285: 图构建方法实现正确

共现图构建逻辑合理，正确处理了窗口大小不足的边界情况，避免了自环，滑动窗口方法恰当。

287-294: PageRank实现正确

正确检查了空图情况，使用标准的PageRank参数，权重处理恰当。

296-319: 主提取方法设计完善

完整地编排了关键词提取流程，各步骤的空结果处理得当，Top-K关键词返回逻辑正确。

Gfreely · 2025-07-16T03:21:40Z

@codecov-ai-reviewer review

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
78-84: 可选：set 去重导致输出顺序不稳定，建议稳定化便于 UI/测试复现

当前转 set 后再 list，会带来非确定顺序，可能影响日志比对、缓存命中或 UI 体验。若不强调保序，也可以仅做“稳定化排序”以确保一致性。

建议最小改动（保持去重，提供稳定顺序）：
-        keywords = {k.replace("'", "") for k in keywords}
-        context["keywords"] = list(keywords)[:self._max_keywords]
+        normalized = [k.replace("'", "") for k in keywords if k]
+        # 非保序但稳定：按小写排序，避免 set 带来的不确定顺序
+        context["keywords"] = sorted(set(normalized), key=str.lower)[:self._max_keywords]
97-114: MemoryError 建议重新抛出，避免掩盖严重资源问题（可选）

对 MemoryError 直接吞掉可能隐藏 OOM 或异常输入规模问题。建议提升为致命并上抛。
         except MemoryError as e:
             log.critical("TextRank memory error (text too large?): %s", e)
-            keywords = []
+            raise
153-179: 正则安全性：匹配 start_token 建议使用 re.escape 以防特殊字符破坏语义

注释已提示该点，但代码未实际转义。虽然当前使用 "KEYWORDS:" 安全，仍建议防御性处理，便于未来调整 start_token。
-        matches = re.findall(rf'{start_token}[^\n]+\n?', response)
+        matches = re.findall(rf'{re.escape(start_token)}[^\n]+\n?', response)
80-81: 可选：避免在 info 级别输出用户原始查询，降低 PII 泄露风险

建议降级到 debug 或做截断/脱敏。
-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s\nKeywords: %s", self._query, context["keywords"])

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d6c54ae and 02520ba.

📒 Files selected for processing (2)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.959Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

📚 Learning: 2025-08-18T14:45:20.742Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.316Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.316Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.316Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.316Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.742Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.959Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.959Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:37:41.186Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.186Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code Graph Analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

stopwords (36-53)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (56-240)

extract_keywords (218-240)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

extract_keywords (68-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.11)
GitHub Check: build (3.10)

🔇 Additional comments (5)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)

34-49: LGTM：构造器与 TextRank 集成对齐 PR 目标

新增 extract_method/mask_words 参数合理，默认 "Hybrid" 符合本次特性。

MultiLingualTextRank 以 keyword_num 与 mask_words 初始化，契合文档实现。

61-64: 语言代码归一化修复到位（'en'/'zh' → 'english'/'chinese'）

该改动确保 NLTKHelper().stopwords(lang) 能正确匹配预期语言键，避免停用词过滤失效。与我们以往学习中对 NLTKHelper 的约束一致。

45-49: 确认：TextRank top_k 不随 run() 中的 max_keywords 增大而更新（按设计）

根据既有学习，top_k 仅在初始化传入；运行期即使 max_keywords 提高，TextRank 也最多返回初始数量，这里是有意取舍。如果后续期望运行期放大生效，可在 run() 同步 self._textrank_model.top_k。

我这边可按需补一个最小同步补丁（不引入 window 支持），要不要我提交？

Also applies to: 65-66

86-96: LLM 关键词提取实现简洁、耗时日志合理

prompt 格式化与调用链清晰，耗时统计与 debug 日志有助于定位瓶颈。

115-151: Hybrid 逻辑修复到位：大小写对齐与生成器误用问题已消除

使用 tr_lower/used_tr_lower 进行大小写一致性判断与去重正确。

先取交集再拼接剩余 LLM 与 TextRank 的顺序清晰，且已避免之前 extend/append 误用生成器的问题。

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
101-114: 异常处理建议将 MemoryError 升级为致命并可选向上抛出

MemoryError 多为进程级资源耗尽信号，继续吞掉后续流程可能导致系统处于不可预期状态。建议至少在日志中标明“不可恢复”，或在捕获后重新抛出/让上层策略化处理。
-        except MemoryError as e:
-            log.critical("TextRank memory error (text too large?): %s", e)
+        except MemoryError as e:
+            log.critical("TextRank memory error (text too large?): %s", e)
+            raise

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 02520ba and 59ad7ed.

📒 Files selected for processing (3)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)

NLTKHelper (30-97)

stopwords (36-53)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (30-190)

extract_keywords (173-190)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

extract_keywords (68-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

63-66: 语言码标准化处理很到位

将 "en"/"zh" 归一为 "english"/"chinese"，有效避免 NLTKHelper.stopwords() 语言码不匹配导致的运行时异常或停用词失效问题。

69-81: 分支调度清晰、TextRank 排序逻辑正确

在 TextRank 路径将 ranks 字典按分数降序转为关键字序列的做法正确；异常分支抛出 ValueError 明确可追踪。

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 59ad7ed and 1053060.

📒 Files selected for processing (2)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)

NLTKHelper (30-97)

stopwords (36-53)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (30-189)

extract_keywords (172-189)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

extract_keywords (68-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (9)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (9)

20-22: 导入语句格式良好，功能清晰

新增的 jieba.posseg 导入用于中文分词，与新增的多语言 TextRank 功能配合使用，符合预期。导入语句组织合理。

28-28: TextRank 模块导入正确

从 textrank_word_extract 模块导入 MultiLingualTextRank 类，与 PR 目标一致，支持 TextRank 关键词抽取功能。

35-50: 构造函数参数扩展合理

移除了 language 参数，新增 extract_method 和 mask_words 参数，符合 PR 目标。默认值设置合理，extract_method 默认为 "Hybrid" 模式，mask_words 默认为空字符串。

63-66: 语言映射实现正确

基于之前的反馈学习，正确实现了从 "en"/"zh" 到 "english"/"chinese" 的映射，确保与 NLTKHelper 的语言代码要求兼容。

69-88: 关键词提取分发逻辑清晰

实现了三种提取模式的分发逻辑：

LLM 模式：调用 _extract_with_llm()

TextRank 模式：调用 _extract_with_textrank() 并按分数排序

Hybrid 模式：调用 _extract_with_hybrid()

逻辑清晰，错误处理适当。最终的关键词处理（去除引号、截断、日志记录）实现正确。

90-99: LLM 提取方法实现正确

将原有的关键词提取逻辑封装为独立方法，时间统计和日志记录完善。返回类型改为 List[str]，与新的设计一致。

101-114: TextRank 提取方法异常处理完善

新增的 TextRank 提取方法实现了适当的异常处理：

TypeError, ValueError：参数错误处理

MemoryError：内存错误处理（针对大文本）

时间统计和日志记录完善。

116-163: Hybrid 融合算法实现良好

实现了复杂的混合算法，结合 LLM 和 TextRank 的优势：

为 LLM 关键词计算来源分数（交集为1.0，仅LLM为0.8）

为 TextRank 关键词分配较低的来源分数（0.5）

结合 TextRank 分数和词长分数进行综合评估

算法逻辑合理，实现了预期的优先级策略。

165-170: 方法签名更新正确

_extract_keywords_from_response 方法返回类型从 Set[str] 更新为 List[str]，与新的设计模式保持一致，支持保序操作。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

60-91: *check_nltk_data 存在二次校验抛异常与“AND”要求过严的问题，应按 OR 兜底 punkt/punkt_tab 与 _eng/通用版

行 87-89 直接调用 nltk.data.find(...) 做二次校验，未捕获 LookupError，会在资源缺失时再次抛异常。
当前将四个包全部视为“必需”，而实际只需要“二选一”：tokenizers 需 punkt 或 punkt_tab 其一可用；taggers 需 averaged_perceptron_tagger_eng 或 averaged_perceptron_tagger 其一可用。
下载逻辑应优先主名称，失败再回退到备名称，避免冗余下载与版本差异导致的失败。

建议用以下最小重构替换本方法主体，修复异常并采用 OR 语义：

 def check_nltk_data(self):
-
-        cache_dir = self.get_cache_dir()
-        nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
-        nltk.data.path.append(os.path.join(resource_path, "nltk_data"))
-
-        if nltk_data_dir not in nltk.data.path:
-            nltk.data.path.append(nltk_data_dir)
-
-        required_packages = {
-            'punkt': False,
-            'punkt_tab': False,
-            'averaged_perceptron_tagger': False,
-            "averaged_perceptron_tagger_eng": False}
-        for package in required_packages:
-            try:
-                if package in ['punkt', 'punkt_tab']:
-                    nltk.data.find(f'tokenizers/{package}')
-                else:
-                    nltk.data.find(f'taggers/{package}')
-                required_packages[package] = True
-            except LookupError:
-                nltk.download(package, download_dir=nltk_data_dir)
-
-        check_flag = all(required_packages.values())
-        if not check_flag:
-            for package in required_packages:
-                if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'):
-                    required_packages[package] = True
-
-        check_flag = all(required_packages.values())
-        return check_flag
+        cache_dir = self.get_cache_dir()
+        nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
+        local_nltk = os.path.join(resource_path, "nltk_data")
+        if local_nltk not in nltk.data.path:
+            nltk.data.path.append(local_nltk)
+        if nltk_data_dir not in nltk.data.path:
+            nltk.data.path.append(nltk_data_dir)
+
+        def _ensure_any(kind: str, primary: str, fallback: str) -> bool:
+            # kind: "tokenizers" | "taggers"
+            try:
+                nltk.data.find(f"{kind}/{primary}")
+                return True
+            except LookupError:
+                try:
+                    nltk.data.find(f"{kind}/{fallback}")
+                    return True
+                except LookupError:
+                    # 优先下载主名称，失败再下载备名称
+                    if not nltk.download(primary, download_dir=nltk_data_dir):
+                        return bool(nltk.download(fallback, download_dir=nltk_data_dir))
+                    return True
+
+        tokenizer_ok = _ensure_any("tokenizers", "punkt", "punkt_tab")
+        tagger_ok = _ensure_any("taggers", "averaged_perceptron_tagger_eng", "averaged_perceptron_tagger")
+        return tokenizer_ok and tagger_ok

若需要，我可以补一版最小单测或在 CI 中缓存下载目录以避免重复下载。

🧹 Nitpick comments (8)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
36-59: stopwords() 建议支持 en/zh 映射并避免重复追加路径

调用方很可能传入 "en"/"zh"；当前仅接受 "english"/"chinese"，建议做语言码归一化，减少误用。

每次调用都无条件 append resource_path/nltk_data，会产生重复项；建议先判断后再追加。

可在当前范围内最小改动如下：
 def stopwords(self, lang: str = "chinese") -> List[str]:
     """Get stopwords."""
-    nltk.data.path.append(os.path.join(resource_path, "nltk_data"))
-    if self._stopwords.get(lang) is None:
+    local_nltk = os.path.join(resource_path, "nltk_data")
+    if local_nltk not in nltk.data.path:
+        nltk.data.path.append(local_nltk)
+
+    # 兼容常见语言码别名
+    lang = ({'en': 'english', 'eng': 'english', 'zh': 'chinese', 'cn': 'chinese'}
+            .get((lang or '').lower(), (lang or '').lower()))
+
+    if self._stopwords.get(lang) is None:
         cache_dir = self.get_cache_dir()
         nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)

         # update nltk path for nltk so that it finds the data
         if nltk_data_dir not in nltk.data.path:
             nltk.data.path.append(nltk_data_dir)

         try:
             nltk.data.find("corpora/stopwords")
         except LookupError:
             nltk.download("stopwords", download_dir=nltk_data_dir)
-        self._stopwords[lang] = stopwords.words(lang)
+        try:
+            self._stopwords[lang] = stopwords.words(lang)
+        except (OSError, LookupError):
+            # 兜底为空集合，避免调用方崩溃
+            self._stopwords[lang] = []
@@
-    final_stopwords = self._stopwords[lang]
+    final_stopwords = self._stopwords.get(lang)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (7)
31-36: top_k 参数未被使用，返回值目前是“全量词典”

当前 extract_keywords() 始终返回全量打分词。若上层未再切片，会与 UI “关键词个数”期望不一致。建议在返回前按 self.top_k 截断，或移除该参数避免混淆。

在返回处添加截断（见 183-191 区域的补丁）：
@@
-            ranks = self._rank_nodes()
-        return ranks
+            ranks = self._rank_nodes()
+        # 按 top_k 截断（仅当配置为正整数时）
+        if isinstance(self.top_k, int) and self.top_k > 0 and ranks:
+            ranks = dict(sorted(ranks.items(), key=lambda kv: kv[1], reverse=True)[: self.top_k])
+        return ranks
126-131: 中文字符检测正则在循环内反复编译，且字符范围建议统一到 \u9fff

re.compile('[\u4e00-\u9fa5]') 每次循环编译；且与上方保留模式使用的 \u4e00-\u9fff 不一致。建议预编译并统一范围。
 class MultiLingualTextRank:
     def __init__(self, keyword_num: int = 5, window_size: int = 2, mask_words: str = ""):
@@
         self.max_len = 100
+        self._ch_char_re = re.compile(r'[\u4e00-\u9fff]')
@@
-                    if re.compile('[\u4e00-\u9fa5]').search(word):
+                    if self._ch_char_re.search(word):
                         ch_tokens.append(word)
133-141: 仅替换中文 token 的首个出现位置，后续相同 token 将保持未细分状态

当前先 set() 去重再 words.index(ch_token) 仅替换到第一个命中的位置。若期望“所有出现位置”都细分，建议改为单次重建列表以覆盖所有位置；若这是有意的性能/效果权衡，请在代码旁注释说明设计取舍。

可替换为一次性重建（最小化修改，覆盖所有出现位置）：
-        ch_tokens = list(set(ch_tokens))
-        for ch_token in ch_tokens:
-            idx = words.index(ch_token)
-            ch_words = []
-            jieba_tokens = pseg.cut(ch_token)
-            for word, flag in jieba_tokens:
-                if len(word) >= 1 and flag in self.pos_filter['chinese'] and word not in ch_stop_words:
-                    ch_words.append(word)
-            words = words[:idx] + ch_words + words[idx+1:]
+        ch_tokens = set(ch_tokens)
+        new_words = []
+        for tok in words:
+            if tok in ch_tokens:
+                buf = []
+                for w, flag in pseg.cut(tok):
+                    if len(w) >= 1 and flag in self.pos_filter['chinese'] and w not in ch_stop_words:
+                        buf.append(w)
+                new_words.extend(buf or [tok])
+            else:
+                new_words.append(tok)
+        words = new_words
145-161: Graph 顶点去重使用 set() 导致非确定性顺序，建议保序去重以提升可复现性

list(set(words)) 使顶点顺序不可预测；虽然不影响正确性，但会影响调试与结果复现。建议使用 dict.fromkeys。
-        unique_words = list(set(words))
+        unique_words = list(dict.fromkeys(words))
         name_to_idx = {word: idx for idx, word in enumerate(unique_words)}
167-170: PageRank 归一化可避免重复计算，并对极端小值更稳健

当前在列表推导内重复 max(...)。建议先取一次最大值并加微小下限，避免潜在除零与 O(n^2)。
-        pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
-        pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores]
+        pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
+        m = max(pagerank_scores) or 1e-12
+        pagerank_scores = [s / m for s in pagerank_scores]
172-177: 对 NLTK 资源不可用时直接返回空字典，建议打日志并提示降级

当 check_nltk_data() 返回 False 时直接 {}，上层将难以定位问题。建议在此处记录 warning 并可选地降级到“仅按英文简单分词”的退路。

我可以补一段最小降级分支，或者在调用链（如 KeywordExtract）中输出一次性提示，便于运维定位。

87-90: 端到端验证建议

为减少回归，建议新增三类用例：

mask_words 混合字面量与正则（含重叠匹配）时的占位与还原正确性；

中文和英文混排且包含多次重复中文 token 时，细分是否符合预期（按你的设计取舍）；

punkt/punkt_tab、*_eng/通用 tagger 在不同环境的就绪检测。

如需，我可以起草 pytest 用例骨架。

Also applies to: 94-105, 172-191

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1053060 and a7c3543.

📒 Files selected for processing (2)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

📚 Learning: 2025-08-21T12:34:10.370Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.370Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中，sub_tokens变量的复用逻辑是正确的：每次循环中sub_tokens被重新赋值为当前token的分词结果，处理后的有效子词会被添加回同一个sub_tokens列表中，最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

🧬 Code graph analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3)

NLTKHelper (30-118)

stopwords (36-58)

check_nltk_data (60-91)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

extract_keywords (68-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.11)
GitHub Check: build (3.10)

🔇 Additional comments (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)
94-105: 确认占位符在 NLTK 分词中的稳定性

当前沙盒环境未安装 NLTK，无法执行自动验证脚本。请在本地安装 NLTK 并运行以下测试，验证 __shieldword_\\d+__ 是否能在 nltk.word_tokenize 后保持为单个 token，避免后续 placeholder_map 匹配失配：
#!/bin/bash
python - <<'PY'
import re
from nltk import word_tokenize

text = "Keep __shieldword_1__ and __shieldword_2__ intact."
# 同样使用了中文字符模式以覆盖全文拆分场景
pattern = r'__shieldword_\\d+__|\\b\\w+\\b|[\\u4e00-\\u9fff]+'
tokens = re.findall(pattern, text)
ptb = word_tokenize(' '.join(tokens))

print("Tokens after NLTK:", ptb)
print("Placeholders intact:", all(t.startswith("__shieldword_") and t.endswith("__") for t in ptb if "__shieldword_" in t))
PY
定位

文件：hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

行号：94–105

建议

在本地完成上述验证，确保占位符不被二次拆分。

在代码库中新增相应单元测试，覆盖常见占位符用例，保证后续 placeholder_map 能稳定映射。

coderabbitai · 2025-08-21T13:01:17Z

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

+        if mask_patterns:
+            special_regex = regex.compile('|'.join(mask_patterns), regex.V1)
+            text = special_regex.sub(_create_placeholder, text)
+


⚠️ Potential issue

用户自定义正则替换缺少超时，存在 ReDoS 风险

虽已限制模式长度并做了预编译校验，但合并后的大模式仍可能在替换阶段卡死。建议为 sub() 增加超时。

- if mask_patterns: - special_regex = regex.compile('|'.join(mask_patterns), regex.V1) - text = special_regex.sub(_create_placeholder, text) + if mask_patterns: + special_regex = regex.compile('|'.join(mask_patterns), regex.V1) + # 设置 1s 超时，防止恶意/病态正则导致阻塞 + text = special_regex.sub(_create_placeholder, text, timeout=1.0)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if mask_patterns:

special_regex = regex.compile('|'.join(mask_patterns), regex.V1)

text = special_regex.sub(_create_placeholder, text)

if mask_patterns:

special_regex = regex.compile('|'.join(mask_patterns), regex.V1)

# 设置 1s 超时，防止恶意/病态正则导致阻塞

text = special_regex.sub(_create_placeholder, text, timeout=1.0)

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py around lines 87 to 90, the call to special_regex.sub(_create_placeholder, text) lacks a timeout and can hang (ReDoS); add a timeout argument to sub (e.g., timeout=0.1 or a configurable small value) and wrap the call in a try/except catching regex.TimeoutError to handle timeouts gracefully (log/warn and skip or fall back without performing the replacement). Ensure the timeout value is sensible and documented/configurable and that the exception path leaves text in a safe state.

imbajin · 2025-08-21T13:13:54Z

@coderabbitai generate sequence diagram

coderabbitai · 2025-08-21T13:14:11Z

Sequence Diagram for PR

sequenceDiagram
    autonumber
    participant U as User
    participant UI as Gradio UI
    participant API as RAG API
    participant Cfg as PromptConfig
    participant Pipe as RAGPipeline
    participant KE as KeywordExtract
    participant TR as MultiLingualTextRank
    participant LLM as LLM
    participant NLTK as NLTKHelper

    U->>UI: Set text, extract_method, language(en/zh), window_size, max_keywords, mask_words
    UI->>API: POST /rag {text, extract_method, language, window_size, max_keywords, mask_words}
    API->>Cfg: Get prompt.maskword_input_text
    API->>API: mask_words := req.mask_words or Cfg.maskword_input_text
    API->>Pipe: rag_answer(..., keywords_extract_method, mask_words, language, window_size, max_keywords)

    Pipe->>KE: run(text, method, mask_words, language, window_size, max_keywords)

    alt method == "TextRank"
        KE->>TR: extract(text, language, window_size, mask_words)
        TR->>NLTK: check_nltk_data(language)
        NLTK-->>TR: ensure stopwords available
        TR->>TR: tokenize + POS filter + mask + co-occur graph + PageRank
        TR-->>KE: keywords
    else method == "LLM"
        KE->>LLM: prompt(text)
        LLM-->>KE: keywords
    else method == "Hybrid"
        KE->>LLM: prompt(text)
        LLM-->>KE: kw_llm
        KE->>TR: extract(text, language, window_size, mask_words)
        TR-->>KE: kw_tr
        KE->>KE: merge/prioritize/dedup to max_keywords
    end

    KE-->>Pipe: keywords (dedup/trim)
    Pipe-->>API: answer + keywords
    API-->>UI: response payload
    UI-->>U: Render answer and extracted keywords

coderabbitai · 2025-08-21T13:14:14Z

✅ Actions performed

Sequence diagram generated.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (2)
106-106: 仅在 TextRank 且 language=zh 时显示/启用该输入，降低 UI 认知负担

该输入主要用于中文分词的掩码短语，建议按条件显示/禁用：当 extract_method == "TextRank" 且 language == "zh" 时可见且可编辑，其余场景（LLM、英文或 Hybrid 非中文）隐藏或置灰，并在占位符中写明期望格式（如“以逗号分隔的短语列表”）。这能减少用户困惑，也避免在无效场景写入无用配置。

Also applies to: 156-156

106-106: 命名统一建议：maskword_input_text ↔ textrank_mask_words ↔ mask_words

当前 UI 变量名为 textbox_textrank_mask_words，Prompt 字段为 maskword_input_text，而 API/流水线语义为 mask_words。建议统一到“mask_words”词根，长期有利于认知与检索（可先保持向后兼容）。若接受调整，本文件可改为引用 prompt.mask_words_input_text（或 prompt.textrank_mask_words），并在配置层同步重命名字段与 YAML 键名。

在本文件的最小变更示例（需结合配置层同步修改）：
-                prompt.maskword_input_text,
+                prompt.mask_words_input_text,
同时请在以下位置同步（示例，非本文件）：

hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py / prompt_config.py：字段名与默认值

对应 YAML/JSON 配置键名迁移与兼容读取逻辑

依赖该字段的 RAG 侧入参构造（确保最终传给服务端的键名仍为 mask_words）

Also applies to: 135-135, 156-156

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a7c3543 and 236094e.

📒 Files selected for processing (4)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

hugegraph-llm/src/hugegraph_llm/api/rag_api.py
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (4)

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (4)

106-107: TextRank 掩码词输入接入正确（解构与返回值一致）

create_rag_block() 已扩展为 5 项返回值，此处新增的 textbox_textrank_mask_words 解构位置正确，后续在 ui.load(outputs=...) 中也有对应映射，整体连线无副作用。

156-156: Gradio outputs 新增项位置正确（索引与返回元组对齐）

textbox_textrank_mask_words 插入为第 13 个 output，与 refresh_ui_config_prompt() 的第 13 个返回项一一对应，避免了加载时错位。

19-19: 空白行调整

仅空白行变化，无需进一步操作。

135-135: 等待 create_rag_block 函数的 return 语句输出，以确认返回值数量和顺序。之后将根据结果更新评审评论。

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)

52-71: 在RAG接口中透传TextRank/Hybrid关键词提取参数

RAGRequest模型缺少 extract_method、language、window_size、max_keywords、mask_words 字段，API无法接收用户配置
rag_api.py 中调用 rag_answer_func 时未透传上述字段

# hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py
 class RAGRequest(BaseModel):
+    extract_method: Optional[Literal["llm", "textrank", "hybrid"]] = Query(None, description="关键词提取方法")
+    language: Optional[str] = Query(None, description="关键词提取语言")
+    window_size: Optional[int] = Query(None, description="TextRank窗口大小")
+    max_keywords: Optional[int] = Query(None, description="最大关键词数")
+    mask_words: Optional[List[str]] = Query(None, description="剔除词列表")

# hugegraph-llm/src/hugegraph_llm/api/rag_api.py
         result = rag_answer_func(
             text=req.query,
             …
             topk_per_keyword=req.topk_per_keyword,
+            extract_method=req.extract_method,
+            language=req.language,
+            window_size=req.window_size,
+            max_keywords=req.max_keywords,
+            mask_words=req.mask_words,
             custom_related_information=req.custom_priority_info,

♻️ Duplicate comments (4)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

103-113: 仅替换第一处重复中文 token，其余位置未细分；可重建列表一次性处理。

此前已讨论过去重策略可接受，但为稳健性与一致性，仍建议采用一次遍历重建以覆盖所有出现位置。

-        if len(ch_tokens) > 0:
-            ch_tokens = list(set(ch_tokens))
-            for ch_token in ch_tokens:
-                idx = words.index(ch_token)
-                ch_words = []
-                jieba_tokens = pseg.cut(ch_token)
-                for word, flag in jieba_tokens:
-                    if len(word) >= 1 and flag in self.pos_filter['chinese'] and word not in ch_stop_words:
-                        ch_words.append(word)
-                words = words[:idx] + ch_words + words[idx+1:]
+        if len(ch_tokens) > 0:
+            new_words = []
+            for tok in words:
+                if regex.search(r'[\u4e00-\u9fff]', tok):
+                    ch_words = []
+                    for w, flag in pseg.cut(tok):
+                        if len(w) >= 1 and flag in self.pos_filter['chinese'] and w not in ch_stop_words:
+                            ch_words.append(w)
+                    new_words.extend(ch_words or [tok])
+                else:
+                    new_words.append(tok)
+            words = new_words

58-61: 正则替换缺少超时保护，存在 ReDoS 风险。

-        text = special_regex.sub(_create_placeholder, text)
+        try:
+            text = special_regex.sub(_create_placeholder, text, timeout=1.0)
+        except regex.TimeoutError:
+            # 超时降级：跳过屏蔽，直接返回原文
+            return text, {}

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

87-96: LLM 异常处理（已讨论）

此前已有讨论并确认 generate 已在底层做了健壮性处理，此处无需重复捕获。

98-112: TextRank 异常处理与严重资源异常的处理策略

建议补充 FileNotFoundError（资源缺失）并对 MemoryError 直接上抛。

-        try:
+        try:
             ranks = self._textrank_model.extract_keywords(self._query)
-        except (TypeError, ValueError) as e:
+        except FileNotFoundError as e:
+            log.error("TextRank resource file not found: %s", e)
+        except (TypeError, ValueError) as e:
             log.error("TextRank parameter error: %s", e)
-        except MemoryError as e:
-            log.critical("TextRank memory error (text too large?): %s", e)
+        except MemoryError as e:
+            log.critical("TextRank memory error (text too large?): %s", e)
+            raise

🧹 Nitpick comments (6)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
117-121: 顶点集合使用 set 会造成非确定性顺序，影响可重复性。
-        unique_words = list(set(words))
+        unique_words = list(dict.fromkeys(words))
30-35: 未使用的字段（top_k、max_len）。

这两字段当前未参与计算，建议删除或接入输出裁剪/长度限制逻辑，避免迷惑。

是否打算在本类内对结果按 top_k 截断？若由上层 KeywordExtract 统一截断，请在此处加注释说明。
hugegraph-llm/src/hugegraph_llm/config/llm_config.py (1)
33-35: 混合权重缺少取值校验。

建议限制在 [0,1] 区间，越界时回退默认值或抛出配置错误。
     hybrid_llm_weights: Optional[float] = 0.5
+    def __init__(self, **data):
+        super().__init__(**data)
+        if self.hybrid_llm_weights is not None:
+            w = self.hybrid_llm_weights
+            if not (0.0 <= w <= 1.0):
+                raise ValueError(f"hybrid_llm_weights out of range: {w}")
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
68-86: Docstring 仍保留已删除参数说明（max_keywords）。

请移除或改为描述配置驱动（llm_settings.keyword_extract_type / window_size 等）。
         """
         Add a keyword extraction operator to the pipeline.
 
-        :param text: Text to extract keywords from.
-        :param max_keywords: Maximum number of keywords to extract.
-        :param extract_template: Template for keyword extraction.
+        :param text: Text to extract keywords from.
+        :param extract_template: Template for keyword extraction.
         :return: Self-instance for chaining.
         """
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
60-63: 语言码映射不完整（未覆盖 en-US/zh-CN 等变体）

建议兼容地区后缀，避免传入 zh-CN/en-US 时未映射。
-lang = context.get("language", self._language).lower()
-if lang in ("en", "zh"):
-    lang = "english" if lang == "en" else "chinese"
+self._language = (context.get("language", self._language) or "").lower().strip()
+if self._language.startswith(("en", "zh")):
+    self._language = "english" if self._language.startswith("en") else "chinese"
- self._language = lang
166-166: 正则安全性：对 start_token 使用 re.escape

增强健壮性，避免特殊字符引发意外匹配。
-        matches = re.findall(rf'{start_token}([^\n]+\n?)', response)
+        matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 236094e and 3c3f7bb.

📒 Files selected for processing (10)

hugegraph-llm/config.md (1 hunks)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/llm_config.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (5 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py

🚧 Files skipped from review as they are similar to previous changes (2)

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py

🧰 Additional context used

🧠 Learnings (10)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

📚 Learning: 2025-06-25T09:45:10.751Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

hugegraph-llm/src/hugegraph_llm/config/llm_config.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/config.md
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-21T12:34:10.407Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中，sub_tokens变量的复用逻辑是正确的：每次循环中sub_tokens被重新赋值为当前token的分词结果，处理后的有效子词会被添加回同一个sub_tokens列表中，最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (3)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3)

NLTKHelper (30-118)

stopwords (36-58)

check_nltk_data (60-91)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (29-161)

extract_keywords (143-161)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

hugegraph-llm/src/hugegraph_llm/utils/decorators.py (3)

log_operator_time (64-78)

log_time (38-61)

record_rpm (81-106)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.11)
GitHub Check: build (3.10)

🔇 Additional comments (4)

hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)

32-32: 仅导入顺序调整，功能不变，LGTM。
hugegraph-llm/config.md (1)
29-40: 调整默认值为 3，与代码 windows_size 保持一致
-| `WINDOWS_SIZE`         | Optional[Integer]                                      | 2            | TextRank 划窗大小                         |
+| `WINDOWS_SIZE`         | Optional[Integer]                                      | 3            | TextRank 划窗大小                         |
Likely an incorrect or invalid review comment.
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

19-22: 导入重排与新增无功能影响，LGTM。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

41-47: 确认 windows_size 字段已存在 llm_settings 配置中定义为 windows_size，与代码引用一致，无需修改。

coderabbitai · 2025-08-29T13:09:06Z

hugegraph-llm/src/hugegraph_llm/config/llm_config.py

+    keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm"
+    windows_size: Optional[int] = 3
+    hybrid_llm_weights: Optional[float] = 0.5


💡 Verification agent

🧩 Analysis chain

字段命名与文档/PR 参数不一致，且默认值与文档冲突。

代码为 windows_size（复数），PR/API/常见命名为 window_size（单数）。

文档默认值为 2，代码为 3。

建议统一为单数且与文档一致：

- windows_size: Optional[int] = 3 + window_size: Optional[int] = 2

并全局替换引用处（含 .env 映射与 UI）：

🏁 Script executed:

#!/bin/bash rg -nP '\bwindows_size\b|WINDOWS_SIZE\b'

Length of output: 422

统一命名及默认值：将 windows_size 改为 window_size 且默认值设为 2

修改 hugegraph-llm/src/hugegraph_llm/config/llm_config.py:34

修改 hugegraph-llm/config.md:38

修改 hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:47

--- a/hugegraph-llm/src/hugegraph_llm/config/llm_config.py - windows_size: Optional[int] = 3 + window_size: Optional[int] = 2

并在 keyword_extract.py 中将

- window_size=llm_settings.windows_size + window_size=llm_settings.window_size

以及文档和 .env/UI 配置中全局替换 windows_size/WINDOWS_SIZE → window_size。

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm"

windows_size: Optional[int] = 3

hybrid_llm_weights: Optional[float] = 0.5

keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm"

window_size: Optional[int] = 2

hybrid_llm_weights: Optional[float] = 0.5

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/config/llm_config.py around lines 33-35 change the field name windows_size to window_size and set its default to 2 (i.e., rename the attribute and update the default value), then update hugegraph-llm/config.md line ~38 to reflect the new parameter name and default, and update hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py line ~47 to use window_size instead of windows_size; finally perform a global replacement of windows_size / WINDOWS_SIZE → window_size in all .env, UI config, and documentation files to keep names consistent.

coderabbitai · 2025-08-29T13:09:06Z

hugegraph-llm/src/hugegraph_llm/config/prompt_config.py

+    1. Extract, evaluate, and rank keywords from the text:
       - Minimum 0, maximum MAX_KEYWORDS keywords.
-       - Keywords should be complete semantic words or phrases, ensuring information completeness.
+       - Keywords should be complete semantic words or phrases, ensuring information completeness, without any changes to the English capitalization.
+       - Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text.
+       - The final list of keywords must be sorted in descending order based on their importance score.
    2. Identify keywords that need rewriting:


🛠️ Refactor suggestion

英⽂规则冲突：允许“短语”但又禁止空格，语义不一致

“完整语义的词或短语”与“关键词不得包含空格”相矛盾。建议允许空格，但禁止影响解析的分隔符（逗号、冒号）。

- - Keywords should be complete semantic words or phrases, ensuring information completeness, without any changes to the English capitalization. - - Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text. + - Keywords should be complete semantic words or phrases (spaces are allowed), ensuring information completeness, without any changes to the English capitalization. + - Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text. + - Do not include commas or colons in keywords (they are used as separators).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

1. Extract, evaluate, and rank keywords from the text:

- Minimum 0, maximum MAX_KEYWORDS keywords.

- Keywords should be complete semantic words or phrases, ensuring information completeness.

- Keywords should be complete semantic words or phrases, ensuring information completeness, without any changes to the English capitalization.

- Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text.

- The final list of keywords must be sorted in descending order based on their importance score.

2. Identify keywords that need rewriting:

1. Extract, evaluate, and rank keywords from the text:

- Minimum 0, maximum MAX_KEYWORDS keywords.

- Keywords should be complete semantic words or phrases (spaces are allowed), ensuring information completeness, without any changes to the English capitalization.

- Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text.

- Do not include commas or colons in keywords (they are used as separators).

- The final list of keywords must be sorted in descending order based on their importance score.

2. Identify keywords that need rewriting:

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/config/prompt_config.py around lines 161 to 166, the prompt text currently conflicts by allowing "phrases" but then forbidding spaces; change the rule to permit spaces inside keywords (so multi-word phrases are allowed) while explicitly disallowing delimiter characters that would break parsing (e.g., commas, colons, semicolons, pipes). Update the human-readable rule text accordingly and adjust any associated validation logic to accept spaces but reject those delimiter characters (and trim surrounding whitespace) so keywords can be multi-word phrases without parsing ambiguity.

hugegraph-llm/src/hugegraph_llm/config/prompt_config.py

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

coderabbitai · 2025-08-29T13:09:07Z

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

+        pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')
+        pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores]
+        node_names = self.graph.vs['name']
+        return dict(zip(node_names, pagerank_scores))


🛠️ Refactor suggestion

PageRank 归一化缺少零值保护，极端情况下可能除零。

- pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores] + max_pr = max(pagerank_scores) if pagerank_scores else 0.0 + pagerank_scores = [s / max_pr if max_pr > 0 else 0.0 for s in pagerank_scores]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')

pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores]

node_names = self.graph.vs['name']

return dict(zip(node_names, pagerank_scores))

pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight')

# Protect against zero or empty scores when normalizing

max_pr = max(pagerank_scores) if pagerank_scores else 0.0

pagerank_scores = [s / max_pr if max_pr > 0 else 0.0 for s in pagerank_scores]

node_names = self.graph.vs['name']

return dict(zip(node_names, pagerank_scores))

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py around lines 138 to 141, the PageRank normalization divides by max(pagerank_scores) without guarding against a zero max which may cause a divide-by-zero; compute max_score = max(pagerank_scores) (or 0 if list empty), if max_score == 0 then set normalized scores to a list of zeros (or skip division) otherwise divide each score by max_score, then zip node_names with the normalized scores and return the dict.

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

135-152: Hybrid 仅对交集加权，可能导致返回项少于 max_keywords（已在先前评论讨论）

此路径当并集>max而交集<max时不会补齐，其结果数可能不足 max_keywords。若按当前设计有意保持，请忽略。

如需修正，可按“对并集加权排序后截断”的策略处理。

🧹 Nitpick comments (5)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
129-130: 将关键词集的中间结果日志从 INFO 降级到 DEBUG

防止日志噪音与潜在敏感信息泄漏。
-        log.info("LLM extract Keywords: %s", lr_set)
-        log.info("TextRank extract Keywords: %s", tr_set)
+        log.debug("LLM extract Keywords: %s", lr_set)
+        log.debug("TextRank extract Keywords: %s", tr_set)
87-87: 细化返回类型注解以提升可读性与静态检查效果

这些方法返回 word->score 的映射，应标注 Dict[str, float]。
-    def _extract_with_llm(self) -> Dict:
+    def _extract_with_llm(self) -> Dict[str, float]:
@@
-    def _extract_with_textrank(self) -> Dict:
+    def _extract_with_textrank(self) -> Dict[str, float]:
@@
-    def _extract_with_hybrid(self) -> Dict:
+    def _extract_with_hybrid(self) -> Dict[str, float]:
@@
-    ) -> Dict:
+    ) -> Dict[str, float]:
Also applies to: 98-98, 113-113, 156-161

166-166: 在正则中对 start_token 进行转义以避免意外匹配

更健壮，便于复用其他起始标记。
-        matches = re.findall(rf'{start_token}([^\n]+\n?)', response)
+        matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)
104-111: MemoryError 建议直接抛出，避免在极端内存压力下“继续运行”

当前仅记录 critical 日志后继续返回空结果，可能掩盖资源问题。
-        except MemoryError as e:
-            log.critical("TextRank memory error (text too large?): %s", e)
+        except MemoryError as e:
+            log.critical("TextRank memory error (text too large?): %s", e)
+            raise
117-121: 对 hybrid_llm_weights 做边界钳制处理

更简洁直观，避免后续使用到非法权重。
-        llm_weights = llm_settings.hybrid_llm_weights
-        ranks = {}
-        if llm_weights < 0 or llm_weights > 1:
-            llm_weights = 0.5
+        llm_weights = min(1.0, max(0.0, float(llm_settings.hybrid_llm_weights)))
+        ranks = {}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3c3f7bb and b7f4136.

📒 Files selected for processing (3)

hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (4 hunks)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py

🧰 Additional context used

🧠 Learnings (10)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

📚 Learning: 2025-08-29T13:11:08.902Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-21T12:34:10.407Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中，sub_tokens变量的复用逻辑是正确的：每次循环中sub_tokens被重新赋值为当前token的分词结果，处理后的有效子词会被添加回同一个sub_tokens列表中，最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (29-162)

extract_keywords (144-162)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (3)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)

168-185: LLM 解析逻辑简洁且健壮，已覆盖常见异常输入

包含空项跳过、分数解析保护与大小写控制，符合预期。

45-47: 无需调整：配置项名 windows_size 与代码引用一致
配置文件 llm_config.py 第34行已定义 windows_size，代码中通过 llm_settings.windows_size 正确获取，无拼写偏差。

Likely an incorrect or invalid review comment.

64-65: 未贯通 TextRank 的 mask_words，UI/API 参数无效

run() 未读取 context["mask_words"]，MultiLingualTextRank 将始终使用默认空列表。

最小改动在读取 max_keywords 之后同步 mask_words：

         self._max_keywords = context.get("max_keywords", self._max_keywords)
+        mask_words = context.get("mask_words", None)
+        if mask_words is not None:
+            try:
+                self._textrank_model.mask_words = list(filter(None, (mask_words or "").split(",")))
+            except Exception as e:
+                log.warning("Invalid mask_words: %s", e)

⛔ Skipped due to learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

coderabbitai · 2025-08-29T13:30:14Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

+        self._extract_method = llm_settings.keyword_extract_type
+        self._textrank_model = MultiLingualTextRank(
+            keyword_num=max_keywords,
+            window_size=llm_settings.windows_size)


⚠️ Potential issue

统一 extract_method 大小写并支持 context 覆盖，否则前端参数不会生效

当前只从配置读取且未做 lower()，与分支判断不匹配；也未从 context 接收覆盖值。

建议补丁：

- self._extract_method = llm_settings.keyword_extract_type + self._extract_method = (llm_settings.keyword_extract_type or "hybrid").lower() @@ - if self._extract_method == "llm": + method = (context.get("extract_method", self._extract_method) or "hybrid").lower() + if method == "llm": # 使用 LLM 提取关键词 - ranks = self._extract_with_llm() - elif self._extract_method == "textrank": + ranks = self._extract_with_llm() + elif method == "textrank": # 使用 TextRank 提取关键词 - ranks = self._extract_with_textrank() - elif self._extract_method == "hybrid": + ranks = self._extract_with_textrank() + elif method == "hybrid": # 使用混合方法提取关键词 - ranks = self._extract_with_hybrid() + ranks = self._extract_with_hybrid() else: - raise ValueError(f"Invalid extract_method: {self._extract_method}") + raise ValueError(f"Invalid extract_method: {method}")

Also applies to: 66-76

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around lines 44-47 (and similarly for lines 66-76), the extract_method is set only from configuration and not normalized or overridable by context; update the logic to first check for a context-provided extract_method and fall back to llm_settings.keyword_extract_type, then normalize the method string (e.g., call lower()) before any branch comparisons so case differences don't break matching; ensure any downstream branches use the same normalized value.

coderabbitai · 2025-08-29T13:30:14Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

+        keywords = [] if not ranks else sorted(ranks, key=ranks.get, reverse=True)
+        keywords = [k.replace("'", "") for k in keywords]
+        context["keywords"] = keywords[:self._max_keywords]
+        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])


🛠️ Refactor suggestion

避免在 INFO 日志中输出用户查询，降低隐私风险

将查询与关键词输出降级到 DEBUG，或做脱敏。

- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])

log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around line 81, the code currently logs the full user query and extracted keywords at INFO level, which raises privacy concerns; change the log level to DEBUG (log.debug) so queries are not logged in INFO, and optionally sanitize the query before logging (e.g., mask or truncate the query, or log only its length/first N chars) to avoid storing full user input in logs; ensure the log message preserves context by mentioning it's masked/truncated when sanitizing.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
137-154: Hybrid 在并集过大且交集不足时可能返回少于期望数量的关键词（确认需求）

当前分支在 union > max 且 intersection != max 时，仅返回交集加权结果，数量可能小于 max_keywords。根据项目偏好（各方法返回 Dict，由 run() 统一排序与截断），这是有意为之还是需要补足到上限？请确认产品预期。

（我们尊重“接口一致性”的偏好，若确认为预期行为可忽略此条。已参考 Learnings 中的约束。）

45-47: 关于 window_size 仅来源于配置的说明（文档/一致性）

此处仅从 llm_settings.window_size 注入，未支持运行期 context 覆盖。若与 PR 目标“UI 可传 window_size”不一致，请在文档中注明“窗口大小不支持动态调整”（参考既有取舍），避免前端/API 使用方误解。

98-112: 谨慎处理 MemoryError：建议直接抛出，避免不稳定状态继续运行

内存不足属于致命错误，记录后继续返回空结果可能掩盖问题。
-        except MemoryError as e:
-            log.critical("TextRank memory error (text too large?): %s", e)
+        except MemoryError as e:
+            log.critical("TextRank memory error (text too large?): %s", e)
+            raise
66-76: 支持从 context 覆盖 extract_method，并统一大小写

当前仅读取配置项，前端/上游传入的 extract_method 不会生效；且大小写未统一到一次性处理。建议在 run() 中先归一化 method，再做分支判断。
-        if self._extract_method == "llm":
+        method = (context.get("extract_method", self._extract_method) or "hybrid").lower()
+        if method == "llm":
             # 使用 LLM 提取关键词
             ranks = self._extract_with_llm()
-        elif self._extract_method == "textrank":
+        elif method == "textrank":
             # 使用 TextRank 提取关键词
             ranks = self._extract_with_textrank()
-        elif self._extract_method == "hybrid":
+        elif method == "hybrid":
             # 使用 混合方法 提取关键词
             ranks = self._extract_with_hybrid()
         else:
-            raise ValueError(f"Invalid extract_method: {self._extract_method}")
+            raise ValueError(f"Invalid extract_method: {method}")
78-81: 保序去重并减少日志中的敏感信息

目前仅做替换，不去重，可能出现重复关键词。

INFO 级别打印完整用户查询存在隐私风险，降级为 DEBUG 更稳妥。
-        keywords = [k.replace("'", "") for k in keywords]
-        context["keywords"] = keywords[:self._max_keywords]
-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        seen, ordered = set(), []
+        for k in keywords:
+            kk = k.replace("'", "")
+            if kk and kk not in seen:
+                seen.add(kk)
+                ordered.append(kk)
+        context["keywords"] = ordered[:self._max_keywords]
+        log.debug("User Query(len=%d): %.200s | Keywords: %s",
+                  len(self._query or ""), (self._query or "")[:200], context["keywords"])

🧹 Nitpick comments (3)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
119-123: hybrid_llm_weights 更健壮的解析与归一化

配置可能为字符串（如 "0.6"）。当前仅判断 float 类型会退回 0.5。
-        if isinstance(llm_settings.hybrid_llm_weights, float):
-            llm_weights = min(1.0, max(0.0, float(llm_settings.hybrid_llm_weights)))
-        else:
-            llm_weights = 0.5
+        try:
+            llm_weights = float(llm_settings.hybrid_llm_weights)
+        except Exception:
+            llm_weights = 0.5
+        llm_weights = min(1.0, max(0.0, llm_weights))
168-169: 对 start_token 使用 re.escape 并启用多行匹配，提升解析鲁棒性

避免含特殊字符的 start_token 破坏正则；且允许 KEYWORDS: 出现在行首场景。
-        matches = re.findall(rf'{start_token}([^\n]+\n?)', response)
+        pattern = re.compile(rf'{re.escape(start_token)}([^\n]+)', flags=re.MULTILINE)
+        matches = pattern.findall(response)
41-41: _language 字段目前未在本类内使用（清理或贯通）

初始化与映射均未被后续逻辑消费（TextRank 不接收 lang；LLM 解析也未用）。建议删除死字段或明确贯通用途，以免造成误导。
-        self._language = "english"
...
-        lang = context.get("language", self._language).lower()
-        if lang in ("en", "zh"):
-            lang = "english" if lang == "en" else "chinese"
-        self._language = lang
+        # 语言处理如需在下游使用，请在此处确认并贯通；否则可移除。
（已参考 Learnings：方法返回 Dict、由 run() 统一处理，我们避免改变该约定。）

Also applies to: 60-63

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b7f4136 and 38064c3.

📒 Files selected for processing (6)

hugegraph-llm/config.md (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/llm_config.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (4 hunks)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/config/llm_config.py
hugegraph-llm/config.md
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py

🧰 Additional context used

🧠 Learnings (10)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

📚 Learning: 2025-08-29T13:11:08.902Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-21T12:34:10.407Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中，sub_tokens变量的复用逻辑是正确的：每次循环中sub_tokens被重新赋值为当前token的分词结果，处理后的有效子词会被添加回同一个sub_tokens列表中，最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (30-163)

extract_keywords (145-163)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

coderabbitai · 2025-08-29T13:51:18Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

        self._max_keywords = context.get("max_keywords", self._max_keywords)



🛠️ Refactor suggestion

为 max_keywords 做类型与边界校验，避免切片异常

context 可能传入字符串或非正数，直接用于切片有风险。

- self._max_keywords = context.get("max_keywords", self._max_keywords) + mk = context.get("max_keywords", self._max_keywords) + try: + mk = int(mk) + except (TypeError, ValueError): + mk = self._max_keywords + self._max_keywords = max(1, mk)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self._max_keywords = context.get("max_keywords", self._max_keywords)

# Validate and sanitize max_keywords from context to avoid slicing errors

mk = context.get("max_keywords", self._max_keywords)

try:

mk = int(mk)

except (TypeError, ValueError):

mk = self._max_keywords

# Ensure at least 1 keyword is requested

self._max_keywords = max(1, mk)

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around lines 64-65, the assignment self._max_keywords = context.get("max_keywords", self._max_keywords) can receive non-integer or non-positive values which will cause slicing errors; validate and normalize this value by: read raw = context.get("max_keywords", self._max_keywords), if raw is None keep existing; if it's not an int try to parse int(raw) inside a try/except and on failure fallback to the default; ensure the resulting integer is >= 0 (or >=1 if you require at least one keyword) and clamp it to a sensible upper bound if needed; then assign the validated integer to self._max_keywords so downstream slicing is safe.

coderabbitai · 2025-08-29T13:51:19Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

+        log.info("LLM extract Keywords: %s", lr_set)
+        log.info("TextRank extract Keywords: %s", tr_set)
+


🛠️ Refactor suggestion

降级关键词集合日志级别，避免在 INFO 级别泄露用户输入特征

建议使用 DEBUG。

- log.info("LLM extract Keywords: %s", lr_set) - log.info("TextRank extract Keywords: %s", tr_set) + log.debug("LLM extract Keywords: %s", lr_set) + log.debug("TextRank extract Keywords: %s", tr_set)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

log.info("LLM extract Keywords: %s", lr_set)

log.info("TextRank extract Keywords: %s", tr_set)

// … earlier in run()/extract method …

- log.info("LLM extract Keywords: %s", lr_set)

log.debug("LLM extract Keywords: %s", lr_set)

log.debug("TextRank extract Keywords: %s", tr_set)

// … subsequent processing …

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around lines 131 to 133, the logging of the extracted keyword sets currently uses INFO which may leak user input features; change those log.info calls to log.debug so sensitive keyword data is only emitted at DEBUG level (keep the same message text and formatting), and run a quick search to ensure no other user-feature logs remain at INFO in this file.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

44-47: 贯通前端 extract_method 且提供健壮默认值

当前仅从配置读取且未支持 context 覆盖；若 llm_settings.keyword_extract_type 为 None 将触发 AttributeError。建议统一归一化、允许 context 覆盖并设默认 "hybrid"。

-        self._extract_method = llm_settings.keyword_extract_type.lower()
+        self._extract_method = (llm_settings.keyword_extract_type or "hybrid").lower()
@@
-        if self._extract_method == "llm":
+        method = (context.get("extract_method", self._extract_method) or "hybrid").lower()
+        if method == "llm":
             # 使用 LLM 提取关键词
-            ranks = self._extract_with_llm()
-        elif self._extract_method == "textrank":
+            ranks = self._extract_with_llm()
+        elif method == "textrank":
             # 使用 TextRank 提取关键词
-            ranks = self._extract_with_textrank()
-        elif self._extract_method == "hybrid":
+            ranks = self._extract_with_textrank()
+        elif method == "hybrid":
             # 使用 混合方法 提取关键词
-            ranks = self._extract_with_hybrid()
+            ranks = self._extract_with_hybrid()
         else:
-            raise ValueError(f"Invalid extract_method: {self._extract_method}")
+            raise ValueError(f"Invalid extract_method: {method}")

Also applies to: 66-76

64-65: 对 max_keywords 做类型与边界校验

context 可能传入字符串或非正数，直接用于切片会报错或产生意外行为。

-        self._max_keywords = context.get("max_keywords", self._max_keywords)
+        mk = context.get("max_keywords", self._max_keywords)
+        try:
+            mk = int(mk)
+        except (TypeError, ValueError):
+            mk = self._max_keywords
+        self._max_keywords = max(1, mk)

81-81: 在日志中输出用户查询属于隐私风险，降级到 DEBUG 或做脱敏

请避免在 INFO 级别记录原始用户输入与关键词。

-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])

131-133: 降低关键词集合日志级别以减少信息泄露面

集合内容可包含用户敏感语义，建议使用 DEBUG。

-        log.info("LLM extract Keywords: %s", lr_set)
-        log.info("TextRank extract Keywords: %s", tr_set)
+        log.debug("LLM extract Keywords: %s", lr_set)
+        log.debug("TextRank extract Keywords: %s", tr_set)

🧹 Nitpick comments (5)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
78-80: 不要无脑去除单引号，改为仅剥离首尾引号以保留如 O'Reilly 等合法词形

当前 replace 会破坏英文专有名词与缩写。
-        keywords = [k.replace("'", "") for k in keywords]
+        keywords = [
+            (k[1:-1] if len(k) >= 2 and k[0] == k[-1] and k[0] in ("'", '"') else k)
+            for k in keywords
+        ]
87-89: 移除多余的 f-string 包裹

这里的 f"" 外层是冗余的。
-        prompt_run = f"{self._extract_template.format(question=self._query, max_keywords=self._max_keywords)}"
+        prompt_run = self._extract_template.format(question=self._query, max_keywords=self._max_keywords)
104-108: 不要吞掉 MemoryError

内存不足通常属于不可恢复错误，建议记录后上抛，便于外层进行熔断/降载。
-        except MemoryError as e:
-            log.critical("TextRank memory error (text too large?): %s", e)
+        except MemoryError as e:
+            log.critical("TextRank memory error (text too large?): %s", e)
+            raise
172-174: 匹配起始 token 时应对其做正则转义

避免 start_token 含特殊字符时导致误匹配或正则错误。
-        matches = re.findall(rf'{start_token}([^\n]+\n?)', response)
+        token = re.escape(start_token) if start_token else start_token
+        matches = re.findall(rf'{token}([^\n]+\n?)', response)
128-130: 小优化：直接用 dict 视图构造集合

等价但更简洁。
-        lr_set = set(k for k in llm_scores)
-        tr_set = set(k for k in tr_scores)
+        lr_set = set(llm_scores)
+        tr_set = set(tr_scores)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 38064c3 and 4379456.

📒 Files selected for processing (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

🧰 Additional context used

🧠 Learnings (10)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

📚 Learning: 2025-08-29T13:11:08.943Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-21T12:34:10.407Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中，sub_tokens变量的复用逻辑是正确的：每次循环中sub_tokens被重新赋值为当前token的分词结果，处理后的有效子词会被添加回同一个sub_tokens列表中，最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (30-163)

extract_keywords (145-163)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

60-63: 语言代码归一映射实现到位

对 "en"/"zh" 的映射合理，能与下游使用的 "english"/"chinese" 对齐。

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
57-66: 修复 extract_word 方法中忽略 language 参数的问题

方法签名已声明 language，但实现强制使用 llm_settings.language，导致接口与实现不一致

请改为优先使用调用方传入的 language，若未传则回退至 llm_settings.language，示例：
-        self._operators.append(WordExtract(text=text, language=llm_settings.language))
+        lang = language or llm_settings.language
+        self._operators.append(WordExtract(text=text, language=lang))

♻️ Duplicate comments (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

65-75: extract_method 不支持运行时覆盖且大小写不统一（与 UI 期望不符）

根据 PR 目标，UI/API 可传入 extract_method。建议允许 context 覆盖，并统一 lower() 后再分支；该建议与以往评论一致。

-        if self._extract_method == "llm":
+        method = (context.get("extract_method", self._extract_method) or "hybrid").lower()
+        if method == "llm":
             # 使用 LLM 提取关键词
             ranks = self._extract_with_llm()
-        elif self._extract_method == "textrank":
+        elif method == "textrank":
             # 使用 TextRank 提取关键词
             ranks = self._extract_with_textrank()
-        elif self._extract_method == "hybrid":
+        elif method == "hybrid":
             # 使用 混合方法 提取关键词
             ranks = self._extract_with_hybrid()
         else:
-            raise ValueError(f"Invalid extract_method: {self._extract_method}")
+            raise ValueError(f"Invalid extract_method: {method}")

63-63: max_keywords 缺少类型与边界校验

来自 UI 的 context["max_keywords"] 可能是字符串或非正数，直接切片有风险。

-        self._max_keywords = context.get("max_keywords", self._max_keywords)
+        mk = context.get("max_keywords", self._max_keywords)
+        try:
+            mk = int(mk)
+        except (TypeError, ValueError):
+            mk = self._max_keywords
+        self._max_keywords = max(1, mk)

🧹 Nitpick comments (4)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)
58-59: 降低日志敏感度与噪声

仅需调试时查看关键词，建议改为 DEBUG 级别，避免在 INFO 级别输出用户相关信息。
-        log.info("KEYWORDS: %s", context['keywords'])
+        log.debug("KEYWORDS: %s", context['keywords'])
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

68-86: 与 UI/配置解耦建议：让 extract_keywords 支持运行时覆盖 language

当前仅从 llm_settings.language 传入，若上游（UI/API）通过 context 想临时切换语言，将无法生效。可在 KeywordExtract.run() 中读取 context["language"] 做归一化（该建议已在另一文件评论中给出实现思路）。
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
77-81: 避免在 INFO 级别记录用户查询与关键词

为降低隐私风险与日志噪声，建议降为 DEBUG。
-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])
118-122: 敏感内容日志降级到 DEBUG

混合模式记录了 LLM 与 TextRank 的关键词集合，建议降级日志级别。
-        log.info("LLM extract Keywords: %s", lr_set)
-        log.info("TextRank extract Keywords: %s", tr_set)
+        log.debug("LLM extract Keywords: %s", lr_set)
+        log.debug("TextRank extract Keywords: %s", tr_set)
Also applies to: 130-131

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4379456 and 61f91de.

📒 Files selected for processing (3)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (3 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

🧰 Additional context used

🧠 Learnings (11)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

📚 Learning: 2025-08-29T13:11:08.943Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-06-25T09:45:10.751Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-21T12:34:10.407Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中，sub_tokens变量的复用逻辑是正确的：每次循环中sub_tokens被重新赋值为当前token的分词结果，处理后的有效子词会被添加回同一个sub_tokens列表中，最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (2)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (3)

hugegraph-llm/src/hugegraph_llm/utils/decorators.py (3)

log_operator_time (64-78)

log_time (38-61)

record_rpm (81-106)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)

WordExtract (29-81)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

KeywordExtract (31-191)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (30-163)

extract_keywords (145-163)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

extract_keywords (68-87)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.11)
GitHub Check: build (3.10)

🔇 Additional comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

86-95: LLM 输出解析健壮性已提升，逻辑清晰

_extract_with_llm 的计时与 _extract_keywords_from_response 的健壮解析（分隔校验、异常处理）实现合理，符合预期。

Also applies to: 169-191

coderabbitai · 2025-09-08T09:05:05Z

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py

+        if self._language in ("EN", "CN"):
+            self._language = "english" if self._language == "EN" else "chinese"


🛠️ Refactor suggestion

语言归一化条件永远不成立，导致停用词语言可能错误

__init__ 已将 language.lower()，这里却判断大写 "EN"/"CN"，逻辑不会触发。建议统一归一化并兼容常见缩写（en/zh/cn）。

- if self._language in ("EN", "CN"): - self._language = "english" if self._language == "EN" else "chinese" + lang = (self._language or "").strip().lower() + if lang in ("en", "zh", "cn"): + self._language = "english" if lang == "en" else "chinese" + else: + self._language = lang

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self._language in ("EN", "CN"):

self._language = "english" if self._language == "EN" else "chinese"

lang = (self._language or "").strip().lower()

if lang in ("en", "zh", "cn"):

self._language = "english" if lang == "en" else "chinese"

else:

self._language = lang

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py around lines 51-52, the language normalization checks for uppercase "EN"/"CN" but __init__ already lowercases language, so the condition never triggers; update the condition to check the normalized lowercase forms (e.g., 'en','zh','cn') and map them to the expected stopword labels ('english' for en, 'chinese' for zh or cn). Ensure the code handles both 'zh' and 'cn' as Chinese and uses self._language = 'english' or 'chinese' accordingly.

coderabbitai · 2025-09-08T09:05:05Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

+        self._language = language
        self._max_keywords = max_keywords
        self._extract_template = extract_template or KEYWORDS_EXTRACT_TPL
+        self._extract_method = llm_settings.keyword_extract_type.lower()
+        self._textrank_model = MultiLingualTextRank(


🛠️ Refactor suggestion

语言大小写/缩写未统一，可能影响下游停用词与分词策略

self._language 未标准化，且仅在后续用 "EN"/"CN" 判断。建议在 run() 开始统一归一化，兼容 en/zh/cn，并保持为 english/chinese。

- self._language = language + self._language = (language or "").strip()

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2025-09-08T09:05:05Z

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

+        if self._language in ("EN", "CN"):
+            self._language = "english" if self._language == "EN" else "chinese"
        self._max_keywords = context.get("max_keywords", self._max_keywords)


🛠️ Refactor suggestion

语言归一化条件错误（同上）

仅判断 "EN"/"CN" 会漏掉常见的 en/zh/cn。建议如下修正。

- if self._language in ("EN", "CN"): - self._language = "english" if self._language == "EN" else "chinese" + lang = (self._language or "").strip().lower() + if lang in ("en", "zh", "cn"): + self._language = "english" if lang == "en" else "chinese" + else: + self._language = lang

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self._language in ("EN", "CN"):

self._language = "english" if self._language == "EN" else "chinese"

self._max_keywords = context.get("max_keywords", self._max_keywords)

# Normalize language codes (handle upper/lower case and common aliases)

lang = (self._language or "").strip().lower()

if lang in ("en", "zh", "cn"):

self._language = "english" if lang == "en" else "chinese"

else:

self._language = lang

self._max_keywords = context.get("max_keywords", self._max_keywords)

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around lines 61 to 63, the language normalization only checks for "EN"/"CN" and misses common variants like "en", "zh", "cn" etc.; change the logic to normalize the input (e.g., language = (self._language or "").strip().lower()) and map common codes to target names (map "en" -> "english"; treat "zh", "zh-cn", "zh_cn", "cn" -> "chinese"), then set self._language based on that mapping before using it further.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

76-79: 在 INFO 级别记录原始查询与关键词，存在隐私合规风险；建议降级到 DEBUG（重复建议）

-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])

129-130: 在 INFO 级别输出关键词集合（重复建议）

关键词集合同样属于用户输入特征，建议降级为 DEBUG。

-        log.info("LLM extract Keywords: %s", lr_set)
-        log.info("TextRank extract Keywords: %s", tr_set)
+        log.debug("LLM extract Keywords: %s", lr_set)
+        log.debug("TextRank extract Keywords: %s", tr_set)

64-74: extract_method 未支持 context 覆盖，前端选择不会生效

分支判断仅依赖构造时从配置读取的值，忽略了 API/UI 传入的 extract_method，违背了本 PR “UI/API 可选 llm/textrank/hybrid”的目标。

建议在 run 中使用 context 覆盖并统一小写：

-        if self._extract_method == "llm":
+        method = (context.get("extract_method", self._extract_method) or "hybrid").strip().lower()
+        if method == "llm":
             # 使用 LLM 提取关键词
             ranks = self._extract_with_llm()
-        elif self._extract_method == "textrank":
+        elif method == "textrank":
             # 使用 TextRank 提取关键词
             ranks = self._extract_with_textrank()
-        elif self._extract_method == "hybrid":
+        elif method == "hybrid":
             # 使用 混合方法 提取关键词
             ranks = self._extract_with_hybrid()
         else:
-            raise ValueError(f"Invalid extract_method: {self._extract_method}")
+            raise ValueError(f"Invalid extract_method: {method}")

60-63: 语言归一化与上下文覆盖缺失，UI 传入的 language/max_keywords 将被忽略或误判

当前仅将 "cn" 映射为 "chinese"，其他一概为 "english"；且未对 context["language"] 做覆盖与归一化，也未对 max_keywords 做类型/边界校验。按照 PR 目标（UI/API 接收 language、max_keywords），这里需要兼容 "en/zh/cn/zh-cn" 等输入并做健壮性处理。

建议最小补丁（归一化 language + 覆盖 + 校验 max_keywords）：

-        # 未传入值或者其他值，默认使用英文
-        self._language = "chinese" if self._language == "cn" else "english"
-        self._max_keywords = context.get("max_keywords", self._max_keywords)
+        # 语言归一化与上下文覆盖（支持 en/zh/cn/zh-cn 等）
+        lang_raw = (context.get("language", self._language) or "").strip().lower()
+        if lang_raw in ("en", "english"):
+            self._language = "english"
+        elif lang_raw in ("zh", "zh-cn", "zh_cn", "cn", "chinese"):
+            self._language = "chinese"
+        else:
+            self._language = "english"
+
+        # max_keywords 类型与边界校验
+        mk = context.get("max_keywords", self._max_keywords)
+        try:
+            mk = int(mk)
+        except (TypeError, ValueError):
+            mk = self._max_keywords
+        self._max_keywords = max(1, mk)

🧹 Nitpick comments (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
170-172: 正则需转义 start_token，避免包含特殊字符时误匹配

注释已提示需要 re.escape(start_token)，但实现未应用。为稳健起见请转义。
-        matches = re.findall(rf'{start_token}([^\n]+\n?)', response)
+        matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
57-66: extract_word 的 language 形参未使用，接口语义不清

当前忽略了 language 形参，容易误导调用方。建议移除或在文档中标明“已废弃”，避免产生错误预期。

可选最小变更（移除形参与相关文档行）：
-    def extract_word(self, text: Optional[str] = None, language: str = "english"):
+    def extract_word(self, text: Optional[str] = None):
@@
-        :param language: Language of the text.
若需继续支持通过 UI 传参，请考虑在 run(**kwargs) 的 context 中传入 language，并在 WordExtract.run 内优先读取 context 覆盖。

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f91de and 27b048e.

📒 Files selected for processing (3)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (3 hunks)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (3 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py

🧰 Additional context used

🧠 Learnings (11)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

📚 Learning: 2025-06-25T09:45:10.751Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-29T13:11:08.943Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T13:20:30.343Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:45:20.756Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:42:31.998Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-18T14:37:41.224Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

📚 Learning: 2025-08-21T12:34:10.407Z

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中，sub_tokens变量的复用逻辑是正确的：每次循环中sub_tokens被重新赋值为当前token的分词结果，处理后的有效子词会被添加回同一个sub_tokens列表中，最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py

🧬 Code graph analysis (2)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)

WordExtract (30-81)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

KeywordExtract (31-190)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)

hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)

BaseLLM (22-74)

generate (26-31)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)

LLMs (25-98)

hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)

MultiLingualTextRank (30-163)

extract_keywords (145-163)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build (3.10)

🔇 Additional comments (2)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

96-109: MemoryError 处理策略请确认

对 MemoryError 仅记录后继续返回空结果，可能掩盖严重资源问题；是否期望在该异常上直接失败（或上报）以便告警/限流？请根据运行环境与 SLO 决定。

我可以提供两版补丁：1) 维持现状但打标告警并触发降级；2) 对 MemoryError 直接抛出，交由上游统一熔断处理。需要我给出对比补丁吗？

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

80-85: 未从此处传递 UI 参数无碍，但需确认上下文路径已打通

本方法不再显式传递 max_keywords/language/extract_method/window_size/mask_words。若 UI 通过 API 以 kwargs 形式注入到 pipeline.run(**kwargs)，请确保 KeywordExtract.run 已从 context 读取这些键（尤其 extract_method 与 language）。我在另一个评论中已提供对 KeywordExtract.run 的补丁。

需要我补充一段端到端验证脚本，检查前端到 operator 的键是否到位吗（rg/ast-grep 检索 context 的赋值与使用路径）？

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
239-243: 潜在 None 传参导致逻辑问题：max_graph_items 可能为 None

kwargs.get('max_graph_items') 缺省时为 None，会覆盖 query_graphdb() 的默认值，进而让 GraphRAGQuery 接收到 None（易引发类型/逻辑问题）。建议在此处回落到配置默认值。
-            self.extract_keywords().query_graphdb(
-                max_graph_items=kwargs.get('max_graph_items')
-            ).synthesize_answer()
+            self.extract_keywords().query_graphdb(
+                max_graph_items=kwargs.get('max_graph_items', huge_settings.max_graph_items)
+            ).synthesize_answer()

🧹 Nitpick comments (4)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)

51-55: 命名小拼写：_text2gqlt_llm → _text2gql_llm

保持命名一致性，便于检索与维护。

-        self._text2gqlt_llm = llm or LLMs().get_text2gql_llm()
+        self._text2gql_llm = llm or LLMs().get_text2gql_llm()

68-85: 确认请求级覆盖：extract_method/language/window_size/mask_words 如何在一次请求内生效？

本方法仅传入 text/extract_template；KeywordExtract 目前只从 context 读取 max_keywords，其余参数从 llm_settings 读取。若 UI/API 要在请求级覆盖这些参数，请确认：

是否在进入 pipeline.run 前会更新 llm_settings（线程/请求隔离）；
或在 KeywordExtract 中从 context 读取 extract_method/language/window_size/mask_words。

我可以跟进补充 KeywordExtract 的 context 读取逻辑（不改动此文件接口）。

117-126: 避免默认参数在定义期绑定动态配置（prompt/max_graph_items）

Python 默认参数在定义期求值：prompt.gremlin_generate_prompt 与 huge_settings.max_graph_items 若在运行期可变（语言/配置切换），现有写法会“固化”成导入时的值。改为在函数体内回落更稳妥。

 def query_graphdb(
     self,
     max_deep: int = 2,
-    max_graph_items: int = huge_settings.max_graph_items,
+    max_graph_items: Optional[int] = None,
     max_v_prop_len: int = 2048,
     max_e_prop_len: int = 256,
     prop_to_match: Optional[str] = None,
     num_gremlin_generate_example: Optional[int] = -1,
-    gremlin_prompt: Optional[str] = prompt.gremlin_generate_prompt,
+    gremlin_prompt: Optional[str] = None,
 ):
@@
             GraphRAGQuery(
                 max_deep=max_deep,
-                max_graph_items=max_graph_items,
+                max_graph_items=max_graph_items or huge_settings.max_graph_items,
                 max_v_prop_len=max_v_prop_len,
                 max_e_prop_len=max_e_prop_len,
                 prop_to_match=prop_to_match,
                 num_gremlin_generate_example=num_gremlin_generate_example,
-                gremlin_prompt=gremlin_prompt,
+                gremlin_prompt=gremlin_prompt or prompt.gremlin_generate_prompt,
             )

Also applies to: 139-149

91-97: 同理，避免将 huge_settings. 作为其它方法的默认参数直接绑定*

这些配置若在运行期可变（UI 可调），建议默认设为 None 并在函数体内回落，以获得即时配置效果。

 def keywords_to_vid(
     self,
     by: Literal["query", "keywords"] = "keywords",
-    topk_per_keyword: int = huge_settings.topk_per_keyword,
+    topk_per_keyword: Optional[int] = None,
     topk_per_query: int = 10,
-    vector_dis_threshold: float = huge_settings.vector_dis_threshold,
+    vector_dis_threshold: Optional[float] = None,
 ):
@@
             SemanticIdQuery(
                 embedding=self._embedding,
                 by=by,
-                topk_per_keyword=topk_per_keyword,
+                topk_per_keyword=topk_per_keyword or huge_settings.topk_per_keyword,
                 topk_per_query=topk_per_query,
-                vector_dis_threshold=vector_dis_threshold,
+                vector_dis_threshold=vector_dis_threshold or huge_settings.vector_dis_threshold,
             )

 def merge_dedup_rerank(
     self,
     graph_ratio: float = 0.5,
     rerank_method: Literal["bleu", "reranker"] = "bleu",
     near_neighbor_first: bool = False,
     custom_related_information: str = "",
-    topk_return_results: int = huge_settings.topk_return_results,
+    topk_return_results: Optional[int] = None,
 ):
@@
             MergeDedupRerank(
                 embedding=self._embedding,
                 graph_ratio=graph_ratio,
                 method=rerank_method,
                 near_neighbor_first=near_neighbor_first,
                 custom_related_information=custom_related_information,
-                topk_return_results=topk_return_results
+                topk_return_results=topk_return_results or huge_settings.topk_return_results
             )

Also applies to: 106-114, 167-174, 181-189

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27b048e and 66c7ea8.

📒 Files selected for processing (1)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely选择简化UI设置，移除了窗口大小(window_size)的动态调整功能，只保留top_k(最大关键词数)的配置。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中，用户 Gfreely 希望保持提取方法的接口一致性：所有提取方法（_extract_with_llm、_extract_with_textrank、_extract_with_hybrid）都返回 Dict[str, float]，由主 run() 方法统一处理排序和截断，而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中，虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords])，但TextRank模型内部的top_k参数在初始化后不会更新，这意味着如果运行时max_keywords增大，TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中，用户Gfreely使用了ch_tokens去重策略（list(set(ch_tokens))）来避免重复中文token的处理，这既解决了words.index()可能的ValueError问题，又提高了批量替换的效率。

📚 Learning: 2025-06-25T09:45:10.751Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py

📚 Learning: 2025-06-25T09:50:06.213Z

Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py

🧬 Code graph analysis (1)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)

hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)

WordExtract (30-81)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

KeywordExtract (31-190)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (4)

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)

19-19: typing 导入顺序调整 OK

不影响功能，可保持。

21-21: 新增 huge_settings/prompt 导入与后续使用一致

与下方默认参数和调用处匹配，无问题。

35-35: 装饰器导入顺序调整不影响行为

与现有装饰器用法兼容。

57-65: 去掉 language 形参与 WordExtract 当前签名一致

语言从 llm_settings 推导，此处不再需要传递。

Gfreely added 4 commits June 27, 2025 15:58

TextRank-fix

11c211d

fix apache#224 problem, update new UI to support change keyword extracion method

feat(llm):TextRank fix

4e3fa9b

fix the pylint check bug

fix

a8313df

pylint bug fix

a4180ea

This comment has been minimized.

Sign in to view

coderabbitai bot reviewed Jul 10, 2025

View reviewed changes

fix Potential issue

98471a4

coderabbitai bot reviewed Jul 10, 2025

View reviewed changes

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

fix default num

750d338

coderabbitai bot reviewed Jul 10, 2025

View reviewed changes

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

fix spilt

29ddeb1

coderabbitai bot reviewed Jul 10, 2025

View reviewed changes

imbajin requested a review from Copilot July 11, 2025 10:06

Copilot AI reviewed Jul 11, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

codecov-ai bot reviewed Jul 11, 2025

View reviewed changes

hugegraph-llm/pyproject.toml Outdated Show resolved Hide resolved

codecov-ai bot reviewed Jul 11, 2025

View reviewed changes

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

codecov-ai bot reviewed Jul 11, 2025

View reviewed changes

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

codecov-ai bot reviewed Jul 11, 2025

View reviewed changes

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

fix bug

d2e846c

coderabbitai bot reviewed Jul 16, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

codecov-ai bot reviewed Jul 16, 2025

View reviewed changes

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

fix bug

02520ba

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

update hybrid method

59ad7ed

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

fix bug

1053060

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Outdated Show resolved Hide resolved

fix bug

a7c3543

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

Merge branch 'main' into TextRank-fix

236094e

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

Gfreely added 2 commits August 29, 2025 20:50

update new version

ece0fc1

Merge remote-tracking branch 'origin/TextRank-fix' into test

3c3f7bb

github-actions bot added the llm label Aug 29, 2025

coderabbitai bot reviewed Aug 29, 2025

View reviewed changes

fix bug

b7f4136

coderabbitai bot reviewed Aug 29, 2025

View reviewed changes

fix bug

38064c3

coderabbitai bot reviewed Aug 29, 2025

View reviewed changes

Update keyword_extract.py

4379456

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

update language

61f91de

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

update language

27b048e

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

Update graph_rag_task.py

66c7ea8

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

Update word_extract.py

00edd28

	log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
	log.debug("User Query: %s \| Keywords: %s", self._query, context["keywords"])

		self._max_keywords = context.get("max_keywords", self._max_keywords)

		log.info("LLM extract Keywords: %s", lr_set)
		log.info("TextRank extract Keywords: %s", tr_set)

		if self._language in ("EN", "CN"):
		self._language = "english" if self._language == "EN" else "chinese"

feat(llm): support TextRank #32

Are you sure you want to change the base?

feat(llm): support TextRank #32

Uh oh!

Conversation

Gfreely commented Jul 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gfreely Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

imbajin commented Jul 11, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Gfreely commented Jul 16, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Gfreely commented Jul 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 10, 2025 •

edited

Loading

coderabbitai bot Jul 10, 2025 •

edited

Loading