-
Notifications
You must be signed in to change notification settings - Fork 1
feat(llm): support TextRank #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fix apache#224 problem, update new UI to support change keyword extracion method
fix the pylint check bug
Walkthrough新增多语言 TextRank 实现并将关键词抽取改为三种模式(llm/textrank/hybrid),抽取结果改为 keyword→score;更新 RAG 流程中 extract_keywords 签名与调用,修改提示模板与配置项,添加 scipy 与 python-igraph 依赖,增强 NLTK 资源检查,并调整若干导入与 .gitignore。 Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant RAG as RAGPipeline.extract_keywords
participant KE as KeywordExtract.run
participant LLM as LLM Backend
participant TR as MultiLingualTextRank
User->>RAG: 提交待抽取文本
RAG->>KE: 调用 KeywordExtract.run()
alt mode == "llm"
KE->>LLM: 发送提示,等待响应
LLM-->>KE: 返回 KEYWORDS 格式响应
KE->>KE: 解析为 {keyword:score}
else mode == "textrank"
KE->>TR: extract_keywords(text, lang)
TR-->>KE: 返回 {keyword:score}
else mode == "hybrid"
par LLM 路径
KE->>LLM: 请求关键词得分
LLM-->>KE: 返回 {keyword:score}
and TextRank 路径
KE->>TR: extract_keywords(text, lang)
TR-->>KE: 返回 {keyword:score}
end
KE->>KE: 根据 hybrid_llm_weights 融合并排序
end
KE-->>RAG: 返回关键词与分数映射
RAG-->>User: 返回结果
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
@codecov-ai-reviewer review |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
257-259
: 考虑优化窗口大小判断逻辑。当唯一词数量少于窗口大小时,仍然可以构建有意义的共现图。建议调整判断逻辑。
-if len(unique_words) < self.window: +if len(unique_words) < 2: returnhugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)
297-302
: 建议改进mask_words输入框的提示信息。当前的提示信息较长,可以考虑简化并提供示例。
mask_words_input = gr.Textbox( label="TextRank mask words", - info="""Enter any words you want to protect from being split during Chinese word segmentation(e.g., C++, website URLs). Separate each entry with a comma.""", + info="保护词列表,防止中文分词时被切分。用逗号分隔。示例:C++,GitHub,ChatGPT", show_copy_button=True, lines=7, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
hugegraph-llm/pyproject.toml
(1 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py
(11 hunks)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
(5 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(4 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/pyproject.toml (1)
Learnt from: cgwer
PR: hugegraph/hugegraph-ai#10
File: hugegraph-python-client/pyproject.toml:0-0
Timestamp: 2025-05-27T06:55:13.779Z
Learning: The hugegraph-python-client is a component within the hugegraph-ai project repository (apache/incubator-hugegraph-ai), not a standalone repository. When reviewing project URLs in pyproject.toml files within this project, they should point to the main hugegraph-ai repository.
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)
RAGPipeline
(38-267)extract_keywords
(68-100)keywords_to_vid
(106-130)import_schema
(102-104)hugegraph-llm/src/hugegraph_llm/operators/llm_op/answer_synthesize.py (1)
AnswerSynthesize
(36-259)hugegraph-llm/src/hugegraph_llm/utils/decorators.py (1)
with_task_id
(109-123)hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (1)
import_schema
(46-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
68-100
: 代码实现正确!新增的TextRank相关参数设计合理,文档说明清晰。语言代码从
"english"
标准化为"en"
是个好的改进。hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
281-301
: TextRank主方法实现完善!关键词提取的完整流程实现合理,包含了适当的边界条件检查。
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (2)
83-90
: 参数传递实现正确!新增的TextRank相关参数正确传递给了
extract_keywords
方法。
431-460
: 批处理函数更新正确!新参数在批处理函数中的传递保持了一致性。
if self._extract_method == "TextRank": | ||
# 使用 TextRank 提取关键词 | ||
keywords = self._extract_with_textrank() | ||
else: | ||
# 使用 LLM 提取关键词 | ||
keywords = self._extract_with_llm() | ||
keywords = {k.replace("'", "") for k in keywords} | ||
context["keywords"] = list(keywords)[:self._max_keywords] | ||
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) | ||
|
||
# extracting keywords & expanding synonyms increase the call count by 1 | ||
context["call_count"] = context.get("call_count", 0) + 1 | ||
return context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
建议为LLM提取方法也添加错误处理。
当前只有TextRank方法有try-catch错误处理,建议为LLM提取方法也添加类似的错误处理,以提高系统的健壮性。
if self._extract_method == "TextRank":
# 使用 TextRank 提取关键词
keywords = self._extract_with_textrank()
else:
# 使用 LLM 提取关键词
- keywords = self._extract_with_llm()
+ try:
+ keywords = self._extract_with_llm()
+ except Exception as e:
+ log.error("LLM Keyword extraction error: %s", e)
+ keywords = set()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if self._extract_method == "TextRank": | |
# 使用 TextRank 提取关键词 | |
keywords = self._extract_with_textrank() | |
else: | |
# 使用 LLM 提取关键词 | |
keywords = self._extract_with_llm() | |
keywords = {k.replace("'", "") for k in keywords} | |
context["keywords"] = list(keywords)[:self._max_keywords] | |
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) | |
# extracting keywords & expanding synonyms increase the call count by 1 | |
context["call_count"] = context.get("call_count", 0) + 1 | |
return context | |
if self._extract_method == "TextRank": | |
# 使用 TextRank 提取关键词 | |
keywords = self._extract_with_textrank() | |
else: | |
# 使用 LLM 提取关键词 | |
try: | |
keywords = self._extract_with_llm() | |
except Exception as e: | |
log.error("LLM Keyword extraction error: %s", e) | |
keywords = set() | |
keywords = {k.replace("'", "") for k in keywords} | |
context["keywords"] = list(keywords)[:self._max_keywords] | |
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) | |
# extracting keywords & expanding synonyms increase the call count by 1 | |
context["call_count"] = context.get("call_count", 0) + 1 | |
return context |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
lines 73 to 85, the LLM keyword extraction method lacks error handling unlike
the TextRank method. Add a try-except block around the call to
self._extract_with_llm() to catch and handle any exceptions, ensuring the system
remains robust. Log the error appropriately and decide on a fallback or empty
keyword set if an exception occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.llm.generate method is also called elsewhere in the project and is already sufficiently robust, so it is not necessary to add error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gfreely, understood! Thank you for the clarification. You have better context about the project's LLM implementation and its robustness across the codebase. I'll defer to your technical judgment on this matter.
🧠 Learnings used
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
73-85
: LLM方法仍缺少错误处理。根据之前的评审意见,TextRank方法已经添加了错误处理,但LLM方法仍然缺少相应的错误处理机制。
if self._extract_method == "TextRank": # 使用 TextRank 提取关键词 keywords = self._extract_with_textrank() else: # 使用 LLM 提取关键词 - keywords = self._extract_with_llm() + try: + keywords = self._extract_with_llm() + except Exception as e: + log.error("LLM Keyword extraction error: %s", e) + keywords = set()
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
263-264
: 图节点数量检查逻辑可能存在问题。当unique_words数量小于window_size时直接返回,但这可能过于严格。即使词汇较少也应该尝试构建图。
-if len(unique_words) < self.window: - return +if len(unique_words) < 2: + return
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
153-164
: 停用词文件加载已正确添加错误处理。根据之前的评审意见,停用词文件加载现在已经正确添加了try-catch错误处理,能够优雅地处理文件不存在的情况。
199-200
: 正则表达式模式已修复。之前评审中提到的正则表达式语法错误已经修复,现在正确使用了
(?![a-zA-Z0-9])
作为否定前瞻断言。
292-294
: 文本预处理结果检查逻辑合理。检查预处理后的词汇列表是否为空,如果为空则直接返回空列表,避免后续处理出错。
300-301
: 图状态检查逻辑完善。在运行PageRank前检查图的状态,确保图存在且有节点,避免算法执行错误。
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
139-142
: 构造函数参数缺少默认值根据之前的评审意见,构造函数参数应该提供默认值以提高代码的健壮性。
- def __init__(self, keyword_num=5, window_size=5, mask_words=""): + def __init__(self, keyword_num=5, window_size=5, mask_words=""):注意:这个问题在之前的评审中已经被标记,但代码中仍然存在。
168-168
: mask_words 参数需要空值检查根据之前的评审意见,当
mask_words
为None
时,split()
方法会导致异常。- self.mask_words = list(filter(None, mask_words.split(','))) + self.mask_words = list(filter(None, (mask_words or "").split(',')))注意:这个问题在之前的评审中已经被标记,但代码中仍然存在。
🧹 Nitpick comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
170-253
: 文本预处理逻辑过于复杂,建议重构
_preprocess
方法包含了复杂的中英文文本处理逻辑,建议将其拆分为更小的专门方法以提高可读性和可维护性。建议重构为以下结构:
def _preprocess(self, text, lang): if lang.startswith('zh'): return self._preprocess_chinese(text) elif lang.startswith('en'): return self._preprocess_english(text) return [] def _preprocess_chinese(self, text): # 中文预处理逻辑 pass def _preprocess_english(self, text): # 英文预处理逻辑 pass
217-218
: 动态添加词典可能影响性能在每次调用时动态向 jieba 词典添加占位符可能会影响性能,特别是在处理大量文本时。
考虑在构造函数中一次性添加常用的占位符模式,或者使用 jieba 的临时词典功能:
# 在构造函数中预处理 jieba.initialize()
263-264
: 图节点数量检查逻辑可能不准确当唯一词汇数量小于滑动窗口大小时直接返回可能不是最佳策略,因为即使词汇较少也可能构建有意义的图。
建议修改为更合理的检查条件:
- if len(unique_words) < self.window: + if len(unique_words) < 2: # 至少需要两个词才能构建图 return
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
287-307
: 关键词提取主函数逻辑清晰
extract_keywords
方法的实现逻辑清晰,正确处理了边界条件,包括空词汇列表和空图的情况。错误处理和流程控制都很合理。
98-109
: TextRank 提取方法实现良好方法包含了适当的错误处理、性能监控和日志记录。异常类型覆盖了常见的运行时错误,返回类型与 LLM 方法保持一致。
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
48-57
: textrank_kwargs参数仍需要空值检查尽管您提到所有实例化代码都设置了默认值,但在构造函数中直接使用
**textrank_kwargs
仍存在潜在风险。如果textrank_kwargs为None,会导致运行时错误。建议添加空值检查以提高代码健壮性:
- self._textrank_model = MultiLingualTextRank(**textrank_kwargs) # TextRank 参数 + self._textrank_model = MultiLingualTextRank(**(textrank_kwargs or {})) # TextRank 参数
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
169-252
: 文本预处理逻辑复杂但功能全面中英文预处理逻辑设计良好,包含了特殊词遮蔽、正则表达式清理、分词和词性过滤等步骤。不过复杂的正则表达式模式可能会影响性能。
对于大量文本处理,建议考虑以下优化:
- 预编译常用的正则表达式模式
- 考虑使用更高效的文本处理库
- 为极长文本添加长度限制或分块处理
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (6)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)
18-37
: 导入和常量定义看起来合理新增的导入项都是TextRank功能所需的,包括jieba用于中文分词、networkx用于图算法、nltk用于英文处理等。EXTRACT_STOPWORDS常量定义也是合适的。
73-85
: 提取方法分发逻辑实现正确条件逻辑正确地在TextRank和LLM方法之间进行分发,错误处理也已经适当实现。关键词后处理和上下文更新逻辑看起来合理。
98-108
: TextRank提取方法实现良好错误处理覆盖了多种异常类型,日志记录有助于调试,返回空集合作为fallback是合理的设计。
139-168
: MultiLingualTextRank构造函数实现完善构造函数现在有了合适的默认值,POS过滤器定义合理,停用词加载包含了适当的错误处理,mask_words的空值检查也已实现。
254-285
: 图构建和PageRank算法实现正确词共现图构建逻辑合理,避免了自环问题,边权重计算正确。PageRank算法使用了标准参数,空图检查确保了健壮性。
286-306
: 关键词提取主流程实现完整主函数按照标准TextRank流程执行:预处理→图构建→PageRank→Top-K提取。边界条件处理得当,包括空词列表和空图的检查。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for TextRank-based keyword extraction in the RAG pipeline, updates the pipeline operator and demo UI to expose TextRank parameters, and introduces new dependencies for graph-based ranking.
- Added a
MultiLingualTextRank
class and branched keyword extraction logic between LLM and TextRank. - Extended
RAGPipeline.extract_keywords
to accept method, window size, mask words, and language settings. - Updated the Gradio demo (
rag_block.py
) to include UI controls for extraction method, language, mask words, keyword count, and sliding window. - Added
networkx
andscipy
to project dependencies.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py | Implemented TextRank extraction, added MultiLingualTextRank , and method branching |
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py | Propagated new extract_method and textrank_kwargs through pipeline |
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py | Added Gradio inputs for extraction method, language, mask words, max keywords, and window size |
hugegraph-llm/pyproject.toml | Added networkx and scipy dependencies |
@codecov-ai-reviewer review |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
179-262
: 建议将复杂的预处理方法拆分当前
_preprocess
方法承担了过多职责,包括遮蔽、清理、分词、标注等多个步骤。建议拆分为更小的方法以提高可维护性。可以考虑将方法拆分为:
def _preprocess(self, text, lang): if lang.startswith('zh'): return self._preprocess_chinese(text) elif lang.startswith('en'): return self._preprocess_english(text) return [] def _preprocess_chinese(self, text): # 中文预处理逻辑 def _preprocess_english(self, text): # 英文预处理逻辑
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/ollama.py (1)
generate
(37-60)hugegraph-llm/src/hugegraph_llm/models/llms/openai.py (1)
generate
(57-85)hugegraph-llm/src/hugegraph_llm/models/llms/qianfan.py (1)
generate
(38-53)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords
(68-100)hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
NLTKHelper
(30-80)stopwords
(36-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (11)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (11)
18-37
: 导入和常量定义合理新增的导入项支持TextRank功能实现,常量定义清晰明确。
41-57
: 构造函数参数设计合理新增的
extract_method
和textrank_kwargs
参数支持TextRank功能,参数传递逻辑正确。
73-85
: 方法调度逻辑清晰根据提取方法正确调度到相应的实现,关键词处理和上下文管理保持一致。
87-96
: LLM提取方法保持向后兼容维护了原有的LLM关键词提取逻辑,时间记录和日志功能完整。
98-114
: TextRank提取方法实现完善包含了适当的异常处理,时间记录与LLM方法保持一致,错误处理策略合理。
116-141
: 响应解析方法保持稳定方法逻辑与原有实现一致,格式化调整合理,停用词处理正确。
144-162
: TextRank类构造函数设计良好默认参数合理,支持中英文词性过滤,mask_words空值检查已实现,延迟加载机制恰当。
164-177
: 停用词加载方法改进良好已添加适当的文件存在性检查和错误处理,延迟加载模式避免重复加载,日志记录完善。
264-285
: 图构建方法实现正确共现图构建逻辑合理,正确处理了窗口大小不足的边界情况,避免了自环,滑动窗口方法恰当。
287-294
: PageRank实现正确正确检查了空图情况,使用标准的PageRank参数,权重处理恰当。
296-319
: 主提取方法设计完善完整地编排了关键词提取流程,各步骤的空结果处理得当,Top-K关键词返回逻辑正确。
@codecov-ai-reviewer review |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
78-84
: 可选:set 去重导致输出顺序不稳定,建议稳定化便于 UI/测试复现当前转 set 后再 list,会带来非确定顺序,可能影响日志比对、缓存命中或 UI 体验。若不强调保序,也可以仅做“稳定化排序”以确保一致性。
建议最小改动(保持去重,提供稳定顺序):
- keywords = {k.replace("'", "") for k in keywords} - context["keywords"] = list(keywords)[:self._max_keywords] + normalized = [k.replace("'", "") for k in keywords if k] + # 非保序但稳定:按小写排序,避免 set 带来的不确定顺序 + context["keywords"] = sorted(set(normalized), key=str.lower)[:self._max_keywords]
97-114
: MemoryError 建议重新抛出,避免掩盖严重资源问题(可选)对 MemoryError 直接吞掉可能隐藏 OOM 或异常输入规模问题。建议提升为致命并上抛。
except MemoryError as e: log.critical("TextRank memory error (text too large?): %s", e) - keywords = [] + raise
153-179
: 正则安全性:匹配 start_token 建议使用 re.escape 以防特殊字符破坏语义注释已提示该点,但代码未实际转义。虽然当前使用 "KEYWORDS:" 安全,仍建议防御性处理,便于未来调整 start_token。
- matches = re.findall(rf'{start_token}[^\n]+\n?', response) + matches = re.findall(rf'{re.escape(start_token)}[^\n]+\n?', response)
80-81
: 可选:避免在 info 级别输出用户原始查询,降低 PII 泄露风险建议降级到 debug 或做截断/脱敏。
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s\nKeywords: %s", self._query, context["keywords"])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.959Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T14:45:20.742Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.316Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.316Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.316Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.316Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.742Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.742Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.959Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.959Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.186Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.186Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
stopwords
(36-53)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(56-240)extract_keywords
(218-240)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords
(68-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
34-49
: LGTM:构造器与 TextRank 集成对齐 PR 目标
- 新增 extract_method/mask_words 参数合理,默认 "Hybrid" 符合本次特性。
- MultiLingualTextRank 以 keyword_num 与 mask_words 初始化,契合文档实现。
61-64
: 语言代码归一化修复到位('en'/'zh' → 'english'/'chinese')该改动确保 NLTKHelper().stopwords(lang) 能正确匹配预期语言键,避免停用词过滤失效。与我们以往学习中对 NLTKHelper 的约束一致。
45-49
: 确认:TextRank top_k 不随 run() 中的 max_keywords 增大而更新(按设计)根据既有学习,top_k 仅在初始化传入;运行期即使 max_keywords 提高,TextRank 也最多返回初始数量,这里是有意取舍。如果后续期望运行期放大生效,可在 run() 同步 self._textrank_model.top_k。
我这边可按需补一个最小同步补丁(不引入 window 支持),要不要我提交?
Also applies to: 65-66
86-96
: LLM 关键词提取实现简洁、耗时日志合理
- prompt 格式化与调用链清晰,耗时统计与 debug 日志有助于定位瓶颈。
115-151
: Hybrid 逻辑修复到位:大小写对齐与生成器误用问题已消除
- 使用 tr_lower/used_tr_lower 进行大小写一致性判断与去重正确。
- 先取交集再拼接剩余 LLM 与 TextRank 的顺序清晰,且已避免之前 extend/append 误用生成器的问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
101-114
: 异常处理建议将 MemoryError 升级为致命并可选向上抛出
MemoryError
多为进程级资源耗尽信号,继续吞掉后续流程可能导致系统处于不可预期状态。建议至少在日志中标明“不可恢复”,或在捕获后重新抛出/让上层策略化处理。- except MemoryError as e: - log.critical("TextRank memory error (text too large?): %s", e) + except MemoryError as e: + log.critical("TextRank memory error (text too large?): %s", e) + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
NLTKHelper
(30-97)stopwords
(36-53)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(30-190)extract_keywords
(173-190)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords
(68-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
63-66
: 语言码标准化处理很到位将 "en"/"zh" 归一为 "english"/"chinese",有效避免 NLTKHelper.stopwords() 语言码不匹配导致的运行时异常或停用词失效问题。
69-81
: 分支调度清晰、TextRank 排序逻辑正确在 TextRank 路径将
ranks
字典按分数降序转为关键字序列的做法正确;异常分支抛出ValueError
明确可追踪。
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
NLTKHelper
(30-97)stopwords
(36-53)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(30-189)extract_keywords
(172-189)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords
(68-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (9)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (9)
20-22
: 导入语句格式良好,功能清晰新增的
jieba.posseg
导入用于中文分词,与新增的多语言 TextRank 功能配合使用,符合预期。导入语句组织合理。
28-28
: TextRank 模块导入正确从
textrank_word_extract
模块导入MultiLingualTextRank
类,与 PR 目标一致,支持 TextRank 关键词抽取功能。
35-50
: 构造函数参数扩展合理移除了
language
参数,新增extract_method
和mask_words
参数,符合 PR 目标。默认值设置合理,extract_method
默认为 "Hybrid" 模式,mask_words
默认为空字符串。
63-66
: 语言映射实现正确基于之前的反馈学习,正确实现了从 "en"/"zh" 到 "english"/"chinese" 的映射,确保与 NLTKHelper 的语言代码要求兼容。
69-88
: 关键词提取分发逻辑清晰实现了三种提取模式的分发逻辑:
- LLM 模式:调用
_extract_with_llm()
- TextRank 模式:调用
_extract_with_textrank()
并按分数排序- Hybrid 模式:调用
_extract_with_hybrid()
逻辑清晰,错误处理适当。最终的关键词处理(去除引号、截断、日志记录)实现正确。
90-99
: LLM 提取方法实现正确将原有的关键词提取逻辑封装为独立方法,时间统计和日志记录完善。返回类型改为
List[str]
,与新的设计一致。
101-114
: TextRank 提取方法异常处理完善新增的 TextRank 提取方法实现了适当的异常处理:
TypeError, ValueError
:参数错误处理MemoryError
:内存错误处理(针对大文本)时间统计和日志记录完善。
116-163
: Hybrid 融合算法实现良好实现了复杂的混合算法,结合 LLM 和 TextRank 的优势:
- 为 LLM 关键词计算来源分数(交集为1.0,仅LLM为0.8)
- 为 TextRank 关键词分配较低的来源分数(0.5)
- 结合 TextRank 分数和词长分数进行综合评估
算法逻辑合理,实现了预期的优先级策略。
165-170
: 方法签名更新正确
_extract_keywords_from_response
方法返回类型从Set[str]
更新为List[str]
,与新的设计模式保持一致,支持保序操作。
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
60-91
: *check_nltk_data 存在二次校验抛异常与“AND”要求过严的问题,应按 OR 兜底 punkt/punkt_tab 与 _eng/通用版
- 行 87-89 直接调用
nltk.data.find(...)
做二次校验,未捕获LookupError
,会在资源缺失时再次抛异常。- 当前将四个包全部视为“必需”,而实际只需要“二选一”:tokenizers 需
punkt
或punkt_tab
其一可用;taggers 需averaged_perceptron_tagger_eng
或averaged_perceptron_tagger
其一可用。- 下载逻辑应优先主名称,失败再回退到备名称,避免冗余下载与版本差异导致的失败。
建议用以下最小重构替换本方法主体,修复异常并采用 OR 语义:
def check_nltk_data(self): - - cache_dir = self.get_cache_dir() - nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) - nltk.data.path.append(os.path.join(resource_path, "nltk_data")) - - if nltk_data_dir not in nltk.data.path: - nltk.data.path.append(nltk_data_dir) - - required_packages = { - 'punkt': False, - 'punkt_tab': False, - 'averaged_perceptron_tagger': False, - "averaged_perceptron_tagger_eng": False} - for package in required_packages: - try: - if package in ['punkt', 'punkt_tab']: - nltk.data.find(f'tokenizers/{package}') - else: - nltk.data.find(f'taggers/{package}') - required_packages[package] = True - except LookupError: - nltk.download(package, download_dir=nltk_data_dir) - - check_flag = all(required_packages.values()) - if not check_flag: - for package in required_packages: - if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'): - required_packages[package] = True - - check_flag = all(required_packages.values()) - return check_flag + cache_dir = self.get_cache_dir() + nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) + local_nltk = os.path.join(resource_path, "nltk_data") + if local_nltk not in nltk.data.path: + nltk.data.path.append(local_nltk) + if nltk_data_dir not in nltk.data.path: + nltk.data.path.append(nltk_data_dir) + + def _ensure_any(kind: str, primary: str, fallback: str) -> bool: + # kind: "tokenizers" | "taggers" + try: + nltk.data.find(f"{kind}/{primary}") + return True + except LookupError: + try: + nltk.data.find(f"{kind}/{fallback}") + return True + except LookupError: + # 优先下载主名称,失败再下载备名称 + if not nltk.download(primary, download_dir=nltk_data_dir): + return bool(nltk.download(fallback, download_dir=nltk_data_dir)) + return True + + tokenizer_ok = _ensure_any("tokenizers", "punkt", "punkt_tab") + tagger_ok = _ensure_any("taggers", "averaged_perceptron_tagger_eng", "averaged_perceptron_tagger") + return tokenizer_ok and tagger_ok若需要,我可以补一版最小单测或在 CI 中缓存下载目录以避免重复下载。
🧹 Nitpick comments (8)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
36-59
: stopwords() 建议支持 en/zh 映射并避免重复追加路径
- 调用方很可能传入 "en"/"zh";当前仅接受 "english"/"chinese",建议做语言码归一化,减少误用。
- 每次调用都无条件 append
resource_path/nltk_data
,会产生重复项;建议先判断后再追加。可在当前范围内最小改动如下:
def stopwords(self, lang: str = "chinese") -> List[str]: """Get stopwords.""" - nltk.data.path.append(os.path.join(resource_path, "nltk_data")) - if self._stopwords.get(lang) is None: + local_nltk = os.path.join(resource_path, "nltk_data") + if local_nltk not in nltk.data.path: + nltk.data.path.append(local_nltk) + + # 兼容常见语言码别名 + lang = ({'en': 'english', 'eng': 'english', 'zh': 'chinese', 'cn': 'chinese'} + .get((lang or '').lower(), (lang or '').lower())) + + if self._stopwords.get(lang) is None: cache_dir = self.get_cache_dir() nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) # update nltk path for nltk so that it finds the data if nltk_data_dir not in nltk.data.path: nltk.data.path.append(nltk_data_dir) try: nltk.data.find("corpora/stopwords") except LookupError: nltk.download("stopwords", download_dir=nltk_data_dir) - self._stopwords[lang] = stopwords.words(lang) + try: + self._stopwords[lang] = stopwords.words(lang) + except (OSError, LookupError): + # 兜底为空集合,避免调用方崩溃 + self._stopwords[lang] = [] @@ - final_stopwords = self._stopwords[lang] + final_stopwords = self._stopwords.get(lang)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (7)
31-36
: top_k 参数未被使用,返回值目前是“全量词典”当前
extract_keywords()
始终返回全量打分词。若上层未再切片,会与 UI “关键词个数”期望不一致。建议在返回前按self.top_k
截断,或移除该参数避免混淆。在返回处添加截断(见 183-191 区域的补丁):
@@ - ranks = self._rank_nodes() - return ranks + ranks = self._rank_nodes() + # 按 top_k 截断(仅当配置为正整数时) + if isinstance(self.top_k, int) and self.top_k > 0 and ranks: + ranks = dict(sorted(ranks.items(), key=lambda kv: kv[1], reverse=True)[: self.top_k]) + return ranks
126-131
: 中文字符检测正则在循环内反复编译,且字符范围建议统一到 \u9fff
re.compile('[\u4e00-\u9fa5]')
每次循环编译;且与上方保留模式使用的\u4e00-\u9fff
不一致。建议预编译并统一范围。class MultiLingualTextRank: def __init__(self, keyword_num: int = 5, window_size: int = 2, mask_words: str = ""): @@ self.max_len = 100 + self._ch_char_re = re.compile(r'[\u4e00-\u9fff]') @@ - if re.compile('[\u4e00-\u9fa5]').search(word): + if self._ch_char_re.search(word): ch_tokens.append(word)
133-141
: 仅替换中文 token 的首个出现位置,后续相同 token 将保持未细分状态当前先
set()
去重再words.index(ch_token)
仅替换到第一个命中的位置。若期望“所有出现位置”都细分,建议改为单次重建列表以覆盖所有位置;若这是有意的性能/效果权衡,请在代码旁注释说明设计取舍。可替换为一次性重建(最小化修改,覆盖所有出现位置):
- ch_tokens = list(set(ch_tokens)) - for ch_token in ch_tokens: - idx = words.index(ch_token) - ch_words = [] - jieba_tokens = pseg.cut(ch_token) - for word, flag in jieba_tokens: - if len(word) >= 1 and flag in self.pos_filter['chinese'] and word not in ch_stop_words: - ch_words.append(word) - words = words[:idx] + ch_words + words[idx+1:] + ch_tokens = set(ch_tokens) + new_words = [] + for tok in words: + if tok in ch_tokens: + buf = [] + for w, flag in pseg.cut(tok): + if len(w) >= 1 and flag in self.pos_filter['chinese'] and w not in ch_stop_words: + buf.append(w) + new_words.extend(buf or [tok]) + else: + new_words.append(tok) + words = new_words
145-161
: Graph 顶点去重使用 set() 导致非确定性顺序,建议保序去重以提升可复现性
list(set(words))
使顶点顺序不可预测;虽然不影响正确性,但会影响调试与结果复现。建议使用dict.fromkeys
。- unique_words = list(set(words)) + unique_words = list(dict.fromkeys(words)) name_to_idx = {word: idx for idx, word in enumerate(unique_words)}
167-170
: PageRank 归一化可避免重复计算,并对极端小值更稳健当前在列表推导内重复
max(...)
。建议先取一次最大值并加微小下限,避免潜在除零与 O(n^2)。- pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight') - pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores] + pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight') + m = max(pagerank_scores) or 1e-12 + pagerank_scores = [s / m for s in pagerank_scores]
172-177
: 对 NLTK 资源不可用时直接返回空字典,建议打日志并提示降级当
check_nltk_data()
返回 False 时直接{}
,上层将难以定位问题。建议在此处记录 warning 并可选地降级到“仅按英文简单分词”的退路。我可以补一段最小降级分支,或者在调用链(如 KeywordExtract)中输出一次性提示,便于运维定位。
87-90
: 端到端验证建议为减少回归,建议新增三类用例:
- mask_words 混合字面量与正则(含重叠匹配)时的占位与还原正确性;
- 中文和英文混排且包含多次重复中文 token 时,细分是否符合预期(按你的设计取舍);
punkt
/punkt_tab
、*_eng
/通用 tagger 在不同环境的就绪检测。如需,我可以起草 pytest 用例骨架。
Also applies to: 94-105, 172-191
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
📚 Learning: 2025-08-21T12:34:10.370Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.370Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3)
NLTKHelper
(30-118)stopwords
(36-58)check_nltk_data
(60-91)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords
(68-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (1)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)
94-105
: 确认占位符在 NLTK 分词中的稳定性当前沙盒环境未安装 NLTK,无法执行自动验证脚本。请在本地安装 NLTK 并运行以下测试,验证
__shieldword_\\d+__
是否能在nltk.word_tokenize
后保持为单个 token,避免后续placeholder_map
匹配失配:#!/bin/bash python - <<'PY' import re from nltk import word_tokenize text = "Keep __shieldword_1__ and __shieldword_2__ intact." # 同样使用了中文字符模式以覆盖全文拆分场景 pattern = r'__shieldword_\\d+__|\\b\\w+\\b|[\\u4e00-\\u9fff]+' tokens = re.findall(pattern, text) ptb = word_tokenize(' '.join(tokens)) print("Tokens after NLTK:", ptb) print("Placeholders intact:", all(t.startswith("__shieldword_") and t.endswith("__") for t in ptb if "__shieldword_" in t)) PY定位
- 文件:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
- 行号:94–105
建议
- 在本地完成上述验证,确保占位符不被二次拆分。
- 在代码库中新增相应单元测试,覆盖常见占位符用例,保证后续
placeholder_map
能稳定映射。
if mask_patterns: | ||
special_regex = regex.compile('|'.join(mask_patterns), regex.V1) | ||
text = special_regex.sub(_create_placeholder, text) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用户自定义正则替换缺少超时,存在 ReDoS 风险
虽已限制模式长度并做了预编译校验,但合并后的大模式仍可能在替换阶段卡死。建议为 sub()
增加超时。
- if mask_patterns:
- special_regex = regex.compile('|'.join(mask_patterns), regex.V1)
- text = special_regex.sub(_create_placeholder, text)
+ if mask_patterns:
+ special_regex = regex.compile('|'.join(mask_patterns), regex.V1)
+ # 设置 1s 超时,防止恶意/病态正则导致阻塞
+ text = special_regex.sub(_create_placeholder, text, timeout=1.0)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if mask_patterns: | |
special_regex = regex.compile('|'.join(mask_patterns), regex.V1) | |
text = special_regex.sub(_create_placeholder, text) | |
if mask_patterns: | |
special_regex = regex.compile('|'.join(mask_patterns), regex.V1) | |
# 设置 1s 超时,防止恶意/病态正则导致阻塞 | |
text = special_regex.sub(_create_placeholder, text, timeout=1.0) |
🤖 Prompt for AI Agents
In
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
around lines 87 to 90, the call to special_regex.sub(_create_placeholder, text)
lacks a timeout and can hang (ReDoS); add a timeout argument to sub (e.g.,
timeout=0.1 or a configurable small value) and wrap the call in a try/except
catching regex.TimeoutError to handle timeouts gracefully (log/warn and skip or
fall back without performing the replacement). Ensure the timeout value is
sensible and documented/configurable and that the exception path leaves text in
a safe state.
@coderabbitai generate sequence diagram |
Sequence Diagram for PRsequenceDiagram
autonumber
participant U as User
participant UI as Gradio UI
participant API as RAG API
participant Cfg as PromptConfig
participant Pipe as RAGPipeline
participant KE as KeywordExtract
participant TR as MultiLingualTextRank
participant LLM as LLM
participant NLTK as NLTKHelper
U->>UI: Set text, extract_method, language(en/zh), window_size, max_keywords, mask_words
UI->>API: POST /rag {text, extract_method, language, window_size, max_keywords, mask_words}
API->>Cfg: Get prompt.maskword_input_text
API->>API: mask_words := req.mask_words or Cfg.maskword_input_text
API->>Pipe: rag_answer(..., keywords_extract_method, mask_words, language, window_size, max_keywords)
Pipe->>KE: run(text, method, mask_words, language, window_size, max_keywords)
alt method == "TextRank"
KE->>TR: extract(text, language, window_size, mask_words)
TR->>NLTK: check_nltk_data(language)
NLTK-->>TR: ensure stopwords available
TR->>TR: tokenize + POS filter + mask + co-occur graph + PageRank
TR-->>KE: keywords
else method == "LLM"
KE->>LLM: prompt(text)
LLM-->>KE: keywords
else method == "Hybrid"
KE->>LLM: prompt(text)
LLM-->>KE: kw_llm
KE->>TR: extract(text, language, window_size, mask_words)
TR-->>KE: kw_tr
KE->>KE: merge/prioritize/dedup to max_keywords
end
KE-->>Pipe: keywords (dedup/trim)
Pipe-->>API: answer + keywords
API-->>UI: response payload
UI-->>U: Render answer and extracted keywords
|
✅ Actions performedSequence diagram generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (2)
106-106
: 仅在 TextRank 且 language=zh 时显示/启用该输入,降低 UI 认知负担该输入主要用于中文分词的掩码短语,建议按条件显示/禁用:当
extract_method == "TextRank"
且language == "zh"
时可见且可编辑,其余场景(LLM、英文或 Hybrid 非中文)隐藏或置灰,并在占位符中写明期望格式(如“以逗号分隔的短语列表”)。这能减少用户困惑,也避免在无效场景写入无用配置。Also applies to: 156-156
106-106
: 命名统一建议:maskword_input_text ↔ textrank_mask_words ↔ mask_words当前 UI 变量名为
textbox_textrank_mask_words
,Prompt 字段为maskword_input_text
,而 API/流水线语义为mask_words
。建议统一到“mask_words”词根,长期有利于认知与检索(可先保持向后兼容)。若接受调整,本文件可改为引用prompt.mask_words_input_text
(或prompt.textrank_mask_words
),并在配置层同步重命名字段与 YAML 键名。在本文件的最小变更示例(需结合配置层同步修改):
- prompt.maskword_input_text, + prompt.mask_words_input_text,同时请在以下位置同步(示例,非本文件):
- hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py / prompt_config.py:字段名与默认值
- 对应 YAML/JSON 配置键名迁移与兼容读取逻辑
- 依赖该字段的 RAG 侧入参构造(确保最终传给服务端的键名仍为
mask_words
)Also applies to: 135-135, 156-156
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py
(2 hunks)hugegraph-llm/src/hugegraph_llm/api/rag_api.py
(2 hunks)hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- hugegraph-llm/src/hugegraph_llm/api/rag_api.py
- hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py
- hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (4)
106-107
: TextRank 掩码词输入接入正确(解构与返回值一致)
create_rag_block()
已扩展为 5 项返回值,此处新增的textbox_textrank_mask_words
解构位置正确,后续在ui.load(outputs=...)
中也有对应映射,整体连线无副作用。
156-156
: Gradio outputs 新增项位置正确(索引与返回元组对齐)
textbox_textrank_mask_words
插入为第 13 个 output,与refresh_ui_config_prompt()
的第 13 个返回项一一对应,避免了加载时错位。
19-19
: 空白行调整仅空白行变化,无需进一步操作。
135-135
: 等待create_rag_block
函数的return
语句输出,以确认返回值数量和顺序。之后将根据结果更新评审评论。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)
52-71
: 在RAG接口中透传TextRank/Hybrid关键词提取参数
- RAGRequest模型缺少 extract_method、language、window_size、max_keywords、mask_words 字段,API无法接收用户配置
- rag_api.py 中调用 rag_answer_func 时未透传上述字段
# hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py class RAGRequest(BaseModel): + extract_method: Optional[Literal["llm", "textrank", "hybrid"]] = Query(None, description="关键词提取方法") + language: Optional[str] = Query(None, description="关键词提取语言") + window_size: Optional[int] = Query(None, description="TextRank窗口大小") + max_keywords: Optional[int] = Query(None, description="最大关键词数") + mask_words: Optional[List[str]] = Query(None, description="剔除词列表")# hugegraph-llm/src/hugegraph_llm/api/rag_api.py result = rag_answer_func( text=req.query, … topk_per_keyword=req.topk_per_keyword, + extract_method=req.extract_method, + language=req.language, + window_size=req.window_size, + max_keywords=req.max_keywords, + mask_words=req.mask_words, custom_related_information=req.custom_priority_info,
♻️ Duplicate comments (4)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
103-113
: 仅替换第一处重复中文 token,其余位置未细分;可重建列表一次性处理。此前已讨论过去重策略可接受,但为稳健性与一致性,仍建议采用一次遍历重建以覆盖所有出现位置。
- if len(ch_tokens) > 0: - ch_tokens = list(set(ch_tokens)) - for ch_token in ch_tokens: - idx = words.index(ch_token) - ch_words = [] - jieba_tokens = pseg.cut(ch_token) - for word, flag in jieba_tokens: - if len(word) >= 1 and flag in self.pos_filter['chinese'] and word not in ch_stop_words: - ch_words.append(word) - words = words[:idx] + ch_words + words[idx+1:] + if len(ch_tokens) > 0: + new_words = [] + for tok in words: + if regex.search(r'[\u4e00-\u9fff]', tok): + ch_words = [] + for w, flag in pseg.cut(tok): + if len(w) >= 1 and flag in self.pos_filter['chinese'] and w not in ch_stop_words: + ch_words.append(w) + new_words.extend(ch_words or [tok]) + else: + new_words.append(tok) + words = new_words
58-61
: 正则替换缺少超时保护,存在 ReDoS 风险。- text = special_regex.sub(_create_placeholder, text) + try: + text = special_regex.sub(_create_placeholder, text, timeout=1.0) + except regex.TimeoutError: + # 超时降级:跳过屏蔽,直接返回原文 + return text, {}hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
87-96
: LLM 异常处理(已讨论)此前已有讨论并确认 generate 已在底层做了健壮性处理,此处无需重复捕获。
98-112
: TextRank 异常处理与严重资源异常的处理策略建议补充 FileNotFoundError(资源缺失)并对 MemoryError 直接上抛。
- try: + try: ranks = self._textrank_model.extract_keywords(self._query) - except (TypeError, ValueError) as e: + except FileNotFoundError as e: + log.error("TextRank resource file not found: %s", e) + except (TypeError, ValueError) as e: log.error("TextRank parameter error: %s", e) - except MemoryError as e: - log.critical("TextRank memory error (text too large?): %s", e) + except MemoryError as e: + log.critical("TextRank memory error (text too large?): %s", e) + raise
🧹 Nitpick comments (6)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
117-121
: 顶点集合使用 set 会造成非确定性顺序,影响可重复性。- unique_words = list(set(words)) + unique_words = list(dict.fromkeys(words))
30-35
: 未使用的字段(top_k、max_len)。这两字段当前未参与计算,建议删除或接入输出裁剪/长度限制逻辑,避免迷惑。
是否打算在本类内对结果按 top_k 截断?若由上层 KeywordExtract 统一截断,请在此处加注释说明。
hugegraph-llm/src/hugegraph_llm/config/llm_config.py (1)
33-35
: 混合权重缺少取值校验。建议限制在 [0,1] 区间,越界时回退默认值或抛出配置错误。
hybrid_llm_weights: Optional[float] = 0.5 + def __init__(self, **data): + super().__init__(**data) + if self.hybrid_llm_weights is not None: + w = self.hybrid_llm_weights + if not (0.0 <= w <= 1.0): + raise ValueError(f"hybrid_llm_weights out of range: {w}")hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
68-86
: Docstring 仍保留已删除参数说明(max_keywords)。请移除或改为描述配置驱动(llm_settings.keyword_extract_type / window_size 等)。
""" Add a keyword extraction operator to the pipeline. - :param text: Text to extract keywords from. - :param max_keywords: Maximum number of keywords to extract. - :param extract_template: Template for keyword extraction. + :param text: Text to extract keywords from. + :param extract_template: Template for keyword extraction. :return: Self-instance for chaining. """hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
60-63
: 语言码映射不完整(未覆盖 en-US/zh-CN 等变体)建议兼容地区后缀,避免传入 zh-CN/en-US 时未映射。
-lang = context.get("language", self._language).lower() -if lang in ("en", "zh"): - lang = "english" if lang == "en" else "chinese" +self._language = (context.get("language", self._language) or "").lower().strip() +if self._language.startswith(("en", "zh")): + self._language = "english" if self._language.startswith("en") else "chinese" - self._language = lang
166-166
: 正则安全性:对 start_token 使用 re.escape增强健壮性,避免特殊字符引发意外匹配。
- matches = re.findall(rf'{start_token}([^\n]+\n?)', response) + matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
hugegraph-llm/config.md
(1 hunks)hugegraph-llm/src/hugegraph_llm/api/rag_api.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/config/llm_config.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
(5 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py
(2 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
(2 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py
🚧 Files skipped from review as they are similar to previous changes (2)
- hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py
- hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
Applied to files:
hugegraph-llm/src/hugegraph_llm/config/llm_config.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/config.md
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (3)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3)
NLTKHelper
(30-118)stopwords
(36-58)check_nltk_data
(60-91)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(29-161)extract_keywords
(143-161)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
hugegraph-llm/src/hugegraph_llm/utils/decorators.py (3)
log_operator_time
(64-78)log_time
(38-61)record_rpm
(81-106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)
32-32
: 仅导入顺序调整,功能不变,LGTM。hugegraph-llm/config.md (1)
29-40
: 调整默认值为 3,与代码windows_size
保持一致-| `WINDOWS_SIZE` | Optional[Integer] | 2 | TextRank 划窗大小 | +| `WINDOWS_SIZE` | Optional[Integer] | 3 | TextRank 划窗大小 |Likely an incorrect or invalid review comment.
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
19-22
: 导入重排与新增无功能影响,LGTM。hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
41-47
: 确认 windows_size 字段已存在 llm_settings 配置中定义为 windows_size,与代码引用一致,无需修改。
keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm" | ||
windows_size: Optional[int] = 3 | ||
hybrid_llm_weights: Optional[float] = 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
字段命名与文档/PR 参数不一致,且默认值与文档冲突。
- 代码为 windows_size(复数),PR/API/常见命名为 window_size(单数)。
- 文档默认值为 2,代码为 3。
建议统一为单数且与文档一致:
- windows_size: Optional[int] = 3
+ window_size: Optional[int] = 2
并全局替换引用处(含 .env 映射与 UI):
🏁 Script executed:
#!/bin/bash
rg -nP '\bwindows_size\b|WINDOWS_SIZE\b'
Length of output: 422
统一命名及默认值:将 windows_size 改为 window_size 且默认值设为 2
- 修改
hugegraph-llm/src/hugegraph_llm/config/llm_config.py:34
- 修改
hugegraph-llm/config.md:38
- 修改
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:47
--- a/hugegraph-llm/src/hugegraph_llm/config/llm_config.py
- windows_size: Optional[int] = 3
+ window_size: Optional[int] = 2
并在 keyword_extract.py 中将
- window_size=llm_settings.windows_size
+ window_size=llm_settings.window_size
以及文档和 .env
/UI 配置中全局替换 windows_size
/WINDOWS_SIZE
→ window_size
。
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm" | |
windows_size: Optional[int] = 3 | |
hybrid_llm_weights: Optional[float] = 0.5 | |
keyword_extract_type: Literal["llm", "textrank", "hybrid"] = "llm" | |
window_size: Optional[int] = 2 | |
hybrid_llm_weights: Optional[float] = 0.5 |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/config/llm_config.py around lines 33-35
change the field name windows_size to window_size and set its default to 2
(i.e., rename the attribute and update the default value), then update
hugegraph-llm/config.md line ~38 to reflect the new parameter name and default,
and update hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
line ~47 to use window_size instead of windows_size; finally perform a global
replacement of windows_size / WINDOWS_SIZE → window_size in all .env, UI config,
and documentation files to keep names consistent.
1. Extract, evaluate, and rank keywords from the text: | ||
- Minimum 0, maximum MAX_KEYWORDS keywords. | ||
- Keywords should be complete semantic words or phrases, ensuring information completeness. | ||
- Keywords should be complete semantic words or phrases, ensuring information completeness, without any changes to the English capitalization. | ||
- Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text. | ||
- The final list of keywords must be sorted in descending order based on their importance score. | ||
2. Identify keywords that need rewriting: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
英⽂规则冲突:允许“短语”但又禁止空格,语义不一致
“完整语义的词或短语”与“关键词不得包含空格”相矛盾。建议允许空格,但禁止影响解析的分隔符(逗号、冒号)。
- - Keywords should be complete semantic words or phrases, ensuring information completeness, without any changes to the English capitalization.
- - Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text.
+ - Keywords should be complete semantic words or phrases (spaces are allowed), ensuring information completeness, without any changes to the English capitalization.
+ - Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text.
+ - Do not include commas or colons in keywords (they are used as separators).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
1. Extract, evaluate, and rank keywords from the text: | |
- Minimum 0, maximum MAX_KEYWORDS keywords. | |
- Keywords should be complete semantic words or phrases, ensuring information completeness. | |
- Keywords should be complete semantic words or phrases, ensuring information completeness, without any changes to the English capitalization. | |
- Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text. | |
- The final list of keywords must be sorted in descending order based on their importance score. | |
2. Identify keywords that need rewriting: | |
1. Extract, evaluate, and rank keywords from the text: | |
- Minimum 0, maximum MAX_KEYWORDS keywords. | |
- Keywords should be complete semantic words or phrases (spaces are allowed), ensuring information completeness, without any changes to the English capitalization. | |
- Assign an importance score to each keyword, as a float between 0.0 and 1.0. A higher score indicates a greater contribution to the core idea of the text. | |
- Do not include commas or colons in keywords (they are used as separators). | |
- The final list of keywords must be sorted in descending order based on their importance score. | |
2. Identify keywords that need rewriting: |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/config/prompt_config.py around lines 161 to
166, the prompt text currently conflicts by allowing "phrases" but then
forbidding spaces; change the rule to permit spaces inside keywords (so
multi-word phrases are allowed) while explicitly disallowing delimiter
characters that would break parsing (e.g., commas, colons, semicolons, pipes).
Update the human-readable rule text accordingly and adjust any associated
validation logic to accept spaces but reject those delimiter characters (and
trim surrounding whitespace) so keywords can be multi-word phrases without
parsing ambiguity.
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
Outdated
Show resolved
Hide resolved
pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight') | ||
pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores] | ||
node_names = self.graph.vs['name'] | ||
return dict(zip(node_names, pagerank_scores)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
PageRank 归一化缺少零值保护,极端情况下可能除零。
- pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores]
+ max_pr = max(pagerank_scores) if pagerank_scores else 0.0
+ pagerank_scores = [s / max_pr if max_pr > 0 else 0.0 for s in pagerank_scores]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight') | |
pagerank_scores = [scores/max(pagerank_scores) for scores in pagerank_scores] | |
node_names = self.graph.vs['name'] | |
return dict(zip(node_names, pagerank_scores)) | |
pagerank_scores = self.graph.pagerank(directed=False, damping=0.85, weights='weight') | |
# Protect against zero or empty scores when normalizing | |
max_pr = max(pagerank_scores) if pagerank_scores else 0.0 | |
pagerank_scores = [s / max_pr if max_pr > 0 else 0.0 for s in pagerank_scores] | |
node_names = self.graph.vs['name'] | |
return dict(zip(node_names, pagerank_scores)) |
🤖 Prompt for AI Agents
In
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
around lines 138 to 141, the PageRank normalization divides by
max(pagerank_scores) without guarding against a zero max which may cause a
divide-by-zero; compute max_score = max(pagerank_scores) (or 0 if list empty),
if max_score == 0 then set normalized scores to a list of zeros (or skip
division) otherwise divide each score by max_score, then zip node_names with the
normalized scores and return the dict.
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
135-152
: Hybrid 仅对交集加权,可能导致返回项少于 max_keywords(已在先前评论讨论)此路径当并集>max而交集<max时不会补齐,其结果数可能不足 max_keywords。若按当前设计有意保持,请忽略。
如需修正,可按“对并集加权排序后截断”的策略处理。
🧹 Nitpick comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
129-130
: 将关键词集的中间结果日志从 INFO 降级到 DEBUG防止日志噪音与潜在敏感信息泄漏。
- log.info("LLM extract Keywords: %s", lr_set) - log.info("TextRank extract Keywords: %s", tr_set) + log.debug("LLM extract Keywords: %s", lr_set) + log.debug("TextRank extract Keywords: %s", tr_set)
87-87
: 细化返回类型注解以提升可读性与静态检查效果这些方法返回 word->score 的映射,应标注 Dict[str, float]。
- def _extract_with_llm(self) -> Dict: + def _extract_with_llm(self) -> Dict[str, float]: @@ - def _extract_with_textrank(self) -> Dict: + def _extract_with_textrank(self) -> Dict[str, float]: @@ - def _extract_with_hybrid(self) -> Dict: + def _extract_with_hybrid(self) -> Dict[str, float]: @@ - ) -> Dict: + ) -> Dict[str, float]:Also applies to: 98-98, 113-113, 156-161
166-166
: 在正则中对 start_token 进行转义以避免意外匹配更健壮,便于复用其他起始标记。
- matches = re.findall(rf'{start_token}([^\n]+\n?)', response) + matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)
104-111
: MemoryError 建议直接抛出,避免在极端内存压力下“继续运行”当前仅记录 critical 日志后继续返回空结果,可能掩盖资源问题。
- except MemoryError as e: - log.critical("TextRank memory error (text too large?): %s", e) + except MemoryError as e: + log.critical("TextRank memory error (text too large?): %s", e) + raise
117-121
: 对 hybrid_llm_weights 做边界钳制处理更简洁直观,避免后续使用到非法权重。
- llm_weights = llm_settings.hybrid_llm_weights - ranks = {} - if llm_weights < 0 or llm_weights > 1: - llm_weights = 0.5 + llm_weights = min(1.0, max(0.0, float(llm_settings.hybrid_llm_weights))) + ranks = {}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
(4 hunks)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
- hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-29T13:11:08.902Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(29-162)extract_keywords
(144-162)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
168-185
: LLM 解析逻辑简洁且健壮,已覆盖常见异常输入包含空项跳过、分数解析保护与大小写控制,符合预期。
45-47
: 无需调整:配置项名 windows_size 与代码引用一致
配置文件 llm_config.py 第34行已定义windows_size
,代码中通过llm_settings.windows_size
正确获取,无拼写偏差。Likely an incorrect or invalid review comment.
64-65
: 未贯通 TextRank 的 mask_words,UI/API 参数无效run() 未读取 context["mask_words"],MultiLingualTextRank 将始终使用默认空列表。
最小改动在读取 max_keywords 之后同步 mask_words:
self._max_keywords = context.get("max_keywords", self._max_keywords) + mask_words = context.get("mask_words", None) + if mask_words is not None: + try: + self._textrank_model.mask_words = list(filter(None, (mask_words or "").split(","))) + except Exception as e: + log.warning("Invalid mask_words: %s", e)⛔ Skipped due to learnings
Learnt from: Gfreely PR: hugegraph/hugegraph-ai#32 File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41 Timestamp: 2025-08-18T14:45:20.756Z Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely PR: hugegraph/hugegraph-ai#32 File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41 Timestamp: 2025-08-18T14:45:20.756Z Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely PR: hugegraph/hugegraph-ai#32 File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0 Timestamp: 2025-08-18T14:42:31.998Z Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
self._extract_method = llm_settings.keyword_extract_type | ||
self._textrank_model = MultiLingualTextRank( | ||
keyword_num=max_keywords, | ||
window_size=llm_settings.windows_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
统一 extract_method 大小写并支持 context 覆盖,否则前端参数不会生效
当前只从配置读取且未做 lower(),与分支判断不匹配;也未从 context 接收覆盖值。
建议补丁:
- self._extract_method = llm_settings.keyword_extract_type
+ self._extract_method = (llm_settings.keyword_extract_type or "hybrid").lower()
@@
- if self._extract_method == "llm":
+ method = (context.get("extract_method", self._extract_method) or "hybrid").lower()
+ if method == "llm":
# 使用 LLM 提取关键词
- ranks = self._extract_with_llm()
- elif self._extract_method == "textrank":
+ ranks = self._extract_with_llm()
+ elif method == "textrank":
# 使用 TextRank 提取关键词
- ranks = self._extract_with_textrank()
- elif self._extract_method == "hybrid":
+ ranks = self._extract_with_textrank()
+ elif method == "hybrid":
# 使用 混合方法 提取关键词
- ranks = self._extract_with_hybrid()
+ ranks = self._extract_with_hybrid()
else:
- raise ValueError(f"Invalid extract_method: {self._extract_method}")
+ raise ValueError(f"Invalid extract_method: {method}")
Also applies to: 66-76
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
lines 44-47 (and similarly for lines 66-76), the extract_method is set only from
configuration and not normalized or overridable by context; update the logic to
first check for a context-provided extract_method and fall back to
llm_settings.keyword_extract_type, then normalize the method string (e.g., call
lower()) before any branch comparisons so case differences don't break matching;
ensure any downstream branches use the same normalized value.
keywords = [] if not ranks else sorted(ranks, key=ranks.get, reverse=True) | ||
keywords = [k.replace("'", "") for k in keywords] | ||
context["keywords"] = keywords[:self._max_keywords] | ||
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
避免在 INFO 日志中输出用户查询,降低隐私风险
将查询与关键词输出降级到 DEBUG,或做脱敏。
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+ log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) | |
log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"]) |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
line 81, the code currently logs the full user query and extracted keywords at
INFO level, which raises privacy concerns; change the log level to DEBUG
(log.debug) so queries are not logged in INFO, and optionally sanitize the query
before logging (e.g., mask or truncate the query, or log only its length/first N
chars) to avoid storing full user input in logs; ensure the log message
preserves context by mentioning it's masked/truncated when sanitizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
137-154
: Hybrid 在并集过大且交集不足时可能返回少于期望数量的关键词(确认需求)当前分支在 union > max 且 intersection != max 时,仅返回交集加权结果,数量可能小于 max_keywords。根据项目偏好(各方法返回 Dict,由 run() 统一排序与截断),这是有意为之还是需要补足到上限?请确认产品预期。
(我们尊重“接口一致性”的偏好,若确认为预期行为可忽略此条。已参考 Learnings 中的约束。)
45-47
: 关于 window_size 仅来源于配置的说明(文档/一致性)此处仅从 llm_settings.window_size 注入,未支持运行期 context 覆盖。若与 PR 目标“UI 可传 window_size”不一致,请在文档中注明“窗口大小不支持动态调整”(参考既有取舍),避免前端/API 使用方误解。
98-112
: 谨慎处理 MemoryError:建议直接抛出,避免不稳定状态继续运行内存不足属于致命错误,记录后继续返回空结果可能掩盖问题。
- except MemoryError as e: - log.critical("TextRank memory error (text too large?): %s", e) + except MemoryError as e: + log.critical("TextRank memory error (text too large?): %s", e) + raise
66-76
: 支持从 context 覆盖 extract_method,并统一大小写当前仅读取配置项,前端/上游传入的 extract_method 不会生效;且大小写未统一到一次性处理。建议在 run() 中先归一化 method,再做分支判断。
- if self._extract_method == "llm": + method = (context.get("extract_method", self._extract_method) or "hybrid").lower() + if method == "llm": # 使用 LLM 提取关键词 ranks = self._extract_with_llm() - elif self._extract_method == "textrank": + elif method == "textrank": # 使用 TextRank 提取关键词 ranks = self._extract_with_textrank() - elif self._extract_method == "hybrid": + elif method == "hybrid": # 使用 混合方法 提取关键词 ranks = self._extract_with_hybrid() else: - raise ValueError(f"Invalid extract_method: {self._extract_method}") + raise ValueError(f"Invalid extract_method: {method}")
78-81
: 保序去重并减少日志中的敏感信息
- 目前仅做替换,不去重,可能出现重复关键词。
- INFO 级别打印完整用户查询存在隐私风险,降级为 DEBUG 更稳妥。
- keywords = [k.replace("'", "") for k in keywords] - context["keywords"] = keywords[:self._max_keywords] - log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + seen, ordered = set(), [] + for k in keywords: + kk = k.replace("'", "") + if kk and kk not in seen: + seen.add(kk) + ordered.append(kk) + context["keywords"] = ordered[:self._max_keywords] + log.debug("User Query(len=%d): %.200s | Keywords: %s", + len(self._query or ""), (self._query or "")[:200], context["keywords"])
🧹 Nitpick comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
119-123
: hybrid_llm_weights 更健壮的解析与归一化配置可能为字符串(如 "0.6")。当前仅判断 float 类型会退回 0.5。
- if isinstance(llm_settings.hybrid_llm_weights, float): - llm_weights = min(1.0, max(0.0, float(llm_settings.hybrid_llm_weights))) - else: - llm_weights = 0.5 + try: + llm_weights = float(llm_settings.hybrid_llm_weights) + except Exception: + llm_weights = 0.5 + llm_weights = min(1.0, max(0.0, llm_weights))
168-169
: 对 start_token 使用 re.escape 并启用多行匹配,提升解析鲁棒性避免含特殊字符的 start_token 破坏正则;且允许 KEYWORDS: 出现在行首场景。
- matches = re.findall(rf'{start_token}([^\n]+\n?)', response) + pattern = re.compile(rf'{re.escape(start_token)}([^\n]+)', flags=re.MULTILINE) + matches = pattern.findall(response)
41-41
: _language 字段目前未在本类内使用(清理或贯通)初始化与映射均未被后续逻辑消费(TextRank 不接收 lang;LLM 解析也未用)。建议删除死字段或明确贯通用途,以免造成误导。
- self._language = "english" ... - lang = context.get("language", self._language).lower() - if lang in ("en", "zh"): - lang = "english" if lang == "en" else "chinese" - self._language = lang + # 语言处理如需在下游使用,请在此处确认并贯通;否则可移除。(已参考 Learnings:方法返回 Dict、由 run() 统一处理,我们避免改变该约定。)
Also applies to: 60-63
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
hugegraph-llm/config.md
(1 hunks)hugegraph-llm/src/hugegraph_llm/config/llm_config.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
(4 hunks)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
(2 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
- hugegraph-llm/src/hugegraph_llm/config/llm_config.py
- hugegraph-llm/config.md
- hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
- hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-29T13:11:08.902Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.902Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(30-163)extract_keywords
(145-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
self._max_keywords = context.get("max_keywords", self._max_keywords) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
为 max_keywords 做类型与边界校验,避免切片异常
context 可能传入字符串或非正数,直接用于切片有风险。
- self._max_keywords = context.get("max_keywords", self._max_keywords)
+ mk = context.get("max_keywords", self._max_keywords)
+ try:
+ mk = int(mk)
+ except (TypeError, ValueError):
+ mk = self._max_keywords
+ self._max_keywords = max(1, mk)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self._max_keywords = context.get("max_keywords", self._max_keywords) | |
# Validate and sanitize max_keywords from context to avoid slicing errors | |
mk = context.get("max_keywords", self._max_keywords) | |
try: | |
mk = int(mk) | |
except (TypeError, ValueError): | |
mk = self._max_keywords | |
# Ensure at least 1 keyword is requested | |
self._max_keywords = max(1, mk) |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
lines 64-65, the assignment self._max_keywords = context.get("max_keywords",
self._max_keywords) can receive non-integer or non-positive values which will
cause slicing errors; validate and normalize this value by: read raw =
context.get("max_keywords", self._max_keywords), if raw is None keep existing;
if it's not an int try to parse int(raw) inside a try/except and on failure
fallback to the default; ensure the resulting integer is >= 0 (or >=1 if you
require at least one keyword) and clamp it to a sensible upper bound if needed;
then assign the validated integer to self._max_keywords so downstream slicing is
safe.
log.info("LLM extract Keywords: %s", lr_set) | ||
log.info("TextRank extract Keywords: %s", tr_set) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
降级关键词集合日志级别,避免在 INFO 级别泄露用户输入特征
建议使用 DEBUG。
- log.info("LLM extract Keywords: %s", lr_set)
- log.info("TextRank extract Keywords: %s", tr_set)
+ log.debug("LLM extract Keywords: %s", lr_set)
+ log.debug("TextRank extract Keywords: %s", tr_set)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
log.info("LLM extract Keywords: %s", lr_set) | |
log.info("TextRank extract Keywords: %s", tr_set) | |
// … earlier in run()/extract method … | |
- log.info("LLM extract Keywords: %s", lr_set) | |
log.debug("LLM extract Keywords: %s", lr_set) | |
log.debug("TextRank extract Keywords: %s", tr_set) | |
// … subsequent processing … |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
lines 131 to 133, the logging of the extracted keyword sets currently uses INFO
which may leak user input features; change those log.info calls to log.debug so
sensitive keyword data is only emitted at DEBUG level (keep the same message
text and formatting), and run a quick search to ensure no other user-feature
logs remain at INFO in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
44-47
: 贯通前端 extract_method 且提供健壮默认值当前仅从配置读取且未支持 context 覆盖;若 llm_settings.keyword_extract_type 为 None 将触发 AttributeError。建议统一归一化、允许 context 覆盖并设默认 "hybrid"。
- self._extract_method = llm_settings.keyword_extract_type.lower() + self._extract_method = (llm_settings.keyword_extract_type or "hybrid").lower() @@ - if self._extract_method == "llm": + method = (context.get("extract_method", self._extract_method) or "hybrid").lower() + if method == "llm": # 使用 LLM 提取关键词 - ranks = self._extract_with_llm() - elif self._extract_method == "textrank": + ranks = self._extract_with_llm() + elif method == "textrank": # 使用 TextRank 提取关键词 - ranks = self._extract_with_textrank() - elif self._extract_method == "hybrid": + ranks = self._extract_with_textrank() + elif method == "hybrid": # 使用 混合方法 提取关键词 - ranks = self._extract_with_hybrid() + ranks = self._extract_with_hybrid() else: - raise ValueError(f"Invalid extract_method: {self._extract_method}") + raise ValueError(f"Invalid extract_method: {method}")Also applies to: 66-76
64-65
: 对 max_keywords 做类型与边界校验context 可能传入字符串或非正数,直接用于切片会报错或产生意外行为。
- self._max_keywords = context.get("max_keywords", self._max_keywords) + mk = context.get("max_keywords", self._max_keywords) + try: + mk = int(mk) + except (TypeError, ValueError): + mk = self._max_keywords + self._max_keywords = max(1, mk)
81-81
: 在日志中输出用户查询属于隐私风险,降级到 DEBUG 或做脱敏请避免在 INFO 级别记录原始用户输入与关键词。
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])
131-133
: 降低关键词集合日志级别以减少信息泄露面集合内容可包含用户敏感语义,建议使用 DEBUG。
- log.info("LLM extract Keywords: %s", lr_set) - log.info("TextRank extract Keywords: %s", tr_set) + log.debug("LLM extract Keywords: %s", lr_set) + log.debug("TextRank extract Keywords: %s", tr_set)
🧹 Nitpick comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (5)
78-80
: 不要无脑去除单引号,改为仅剥离首尾引号以保留如 O'Reilly 等合法词形当前 replace 会破坏英文专有名词与缩写。
- keywords = [k.replace("'", "") for k in keywords] + keywords = [ + (k[1:-1] if len(k) >= 2 and k[0] == k[-1] and k[0] in ("'", '"') else k) + for k in keywords + ]
87-89
: 移除多余的 f-string 包裹这里的 f"" 外层是冗余的。
- prompt_run = f"{self._extract_template.format(question=self._query, max_keywords=self._max_keywords)}" + prompt_run = self._extract_template.format(question=self._query, max_keywords=self._max_keywords)
104-108
: 不要吞掉 MemoryError内存不足通常属于不可恢复错误,建议记录后上抛,便于外层进行熔断/降载。
- except MemoryError as e: - log.critical("TextRank memory error (text too large?): %s", e) + except MemoryError as e: + log.critical("TextRank memory error (text too large?): %s", e) + raise
172-174
: 匹配起始 token 时应对其做正则转义避免 start_token 含特殊字符时导致误匹配或正则错误。
- matches = re.findall(rf'{start_token}([^\n]+\n?)', response) + token = re.escape(start_token) if start_token else start_token + matches = re.findall(rf'{token}([^\n]+\n?)', response)
128-130
: 小优化:直接用 dict 视图构造集合等价但更简洁。
- lr_set = set(k for k in llm_scores) - tr_set = set(k for k in tr_scores) + lr_set = set(llm_scores) + tr_set = set(tr_scores)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(30-163)extract_keywords
(145-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
60-63
: 语言代码归一映射实现到位对 "en"/"zh" 的映射合理,能与下游使用的 "english"/"chinese" 对齐。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
57-66
: 修复 extract_word 方法中忽略 language 参数的问题
- 方法签名已声明
language
,但实现强制使用llm_settings.language
,导致接口与实现不一致- 请改为优先使用调用方传入的
language
,若未传则回退至llm_settings.language
,示例:- self._operators.append(WordExtract(text=text, language=llm_settings.language)) + lang = language or llm_settings.language + self._operators.append(WordExtract(text=text, language=lang))
♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
65-75
: extract_method 不支持运行时覆盖且大小写不统一(与 UI 期望不符)根据 PR 目标,UI/API 可传入
extract_method
。建议允许context
覆盖,并统一lower()
后再分支;该建议与以往评论一致。- if self._extract_method == "llm": + method = (context.get("extract_method", self._extract_method) or "hybrid").lower() + if method == "llm": # 使用 LLM 提取关键词 ranks = self._extract_with_llm() - elif self._extract_method == "textrank": + elif method == "textrank": # 使用 TextRank 提取关键词 ranks = self._extract_with_textrank() - elif self._extract_method == "hybrid": + elif method == "hybrid": # 使用 混合方法 提取关键词 ranks = self._extract_with_hybrid() else: - raise ValueError(f"Invalid extract_method: {self._extract_method}") + raise ValueError(f"Invalid extract_method: {method}")
63-63
: max_keywords 缺少类型与边界校验来自 UI 的
context["max_keywords"]
可能是字符串或非正数,直接切片有风险。- self._max_keywords = context.get("max_keywords", self._max_keywords) + mk = context.get("max_keywords", self._max_keywords) + try: + mk = int(mk) + except (TypeError, ValueError): + mk = self._max_keywords + self._max_keywords = max(1, mk)
🧹 Nitpick comments (4)
hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)
58-59
: 降低日志敏感度与噪声仅需调试时查看关键词,建议改为 DEBUG 级别,避免在 INFO 级别输出用户相关信息。
- log.info("KEYWORDS: %s", context['keywords']) + log.debug("KEYWORDS: %s", context['keywords'])hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
68-86
: 与 UI/配置解耦建议:让 extract_keywords 支持运行时覆盖 language当前仅从
llm_settings.language
传入,若上游(UI/API)通过context
想临时切换语言,将无法生效。可在KeywordExtract.run()
中读取context["language"]
做归一化(该建议已在另一文件评论中给出实现思路)。hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
77-81
: 避免在 INFO 级别记录用户查询与关键词为降低隐私风险与日志噪声,建议降为 DEBUG。
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])
118-122
: 敏感内容日志降级到 DEBUG混合模式记录了 LLM 与 TextRank 的关键词集合,建议降级日志级别。
- log.info("LLM extract Keywords: %s", lr_set) - log.info("TextRank extract Keywords: %s", tr_set) + log.debug("LLM extract Keywords: %s", lr_set) + log.debug("TextRank extract Keywords: %s", tr_set)Also applies to: 130-131
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py
(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
(3 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (2)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (3)
hugegraph-llm/src/hugegraph_llm/utils/decorators.py (3)
log_operator_time
(64-78)log_time
(38-61)record_rpm
(81-106)hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)
WordExtract
(29-81)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
KeywordExtract
(31-191)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(30-163)extract_keywords
(145-163)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords
(68-87)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
86-95
: LLM 输出解析健壮性已提升,逻辑清晰
_extract_with_llm
的计时与_extract_keywords_from_response
的健壮解析(分隔校验、异常处理)实现合理,符合预期。Also applies to: 169-191
if self._language in ("EN", "CN"): | ||
self._language = "english" if self._language == "EN" else "chinese" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
语言归一化条件永远不成立,导致停用词语言可能错误
__init__
已将 language.lower()
,这里却判断大写 "EN"/"CN"
,逻辑不会触发。建议统一归一化并兼容常见缩写(en/zh/cn)。
- if self._language in ("EN", "CN"):
- self._language = "english" if self._language == "EN" else "chinese"
+ lang = (self._language or "").strip().lower()
+ if lang in ("en", "zh", "cn"):
+ self._language = "english" if lang == "en" else "chinese"
+ else:
+ self._language = lang
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if self._language in ("EN", "CN"): | |
self._language = "english" if self._language == "EN" else "chinese" | |
lang = (self._language or "").strip().lower() | |
if lang in ("en", "zh", "cn"): | |
self._language = "english" if lang == "en" else "chinese" | |
else: | |
self._language = lang |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py around
lines 51-52, the language normalization checks for uppercase "EN"/"CN" but
__init__ already lowercases language, so the condition never triggers; update
the condition to check the normalized lowercase forms (e.g., 'en','zh','cn') and
map them to the expected stopword labels ('english' for en, 'chinese' for zh or
cn). Ensure the code handles both 'zh' and 'cn' as Chinese and uses
self._language = 'english' or 'chinese' accordingly.
self._language = language | ||
self._max_keywords = max_keywords | ||
self._extract_template = extract_template or KEYWORDS_EXTRACT_TPL | ||
self._extract_method = llm_settings.keyword_extract_type.lower() | ||
self._textrank_model = MultiLingualTextRank( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
语言大小写/缩写未统一,可能影响下游停用词与分词策略
self._language
未标准化,且仅在后续用 "EN"/"CN"
判断。建议在 run()
开始统一归一化,兼容 en/zh/cn
,并保持为 english/chinese
。
- self._language = language
+ self._language = (language or "").strip()
Committable suggestion skipped: line range outside the PR's diff.
if self._language in ("EN", "CN"): | ||
self._language = "english" if self._language == "EN" else "chinese" | ||
self._max_keywords = context.get("max_keywords", self._max_keywords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
语言归一化条件错误(同上)
仅判断 "EN"/"CN"
会漏掉常见的 en/zh/cn
。建议如下修正。
- if self._language in ("EN", "CN"):
- self._language = "english" if self._language == "EN" else "chinese"
+ lang = (self._language or "").strip().lower()
+ if lang in ("en", "zh", "cn"):
+ self._language = "english" if lang == "en" else "chinese"
+ else:
+ self._language = lang
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if self._language in ("EN", "CN"): | |
self._language = "english" if self._language == "EN" else "chinese" | |
self._max_keywords = context.get("max_keywords", self._max_keywords) | |
# Normalize language codes (handle upper/lower case and common aliases) | |
lang = (self._language or "").strip().lower() | |
if lang in ("en", "zh", "cn"): | |
self._language = "english" if lang == "en" else "chinese" | |
else: | |
self._language = lang | |
self._max_keywords = context.get("max_keywords", self._max_keywords) |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
lines 61 to 63, the language normalization only checks for "EN"/"CN" and misses
common variants like "en", "zh", "cn" etc.; change the logic to normalize the
input (e.g., language = (self._language or "").strip().lower()) and map common
codes to target names (map "en" -> "english"; treat "zh", "zh-cn", "zh_cn", "cn"
-> "chinese"), then set self._language based on that mapping before using it
further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
76-79
: 在 INFO 级别记录原始查询与关键词,存在隐私合规风险;建议降级到 DEBUG(重复建议)- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])
129-130
: 在 INFO 级别输出关键词集合(重复建议)关键词集合同样属于用户输入特征,建议降级为 DEBUG。
- log.info("LLM extract Keywords: %s", lr_set) - log.info("TextRank extract Keywords: %s", tr_set) + log.debug("LLM extract Keywords: %s", lr_set) + log.debug("TextRank extract Keywords: %s", tr_set)
64-74
: extract_method 未支持 context 覆盖,前端选择不会生效分支判断仅依赖构造时从配置读取的值,忽略了 API/UI 传入的 extract_method,违背了本 PR “UI/API 可选 llm/textrank/hybrid”的目标。
建议在 run 中使用 context 覆盖并统一小写:
- if self._extract_method == "llm": + method = (context.get("extract_method", self._extract_method) or "hybrid").strip().lower() + if method == "llm": # 使用 LLM 提取关键词 ranks = self._extract_with_llm() - elif self._extract_method == "textrank": + elif method == "textrank": # 使用 TextRank 提取关键词 ranks = self._extract_with_textrank() - elif self._extract_method == "hybrid": + elif method == "hybrid": # 使用 混合方法 提取关键词 ranks = self._extract_with_hybrid() else: - raise ValueError(f"Invalid extract_method: {self._extract_method}") + raise ValueError(f"Invalid extract_method: {method}")
60-63
: 语言归一化与上下文覆盖缺失,UI 传入的 language/max_keywords 将被忽略或误判当前仅将 "cn" 映射为 "chinese",其他一概为 "english";且未对 context["language"] 做覆盖与归一化,也未对 max_keywords 做类型/边界校验。按照 PR 目标(UI/API 接收 language、max_keywords),这里需要兼容 "en/zh/cn/zh-cn" 等输入并做健壮性处理。
建议最小补丁(归一化 language + 覆盖 + 校验 max_keywords):
- # 未传入值或者其他值,默认使用英文 - self._language = "chinese" if self._language == "cn" else "english" - self._max_keywords = context.get("max_keywords", self._max_keywords) + # 语言归一化与上下文覆盖(支持 en/zh/cn/zh-cn 等) + lang_raw = (context.get("language", self._language) or "").strip().lower() + if lang_raw in ("en", "english"): + self._language = "english" + elif lang_raw in ("zh", "zh-cn", "zh_cn", "cn", "chinese"): + self._language = "chinese" + else: + self._language = "english" + + # max_keywords 类型与边界校验 + mk = context.get("max_keywords", self._max_keywords) + try: + mk = int(mk) + except (TypeError, ValueError): + mk = self._max_keywords + self._max_keywords = max(1, mk)
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
170-172
: 正则需转义 start_token,避免包含特殊字符时误匹配注释已提示需要 re.escape(start_token),但实现未应用。为稳健起见请转义。
- matches = re.findall(rf'{start_token}([^\n]+\n?)', response) + matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
57-66
: extract_word 的 language 形参未使用,接口语义不清当前忽略了 language 形参,容易误导调用方。建议移除或在文档中标明“已废弃”,避免产生错误预期。
可选最小变更(移除形参与相关文档行):
- def extract_word(self, text: Optional[str] = None, language: str = "english"): + def extract_word(self, text: Optional[str] = None): @@ - :param language: Language of the text.若需继续支持通过 UI 传参,请考虑在 run(**kwargs) 的 context 中传入 language,并在 WordExtract.run 内优先读取 context 覆盖。
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py
(3 hunks)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
(3 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (2)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)
hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)
WordExtract
(30-81)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
KeywordExtract
(31-190)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM
(22-74)generate
(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs
(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank
(30-163)extract_keywords
(145-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build (3.10)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
96-109
: MemoryError 处理策略请确认对 MemoryError 仅记录后继续返回空结果,可能掩盖严重资源问题;是否期望在该异常上直接失败(或上报)以便告警/限流?请根据运行环境与 SLO 决定。
我可以提供两版补丁:1) 维持现状但打标告警并触发降级;2) 对 MemoryError 直接抛出,交由上游统一熔断处理。需要我给出对比补丁吗?
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
80-85
: 未从此处传递 UI 参数无碍,但需确认上下文路径已打通本方法不再显式传递 max_keywords/language/extract_method/window_size/mask_words。若 UI 通过 API 以 kwargs 形式注入到 pipeline.run(**kwargs),请确保 KeywordExtract.run 已从 context 读取这些键(尤其 extract_method 与 language)。我在另一个评论中已提供对 KeywordExtract.run 的补丁。
需要我补充一段端到端验证脚本,检查前端到 operator 的键是否到位吗(rg/ast-grep 检索 context 的赋值与使用路径)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
239-243
: 潜在 None 传参导致逻辑问题:max_graph_items 可能为 None
kwargs.get('max_graph_items')
缺省时为 None,会覆盖query_graphdb()
的默认值,进而让GraphRAGQuery
接收到 None(易引发类型/逻辑问题)。建议在此处回落到配置默认值。- self.extract_keywords().query_graphdb( - max_graph_items=kwargs.get('max_graph_items') - ).synthesize_answer() + self.extract_keywords().query_graphdb( + max_graph_items=kwargs.get('max_graph_items', huge_settings.max_graph_items) + ).synthesize_answer()
🧹 Nitpick comments (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)
51-55
: 命名小拼写:_text2gqlt_llm → _text2gql_llm保持命名一致性,便于检索与维护。
- self._text2gqlt_llm = llm or LLMs().get_text2gql_llm() + self._text2gql_llm = llm or LLMs().get_text2gql_llm()
68-85
: 确认请求级覆盖:extract_method/language/window_size/mask_words 如何在一次请求内生效?本方法仅传入 text/extract_template;KeywordExtract 目前只从 context 读取 max_keywords,其余参数从 llm_settings 读取。若 UI/API 要在请求级覆盖这些参数,请确认:
- 是否在进入 pipeline.run 前会更新 llm_settings(线程/请求隔离);
- 或在 KeywordExtract 中从 context 读取 extract_method/language/window_size/mask_words。
我可以跟进补充 KeywordExtract 的 context 读取逻辑(不改动此文件接口)。
117-126
: 避免默认参数在定义期绑定动态配置(prompt/max_graph_items)Python 默认参数在定义期求值:
prompt.gremlin_generate_prompt
与huge_settings.max_graph_items
若在运行期可变(语言/配置切换),现有写法会“固化”成导入时的值。改为在函数体内回落更稳妥。def query_graphdb( self, max_deep: int = 2, - max_graph_items: int = huge_settings.max_graph_items, + max_graph_items: Optional[int] = None, max_v_prop_len: int = 2048, max_e_prop_len: int = 256, prop_to_match: Optional[str] = None, num_gremlin_generate_example: Optional[int] = -1, - gremlin_prompt: Optional[str] = prompt.gremlin_generate_prompt, + gremlin_prompt: Optional[str] = None, ): @@ GraphRAGQuery( max_deep=max_deep, - max_graph_items=max_graph_items, + max_graph_items=max_graph_items or huge_settings.max_graph_items, max_v_prop_len=max_v_prop_len, max_e_prop_len=max_e_prop_len, prop_to_match=prop_to_match, num_gremlin_generate_example=num_gremlin_generate_example, - gremlin_prompt=gremlin_prompt, + gremlin_prompt=gremlin_prompt or prompt.gremlin_generate_prompt, )Also applies to: 139-149
91-97
: 同理,避免将 huge_settings. 作为其它方法的默认参数直接绑定*这些配置若在运行期可变(UI 可调),建议默认设为 None 并在函数体内回落,以获得即时配置效果。
def keywords_to_vid( self, by: Literal["query", "keywords"] = "keywords", - topk_per_keyword: int = huge_settings.topk_per_keyword, + topk_per_keyword: Optional[int] = None, topk_per_query: int = 10, - vector_dis_threshold: float = huge_settings.vector_dis_threshold, + vector_dis_threshold: Optional[float] = None, ): @@ SemanticIdQuery( embedding=self._embedding, by=by, - topk_per_keyword=topk_per_keyword, + topk_per_keyword=topk_per_keyword or huge_settings.topk_per_keyword, topk_per_query=topk_per_query, - vector_dis_threshold=vector_dis_threshold, + vector_dis_threshold=vector_dis_threshold or huge_settings.vector_dis_threshold, )def merge_dedup_rerank( self, graph_ratio: float = 0.5, rerank_method: Literal["bleu", "reranker"] = "bleu", near_neighbor_first: bool = False, custom_related_information: str = "", - topk_return_results: int = huge_settings.topk_return_results, + topk_return_results: Optional[int] = None, ): @@ MergeDedupRerank( embedding=self._embedding, graph_ratio=graph_ratio, method=rerank_method, near_neighbor_first=near_neighbor_first, custom_related_information=custom_related_information, - topk_return_results=topk_return_results + topk_return_results=topk_return_results or huge_settings.topk_return_results )Also applies to: 106-114, 167-174, 181-189
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
(3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)
hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py (1)
WordExtract
(30-81)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
KeywordExtract
(31-190)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)
19-19
: typing 导入顺序调整 OK不影响功能,可保持。
21-21
: 新增 huge_settings/prompt 导入与后续使用一致与下方默认参数和调用处匹配,无问题。
35-35
: 装饰器导入顺序调整不影响行为与现有装饰器用法兼容。
57-65
: 去掉 language 形参与 WordExtract 当前签名一致语言从 llm_settings 推导,此处不再需要传递。
support TextRank, update new UI to support change keyword extraction method.
Main changes:
Added options to the RAG interface for selecting the keyword extraction method, along with several settings for TextRank, such as the number of keywords to extract, the size of the sliding window, and the processing language ('en' for English, 'zh' for Chinese).
A 'TextRank mask words' setting has also been added. It allows users to manually input specific phrases composed of letters and symbols to prevent them from being split during Chinese word segmentation. This requires manual entry by the user.
Summary by CodeRabbit
新功能
行为变更
文档
杂务