-
Notifications
You must be signed in to change notification settings - Fork 15
Add multi-language support for prompt templates #414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add multi-language support for prompt templates #414
Conversation
…ers; update OCR default prompt handling
…nglish prompt templates for deepsearch
a438fda
to
542f54c
Compare
else: | ||
logger.debug("Prompt overrides directory does not exist: %s", self.overrides_dir) | ||
search_paths.append(self.internal_dir) | ||
self.env = Environment(loader=FileSystemLoader(search_paths), autoescape=False, trim_blocks=True, lstrip_blocks=True) |
Check warning
Code scanning / CodeQL
Jinja2 templating with autoescape=False Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
To fix this vulnerability, autoescaping should be enabled when constructing the Jinja2 Environment
. The recommended and most robust solution is to use select_autoescape
, which will automatically enable escaping for templates that are commonly rendered as HTML or XML (files ending in .html
, .htm
, .xml
, etc.)—and possibly leave others, like pure text templates, unescaped. However, if you know that all of your templates are either HTML or XML, you can use autoescape=True
. To maximize safety and future flexibility, replacing autoescape=False
with autoescape=select_autoescape(['html', 'xml', 'j2'])
ensures anything with .j2
(or the relevant extensions you use) is escaped. Since your templates use the .j2
extension, including 'j2'
in the extension list is prudent. This does not change any application logic and retains existing behavior, with the reliability benefit of escaping.
The only changes needed are:
- Import the
select_autoescape
method fromjinja2
(to allow its use). - Replace the
autoescape=False
argument in theEnvironment
instantiation withautoescape=select_autoescape(['j2', 'html', 'xml'])
.
These changes need to be made in api/utils/prompt_loader.py
—specifically, to the import statement and the environment construction.
-
Copy modified line R6 -
Copy modified lines R56-R61
@@ -3,7 +3,7 @@ | ||
from functools import lru_cache | ||
from typing import Any, Iterator, List, Optional | ||
|
||
from jinja2 import Environment, FileSystemLoader, TemplateNotFound | ||
from jinja2 import Environment, FileSystemLoader, TemplateNotFound, select_autoescape | ||
|
||
from api.utils.configuration import get_configuration | ||
|
||
@@ -53,7 +53,12 @@ | ||
else: | ||
logger.debug("Prompt overrides directory does not exist: %s", self.overrides_dir) | ||
search_paths.append(self.internal_dir) | ||
self.env = Environment(loader=FileSystemLoader(search_paths), autoescape=False, trim_blocks=True, lstrip_blocks=True) | ||
self.env = Environment( | ||
loader=FileSystemLoader(search_paths), | ||
autoescape=select_autoescape(['j2', 'html', 'xml']), | ||
trim_blocks=True, | ||
lstrip_blocks=True, | ||
) | ||
|
||
# cache of resolved template name lists per module key (ordered) | ||
self._module_template_cache: dict[Optional[str], List[str]] = {} |
if not base or not os.path.isdir(base): | ||
continue | ||
path = os.path.join(base, cand) | ||
if os.path.isfile(path) and (base, cand) not in found: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
To mitigate path traversal, the user-controlled language
value (as well as related inputs derived from it) must be validated before constructing candidate filenames. The best fix, given filenames are supposed to follow the format <lang>.j2
or <module>.<lang>.j2
, is to allow only valid language codes (e.g. two-letter alphabetical lowercase strings or an explicitly allow-listed set) for the filename position.
The recommended approach:
- Validate
language
strictly in thePromptRenderer
initializer. Accept only language codes matching a regex such as^[a-z]{2}$
or a set of known codes. - If the value does not match, either default to the fallback
DEFAULT_LANGUAGE
or raise an exception. - Additionally, in
_candidate_templates
, ensure that only valid candidate filenames get constructed (derived from a safe language string). - This validation can be encapsulated in a helper function, and performed immediately on initialization.
Files to edit: api/utils/prompt_loader.py
(in the constructor and associated candidate template logic).
Dependencies: For regex, Python's builtin re
is sufficient.
-
Copy modified line R3 -
Copy modified lines R46-R50
@@ -1,5 +1,6 @@ | ||
import logging | ||
import os | ||
import re | ||
from functools import lru_cache | ||
from typing import Any, Iterator, List, Optional | ||
|
||
@@ -42,7 +43,11 @@ | ||
# read dynamic configuration if available | ||
self.overrides_dir = overrides_dir or getattr(settings, "prompts_dir", "/prompts") | ||
# prompts_lang is stored as a language code (eg 'en', 'fr') without the .j2 extension | ||
self.language = language or getattr(settings, "prompts_lang", DEFAULT_LANGUAGE) | ||
selected_lang = language or getattr(settings, "prompts_lang", DEFAULT_LANGUAGE) | ||
if not re.fullmatch(r"^[a-z]{2}$", str(selected_lang)): | ||
logger.warning(f"Invalid language code '{selected_lang}' provided, falling back to default: {DEFAULT_LANGUAGE}") | ||
selected_lang = DEFAULT_LANGUAGE | ||
self.language = selected_lang | ||
|
||
self.internal_dir = DEFAULT_PROMPTS_RELATIVE_DIR | ||
|
# if no exact candidate matched, try to use any internal templates matching the language | ||
logger.debug( | ||
"No exact candidate template found; searching internal dir for any *.%s.j2 files", | ||
self.language, |
Check failure
Code scanning / CodeQL
Log Injection High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
To mitigate log injection, user-provided values that are included in log messages should be sanitized. For plain-text logs, the most important step is to strip or replace newline and carriage return characters from the language value before logging. The simplest fix is to wrap the self.language
reference in a helper that removes problematic characters (e.g., replacing \n
and \r
with an empty string) in all logging calls where user-provided language codes could end up in log output. Only the logging on line 106 needs to be changed, as subsequent uses do not directly log the value.
The fix should focus only on the code shown in api/utils/prompt_loader.py
. A good approach is to use a local variable (e.g., safe_language
) constructed by sanitizing self.language
just before logging, and using that variable in place of self.language
in the log statement.
No new imports are required, as only native string replacement (str.replace
) is needed.
-
Copy modified line R104 -
Copy modified line R107
@@ -101,9 +101,10 @@ | ||
|
||
if not found: | ||
# if no exact candidate matched, try to use any internal templates matching the language | ||
safe_language = str(self.language).replace("\n", "").replace("\r", "") | ||
logger.debug( | ||
"No exact candidate template found; searching internal dir for any *.%s.j2 files", | ||
self.language, | ||
safe_language, | ||
) | ||
try: | ||
internal_files = [f for f in os.listdir(self.internal_dir) if f.endswith(f".{self.language}.j2")] |
found = [(self.internal_dir, f) for f in internal_files] | ||
else: | ||
# final fallback to default language file name (will be looked up in env paths) | ||
logger.error("No prompt template found; tried: %s", ", ".join(candidates)) |
Check failure
Code scanning / CodeQL
Log Injection High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
The best way to fix this problem is to sanitize any user-controlled input that is written to log files. In this specific case, before logging the candidate filenames joined together, each candidate should be sanitized to remove potentially malicious characters such as carriage returns and line feeds (\r
, \n
). This can be done by mapping a sanitize function over the list of candidate filenames before joining them for logging.
The only file that requires editing is api/utils/prompt_loader.py
, specifically in the _resolve_template_for_module
method where the log uses ", ".join(candidates)
. To implement this:
- Define a helper function (locally or as a static/class method) to sanitize user-supplied or user-influenced strings before logging.
- Use this function to clean all candidate strings before joining and logging them.
No external dependencies need to be added; built-in Python string operations are sufficient.
-
Copy modified lines R88-R93 -
Copy modified line R123
@@ -85,6 +85,12 @@ | ||
yield from self._module_template_pairs_cache[module] | ||
return | ||
|
||
def _sanitize_for_log(s: str) -> str: | ||
# Remove CR and LF characters and indicate user origin | ||
sanitized = s.replace('\r', '').replace('\n', '') | ||
return sanitized | ||
|
||
|
||
search_paths = [self.overrides_dir, self.internal_dir] | ||
candidates = self._candidate_templates(module) | ||
found: list[tuple[str, str]] = [] | ||
@@ -114,7 +120,7 @@ | ||
found = [(self.internal_dir, f) for f in internal_files] | ||
else: | ||
# final fallback to default language file name (will be looked up in env paths) | ||
logger.error("No prompt template found; tried: %s", ", ".join(candidates)) | ||
logger.error("No prompt template found; tried: %s", ", ".join(_sanitize_for_log(c) for c in candidates)) | ||
found = [(self.internal_dir, f"{DEFAULT_LANGUAGE}.j2")] | ||
|
||
# cache relative template names for introspection but keep pairs for loading |
# Load template from a specific base directory to allow loading internal | ||
# templates even when the same filename exists in an overrides dir. | ||
try: | ||
env = Environment(loader=FileSystemLoader([base]), autoescape=False, trim_blocks=True, lstrip_blocks=True) |
Check warning
Code scanning / CodeQL
Jinja2 templating with autoescape=False Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
To fix this issue, we should instantiate the Jinja2 Environment
with autoescape
enabled in a way that is correct for what is being templated. If the templates might generate HTML or XML, use select_autoescape(['html', 'xml'])
—this is Jinja2's best practice. This approach will enable autoescaping for HTML/XML templates but not for others (like plain .txt
files). If all templates are, for example, .j2
files but exclusively render non-HTML, the more precise control is to explicitly enable or disable autoescape based on downstream needs. The safest general fix is to replace autoescape=False
with autoescape=select_autoescape(['html', 'xml'])
.
Therefore, in api/utils/prompt_loader.py
, replace the instantiation:
env = Environment(loader=FileSystemLoader([base]), autoescape=False, trim_blocks=True, lstrip_blocks=True)
with
env = Environment(
loader=FileSystemLoader([base]),
autoescape=select_autoescape(['html', 'xml']),
trim_blocks=True,
lstrip_blocks=True,
)
and add an import for select_autoescape
from jinja2
at the top of the file.
-
Copy modified line R6 -
Copy modified lines R131-R136
@@ -3,7 +3,7 @@ | ||
from functools import lru_cache | ||
from typing import Any, Iterator, List, Optional | ||
|
||
from jinja2 import Environment, FileSystemLoader, TemplateNotFound | ||
from jinja2 import Environment, FileSystemLoader, TemplateNotFound, select_autoescape | ||
|
||
from api.utils.configuration import get_configuration | ||
|
||
@@ -128,7 +128,12 @@ | ||
# Load template from a specific base directory to allow loading internal | ||
# templates even when the same filename exists in an overrides dir. | ||
try: | ||
env = Environment(loader=FileSystemLoader([base]), autoescape=False, trim_blocks=True, lstrip_blocks=True) | ||
env = Environment( | ||
loader=FileSystemLoader([base]), | ||
autoescape=select_autoescape(['html', 'xml']), | ||
trim_blocks=True, | ||
lstrip_blocks=True | ||
) | ||
return env.get_template(template_name) | ||
except TemplateNotFound as e: | ||
raise FileNotFoundError(f"Prompt template file '{template_name}' not found in base '{base}'.") from e |
…st parameters accordingly
|
||
|
||
# revision identifiers, used by Alembic. | ||
revision: str = "095feb42bc54" |
Check notice
Code scanning / CodeQL
Unused global variable Note
Copilot Autofix
AI 5 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
|
||
# revision identifiers, used by Alembic. | ||
revision: str = "095feb42bc54" | ||
down_revision: Union[str, None] = "479aeeae940b" |
Check notice
Code scanning / CodeQL
Unused global variable Note
Copilot Autofix
AI 5 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
# revision identifiers, used by Alembic. | ||
revision: str = "095feb42bc54" | ||
down_revision: Union[str, None] = "479aeeae940b" | ||
branch_labels: Union[str, Sequence[str], None] = None |
Check notice
Code scanning / CodeQL
Unused global variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the issue, delete the assignment to the unused global variable branch_labels
in api/alembic/versions/2025_09_10_1143-095feb42bc54_user_name_optional.py
. Make sure not to remove any right-hand side code that has side effects, but since the assignment is simply None
, safe to fully remove the line. No other changes are necessary because Alembic does not require this variable to be present unless the migration is part of a branch.
@@ -15,7 +15,6 @@ | ||
# revision identifiers, used by Alembic. | ||
revision: str = "095feb42bc54" | ||
down_revision: Union[str, None] = "479aeeae940b" | ||
branch_labels: Union[str, Sequence[str], None] = None | ||
depends_on: Union[str, Sequence[str], None] = None | ||
|
||
|
revision: str = "095feb42bc54" | ||
down_revision: Union[str, None] = "479aeeae940b" | ||
branch_labels: Union[str, Sequence[str], None] = None | ||
depends_on: Union[str, Sequence[str], None] = None |
Check notice
Code scanning / CodeQL
Unused global variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 days ago
To fix the problem, we should remove the definition of the unused global variable depends_on
on line 19 of api/alembic/versions/2025_09_10_1143-095feb42bc54_user_name_optional.py
. As its value is simply None
and there are no side effects associated with its assignment, it's safe to delete this line without impacting any functionality or documentation. No imports or other changes are necessary.
-
Copy modified line R21
@@ -16,9 +16,9 @@ | ||
revision: str = "095feb42bc54" | ||
down_revision: Union[str, None] = "479aeeae940b" | ||
branch_labels: Union[str, Sequence[str], None] = None | ||
depends_on: Union[str, Sequence[str], None] = None | ||
|
||
|
||
|
||
def upgrade() -> None: | ||
"""Upgrade schema.""" | ||
op.alter_column("user", "name", existing_type=sa.VARCHAR(), nullable=True) |
Introduce a centralized Jinja2 prompt system with multi-language capabilities, including English and French templates. Enhance configuration options for prompts directory and language selection, and update documentation to reflect these changes. Refactor existing prompt handling to utilize the new template-based approach.