-
Notifications
You must be signed in to change notification settings - Fork 0
Improve pgn parsing #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Chess usernames are not case sensitive.
Reviewer's GuideRefactors PGN fetching and parsing by improving archive caching logic, normalizing username comparisons to lowercase, generalizing lexing of move text for any digit, and adding a default cache timeout. Class diagram for updated PGN archive fetching and cachingclassDiagram
class Cache {
+get(key)
+set(key, value)
}
class fetch_archive {
+archive_url: str
+session: aiohttp.ClientSession
+semaphore: asyncio.Semaphore
+returns: str
}
class get_chess_dot_com_games {
+username: str
+returns: str
}
Cache <.. fetch_archive : uses
fetch_archive <.. get_chess_dot_com_games : used by
class aiohttp.ClientSession
class asyncio.Semaphore
class chessdotcomClient {
+get_player_game_archives(username)
}
get_chess_dot_com_games --> chessdotcomClient : calls
get_chess_dot_com_games --> fetch_archive : calls
fetch_archive --> Cache : set/get
Class diagram for username normalization in game analysisclassDiagram
class get_games_analysis {
+username: str
+pgn_games: list
+returns: dict
-names: set[str] (now lowercased)
}
get_games_analysis : +names = {n.strip().lower() for n in username.split("||")}
get_games_analysis : +is_white = white.lower() in names
get_games_analysis : +is_black = black.lower() in names
Class diagram for PGN lexer movetext detection updateclassDiagram
class Lexer {
+lex()
+lex_movetext()
-peek()
-read()
-_cr_pos()
}
Lexer : +lex() now triggers lex_movetext() on any digit, not just '1'
Class diagram for Django cache settings updateclassDiagram
class DjangoCacheSettings {
BACKEND: str
LOCATION: str
TIMEOUT: int | None
}
DjangoCacheSettings : +TIMEOUT = None (added)
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @CalvoM - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `style_predictor/pgn_parser/file_processing/lexer.py:86` </location>
<code_context>
case "\n":
_ = self.read()
self._cr_pos()
- if self.peek() == "1":
+ if self.peek() and self.peek().isdigit():
self.lex_movetext()
# break
</code_context>
<issue_to_address>
The new condition is more general but may match unintended digits.
If the goal is to support all move numbers, this change is appropriate. Otherwise, consider whether lines starting with a digit could cause incorrect matches.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13 +/- ##
==========================================
- Coverage 36.19% 35.75% -0.44%
==========================================
Files 42 42
Lines 1061 1074 +13
Branches 100 101 +1
==========================================
Hits 384 384
- Misses 661 674 +13
Partials 16 16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @CalvoM - I've reviewed your changes - here's some feedback:
- Consider adding a TTL to your cache entries instead of using TIMEOUT=None to prevent unbounded stale data accumulation.
- Switch to timezone-aware or UTC timestamps when comparing archive dates to avoid mismatches in different server locales.
- Replace the manual archive_url.split(...) parsing with a more robust URL parsing method (e.g. urllib.parse) to handle unexpected URL formats.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding a TTL to your cache entries instead of using TIMEOUT=None to prevent unbounded stale data accumulation.
- Switch to timezone-aware or UTC timestamps when comparing archive dates to avoid mismatches in different server locales.
- Replace the manual archive_url.split(...) parsing with a more robust URL parsing method (e.g. urllib.parse) to handle unexpected URL formats.
## Individual Comments
### Comment 1
<location> `style_predictor/apis/pgn/utils.py:124` </location>
<code_context>
elif resp.status == 200:
- return await resp.text()
+ resp = await resp.text()
+ archive_year, archive_month = archive_url.split("/")[-2:]
+ if this_month == int(archive_month) and this_year == int(
+ archive_year
+ ):
</code_context>
<issue_to_address>
Add error handling for archive_url parsing.
If archive_url does not have the expected format, this code may raise a ValueError. Add error handling to prevent unhandled exceptions.
</issue_to_address>
### Comment 2
<location> `my_chess_style/settings/base.py:157` </location>
<code_context>
"default": {
"BACKEND": "django.core.cache.backends.redis.RedisCache",
"LOCATION": f"redis://default:{os.getenv('CACHE_PASSWORD')}@{os.getenv('DB_HOST')}:6379/1",
+ "TIMEOUT": None,
}
}
</code_context>
<issue_to_address>
Setting cache TIMEOUT to None disables expiration.
Ensure unbounded cache growth is acceptable or implement measures to manage cache size if expiration is disabled.
</issue_to_address>
### Comment 3
<location> `style_predictor/apis/pgn/utils.py:106` </location>
<code_context>
async def fetch_archive(
archive_url: str, session: aiohttp.ClientSession, semaphore: asyncio.Semaphore
):
</code_context>
<issue_to_address>
Consider extracting the cacheability check into a helper function and using comprehensions to simplify cached and uncached archive handling.
Consider pulling the “is it cacheable?” logic out into its own helper, and replace the manual loops in get_chess_dot_com_games with simple comprehensions. For example:
```python
from datetime import datetime
def should_cache_archive(url: str, today: datetime | None = None) -> bool:
today = today or datetime.now()
year, month = url.rstrip("/").split("/")[-2:]
return not (int(year) == today.year and int(month) == today.month)
```
Then in fetch_archive you can shrink the conditional:
```python
async def fetch_archive(archive_url, session, semaphore):
async with semaphore:
while True:
async with session.get(f"{archive_url}/pgn") as resp:
if resp.status == 429:
# ...
elif resp.status == 200:
text = await resp.text()
if should_cache_archive(archive_url):
cache.set(archive_url, text)
return text
# ...
```
And in get_chess_dot_com_games use one pass to split cached vs uncached:
```python
async def get_chess_dot_com_games(username: str) -> str:
archives = chessdotcomClient.get_player_game_archives(username).json.get("archives", [])
# collect cached results and list out uncached URLs
lookup = {url: cache.get(url) for url in archives}
cached = [pgn for pgn in lookup.values() if pgn]
to_fetch = [url for url, pgn in lookup.items() if not pgn]
conn = aiohttp.TCPConnector(limit=15)
async with aiohttp.ClientSession(connector=conn) as session:
fetched = await asyncio.gather(
*(fetch_archive(url, session, asyncio.Semaphore(15)) for url in to_fetch)
)
return "\n\n".join([*cached, *fetched])
```
This keeps all the new caching behavior but isolates the date logic and collapses the manual loops into clear comprehensions.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
style_predictor/apis/pgn/utils.py
Outdated
archive_year, archive_month = archive_url.split("/")[-2:] | ||
if this_month == int(archive_month) and this_year == int( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Add error handling for archive_url parsing.
If archive_url does not have the expected format, this code may raise a ValueError. Add error handling to prevent unhandled exceptions.
"default": { | ||
"BACKEND": "django.core.cache.backends.redis.RedisCache", | ||
"LOCATION": f"redis://default:{os.getenv('CACHE_PASSWORD')}@{os.getenv('DB_HOST')}:6379/1", | ||
"TIMEOUT": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (bug_risk): Setting cache TIMEOUT to None disables expiration.
Ensure unbounded cache growth is acceptable or implement measures to manage cache size if expiration is disabled.
return is_present | ||
|
||
|
||
async def fetch_archive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider extracting the cacheability check into a helper function and using comprehensions to simplify cached and uncached archive handling.
Consider pulling the “is it cacheable?” logic out into its own helper, and replace the manual loops in get_chess_dot_com_games with simple comprehensions. For example:
from datetime import datetime
def should_cache_archive(url: str, today: datetime | None = None) -> bool:
today = today or datetime.now()
year, month = url.rstrip("/").split("/")[-2:]
return not (int(year) == today.year and int(month) == today.month)
Then in fetch_archive you can shrink the conditional:
async def fetch_archive(archive_url, session, semaphore):
async with semaphore:
while True:
async with session.get(f"{archive_url}/pgn") as resp:
if resp.status == 429:
# ...
elif resp.status == 200:
text = await resp.text()
if should_cache_archive(archive_url):
cache.set(archive_url, text)
return text
# ...
And in get_chess_dot_com_games use one pass to split cached vs uncached:
async def get_chess_dot_com_games(username: str) -> str:
archives = chessdotcomClient.get_player_game_archives(username).json.get("archives", [])
# collect cached results and list out uncached URLs
lookup = {url: cache.get(url) for url in archives}
cached = [pgn for pgn in lookup.values() if pgn]
to_fetch = [url for url, pgn in lookup.items() if not pgn]
conn = aiohttp.TCPConnector(limit=15)
async with aiohttp.ClientSession(connector=conn) as session:
fetched = await asyncio.gather(
*(fetch_archive(url, session, asyncio.Semaphore(15)) for url in to_fetch)
)
return "\n\n".join([*cached, *fetched])
This keeps all the new caching behavior but isolates the date logic and collapses the manual loops into clear comprehensions.
style_predictor/apis/pgn/utils.py
Outdated
if this_month == int(archive_month) and this_year == int( | ||
archive_year | ||
): | ||
# Do not cache the latest archive since it will change. | ||
pass | ||
else: | ||
cache.set(archive_url, resp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Swap if/else to remove empty if body (remove-pass-body
)
style_predictor/apis/pgn/utils.py
Outdated
saved_res = cache.get(archive) | ||
if not saved_res: | ||
archives_not_saved.append(archive) | ||
else: | ||
all_pgns.append(saved_res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Use named expression to simplify assignment and conditional (
use-named-expression
) - Swap if/else branches (
swap-if-else-branches
)
Since the archives will not change, apart from the latest one, then caching will really help with performance.
05fb3af
to
fbc96a5
Compare
Summary by Sourcery
Improve PGN fetching and parsing by adding selective caching of archives, normalizing usernames for analysis, fixing lexer digit detection, and configuring indefinite cache timeout.
Bug Fixes:
Enhancements: