Improve pgn parsing #13

CalvoM · 2025-08-02T23:01:44Z

Summary by Sourcery

Improve PGN fetching and parsing by adding selective caching of archives, normalizing usernames for analysis, fixing lexer digit detection, and configuring indefinite cache timeout.

Bug Fixes:

Relax PGN lexer break detection to trigger move text lexing on any digit, not just '1'.

Enhancements:

Cache fetched PGN archives and reuse cached data to minimize network requests.
Exclude the current month’s archive from caching to prevent stale results.
Normalize player usernames to lowercase for consistent game analysis matching.
Set Django Redis cache timeout to never expire by default.

Example:https://www.chess.com/game/daily/697768513?move=0

Chess usernames are not case sensitive.

sourcery-ai · 2025-08-02T23:01:50Z

Reviewer's Guide

Refactors PGN fetching and parsing by improving archive caching logic, normalizing username comparisons to lowercase, generalizing lexing of move text for any digit, and adding a default cache timeout.

Class diagram for updated PGN archive fetching and caching

classDiagram
    class Cache {
        +get(key)
        +set(key, value)
    }
    class fetch_archive {
        +archive_url: str
        +session: aiohttp.ClientSession
        +semaphore: asyncio.Semaphore
        +returns: str
    }
    class get_chess_dot_com_games {
        +username: str
        +returns: str
    }
    Cache <.. fetch_archive : uses
    fetch_archive <.. get_chess_dot_com_games : used by

    class aiohttp.ClientSession
    class asyncio.Semaphore
    class chessdotcomClient {
        +get_player_game_archives(username)
    }
    get_chess_dot_com_games --> chessdotcomClient : calls
    get_chess_dot_com_games --> fetch_archive : calls
    fetch_archive --> Cache : set/get

Class diagram for username normalization in game analysis

classDiagram
    class get_games_analysis {
        +username: str
        +pgn_games: list
        +returns: dict
        -names: set[str] (now lowercased)
    }
    get_games_analysis : +names = {n.strip().lower() for n in username.split("||")}
    get_games_analysis : +is_white = white.lower() in names
    get_games_analysis : +is_black = black.lower() in names

Class diagram for PGN lexer movetext detection update

classDiagram
    class Lexer {
        +lex()
        +lex_movetext()
        -peek()
        -read()
        -_cr_pos()
    }
    Lexer : +lex() now triggers lex_movetext() on any digit, not just '1'

Class diagram for Django cache settings update

classDiagram
    class DjangoCacheSettings {
        BACKEND: str
        LOCATION: str
        TIMEOUT: int | None
    }
    DjangoCacheSettings : +TIMEOUT = None (added)

File-Level Changes

Change	Details	Files
Enhanced archive fetching and caching logic	Introduce current date variables for conditional caching Skip caching for archives matching the current month and year Build list of unsaved archives to avoid refetching cached data Modify task generation to fetch only missing archives and append results	`style_predictor/apis/pgn/utils.py`
Normalize username matching to be case-insensitive	Convert split usernames to lowercase Compare white and black player names in lowercase against normalized set	`style_predictor/tasks.py`
Generalize PGN lexer movetext detection to any digit	Replace check for specific '1' peek with a generic isdigit() call	`style_predictor/pgn_parser/file_processing/lexer.py`
Set default cache timeout to None	Add TIMEOUT: None to Redis cache backend configuration	`my_chess_style/settings/base.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @CalvoM - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `style_predictor/pgn_parser/file_processing/lexer.py:86` </location>
<code_context>
                 case "\n":
                     _ = self.read()
                     self._cr_pos()
-                    if self.peek() == "1":
+                    if self.peek() and self.peek().isdigit():
                         self.lex_movetext()
                         # break
</code_context>

<issue_to_address>
The new condition is more general but may match unintended digits.

If the goal is to support all move numbers, this change is appropriate. Otherwise, consider whether lines starting with a digit could cause incorrect matches.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

style_predictor/pgn_parser/file_processing/lexer.py

codecov-commenter · 2025-08-02T23:02:21Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 4.34783% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.75%. Comparing base (6efa186) to head (fbc96a5).

Files with missing lines	Patch %	Lines
style_predictor/apis/pgn/utils.py	0.00%	19 Missing ⚠️
style_predictor/tasks.py	0.00%	3 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #13      +/-   ##
==========================================
- Coverage   36.19%   35.75%   -0.44%     
==========================================
  Files          42       42              
  Lines        1061     1074      +13     
  Branches      100      101       +1     
==========================================
  Hits          384      384              
- Misses        661      674      +13     
  Partials       16       16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

CalvoM · 2025-08-02T23:55:28Z

@sourcery-ai review

sourcery-ai

Hey @CalvoM - I've reviewed your changes - here's some feedback:

Consider adding a TTL to your cache entries instead of using TIMEOUT=None to prevent unbounded stale data accumulation.
Switch to timezone-aware or UTC timestamps when comparing archive dates to avoid mismatches in different server locales.
Replace the manual archive_url.split(...) parsing with a more robust URL parsing method (e.g. urllib.parse) to handle unexpected URL formats.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Consider adding a TTL to your cache entries instead of using TIMEOUT=None to prevent unbounded stale data accumulation.
- Switch to timezone-aware or UTC timestamps when comparing archive dates to avoid mismatches in different server locales.
- Replace the manual archive_url.split(...) parsing with a more robust URL parsing method (e.g. urllib.parse) to handle unexpected URL formats.

## Individual Comments

### Comment 1
<location> `style_predictor/apis/pgn/utils.py:124` </location>
<code_context>
                 elif resp.status == 200:
-                    return await resp.text()
+                    resp = await resp.text()
+                    archive_year, archive_month = archive_url.split("/")[-2:]
+                    if this_month == int(archive_month) and this_year == int(
+                        archive_year
+                    ):
</code_context>

<issue_to_address>
Add error handling for archive_url parsing.

If archive_url does not have the expected format, this code may raise a ValueError. Add error handling to prevent unhandled exceptions.
</issue_to_address>

### Comment 2
<location> `my_chess_style/settings/base.py:157` </location>
<code_context>
     "default": {
         "BACKEND": "django.core.cache.backends.redis.RedisCache",
         "LOCATION": f"redis://default:{os.getenv('CACHE_PASSWORD')}@{os.getenv('DB_HOST')}:6379/1",
+        "TIMEOUT": None,
     }
 }
</code_context>

<issue_to_address>
Setting cache TIMEOUT to None disables expiration.

Ensure unbounded cache growth is acceptable or implement measures to manage cache size if expiration is disabled.
</issue_to_address>

### Comment 3
<location> `style_predictor/apis/pgn/utils.py:106` </location>
<code_context>
 async def fetch_archive(
     archive_url: str, session: aiohttp.ClientSession, semaphore: asyncio.Semaphore
 ):
</code_context>

<issue_to_address>
Consider extracting the cacheability check into a helper function and using comprehensions to simplify cached and uncached archive handling.

Consider pulling the “is it cacheable?” logic out into its own helper, and replace the manual loops in get_chess_dot_com_games with simple comprehensions. For example:

```python
from datetime import datetime

def should_cache_archive(url: str, today: datetime | None = None) -> bool:
    today = today or datetime.now()
    year, month = url.rstrip("/").split("/")[-2:]
    return not (int(year) == today.year and int(month) == today.month)
```

Then in fetch_archive you can shrink the conditional:

```python
async def fetch_archive(archive_url, session, semaphore):
    async with semaphore:
        while True:
            async with session.get(f"{archive_url}/pgn") as resp:
                if resp.status == 429:
                    # ...
                elif resp.status == 200:
                    text = await resp.text()
                    if should_cache_archive(archive_url):
                        cache.set(archive_url, text)
                    return text
                # ...
```

And in get_chess_dot_com_games use one pass to split cached vs uncached:

```python
async def get_chess_dot_com_games(username: str) -> str:
    archives = chessdotcomClient.get_player_game_archives(username).json.get("archives", [])
    # collect cached results and list out uncached URLs
    lookup = {url: cache.get(url) for url in archives}
    cached = [pgn for pgn in lookup.values() if pgn]
    to_fetch = [url for url, pgn in lookup.items() if not pgn]

    conn = aiohttp.TCPConnector(limit=15)
    async with aiohttp.ClientSession(connector=conn) as session:
        fetched = await asyncio.gather(
            *(fetch_archive(url, session, asyncio.Semaphore(15)) for url in to_fetch)
        )

    return "\n\n".join([*cached, *fetched])
```

This keeps all the new caching behavior but isolates the date logic and collapses the manual loops into clear comprehensions.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-02T23:56:22Z

style_predictor/apis/pgn/utils.py

+                    archive_year, archive_month = archive_url.split("/")[-2:]
+                    if this_month == int(archive_month) and this_year == int(


issue (bug_risk): Add error handling for archive_url parsing.

If archive_url does not have the expected format, this code may raise a ValueError. Add error handling to prevent unhandled exceptions.

sourcery-ai · 2025-08-02T23:56:22Z

my_chess_style/settings/base.py

    "default": {
        "BACKEND": "django.core.cache.backends.redis.RedisCache",
        "LOCATION": f"redis://default:{os.getenv('CACHE_PASSWORD')}@{os.getenv('DB_HOST')}:6379/1",
+        "TIMEOUT": None,


question (bug_risk): Setting cache TIMEOUT to None disables expiration.

Ensure unbounded cache growth is acceptable or implement measures to manage cache size if expiration is disabled.

sourcery-ai · 2025-08-02T23:56:22Z

style_predictor/apis/pgn/utils.py

    return is_present


 async def fetch_archive(


issue (complexity): Consider extracting the cacheability check into a helper function and using comprehensions to simplify cached and uncached archive handling.

Consider pulling the “is it cacheable?” logic out into its own helper, and replace the manual loops in get_chess_dot_com_games with simple comprehensions. For example:

from datetime import datetime def should_cache_archive(url: str, today: datetime | None = None) -> bool: today = today or datetime.now() year, month = url.rstrip("/").split("/")[-2:] return not (int(year) == today.year and int(month) == today.month)

Then in fetch_archive you can shrink the conditional:

async def fetch_archive(archive_url, session, semaphore): async with semaphore: while True: async with session.get(f"{archive_url}/pgn") as resp: if resp.status == 429: # ... elif resp.status == 200: text = await resp.text() if should_cache_archive(archive_url): cache.set(archive_url, text) return text # ...

And in get_chess_dot_com_games use one pass to split cached vs uncached:

async def get_chess_dot_com_games(username: str) -> str: archives = chessdotcomClient.get_player_game_archives(username).json.get("archives", []) # collect cached results and list out uncached URLs lookup = {url: cache.get(url) for url in archives} cached = [pgn for pgn in lookup.values() if pgn] to_fetch = [url for url, pgn in lookup.items() if not pgn] conn = aiohttp.TCPConnector(limit=15) async with aiohttp.ClientSession(connector=conn) as session: fetched = await asyncio.gather( *(fetch_archive(url, session, asyncio.Semaphore(15)) for url in to_fetch) ) return "\n\n".join([*cached, *fetched])

This keeps all the new caching behavior but isolates the date logic and collapses the manual loops into clear comprehensions.

sourcery-ai · 2025-08-02T23:56:22Z

style_predictor/apis/pgn/utils.py

+                    if this_month == int(archive_month) and this_year == int(
+                        archive_year
+                    ):
+                        # Do not cache the latest archive since it will change.
+                        pass
+                    else:
+                        cache.set(archive_url, resp)


issue (code-quality): Swap if/else to remove empty if body (remove-pass-body)

sourcery-ai · 2025-08-02T23:56:22Z

style_predictor/apis/pgn/utils.py

+        saved_res = cache.get(archive)
+        if not saved_res:
+            archives_not_saved.append(archive)
+        else:
+            all_pgns.append(saved_res)


issue (code-quality): We've found these issues:

Use named expression to simplify assignment and conditional (use-named-expression)

Swap if/else branches (swap-if-else-branches)

Since the archives will not change, apart from the latest one, then caching will really help with performance.

CalvoM added 2 commits August 3, 2025 01:49

fix(parsing): Not all games begin from move 1

e7039c0

Example:https://www.chess.com/game/daily/697768513?move=0

fix(game analysis): Make sure that both usernames are same case

90bc3d6

Chess usernames are not case sensitive.

sourcery-ai bot reviewed Aug 2, 2025

View reviewed changes

style_predictor/pgn_parser/file_processing/lexer.py Show resolved Hide resolved

sourcery-ai bot reviewed Aug 2, 2025

View reviewed changes

perf(archives): Improve performance of archive retrieval

fbc96a5

Since the archives will not change, apart from the latest one, then caching will really help with performance.

CalvoM force-pushed the improve_pgn_parsing branch from 05fb3af to fbc96a5 Compare August 3, 2025 00:21

CalvoM merged commit 8e99919 into main Aug 3, 2025
2 checks passed

		archive_year, archive_month = archive_url.split("/")[-2:]
		if this_month == int(archive_month) and this_year == int(

Improve pgn parsing #13

Improve pgn parsing #13

Uh oh!

Conversation

CalvoM commented Aug 2, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for updated PGN archive fetching and caching

Class diagram for username normalization in game analysis

Class diagram for PGN lexer movetext detection update

Class diagram for Django cache settings update

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

CalvoM commented Aug 2, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CalvoM commented Aug 2, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 2, 2025 •

edited

Loading

codecov-commenter commented Aug 2, 2025 •

edited

Loading