Skip to content

Conversation

CalvoM
Copy link
Owner

@CalvoM CalvoM commented Aug 2, 2025

Summary by Sourcery

Improve PGN fetching and parsing by adding selective caching of archives, normalizing usernames for analysis, fixing lexer digit detection, and configuring indefinite cache timeout.

Bug Fixes:

  • Relax PGN lexer break detection to trigger move text lexing on any digit, not just '1'.

Enhancements:

  • Cache fetched PGN archives and reuse cached data to minimize network requests.
  • Exclude the current month’s archive from caching to prevent stale results.
  • Normalize player usernames to lowercase for consistent game analysis matching.
  • Set Django Redis cache timeout to never expire by default.

Copy link

sourcery-ai bot commented Aug 2, 2025

Reviewer's Guide

Refactors PGN fetching and parsing by improving archive caching logic, normalizing username comparisons to lowercase, generalizing lexing of move text for any digit, and adding a default cache timeout.

Class diagram for updated PGN archive fetching and caching

classDiagram
    class Cache {
        +get(key)
        +set(key, value)
    }
    class fetch_archive {
        +archive_url: str
        +session: aiohttp.ClientSession
        +semaphore: asyncio.Semaphore
        +returns: str
    }
    class get_chess_dot_com_games {
        +username: str
        +returns: str
    }
    Cache <.. fetch_archive : uses
    fetch_archive <.. get_chess_dot_com_games : used by

    class aiohttp.ClientSession
    class asyncio.Semaphore
    class chessdotcomClient {
        +get_player_game_archives(username)
    }
    get_chess_dot_com_games --> chessdotcomClient : calls
    get_chess_dot_com_games --> fetch_archive : calls
    fetch_archive --> Cache : set/get
Loading

Class diagram for username normalization in game analysis

classDiagram
    class get_games_analysis {
        +username: str
        +pgn_games: list
        +returns: dict
        -names: set[str] (now lowercased)
    }
    get_games_analysis : +names = {n.strip().lower() for n in username.split("||")}
    get_games_analysis : +is_white = white.lower() in names
    get_games_analysis : +is_black = black.lower() in names
Loading

Class diagram for PGN lexer movetext detection update

classDiagram
    class Lexer {
        +lex()
        +lex_movetext()
        -peek()
        -read()
        -_cr_pos()
    }
    Lexer : +lex() now triggers lex_movetext() on any digit, not just '1'
Loading

Class diagram for Django cache settings update

classDiagram
    class DjangoCacheSettings {
        BACKEND: str
        LOCATION: str
        TIMEOUT: int | None
    }
    DjangoCacheSettings : +TIMEOUT = None (added)
Loading

File-Level Changes

Change Details Files
Enhanced archive fetching and caching logic
  • Introduce current date variables for conditional caching
  • Skip caching for archives matching the current month and year
  • Build list of unsaved archives to avoid refetching cached data
  • Modify task generation to fetch only missing archives and append results
style_predictor/apis/pgn/utils.py
Normalize username matching to be case-insensitive
  • Convert split usernames to lowercase
  • Compare white and black player names in lowercase against normalized set
style_predictor/tasks.py
Generalize PGN lexer movetext detection to any digit
  • Replace check for specific '1' peek with a generic isdigit() call
style_predictor/pgn_parser/file_processing/lexer.py
Set default cache timeout to None
  • Add TIMEOUT: None to Redis cache backend configuration
my_chess_style/settings/base.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @CalvoM - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `style_predictor/pgn_parser/file_processing/lexer.py:86` </location>
<code_context>
                 case "\n":
                     _ = self.read()
                     self._cr_pos()
-                    if self.peek() == "1":
+                    if self.peek() and self.peek().isdigit():
                         self.lex_movetext()
                         # break
</code_context>

<issue_to_address>
The new condition is more general but may match unintended digits.

If the goal is to support all move numbers, this change is appropriate. Otherwise, consider whether lines starting with a digit could cause incorrect matches.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@codecov-commenter
Copy link

codecov-commenter commented Aug 2, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 4.34783% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.75%. Comparing base (6efa186) to head (fbc96a5).

Files with missing lines Patch % Lines
style_predictor/apis/pgn/utils.py 0.00% 19 Missing ⚠️
style_predictor/tasks.py 0.00% 3 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #13      +/-   ##
==========================================
- Coverage   36.19%   35.75%   -0.44%     
==========================================
  Files          42       42              
  Lines        1061     1074      +13     
  Branches      100      101       +1     
==========================================
  Hits          384      384              
- Misses        661      674      +13     
  Partials       16       16              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@CalvoM
Copy link
Owner Author

CalvoM commented Aug 2, 2025

@sourcery-ai review

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @CalvoM - I've reviewed your changes - here's some feedback:

  • Consider adding a TTL to your cache entries instead of using TIMEOUT=None to prevent unbounded stale data accumulation.
  • Switch to timezone-aware or UTC timestamps when comparing archive dates to avoid mismatches in different server locales.
  • Replace the manual archive_url.split(...) parsing with a more robust URL parsing method (e.g. urllib.parse) to handle unexpected URL formats.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding a TTL to your cache entries instead of using TIMEOUT=None to prevent unbounded stale data accumulation.
- Switch to timezone-aware or UTC timestamps when comparing archive dates to avoid mismatches in different server locales.
- Replace the manual archive_url.split(...) parsing with a more robust URL parsing method (e.g. urllib.parse) to handle unexpected URL formats.

## Individual Comments

### Comment 1
<location> `style_predictor/apis/pgn/utils.py:124` </location>
<code_context>
                 elif resp.status == 200:
-                    return await resp.text()
+                    resp = await resp.text()
+                    archive_year, archive_month = archive_url.split("/")[-2:]
+                    if this_month == int(archive_month) and this_year == int(
+                        archive_year
+                    ):
</code_context>

<issue_to_address>
Add error handling for archive_url parsing.

If archive_url does not have the expected format, this code may raise a ValueError. Add error handling to prevent unhandled exceptions.
</issue_to_address>

### Comment 2
<location> `my_chess_style/settings/base.py:157` </location>
<code_context>
     "default": {
         "BACKEND": "django.core.cache.backends.redis.RedisCache",
         "LOCATION": f"redis://default:{os.getenv('CACHE_PASSWORD')}@{os.getenv('DB_HOST')}:6379/1",
+        "TIMEOUT": None,
     }
 }
</code_context>

<issue_to_address>
Setting cache TIMEOUT to None disables expiration.

Ensure unbounded cache growth is acceptable or implement measures to manage cache size if expiration is disabled.
</issue_to_address>

### Comment 3
<location> `style_predictor/apis/pgn/utils.py:106` </location>
<code_context>
 async def fetch_archive(
     archive_url: str, session: aiohttp.ClientSession, semaphore: asyncio.Semaphore
 ):
</code_context>

<issue_to_address>
Consider extracting the cacheability check into a helper function and using comprehensions to simplify cached and uncached archive handling.

Consider pulling the “is it cacheable?” logic out into its own helper, and replace the manual loops in get_chess_dot_com_games with simple comprehensions. For example:

```python
from datetime import datetime

def should_cache_archive(url: str, today: datetime | None = None) -> bool:
    today = today or datetime.now()
    year, month = url.rstrip("/").split("/")[-2:]
    return not (int(year) == today.year and int(month) == today.month)
```

Then in fetch_archive you can shrink the conditional:

```python
async def fetch_archive(archive_url, session, semaphore):
    async with semaphore:
        while True:
            async with session.get(f"{archive_url}/pgn") as resp:
                if resp.status == 429:
                    # ...
                elif resp.status == 200:
                    text = await resp.text()
                    if should_cache_archive(archive_url):
                        cache.set(archive_url, text)
                    return text
                # ...
```

And in get_chess_dot_com_games use one pass to split cached vs uncached:

```python
async def get_chess_dot_com_games(username: str) -> str:
    archives = chessdotcomClient.get_player_game_archives(username).json.get("archives", [])
    # collect cached results and list out uncached URLs
    lookup = {url: cache.get(url) for url in archives}
    cached = [pgn for pgn in lookup.values() if pgn]
    to_fetch = [url for url, pgn in lookup.items() if not pgn]

    conn = aiohttp.TCPConnector(limit=15)
    async with aiohttp.ClientSession(connector=conn) as session:
        fetched = await asyncio.gather(
            *(fetch_archive(url, session, asyncio.Semaphore(15)) for url in to_fetch)
        )

    return "\n\n".join([*cached, *fetched])
```

This keeps all the new caching behavior but isolates the date logic and collapses the manual loops into clear comprehensions.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 124 to 125
archive_year, archive_month = archive_url.split("/")[-2:]
if this_month == int(archive_month) and this_year == int(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Add error handling for archive_url parsing.

If archive_url does not have the expected format, this code may raise a ValueError. Add error handling to prevent unhandled exceptions.

"default": {
"BACKEND": "django.core.cache.backends.redis.RedisCache",
"LOCATION": f"redis://default:{os.getenv('CACHE_PASSWORD')}@{os.getenv('DB_HOST')}:6379/1",
"TIMEOUT": None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): Setting cache TIMEOUT to None disables expiration.

Ensure unbounded cache growth is acceptable or implement measures to manage cache size if expiration is disabled.

return is_present


async def fetch_archive(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting the cacheability check into a helper function and using comprehensions to simplify cached and uncached archive handling.

Consider pulling the “is it cacheable?” logic out into its own helper, and replace the manual loops in get_chess_dot_com_games with simple comprehensions. For example:

from datetime import datetime

def should_cache_archive(url: str, today: datetime | None = None) -> bool:
    today = today or datetime.now()
    year, month = url.rstrip("/").split("/")[-2:]
    return not (int(year) == today.year and int(month) == today.month)

Then in fetch_archive you can shrink the conditional:

async def fetch_archive(archive_url, session, semaphore):
    async with semaphore:
        while True:
            async with session.get(f"{archive_url}/pgn") as resp:
                if resp.status == 429:
                    # ...
                elif resp.status == 200:
                    text = await resp.text()
                    if should_cache_archive(archive_url):
                        cache.set(archive_url, text)
                    return text
                # ...

And in get_chess_dot_com_games use one pass to split cached vs uncached:

async def get_chess_dot_com_games(username: str) -> str:
    archives = chessdotcomClient.get_player_game_archives(username).json.get("archives", [])
    # collect cached results and list out uncached URLs
    lookup = {url: cache.get(url) for url in archives}
    cached = [pgn for pgn in lookup.values() if pgn]
    to_fetch = [url for url, pgn in lookup.items() if not pgn]

    conn = aiohttp.TCPConnector(limit=15)
    async with aiohttp.ClientSession(connector=conn) as session:
        fetched = await asyncio.gather(
            *(fetch_archive(url, session, asyncio.Semaphore(15)) for url in to_fetch)
        )

    return "\n\n".join([*cached, *fetched])

This keeps all the new caching behavior but isolates the date logic and collapses the manual loops into clear comprehensions.

Comment on lines 125 to 131
if this_month == int(archive_month) and this_year == int(
archive_year
):
# Do not cache the latest archive since it will change.
pass
else:
cache.set(archive_url, resp)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Swap if/else to remove empty if body (remove-pass-body)

Comment on lines 155 to 159
saved_res = cache.get(archive)
if not saved_res:
archives_not_saved.append(archive)
else:
all_pgns.append(saved_res)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Since the archives will not change, apart from the latest one,
then caching will really help with performance.
@CalvoM CalvoM force-pushed the improve_pgn_parsing branch from 05fb3af to fbc96a5 Compare August 3, 2025 00:21
@CalvoM CalvoM merged commit 8e99919 into main Aug 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants