Fix excessive tokens used with data format #1463

ahuang11 · 2025-10-23T23:53:12Z

Serializing a dataframe to YAML caused excessive token bloat:

Data Overview: - input_tokens: 80
  frequency: 1
- input_tokens: 231
  frequency: 1
- input_tokens: 1207
  frequency: 1
- input_tokens: 1324
  frequency: 1
- input_tokens: 1353
  frequency: 1
- input_tokens: 1452
  frequency: 1
- input_tokens: 1539
  frequency: 1
- input_tokens: 1687
  frequency: 1
- input_tokens: 1745
  frequency: 1
- input_tokens: 1895
  frequency: 1
- input_tokens: 1993
  frequency: 1
- input_tokens: 2066
  frequency: 1
- input_tokens: 2115
  frequency: 1
- input_tokens: 2155
  frequency: 1
- input_tokens: 2302
  frequency: 1
- input_tokens: 2316
  frequency: 1
- input_tokens: 2541
  frequency: 1
- input_tokens: 2553
  frequency: 1
- input_tokens: 3342
  frequency: 1
- input_tokens: 3534
  frequency: 1
- input_tokens: 5086
  frequency: 1

Then, sometimes I get:


    return yaml.dump(df.to_dict('index'), default_flow_style=False, allow_unicode=True, sort_keys=False)
                     ^^^^^^^^^^^^^^^^^^^
  File "/Users/ahuang/miniconda3/envs/lumen/lib/python3.12/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahuang/miniconda3/envs/lumen/lib/python3.12/site-packages/pandas/core/frame.py", line 2183, in to_dict
    return to_dict(self, orient, into=into, index=index)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahuang/miniconda3/envs/lumen/lib/python3.12/site-packages/pandas/core/methods/to_dict.py", line 242, in to_dict
    raise ValueError("DataFrame index must be unique for orient='index'.")
ValueError: DataFrame index must be unique for orient='index'.

I think to_markdown might be best option.

codecov · 2025-10-23T23:57:23Z

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 47.91%. Comparing base (77e4af2) to head (55307db).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
lumen/ai/utils.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1463      +/-   ##
==========================================
+ Coverage   47.90%   47.91%   +0.01%     
==========================================
  Files         122      122              
  Lines       20799    20791       -8     
==========================================
  Hits         9963     9963              
+ Misses      10836    10828       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

philippjfr · 2025-10-24T12:31:39Z

pyproject.toml

 ai = [
    'griffe', 'nbformat', 'duckdb >= 1.2.0', 'pyarrow', 'instructor >=1.6.4', 'pydantic >=2.8.0', 'pydantic-extra-types', 'panel-graphic-walker[kernel] >=0.6.4',
-    'markitdown', 'semchunk', 'tiktoken', 'chardet', "panel-material-ui >=0.4.0"
+    'markitdown', 'semchunk', 'tiktoken', 'chardet', "panel-material-ui >=0.4.0", "tabulate"


I don't see tabulate used anywhere.

ahuang11 added 3 commits October 23, 2025 16:24

fix_data_format

38b0f8e

remove unused

263f5ce

add tabulate

55307db

ahuang11 requested a review from philippjfr October 23, 2025 23:55

philippjfr reviewed Oct 24, 2025

View reviewed changes

philippjfr approved these changes Oct 27, 2025

View reviewed changes

philippjfr merged commit b35cab8 into main Oct 27, 2025
12 checks passed

philippjfr deleted the fix_data_display branch October 27, 2025 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Fix excessive tokens used with data format #1463

Fix excessive tokens used with data format #1463

Uh oh!

ahuang11 commented Oct 23, 2025

Uh oh!

codecov bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

philippjfr Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Fix excessive tokens used with data format #1463

Fix excessive tokens used with data format #1463

Uh oh!

Conversation

ahuang11 commented Oct 23, 2025

Uh oh!

codecov bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

philippjfr Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 23, 2025 •

edited

Loading