Skip to content

Conversation

@Vangmay
Copy link

@Vangmay Vangmay commented Nov 18, 2025

Fixes #14

Enables training directly on raw text files without requiring structured datasets. Adds RawTextDataLoader class with intelligent token-aware chunking, support for multiple formats (.txt, .md, .json, .jsonl, .csv), and CLI integration with --raw_text_file flag.

Usage: python unsloth-cli.py --raw_text_file book.txt --chunk_size 1024

Test: python tests/test_raw_text.py

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Vangmay, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data input capabilities of the training pipeline by enabling direct training on raw text files. It introduces a robust RawTextDataLoader that intelligently handles various file formats and performs token-aware chunking, along with a TextPreprocessor for data hygiene and validation. This change allows users to leverage unstructured text data more flexibly, streamlining the process of preparing diverse datasets for model training.

Highlights

  • Raw Text Training: Introduces the ability to train directly on raw text files, removing the dependency on structured datasets.
  • RawTextDataLoader: Adds a new class for loading and processing raw text, supporting intelligent token-aware chunking with configurable chunk_size and stride.
  • Multi-format Support: The loader can automatically detect and process .txt, .md, .json, .jsonl, and .csv file formats.
  • CLI Integration: New command-line arguments (--raw_text_file, --chunk_size, --stride, --training_mode) are added to unsloth-cli.py for easy access to this feature.
  • Text Preprocessing & Validation: Includes a TextPreprocessor class with utilities for cleaning text, extracting structured sections, adding special tokens, and validating dataset quality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for training on raw text files by adding a RawTextDataLoader. The implementation is well-structured, supporting various file formats and integrating with the CLI. However, I've identified a few critical issues that prevent the feature from working as intended. The most significant problem is that the new data loading logic in unsloth-cli.py is defined but never actually called, and the new RawTextDataLoader class is not properly exported, which will lead to an import error. Additionally, there's a new CLI argument that is unused and a notable performance inefficiency in the text chunking process. My review includes detailed feedback and suggestions to address these points.

from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import logging
from unsloth import RawTextDataLoader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This import will likely fail because RawTextDataLoader is not exposed in the top-level unsloth package's __init__.py. You need to export the new symbols from unsloth.dataprep and then from unsloth itself. For example, by adding from .raw_text import * to unsloth/dataprep/__init__.py and from .dataprep import * to unsloth/__init__.py.

Comment on lines +102 to +115
def load_dataset_smart(args):
if args.raw_text_file:
# Use raw text loader
loader = RawTextDataLoader(tokenizer, args.chunk_size, args.stride)
dataset = loader.load_from_file(args.raw_text_file)
elif args.dataset.endswith(('.txt', '.md', '.json', '.jsonl')):
# Auto-detect local raw text files
loader = RawTextDataLoader(tokenizer)
dataset = loader.load_from_file(args.dataset)
else:
# Existing HuggingFace dataset logic
dataset = load_dataset(args.dataset, split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)
return dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This new load_dataset_smart function is a great way to abstract the dataset loading logic. However, it is defined but never called within the run function. The existing dataset loading logic remains, so this new functionality for raw text files is never triggered. You should replace the existing dataset loading blocks with a single call to dataset = load_dataset_smart(args). You might also want to move the modelscope logic inside this function to keep all data loading logic in one place.

chunk_tokens = tokens[start_idx:end_idx]

# Decode back to text
chunk_text = self.tokenizer.decode(chunk_tokens, skip_special_tokens=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tokenized chunks are decoded back to text here. The trainer will then have to re-tokenize this text, which creates an inefficient decode-and-re-encode cycle. To improve performance, the data loader should produce tokenized chunks directly (e.g., input_ids, attention_mask) instead of text. This avoids redundant processing, especially for large datasets.

Comment on lines +432 to +438
parser.add_argument(
"--training_mode",
type=str,
default="instruction",
choices=list(TRAINING_MODES.keys()),
help="Training mode for the model"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The --training_mode argument is added, but its value (args.training_mode) is never used in the run function. This can be confusing for users who might expect it to change the training behavior. If this argument is not yet used, it might be better to remove it until its functionality is implemented to avoid confusion.

Comment on lines +81 to +83
# First pass: tokenize the entire text to get accurate token counts
tokenized = self.tokenizer(text, return_tensors="pt", add_special_tokens=False)
tokens = tokenized["input_ids"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation reads and tokenizes the entire file at once. This approach can lead to very high memory consumption for large files (e.g., several gigabytes), potentially causing out-of-memory errors. For better scalability, consider implementing a streaming approach where the file is read and processed in smaller chunks instead of loading everything into memory.

Comment on lines +127 to +134
def tokenize_and_chunk(self, text):
"""
Tokenize first, then chunk by token count:
1. More precise length control
2. Avoids mid-token splits
3. Handles different languages better
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This tokenize_and_chunk method is defined but has no implementation, and its docstring describes a different chunking strategy. This is confusing for anyone reading the code. If this method is not intended for use, it should be removed to improve code clarity.

@danielhanchen
Copy link
Contributor

@Vangmay Thanks for the PR and appreciate it! Would it be possible for you to address some of Gemini's comments - also @djsaunde could you see if this impacts ur CLI changes as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Raw txt file training

2 participants