Feature/raw text dataprep #3612

Vangmay · 2025-11-18T14:38:08Z

Fixes #14

Enables training directly on raw text files without requiring structured datasets. Adds RawTextDataLoader class with intelligent token-aware chunking, support for multiple formats (.txt, .md, .json, .jsonl, .csv), and CLI integration with --raw_text_file flag.

Usage: python unsloth-cli.py --raw_text_file book.txt --chunk_size 1024

Test: python tests/test_raw_text.py

gemini-code-assist · 2025-11-18T14:38:28Z

Summary of Changes

Hello @Vangmay, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data input capabilities of the training pipeline by enabling direct training on raw text files. It introduces a robust RawTextDataLoader that intelligently handles various file formats and performs token-aware chunking, along with a TextPreprocessor for data hygiene and validation. This change allows users to leverage unstructured text data more flexibly, streamlining the process of preparing diverse datasets for model training.

Highlights

Raw Text Training: Introduces the ability to train directly on raw text files, removing the dependency on structured datasets.
RawTextDataLoader: Adds a new class for loading and processing raw text, supporting intelligent token-aware chunking with configurable chunk_size and stride.
Multi-format Support: The loader can automatically detect and process .txt, .md, .json, .jsonl, and .csv file formats.
CLI Integration: New command-line arguments (--raw_text_file, --chunk_size, --stride, --training_mode) are added to unsloth-cli.py for easy access to this feature.
Text Preprocessing & Validation: Includes a TextPreprocessor class with utilities for cleaning text, extracting structured sections, adding special tokens, and validating dataset quality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature for training on raw text files by adding a RawTextDataLoader. The implementation is well-structured, supporting various file formats and integrating with the CLI. However, I've identified a few critical issues that prevent the feature from working as intended. The most significant problem is that the new data loading logic in unsloth-cli.py is defined but never actually called, and the new RawTextDataLoader class is not properly exported, which will lead to an import error. Additionally, there's a new CLI argument that is unused and a notable performance inefficiency in the text chunking process. My review includes detailed feedback and suggestions to address these points.

gemini-code-assist · 2025-11-18T14:40:33Z

unsloth-cli.py

    from transformers import TrainingArguments
    from unsloth import is_bfloat16_supported
    import logging
+    from unsloth import RawTextDataLoader


This import will likely fail because RawTextDataLoader is not exposed in the top-level unsloth package's __init__.py. You need to export the new symbols from unsloth.dataprep and then from unsloth itself. For example, by adding from .raw_text import * to unsloth/dataprep/__init__.py and from .dataprep import * to unsloth/__init__.py.

gemini-code-assist · 2025-11-18T14:40:33Z

unsloth-cli.py

+    def load_dataset_smart(args):
+        if args.raw_text_file:
+            # Use raw text loader
+            loader = RawTextDataLoader(tokenizer, args.chunk_size, args.stride)
+            dataset = loader.load_from_file(args.raw_text_file)
+        elif args.dataset.endswith(('.txt', '.md', '.json', '.jsonl')):
+            # Auto-detect local raw text files
+            loader = RawTextDataLoader(tokenizer)
+            dataset = loader.load_from_file(args.dataset)
+        else:
+            # Existing HuggingFace dataset logic
+            dataset = load_dataset(args.dataset, split="train")
+            dataset = dataset.map(formatting_prompts_func, batched=True)
+        return dataset


This new load_dataset_smart function is a great way to abstract the dataset loading logic. However, it is defined but never called within the run function. The existing dataset loading logic remains, so this new functionality for raw text files is never triggered. You should replace the existing dataset loading blocks with a single call to dataset = load_dataset_smart(args). You might also want to move the modelscope logic inside this function to keep all data loading logic in one place.

gemini-code-assist · 2025-11-18T14:40:34Z

unsloth/dataprep/raw_text.py

+            chunk_tokens = tokens[start_idx:end_idx]
+
+            # Decode back to text
+            chunk_text = self.tokenizer.decode(chunk_tokens, skip_special_tokens=True)


The tokenized chunks are decoded back to text here. The trainer will then have to re-tokenize this text, which creates an inefficient decode-and-re-encode cycle. To improve performance, the data loader should produce tokenized chunks directly (e.g., input_ids, attention_mask) instead of text. This avoids redundant processing, especially for large datasets.

gemini-code-assist · 2025-11-18T14:40:34Z

unsloth-cli.py

+    parser.add_argument(
+        "--training_mode",
+        type=str,
+        default="instruction", 
+        choices=list(TRAINING_MODES.keys()),
+        help="Training mode for the model"
+    )


The --training_mode argument is added, but its value (args.training_mode) is never used in the run function. This can be confusing for users who might expect it to change the training behavior. If this argument is not yet used, it might be better to remove it until its functionality is implemented to avoid confusion.

gemini-code-assist · 2025-11-18T14:40:34Z

unsloth/dataprep/raw_text.py

+        # First pass: tokenize the entire text to get accurate token counts
+        tokenized = self.tokenizer(text, return_tensors="pt", add_special_tokens=False)
+        tokens = tokenized["input_ids"]


The current implementation reads and tokenizes the entire file at once. This approach can lead to very high memory consumption for large files (e.g., several gigabytes), potentially causing out-of-memory errors. For better scalability, consider implementing a streaming approach where the file is read and processed in smaller chunks instead of loading everything into memory.

gemini-code-assist · 2025-11-18T14:40:34Z

unsloth/dataprep/raw_text.py

+    def tokenize_and_chunk(self, text):
+        """
+        Tokenize first, then chunk by token count:
+        1. More precise length control
+        2. Avoids mid-token splits
+        3. Handles different languages better
+        """
+


This tokenize_and_chunk method is defined but has no implementation, and its docstring describes a different chunking strategy. This is confusing for anyone reading the code. If this method is not intended for use, it should be removed to improve code clarity.

danielhanchen · 2025-11-20T04:21:39Z

@Vangmay Thanks for the PR and appreciate it! Would it be possible for you to address some of Gemini's comments - also @djsaunde could you see if this impacts ur CLI changes as well

Vangmay added 7 commits November 18, 2025 21:46

Write file and template for raw_text dataprep

6647193

Add implementation to cli

1d07dd9

Add support for multiple files

0a9e219

Write chunking logic

ea10d8c

Add logic to clean and extract text sections

2ff2942

Add validation code

e528a5b

Write simple test

5017d97

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature/raw text dataprep #3612

Feature/raw text dataprep #3612

Vangmay commented Nov 18, 2025

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

gemini-code-assist bot Nov 18, 2025

Uh oh!

danielhanchen commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Feature/raw text dataprep #3612

Are you sure you want to change the base?

Feature/raw text dataprep #3612

Conversation

Vangmay commented Nov 18, 2025

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants