You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pull request introduces a comprehensive set of updates and
improvements to the RAFT project, enhancing robustness, logging,
progress monitoring, checkpointing, multi-threading, Llama support,
Azure authentication, and evaluation processes.
**Note**: Those updates where developed for the most part to prepare the
MS Build 2024 talk [Practicalities of Fine-Tuning Llama 2 with AI
Studio](https://aka.ms/build24-ft-practical) with @ShishirPatil and Bala
Venkataraman.
Key updates include:
### RAFT Script Improvements:
This PR introduces significant updates to the `raft.py` script,
expanding its functionality, improving its configurability, and removing
deprecated options. Below is a summary of the key changes:
- **Logging Enhancements:** Improved logging configuration, including
more granular logging for various operations.
- **Checkpointing Overhaul:** Significant refactoring of checkpointing
logic in `raft.py`, including the introduction of multi-threading,
better directory handling, and optimization of chunk processing. The
`--fast` mode, which deactivated checkpointing, was removed in favor of
a more efficient implementation that allows checkpointing to remain
activated at all times.
- **Multi-Worker Support:** Added a `--workers` parameter to enable
parallel processing, improving efficiency and reliability during various
operations.
- **Llama Instruction Support:** Added support for Llama instructions in
addition to GPT instructions, enhancing the versatility of the script
for different model types.
- **Dataset Processing:** Added more robust handling and filtering of
datasets, including support for customized field names, empty row
filtering, and threshold-based early stopping.
- **Authentication Updates:** Added support for Azure OpenAI Keyless and
Managed Identity authentication, along with related environment variable
handling.
- **Content Safety Handling:** Updated the content generation process to
skip chunks that fail content safety compliance checks, allowing the
process to continue without interruption.
- **Progress Logging Enhancements:** Improved progress logging with
`tqdm`, including enhanced stats support in `client_utils.py`, providing
better insights into the process flow.
- **Bug Fixes and Cleanup:** Fixed various bugs across the project,
cleaned up help messages, and removed outdated or redundant components.
#### New Features and Options
1. **Output Format Expansion:**
- Added a new output format option: `eval`. This format is intended for
evaluation purposes, providing an additional way to format datasets.
2. **Enhanced Output Configuration:**
- Introduced `--output-completion-prompt-column` and
`--output-completion-completion-column` options to allow users to
specify custom column names for prompts and completions when using the
`completion` format.
3. **System Prompt Customization:**
- Added the `--system-prompt-key` option to allow users to select
between different system prompt keys (`gpt` or `llama`) based on the
model they intend to use for dataset generation.
4. **Worker Thread Management:**
- Introduced the `--workers` option to allow parallel processing by
specifying the number of worker threads, improving the script’s
efficiency in handling large datasets.
5. **Checkpoint Management:**
- Added the `--auto-clean-checkpoints` option, giving users the ability
to automatically clean up checkpoints after dataset generation, reducing
the need for manual intervention.
6. **Question/Answer Sample Threshold:**
- Introduced the `--qa-threshold` option, which allows users to specify
a threshold for the number of Question/Answer samples to generate before
stopping. This provides more control over the dataset generation
process, particularly in large-scale operations.
#### Removed Options
1. **`--fast`:**
- The `--fast` option has been removed. This option was previously used
to run the script in a fast mode with no recovery implemented. The
script has been optimized to improve performance without the need for a
separate fast mode, rendering this option obsolete.
#### Default Value Updates
- Several options now have default values set, including
`--output-type`, `--output-format`, `--doctype`, `--embedding_model`,
`--completion_model`, `--workers`, and more. These defaults aim to make
the script more user-friendly by reducing the need for extensive
configuration.
---
### Evaluation Script Improvements:
- **Stop Keyword:** Added a stop keyword functionality to allow
controlled early termination of evaluation processes when specific
conditions are met.
- **Retry Mechanism:** Introduced a retry mechanism for failed tasks,
improving reliability during evaluations.
- **Improved Robustness:** Enhanced the script’s robustness,
particularly in handling errors and edge cases, ensuring a smoother
evaluation process.
- **Logging Retry Statistics:** Implemented logging for retry attempts,
providing detailed insights and transparency into the evaluation
process.
- **Main Thread Exception Handling:** Fixed an issue where exceptions in
the main thread could cause silent failures, ensuring that all errors
are properly reported and handled.
- **Support for Chat and Completion Models:** Extended the script to
support both chat and completion models, increasing its versatility
across different use cases.
- **Environment Prefix Handling:** Enabled the script to accept an
environment prefix as a parameter, enhancing its adaptability to
different deployment environments.
- **Progress Monitoring:** Integrated progress monitoring with `tqdm`,
allowing for real-time tracking of the evaluation process.
- **Configurable Workers:** Made the number of workers configurable
using the `--workers` option, allowing for fine-tuned parallel
processing during evaluations.
Here's the PR message formatted in Markdown:
#### Enhanced CLI Options for `eval.py`
This PR introduces several new command-line options to the `eval.py`
script, providing enhanced functionality and flexibility for model
evaluation. The following changes have been made:
- **`--model MODEL`**: Added support for specifying the model to be
evaluated.
- **`--mode MODE`**: Introduced a new option to select the API mode,
either 'chat' or 'completion'. The default mode is set to 'chat'.
- **`--input-prompt-key INPUT_PROMPT_KEY`**: Added the ability to define
which column in the dataset should be used as the input prompt.
- **`--output-answer-key OUTPUT_ANSWER_KEY`**: Added the ability to
define which column in the dataset should be used as the output answer.
- **`--workers WORKERS`**: Introduced multi-threading support, allowing
users to specify the number of worker threads for evaluating the
dataset, improving processing efficiency.
- **`--env-prefix ENV_PREFIX`**: Added an option to customize the prefix
for environment variables used for API keys and base URLs. The default
prefix is set to `EVAL`.
These enhancements provide greater control over the evaluation process,
allowing for more customized and efficient use of the `eval.py` script.
## Testing
```
pytest
```
-`--datapath` - the path at which the document is located
24
+
-`--datapath` - if a file, the path at which the document is located. If a folder, the path at which to load all documents
25
25
-`--output` - the path at which to save the dataset
26
-
-`--output-format` - the format of the output dataset. Defaults to `hf` for HuggingFace. Can be one of `hf`, `completion`, `chat`.
26
+
-`--output-format` - the format of the output dataset. Defaults to `hf` for HuggingFace. Can be one of `hf`, `completion`, `chat`, `eval`.
27
27
-`--output-type` - the type of the output dataset file. Defaults to `jsonl`. Can be one of `jsonl`, `parquet`.
28
28
-`--output-chat-system-prompt` - The system prompt to use when the output format is `chat`. Optional.
29
+
-`--output-completion-prompt-column` - The column (json field name) for the `prompt` / `instruction` when using the `completion` output format. Defaults to `prompt`.
30
+
-`--output-completion-completion-column` - The column (json field name) for the `completion` when using the `completion` output format. Defaults to `completion`.
29
31
-`--distractors` - the number of distractor documents to include per data point / triplet
30
32
-`--doctype` - the type of the document, must be one of the accepted doctypes
31
33
- currently accepted doctypes: `pdf`, `txt`, `json`, `api`
@@ -37,8 +39,11 @@ Arguments:
37
39
-`--openai_key` - your OpenAI key used to make queries to GPT-3.5 or GPT-4
38
40
-`--embedding-model` - The embedding model to use to encode documents chunks. Defaults to `text-embedding-ada-002`.
39
41
-`--completion-model` - The model to use to generate questions and answers. Defaults to `gpt-4`.
40
-
-`--fast` - Fast mode flag. By default, this flag is not included and the script runs in safe mode, where it saves checkpoint datasets, allowing the script to recover and continue where it left off in the case of an interruption. Include this flag to run RAFT without recovery.
42
+
-`--system-prompt-key` - The system prompt key to use to generate the dataset. Defaults to `gpt`. Can by one of `gpt`, `llama`.
43
+
-`--workers` - The number of worker threads to use to generate the dataset. Defaults to 2.
44
+
-`--auto-clean-checkpoints` - Whether to auto clean the checkpoints after the dataset is generated. Defaults to `false`.
41
45
46
+
*Note*: The `--fast` mode flag has been removed, checkpointing is now always active.
The completion column name to use for the completion format (default: completion)
246
+
--output-completion-stop OUTPUT_COMPLETION_STOP
247
+
The stop keyword to use for the completion format (default: <STOP>)
222
248
```
223
249
224
250
**Note**: If fine tuning a chat model, then you need to use `--output-format chat` and optionally add the `--output-chat-system-prompt` parameter to configure the system prompt included in the dataset.
0 commit comments