CarperAI
diff --git a/‎Makefile‎
Lines changed: 0 additions & 14 deletions b/‎Makefile‎
Lines changed: 0 additions & 14 deletions
diff --git a/‎README.md‎
Lines changed: 17 additions & 23 deletions b/‎README.md‎
Lines changed: 17 additions & 23 deletions
@@ -3,53 +3,47 @@
 
 # Transformer Reinforcement Learning X
 
-`trlx` allows you to fine-tune 🤗 Hugging Face supported language models (`gpt2`, `gpt-j`, `gpt-neo` and `gpt-neox` based) up to 20B parameters using reinforcement learning via either a provided reward function or reward-labeled dataset. Proximal Policy Optimization ([PPO](https://arxiv.org/pdf/1909.08593.pdf)) and Implicit Language Q-Learning ([ILQL](https://sea-snell.github.io/ILQL_site/)) are implemented.
+TRLX allows you to fine-tune 🤗 Hugging Face supported language models (`gpt2`, `gpt-j`, `gpt-neo` and `gpt-neox` based) up to 20B parameters using reinforcement learning via either a provided reward function or reward-labeled dataset. Proximal Policy Optimization ([PPO](https://arxiv.org/pdf/1909.08593.pdf)) and Implicit Language Q-Learning ([ILQL](https://sea-snell.github.io/ILQL_site/)) are implemented.
 
-You can read more about trlX in our [documentation](https://trlX.readthedocs.io).
+You can read more about TRLX in our [documentation](https://trlX.readthedocs.io).
 
 ## Installation
-### From Source
 ```bash
 git clone https://github.com/CarperAI/trlx.git
 cd trlx
-pip install torch --extra-index-url https://download.pytorch.org/whl/cu113 # for cuda
+pip install torch --extra-index-url https://download.pytorch.org/whl/cu116 # for cuda
 pip install -e .
 ```
 
 ## How to Train
-You can train your model using a reward function or a reward-labeled dataset.
+You can train a model using a reward function or a reward-labeled dataset.
 
-### Using a reward function
+#### Using a reward function
 ```python
-import trlx
-
-# optimize some reward function
 model = trlx.train('gpt2', reward_fn=lambda samples: [sample.count('cats') for sample in samples])
-
-# model is a wrapper with some logit preprocessing
-model.generate(**tokenizer('Q: Who rules the world? A:', return_tensors='pt'), do_sample=True)
 ```
-
-### Using a reward-labeled dataset
-
+#### Using a reward-labeled dataset
 ```python
-import trlx
-
-# Steer a model with a collection of rated samples
 model = trlx.train('EleutherAI/gpt-j-6B', dataset=[('dolphins', 'geese'), (1.0, 100.0)])
+```
 
-# model is a wrapper with some logit preprocessing
+#### Trained model is a wrapper over a given autoregressive model
+```python
 model.generate(**tokenizer('Q: Who rules the world? A:', return_tensors='pt'), do_sample=True)
 ```
 
-### Using 🤗 Accelerate to speed up the training
-Launch distributed training with 🤗 Accelerate (only DeepSpeed integration is tested)
+#### Use 🤗 Accelerate to launch distributed training
 
 ```bash
-accelerate config
+accelerate config # choose DeepSpeed option
 accelerate launch examples/simulacra.py
 ```
 
+#### Use Ray Tune to launch hyperparameter sweep
+```bash
+python train_sweep.py --config configs/ray_tune_configs/ppo_config.yml --example-name ppo_sentiments
+```
+
 For more usage see [examples](./examples)
 
 ## Contributing
@@ -59,4 +53,4 @@ and also read our [docs](https://trlX.readthedocs.io)
 
 ## Acknowledgements
 
-Thanks Leandro for starting the original [trl](https://github.com/lvwerra/trl/)
+Many thanks to Leandro von Werra for hacking on the [trl](https://github.com/lvwerra/trl/)