Simplify api #24

maxreciprocate · 2022-10-10T15:59:37Z

WIP on #7 #15

LouisCastricato · 2022-10-12T01:10:20Z

Make sure that you update your code to follow @shahbuland's method of documentation. We also need to update the read the docs after this merge.

LouisCastricato · 2022-10-12T01:14:24Z

trlx/model/nn/ilql_models.py

+        V = vs[:, 1:].squeeze() * terminal_mask
        Q_ = rewards + self.gamma * V

        if self.two_qs:


We still need comments explaining what this is

I agree more comments would be useful

LouisCastricato · 2022-10-12T01:15:37Z

trlx/trlx.py

+from trlx.pipeline.offline_pipeline import OfflinePipeline
+
+
+def train(


This seems to exclusively assume offline...? No?

soon I'll add online as well

Dahoas · 2022-10-12T00:56:48Z

examples/ilql_randomwalks.py

-    )
-
-    model.learn()
+    trlx.train(walks, lengths, eval_prompts=eval_prompts, metric_fn=metric_fn, config=config, logit_mask=logit_mask)


Should trlx.train return the model?

yes, that's the plan

Dahoas · 2022-10-12T00:57:30Z

examples/ilql_randomwalks.py



 if __name__ == "__main__":
+    walks, logit_mask, metric_fn = generate_random_walks(seed=1000)


Somehow I'd like to move the code for generating graph data outside the run file. Perhaps this belongs in some pipeline?

for use examples I think it's better to go without any subclasses

Dahoas · 2022-10-12T01:04:36Z

trlx/model/accelerate_ilql_model.py

-        eval_dataloader = self.eval_pipeline.create_loader(
-            self.config.train.batch_size, shuffle=False
-        )
+        train_dataloader = self.train_store.create_loader(self.config.train.batch_size)


Where is train_store defined?

I haven't decided on a proper way for now, so they are defined dynamically

Dahoas · 2022-10-12T01:05:45Z

trlx/model/accelerate_ilql_model.py

-                    for prompts in eval_dataloader:
-                        with torch.no_grad():
-                            samples, _ = self.model.sample(
+                    for beta in self.config.method.betas:


Why do we have multiple betas?

to compare against finetune (beta=0)

Dahoas · 2022-10-12T01:06:25Z

trlx/model/accelerate_ilql_model.py

+                            )

                    self.model.train()
+                    generate_time = time() - generate_time


Do you think we can try to use the Clock object Shabuland made?

it doesn't support granular measurements like these

Dahoas · 2022-10-12T01:10:11Z

trlx/model/accelerate_ppo_model.py

+        self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare(
+            self.model, self.opt, self.scheduler, rollout_loader
+        )
+        self.store.clear_history()


It turns out when passing a dataloader into deepspeed it is required to be nonempty. I had a hack that loaded a dummy prompt and then clears it once things are loaded. Is this still in the code? I cannot seem to find it

are you talking about clear_history call?

Dahoas · 2022-10-12T01:11:31Z

trlx/model/accelerate_ppo_model.py

                    self.scheduler.step()
                    self.iter_count += 1

+                    if self.iter_count % self.config.train.checkpoint_interval == 0:


Probably this can be put in the accelerate base model? My hope is that all models inheriting the accelerateRLModel can use the default training loop with any changes made via the post_batch and post_epoch callback functions

Dahoas · 2022-10-12T01:15:24Z

trlx/model/nn/ilql_models.py

+            * terminal_mask
        ).sum() / n_nonterminal

        loss = loss_q + loss_v + self.cql_scale * loss_cql + self.awac_scale * loss_awac


Do you think it makes more sense to put the loss in the nn.module classes or the accelerator trainer classes? I thought we would want loss defined in the accelerator classes but I'm open to something different if you have a strong opinion?

I don't have a strong opionion, but forward and generate are also specific here and have to be decoupled and I'm not sure there is a reason for that just yet

Dahoas · 2022-10-12T01:17:35Z

trlx/pipeline/offline_pipeline.py

+        return DataLoader(self, batch_size=batch_size, collate_fn=collate_fn)


 class OfflineRolloutStorage(BaseRolloutStore):


Perhaps it makes more sense to construct and load our reward labeled datasets in the rollout storage init? I am unsure but I do think the datasets should be separated from the run scripts

Dahoas · 2022-10-12T01:18:00Z

trlx/trlx.py

+    )
+    model.eval_pipeline = OfflinePipeline(model.tokenizer, eval_prompts)
+
+    model.learn()


I think we should return the model

I like these changes! Having one simple trlx trainer is a good direction.

The readme will need to be updated when this pr is finished.

Before we commit to master make sure we've tested both the ppo and ilql pipelines on the sentiment task.

LouisCastricato · 2022-10-12T01:36:54Z

@dmarx do you mind checking out the architecture choices of this PR?

* Had to add py_modules=trlx to setup. * Added a save strategy. * Cleaned up a few things. * Added save_steps to ilql_config.yaml and save steps strategy to accelerate_ilql_model.py for consistency. The save_steps parameter must be set now because of how TrainConfig.from_dict operates. If not save_steps parameter is given in the configs it throws an error. * Adding mininal changes to enable step based save strategy in configs/ppo_config.yml, trlx/data/configs.py, and trlx/model_accelerate_ppo_model.py * Some problems crept in despite merge check. This fixes them. * Realized I am merging into stage-api not main so fixed an issue with ilql_config.yml

cat-state · 2022-10-17T23:32:37Z

examples/ilql_architext.py

+                    count -= 1
+            nrooms.append(count)
+
+        return {'nrooms': nrooms}


this could be { 'nrooms': [-sample.count(':') for sample in samples] }

cat-state · 2022-10-18T01:39:05Z

Could you make this black formatted?

4.23.1 complains if .generate() starts with single bos token, when bos=eos=pad token

maxreciprocate · 2022-10-20T00:06:54Z

This PR mainly addresses

Issues Create a unified API between PPO and ILQL #7 and Remove boiler plate between ILQL and PPO #15
It also fixes few inaccuracies in ppo and ilql implementations
By intertia it also simplifies the readme

Dahoas · 2022-10-20T13:59:24Z

Can you attach wandb runs verifying ppo and ilql still perform appropriately? (I know you have them just wanna make this a standard process)

LouisCastricato · 2022-10-20T14:45:50Z

trlx/model/accelerate_base_model.py

        )
        return components

+    def save(self, directory=None):


all of these need doc strings and comments

LouisCastricato · 2022-10-20T14:51:45Z

trlx/model/accelerate_base_model.py

+
+            if self.reward_fn:
+                rewards = torch.as_tensor(self.reward_fn(samples), dtype=torch.float)
+                mean_reward = rewards.mean()


This needs comments

LouisCastricato · 2022-10-20T14:52:16Z

trlx/model/accelerate_base_model.py

+
    @abstractmethod
-    def get_arch(config: TRLConfig):
+    def get_arch(self, config: TRLConfig):


doc strings

LouisCastricato · 2022-10-20T14:52:34Z

trlx/model/accelerate_ilql_model.py

+        if self.iter_count % self.config.method.steps_for_target_q_sync == 0:
+            self.accelerator.unwrap_model(self.model).sync_target_q_heads()
+
+    def loss(self, batch):


LouisCastricato · 2022-10-20T14:52:47Z

trlx/model/accelerate_base_model.py

-        """
-        Additional exploration can happen here
-        """
+    def post_epoch_callback(self):


End in new line

LouisCastricato · 2022-10-20T14:56:09Z

trlx/model/accelerate_ppo_model.py

-    def loss(
-        self, query_tensors, response_tensors, all_logprobs, all_values, all_rewards
-    ):
+    def loss(self, batch):


LouisCastricato · 2022-10-20T14:56:44Z

trlx/model/nn/ilql_models.py


-    @torch.inference_mode()
-    def sample(
+    def generate(


LouisCastricato · 2022-10-20T14:57:01Z

trlx/model/nn/ppo_models.py

            return outputs.logits
        return outputs

    def forward(


LouisCastricato · 2022-10-20T14:57:10Z

trlx/orchestrator/offline_orchestrator.py

        self.model = model
+        self.split_token = split_token
+
+    def make_experience(self, samples, rewards):


Doc string and comments

LouisCastricato · 2022-10-20T14:57:19Z

trlx/orchestrator/ppo_orchestrator.py


 @register_orchestrator
 class PPOOrchestrator(Orchestrator):
    def __init__(


Dahoas · 2022-10-20T14:18:18Z

configs/ppo_config.yml

+  init_kl_coef: 0.2  # init kl coefficient
+  target: 6  # target kl coefficient, set None for fixed kl coef
+  horizon: 10000  # PPO horizon
+  gamma: 0.99  # PPO discount


Why the change?

Dahoas · 2022-10-20T14:23:16Z

trlx/data/ilql_types.py

    input_ids: TensorType["query_size"]
    attention_mask: TensorType["query_size"]
    rewards: TensorType["reward_size"]
+    states_ixs: TensorType["states_size"]


What are these for?

Dahoas · 2022-10-20T14:33:19Z

trlx/model/accelerate_base_model.py

-        query_tensors = data.tokens.to(
-            self.accelerator.device
-        )  # [B, N] #TODO(dahoas): This may need to be changed
+    def generate(self, input_ids, attention_mask=None, **kwargs):


I noticed we are no longer loading the model into accelerate at init time. If we just want to do large model (>20B) inference do we still need to load with accelerate?

Dahoas · 2022-10-20T14:36:58Z

trlx/model/accelerate_ilql_model.py

@@ -1,180 +1,179 @@
-import os
-from typing import Dict, Iterable
+from typing import Iterable, Union


Unfortunately I really cannot review ilql so I trust things are fine here.

Perhaps it would be a good idea to write some unittests for the ilql implementation moving forward.

trlx/model/accelerate_ppo_model.py

Dahoas · 2022-10-20T14:48:02Z

trlx/model/accelerate_ppo_model.py

-                    )
-                )
-
-    def learn(self, log_fn=None, save_fn=None, eval_fn=None):


Does the PPO implementation still have multiple ppo_epochs per batch? I see a new variable defining this (n_updated_per_batch) but since we are relying on the base_model's training loop I am not seeing where it gets used.

Perhaps if we do not want to override base_model's learn method we should write something in the post_backward callback

Dahoas · 2022-10-20T14:48:32Z

trlx/model/nn/ilql_models.py

+        V = vs[:, 1:].squeeze() * terminal_mask
        Q_ = rewards + self.gamma * V

        if self.two_qs:


I agree more comments would be useful

trlx/orchestrator/ppo_orchestrator.py

Dahoas · 2022-10-20T14:56:33Z

trlx/pipeline/ppo_pipeline.py

-                torch.stack([elem.rewards for elem in elems]),
+            return PPORLBatch(
+                # Left padding of already left-padded queries
+                pad_sequence(


This is a bit funny but I suppose necessary

Dahoas · 2022-10-20T14:58:23Z

trlx/pipeline/ppo_pipeline.py

+                    batch_first=True,
+                ).flip(1),
+                # Right pad the rest, to have a single horizontal query/response split
+                pad_sequence(


Ok so seeing this now I assume the padded values are handled in loss computation by the attention mask?

LouisCastricato · 2022-10-21T20:19:34Z

The example in the readme is weird... What is it supposed to do? also the simulcra example is kinda odd too.. no explanation of what it is supposed to do.

LouisCastricato · 2022-10-21T21:32:09Z

Looks good to me... Ready to merge

maxreciprocate requested review from LouisCastricato and Dahoas October 10, 2022 15:59

LouisCastricato mentioned this pull request Oct 10, 2022

Save strategy #23

Merged

LouisCastricato reviewed Oct 12, 2022

View reviewed changes

Dahoas reviewed Oct 12, 2022

View reviewed changes

maxreciprocate added 5 commits October 12, 2022 11:18

fix(ilql): sampling on variable sized prompts & stage simplified api

b24375f

fix(ilql): eval on a set of betas & add simple timers

e17d017

fix: saving checkpoints

1df60aa

refactor(ilql): subsume under base_model

d0be78a

maxreciprocate force-pushed the stage-api branch from 2f55b79 to d0be78a Compare October 12, 2022 08:28

maxreciprocate added 3 commits October 14, 2022 21:11

fix(ilql): mask prompts

ed9f209

merge hydra

8498a87

Merge branch 'master' into save-stage-api

cbdfc7f

maxreciprocate force-pushed the stage-api branch from 7f3a4ca to 63df70d Compare October 17, 2022 15:53

maxreciprocate added 3 commits October 17, 2022 18:58

fix(ppo): generalize and stage for api

63df70d

feat: add architext examples

1f98da0

fix(ppo,ilql): ddp + accelerate

e0575b0

maxreciprocate requested a review from cat-state October 17, 2022 22:12

cat-state reviewed Oct 17, 2022

View reviewed changes

cat-state mentioned this pull request Oct 18, 2022

Add isort flake8 #39

Closed

maxreciprocate added 6 commits October 18, 2022 14:06

Merge branch 'master' into save-stage-api

7e993b0

refactor: clean pipelines

246f5a3

feat: add simulacra example

a6c2455

fix(ppo): single token prompting

7985f29

refactor: fully merge models

ea11061

refactor(configs): lower batch_sizes & remove dead entries

01c154d

maxreciprocate requested a review from Dahoas October 19, 2022 21:39

maxreciprocate added 5 commits October 20, 2022 00:42

refactor(examples): update for new api

05acef4

Merge branch 'master' into stage-api

1558179

fix(tests,style): one way to pass tests is to change them

fda5608

fix(ppo): warnings of the most recent version of transformers

28622e9

4.23.1 complains if .generate() starts with single bos token, when bos=eos=pad token

refactor(readme): add api

8057d16

maxreciprocate marked this pull request as ready for review October 19, 2022 23:48

maxreciprocate mentioned this pull request Oct 20, 2022

AttributeError: 'DistributedDataParallel' object has no attribute 'generate' #50

Closed

LouisCastricato reviewed Oct 20, 2022

View reviewed changes

Dahoas reviewed Oct 20, 2022

View reviewed changes

maxreciprocate added 6 commits October 21, 2022 22:28

chore: add doc strings

6bcffd1

fix: remove dropout

1300d1c

chore: keep gpt2 small in examples

ef9ef2c

chore: revert to previous default configs

dfa098a

chore(docs): rename classes, remove unused, add examples

75dea16

Merge branch 'master' into save-stage-api

ff3545f

maxreciprocate added 2 commits October 21, 2022 23:21

chore(readme): add contributing.md & deepspeed note

4dd04fb

style(readme): US spelling

4f59dad

chore(examples): add explanations for each task

f8337f3

LouisCastricato merged commit 06cd30f into master Oct 21, 2022

maxreciprocate deleted the stage-api branch October 21, 2022 22:26

		from trlx.pipeline.offline_pipeline import OfflinePipeline


		def train(



		if __name__ == "__main__":
		walks, logit_mask, metric_fn = generate_random_walks(seed=1000)

		return DataLoader(self, batch_size=batch_size, collate_fn=collate_fn)


		class OfflineRolloutStorage(BaseRolloutStore):

Simplify api #24

Simplify api #24

Uh oh!

Conversation

maxreciprocate commented Oct 10, 2022

Uh oh!

LouisCastricato commented Oct 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LouisCastricato commented Oct 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cat-state commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxreciprocate commented Oct 20, 2022

Uh oh!

Dahoas commented Oct 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

cat-state commented Oct 18, 2022 •

edited

Loading