Skip to content

Something isn't right with actions (and chance) encoding #413

@Firerozes

Description

@Firerozes

Hello,

Something isn't right with actions/chance encoding. I only checked for muzero and stochastic muzero (lzero/model), but these 4 follow the same overall process - and I guess other algos too. This one is taken from lzero/model/muzero_model_mlp :

        # discrete action space
        if self.discrete_action_encoding_type == 'one_hot':
            # Stack latent_state with the one hot encoded action
            if len(action.shape) == 1:
                # (batch_size, ) -> (batch_size, 1)
                # e.g.,  torch.Size([8]) ->  torch.Size([8, 1])
                action = action.unsqueeze(-1)

            # transform action to one-hot encoding.
            # action_one_hot shape: (batch_size, action_space_size), e.g., (8, 4)
            action_one_hot = torch.zeros(action.shape[0], self.action_space_size, device=action.device)
            # transform action to torch.int64
            action = action.long()
            action_one_hot.scatter_(1, action, 1)
            action_encoding = action_one_hot
        elif self.discrete_action_encoding_type == 'not_one_hot':
            action_encoding = action / self.action_space_size
            if len(action_encoding.shape) == 1:
                # (batch_size, ) -> (batch_size, 1)
                # e.g.,  torch.Size([8]) ->  torch.Size([8, 1])
                action_encoding = action_encoding.unsqueeze(-1)

        action_encoding = action_encoding.to(latent_state.device).float()
        # state_action_encoding shape: (batch_size, latent_state[1] + action_dim]) or
        # (batch_size, latent_state[1] + action_space_size]) depending on the discrete_action_encoding_type.
        state_action_encoding = torch.cat((latent_state, action_encoding), dim=1)

        next_latent_state, reward = self.dynamics_network(state_action_encoding)

        if not self.state_norm:
            return next_latent_state, reward
        else:
            next_latent_state_normalized = renormalize(next_latent_state)
            return next_latent_state_normalized, reward
  • If self.discrete_action_encoding_type == 'one_hot', there is some encoding...but it doesn't follow DeepMind's Zero algo's ideas. That is, there are several 1s that are spread in either the first or the second colum of a tensor. This doesn't match e.g. AlphaZero's philosophy of putting a single 1 in played position for Go
  • If self.discrete_action_encoding_type != 'one_hot', there is some kind of subtitute, that is, action / self.action_space_size. This is, I guess, okayish for action space sizes of low cardinality. But for high ones ? E.g., for Go's 362 state, the difference between two states is ~0.003. Isn't that too small to have the model truly distinct actions ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    configNew or improved configurationdiscussionDiscussion of a typical issue or concept

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions