-
Notifications
You must be signed in to change notification settings - Fork 170
Open
Labels
configNew or improved configurationNew or improved configurationdiscussionDiscussion of a typical issue or conceptDiscussion of a typical issue or concept
Description
Hello,
Something isn't right with actions/chance encoding. I only checked for muzero and stochastic muzero (lzero/model
), but these 4 follow the same overall process - and I guess other algos too. This one is taken from lzero/model/muzero_model_mlp
:
# discrete action space
if self.discrete_action_encoding_type == 'one_hot':
# Stack latent_state with the one hot encoded action
if len(action.shape) == 1:
# (batch_size, ) -> (batch_size, 1)
# e.g., torch.Size([8]) -> torch.Size([8, 1])
action = action.unsqueeze(-1)
# transform action to one-hot encoding.
# action_one_hot shape: (batch_size, action_space_size), e.g., (8, 4)
action_one_hot = torch.zeros(action.shape[0], self.action_space_size, device=action.device)
# transform action to torch.int64
action = action.long()
action_one_hot.scatter_(1, action, 1)
action_encoding = action_one_hot
elif self.discrete_action_encoding_type == 'not_one_hot':
action_encoding = action / self.action_space_size
if len(action_encoding.shape) == 1:
# (batch_size, ) -> (batch_size, 1)
# e.g., torch.Size([8]) -> torch.Size([8, 1])
action_encoding = action_encoding.unsqueeze(-1)
action_encoding = action_encoding.to(latent_state.device).float()
# state_action_encoding shape: (batch_size, latent_state[1] + action_dim]) or
# (batch_size, latent_state[1] + action_space_size]) depending on the discrete_action_encoding_type.
state_action_encoding = torch.cat((latent_state, action_encoding), dim=1)
next_latent_state, reward = self.dynamics_network(state_action_encoding)
if not self.state_norm:
return next_latent_state, reward
else:
next_latent_state_normalized = renormalize(next_latent_state)
return next_latent_state_normalized, reward
- If
self.discrete_action_encoding_type == 'one_hot'
, there is some encoding...but it doesn't follow DeepMind's Zero algo's ideas. That is, there are several 1s that are spread in either the first or the second colum of a tensor. This doesn't match e.g. AlphaZero's philosophy of putting a single 1 in played position for Go - If
self.discrete_action_encoding_type != 'one_hot'
, there is some kind of subtitute, that is, action / self.action_space_size. This is, I guess, okayish for action space sizes of low cardinality. But for high ones ? E.g., for Go's 362 state, the difference between two states is ~0.003. Isn't that too small to have the model truly distinct actions ?
Metadata
Metadata
Assignees
Labels
configNew or improved configurationNew or improved configurationdiscussionDiscussion of a typical issue or conceptDiscussion of a typical issue or concept