Gradient calc in deterministic OAC

Hi Quan,

I came across your paper and found it to be interesting. One of the doubts I have is with the implementation of the [optimistic policies](https://github.com/microsoft/oac-explore/blob/master/optimistic_exploration.py#L42). Why are you computing gradients of the upper bound w.r.t pre-tanh of the policies? As per the paper, isn' it supposed to be the deterministic action (output of the tanh policy)?

Regards,
Kartik

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient calc in deterministic OAC #38

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gradient calc in deterministic OAC #38

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions