-
Notifications
You must be signed in to change notification settings - Fork 2
Description
I've been looking at using evidential PPO for a project I've been working on, and, to that ends, I've tried to replicate the comparison results published in the paper. While the results obtained using evidential PPO are in line with those in the paper, the results I get when using standard, non-evidential PPO look quite a bit higher than the ones that were published. In particular, looking at the HalfCheetah environment with the front-one paralysis strategy:
The results I obtained were, for EPPO:
AULC: 3833.0048828125
Final Return: 4082.907470703125
And, for standard PPO:
AULC: 3481.70068359375
Final Return: 3648.4755859375
Loss curves:
EPPO does consistently do better than PPO, but the margin is substantially smaller than what I see in the paper's results. Hyperparameters used are identical to the defaults provided in this repository. The code for my vanilla PPO implementation is here, and I'm reasonably confident that no trace of the evidential critic remains:
Is there a difference between my implementation and yours?