Replication of results for standard PPO

I've been looking at using evidential PPO for a project I've been working on, and, to that ends, I've tried to replicate the comparison results published in the paper. While the results obtained using evidential PPO are in line with those in the paper, the results I get when using standard, non-evidential PPO look quite a bit higher than the ones that were published. In particular, looking at the `HalfCheetah` environment with the front-one paralysis strategy:

<img width="746" height="634" alt="Image" src="https://github.com/user-attachments/assets/e86adbb6-8081-4c40-ae75-f18ab1ad7b3c" />

The results I obtained were, for EPPO:

```
AULC: 3833.0048828125
Final Return: 4082.907470703125
```

And, for standard PPO:

```
AULC: 3481.70068359375
Final Return: 3648.4755859375
```

Loss curves:

<img width="863" height="551" alt="Image" src="https://github.com/user-attachments/assets/f0df5af9-b6df-4659-9f35-513c061aea86" />

EPPO does consistently do better than PPO, but the margin is substantially smaller than what I see in the paper's results. Hyperparameters used are identical to the defaults provided in this repository. The code for my vanilla PPO implementation is here, and I'm reasonably confident that no trace of the evidential critic remains:

[evidential_ppo_paper.ipynb](https://github.com/user-attachments/files/24671923/evidential_ppo_paper.ipynb)

Is there a difference between my implementation and yours?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication of results for standard PPO #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replication of results for standard PPO #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions