Foundation models across various modalities have enjoyed unparalleled improvements in generalization over the past few years. These gains can be attributed to a variety of factors including but not limited to web-scale data, increases in parameter count, and training techniques such as instruction tuning. Despite these improvements, reinforcement learning techniques have not yet managed to achieve similar generalization across multiple environments without fine-tuning. In this project, we explore whether the generalization inherent to foundational vision language models can be applied to various reinforcement learning environments. Our work aims to determine whether the ability of large foundational vision-language models to generalize beyond their training data can be extended to reinforcement learning by having the models act as agents, reward functions, or reward function code generators in unseen environments given a state and a goal.
XiangshengGu/ActionVLM
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|