The Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning

Authors

Moritz Schneider

Bosch Center for AI

Robert Krug

Bosch Center for AI

Narunas Vaskevicius

Bosch Center for AI

Luigi Palmieri

Bosch Center for AI

Joschka Boedecker

Neurorobotics Lab

Published at

Neural Information Processing Systems

2024

Paper

OpenReview, arXiv

TL;DR

Our results challenge some common assumptions about the benefits of PVRs in MBRL and highlight the importance of data diversity and reward prediction accuracy.

Abstract

Visual Reinforcement Learning (RL) methods often require extensive amounts of data. As opposed to model-free RL, model-based RL (MBRL) offers a potential solution with efficient data utilization through planning. Additionally, RL lacks generalization capabilities for real-world tasks. Prior work has shown that incorporating pre-trained visual representations (PVRs) enhances sample efficiency and generalization. While PVRs have been extensively studied in the context of model-free RL, their potential in MBRL remains largely unexplored. In this paper, we benchmark a set of PVRs on challenging control tasks in a model-based RL setting. We investigate the data efficiency, generalization capabilities, and the impact of different properties of PVRs on the performance of model-based agents. Our results, perhaps surprisingly, reveal that for MBRL current PVRs are not more sample efficient than learning representations from scratch, and that they do not generalize better to out-of-distribution (OOD) settings. To explain this, we analyze the quality of the trained dynamics model. Furthermore, we show that data diversity and network architecture are the most important contributors to OOD generalization performance.

Method

Our study uses DreamerV3 and TD-MPC2 algorithms, known for their state-of-the-art performance. These algorithms are integrated with PVRs by replacing the encoder with a frozen PVR and a linear mapping. The rest of the MBRL algorithms remain unchanged. Using this setup, we evaluate various PVRs, including popular models like CLIP and less common ones like mid-level representations. All models are open-source and trained on self-supervised objectives using Vision Transformers (ViT) or ResNets. For better comparison purposes we also include custom pre-trained autoencoders trained on task-specific data.

The evaluation spans 10 control tasks from three domains: DeepMind Control Suite (DMC), ManiSkill2, and Miniworld. All tasks use 256x256 RGB images for observations. The agents are trained under a distribution of visual changes in the environment (which we refer to as In Distribution (ID) and are evaluated later under a different distribution of unseen changes (OOD changes). ID training and OOD evaluation are implemented through randomizations of visual attributes in the environments by splitting all possible randomizations into ID training and OOD evaluation sets. We focus exclusively on the setting of visual distribution shifts.

Results

Data Efficiency

Our study investigates the sample efficiency of PVR-based MBRL agents compared to agents using visual representations learned from scratch. Contrary to expectations, representations learned from scratch are often equally or more data-efficient than PVRs. This challenges the belief that PVRs accelerate MBRL training, possibly due to an objective mismatch in MBRL.

Generalization to OOD Settings

The research evaluates the out-of-distribution (OOD) performance of PVRs. The results show that, except for VC-1, PVRs do not perform well in OOD domains compared to representations learned from scratch. This is surprising given that some PVRs are trained on diverse data sets.

Properties of PVRs for Generalization

Our study examines which properties of PVRs are relevant for OOD generalization. We find that language conditioning is not necessary for good OOD generalization, while data diversity is generally important. Vision Transformer (ViT) architectures perform well, but data diversity appears more crucial than the network architecture of the encoder.

World Model Differences

Based on our results so far, we finally investigate how different visual representations affect the quality of the world model in MBRL. We analyze dynamics prediction error and reward prediction error, finding that reward prediction accuracy is more important for performance. PVRs may not provide enough information to predict rewards as accurately as representations learned from scratch.

PVRs are trained to compress information as bottlenecks, often excluding reward-relevant data since reward information is not part of their training objectives. In contrast, MBRL methods like DreamerV3 and TD-MPC2 prioritize reward information. Benchmarks of PVRs typically focus on imitation learning, where reward information is irrelevant.

To investigate reward-related data in representations, we analyze UMAP projections of the latent state space in the Pendulum Swingup task. Representations learned from scratch and similar performing methods like VC-1 clearly encode reward information, with states of similar rewards clustered together. This lack of reward embedding in many PVRs may explain their underperformance compared to learned-from-scratch representations. Accurate reward extraction is essential for learning effective policies, which is hindered if reward information is not consistently embedded.