Profile
Learning To Play Minecraft With Video PreTraining (VPT) There are a lot of videos on the internet that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. These videos are merely a record of what happened, but they do not show the exact process. The sequence of mouse movements and key presses will not be known. If we would like to build large-scale foundation models in these domains as we've done in language with GPT, this lack of action labels poses a new challenge not present in the language domain, where "action labels" are simply the next words in a sentence. We introduce Video PreTraining (VPT) to help you make the most of the unlabeled video data on the internet. It is a simple and effective semi-supervised imitation learning technique. We begin by gathering a small number of contractors to record their video as well as the actions they take. In our case, these are keypresses and mouse movement. We use this data to train an inverse dynamics (IDM) that predicts the action at each step of the video. The IDM can use both past and future information to predict the action at each step. This task is simpler and requires less data than behavioral cloning tasks that require predicting actions using past video frames. This task requires inferring the person's intentions and how they will accomplish them. The trained IDM will then be able to label a greater number of online video clips and learn to behave via behavioral cloning. VPT Zero-Shot Results Our validation method was tested in Minecraft, which is a popular video game and has a lot more video data. It also allows for open-ended activities similar to real-world applications, such as computer usage. Unlike prior works in Minecraft that use simplified action spaces aimed at easing exploration, our AI uses the much more generally applicable, though also much more difficult, native human interface: 20Hz framerate with the mouse and keyboard. Trained on 70,000 hours of IDM-labeled online video, our behavioral cloning model (the "VPT foundation model") accomplishes tasks in Minecraft that are nearly impossible to achieve with reinforcement learning from scratch. Crafting It learns how to cut down trees to collect logs and craft those logs into planks. Then it crafts those planks into crafting tables. This process takes a human who is proficient in Minecraft approximately 50 seconds, or 1,000 consecutive actions. Additionally, the model is able to perform other complex skills common to humans in the game. These include swimming, hunting for food animals and eating that food. It also learned "pillar jumping", which is a common Minecraft behavior that involves repeatedly jumping and placing blocks underneath you. Fine-tuning with behavioral Cloning Foundation models are designed with a broad behavior profile to be able to perform a wide range if tasks. It is common to fine-tune models to smaller, more precise datasets to incorporate new knowledge and allow them to specialize on a narrower task allocation. To demonstrate how the VPT foundation model can adapt to downstream datasets, our contractors were asked to play in new Minecraft worlds for 10 minutes and build a house using basic Minecraft materials. This would increase the foundation model's reliability in performing "early game" skills like building crafting tables. This dataset can be fine-tuned to show a huge improvement in its ability to reliably execute the early game skills. However, the fine-tuned model is also able to go deeper into technology tree and create both wooden and stones tools. Sometimes, we can even see basic shelter construction and the agent searching for villages, including raiding chests. BC fine-tuning leads to improved early game behavior Data Scaling Our work has led us to the conclusion that using labeled contractor datasets to train an IDM (as part the VPT pipeline), is far more efficient than directly training a BC Foundation model from the same small contractor dataset. This hypothesis is supported by foundation models being trained on increasing amounts of data, from 1 to 70,000 hours. The IDM's contractor data is used to train foundation models. Those who have been trained on less than 2,000 hours of data are trained using ground-truth labels. Those who have been trained for more than 2,000 hours are taught using internet data that has been labeled with our IDM. Each foundation model is then fine-tuned to match the house building dataset. Fine-tuning and the impact of training data from the foundation model As the foundation model data grows, we typically see an increase in crafting capability. Stone tool crafting is only possible at the largest scale. Fine-Tuning with Reinforcement Learning Reinforcement learning (RL), when it is possible to specify the reward function, can be a powerful tool for eliciting high, and possibly even super-human performance. Many tasks require you to overcome difficult exploration challenges. Most RL methods address these issues with random exploration priors (e.g. models are often incentivized to act randomly via entropy bonuses. The VPT model should be a much better prior for RL because emulating human behavior is likely much more helpful than taking random actions. Our model was given the difficult task of collecting a diamond pickaxe. This is a unique capability in Minecraft, made more difficult by the native human interface. It takes a complex and lengthy sequence of subtasks to create a diamond pickaxe. We reward agents for each item in this sequence to make it easier. We found that an RL policy trained from a random initialization (the standard RL method) barely achieves any reward, never learning to collect logs and only rarely collecting sticks. In stark contrast, fine-tuning from a VPT model not only learns to craft diamond pickaxes (which it does in 2.5% of 10-minute Minecraft episodes), but it even has a human-level success rate at collecting all items leading up to the diamond pickaxe. This is the first time anyone has shown a computer agent capable of crafting diamond tools in Minecraft, which takes humans over 20 minutes (24,000 actions) on average. Reward over episodes Conclusion VPT is a path to allowing agents to learn how to act by watching the large number of videos on the Internet. VPT offers the possibility of learning large-scale behavioral priors directly in domains other than language, as opposed to contrastive methods and generative video modelling. Although we have only used Minecraft to test our hypothesis, the game is open-ended, and the native keyboard interface (mouse & keyboard) is very general. We believe our results can be applied to other similar domains like e.g. computer usage. See our paper for more information. We are also open sourcing contractor data, Minecraft code, model weights, and Minecraft environment. This will help in future VPT research. We have also partnered with the MineRL NeurIPS contest this year. Contestants can fine-tune and use our models to solve many challenging tasks in Minecraft. You can view the competition webpage and enter to win a blue-sky $100,000 prize along with a regular prize pool worth $20,000.
Forum Role: Participant
Topics Started: 0
Replies Created: 0