RT-1 (robotics transformer) is a general transformer model used for imitation learning. The encoder takes in observation images, encodes it with a language-conditioned โš™๏ธ EfficientNet, tokenizes the internal feature map, and passes it through a token learner that significantly decreases the number of significant tokens. The transformer decoder then takes in the reduced tokens and predicts discretized variables for the robotโ€™s action.