Sim2Real – Using Simulation to Train Real-Life Grasping Robots8 min read

Grasping real-world objects is considered one of the more iconic examples of the current limits of machine intelligence. While humans can easily grasp and pick up objects they’ve never seen before, even the most advanced robotic arms can’t manipulate objects that they weren’t trained to handle. Recent developments in reinforcement learning (RL) have allowed for the creation of robots with better manipulation skills, but even state-of-the-art technology leaves much to be desired. A key challenge in research is the scarcity of real-world training data, as even the biggest research institutions don’t have more than dozens of robots.

RCAN, a new paper from X (previously Google X), Google, and DeepMind, presents a novel technique which allows robots to learn grasping from simulation and apply the learned skills to real-world situations. The paper vastly improves on domain randomization, a popular technique for training in simulation, by using pix2pix, a GAN-based technique to convert images to different styles, and combining it with QT-Opt, a state-of-the-art RL method first presented in 2018. The result is a grasping robot with state-of-the-art capabilities – the RCAN robot can pick up 70% of objects in a tray without any real-world experience and reaches top results (91%) using 99% less real-world data than equivalent algorithms.


RCAN is relatively simple and elegant but builds on several previous breakthroughs in RL and GANs. We’ll detail these past achievements in short to allow readers to thoroughly understand how RCAN works.

Domain randomization

Domain randomization is the idea of generating a complex environment in simulation, running experiments in the simulated environment (for instance with a robot), and finally applying the lessons to real-world tasks. While the concept itself is rather trivial, the implementation challenge is often immense – how to create a simulated environment which both faithfully emulates the real world and allows the actor in the simulation (the robot) to gain valuable real-world experience.

While the concept of domain randomization has been around for a while and has been applied in physics and other fields, it was only popularized in the context of robotic machine learning by OpenAI in a 2017 paper. In the OpenAI study, the researchers laid objects on a tray with randomized texture, varied their pose, randomized the relative position between the objects, and applied diverse lighting conditions and different camera angles. They then trained a simulation robot to pick up the objects in the various conditions and were able to achieve impressive results in real-life object fetching experiments.


QT-Opt is a reinforcement learning algorithm which allows robots to improve their grasping capability after watching hundreds of thousands of real-world grasping examples. At its heart is a large (1.2m parameters) CNN which represents the robot’s grasping logic (its Q function).

RL algorithms are often divided into two categories:

  • Open-loop systems execute a policy while ignoring the environmental consequences of the agent’s activity. In robotics, an example of an open-loop system would be an algorithm which attempts to grasp an object by finding an ideal position and pose for the grasping arm and then aiming for that location, regardless of possible interruptions along the way.
  • In closed-loop systems, the policy adapts itself based on the real-time performance of the arm, incorporating the results of their actions into the algorithm’s logic. In robotics, an example of a closed-loop system would be an algorithm that recalculates the grasping arm’s movement if the arm is blocked or suffers any other setback in the grasping process.

Closed-loop systems are more robust and have the potential of better results in practice, but also tend to be harder to train. QT-Opt successfully trains a neural network in a closed-loop system, allowing the robot to learn useful techniques like regrasping in case of a failed grasp, pre-grasp manipulation of objects, grasping in clutter, and handling dynamic objects like a rolling ball. (For more details on the QT-Opt implementation, see the paper)

The QT-Opt training setup (Image: Kalahsnikov et al)

In its vanilla form, QT-Opt is an off-policy algorithm, in this case meaning that the robot’s policy doesn’t change by training its own grasping process directly but only through learning from previously collected grasping attempts. The off-policy form allows it to achieve a state-of-the-art result of 87% grasp success rate in a common bin-emptying challenge, after training on 580,000 real-world examples. When the researchers added on-policy learning, QT-Opt actually initially decreased in effectiveness to 85% after 5,000 real-world grasps, but finally reached a 96% grasp success rate after 28,000 real-world grasps.

 Off-Policy Training (580k samples)Off-Policy + 5,000 on-policy samplesOff-Policy + 28,000 on-policy samples
Grasp Success Rate87%85%96%


Generative Adversarial Networks (GANs) are systems which consist of two neural networks – a generator and a discriminator. The generator learns to create a fake but believable output, and the discriminator learns to discern which outputs are fake and which are real. The most common use case of GANs is in image generation, wherein the generator aims to create an image of a certain style or with certain characteristics.

In common image GANs, the generator learns by receiving a noise vector as input and trying to turn the noise into a believable image. If the discriminator accepts the image, the generator neural network receives positive feedback, whereas if the discriminator rejects the image then the generator neural network receives negative feedback. The discriminator trains with real images (true samples) and the outputs of the generator network (false samples).

Standard GAN with z as the noise vector

Image GANs have been shown to produce believable and intriguing results, as exemplified in several recent papers.

In 2014, Mirza and Osindero expanded on the concept of GANs with cGANs (conditional GANs). In cGANs, the system receives as input not only noise and real images but also a third kind of input – a label – which is a condition on which the network is trained. This label is usually an image of a certain style, and it presumably assists the network in generating this kind of style. In cGANs, the generator receives as input both a noise vector and the label, and the discriminator receives as input both the label and a true/false sample.

cGAN with y as the label


In 2016, a team from the University of Berkeley presented pix2pix, a technique which uses cGANs for image-to-image translation. The objective of pix2pix is to translate between two given image structures, for instance to receive a sketch and turn it into a realistic-looking image. To do so, the generator in the cGAN receives a new task – In addition to fooling the discriminator, which remains unchanged, the generator also needs to generate an image which looks similar to the target image (i.e. minimize the loss when comparing the two images). In pix2pix, the generator does this by implementing a U-Net neural network, which receives as input a full-resolution image and outputs a manipulated full-resolution image (see detailed U-Net explanation in previous blog post). When implemented correctly, the result of pix2pix image translation can be quite impressive:

Source: pix2pix

How RCAN works

Now that we’ve established the previous research which RCAN builds upon, we can describe its mechanism.

The key insight in RCAN (Randomized-to-Canonical Adaptation Networks) is that despite Kalashnikov’s success with QT-Opt, there seems to be a limit to the effectiveness of training grasping robots with full resolution images. The infinite variety of possible lighting styles and poses result in subtle changes which confuse the CNN, and training a CNN to directly notice these distinctions seems to require too much training data.

Therefore, RCAN divides both the training and policy execution (inference) process into two stages:

  1. The image observed by the robot is translated to a specific image style, known as a canonical style. The canonical style was designed to show clear distinctions between objects and object components by presenting them in different colors (see image)
  2. The robot attempts to grasp objects by looking at the canonical version of its environment, thus gaining experience at grasping objects with a canonical-style view of the world.
Source: RCAN

In total, the RCAN team created a system with four distinct stages:

  1. Generating training data for image translation – The RCAN team applies domain randomization, generating a wide variety of robot grasping scenarios. They then translate each image into a canonical version of itself, a process which doesn’t require a specialized algorithm thanks to the inherent knowledge of every object’s position is in the simulator screen. Naturally, this knowledge doesn’t exist when the robot operates in the real world and therefore it’s necessary to generate data to train an image translation module for the robot.
  2. Image translation – The translation from simulated images to the canonical style is done with a pix2pix cGAN which receives as input the simulated images (“label” in cGAN terms) and their canonical version (“real image” in cGAN terms) and learns to generate a canonical version of a given image.
Source: RCAN
  1. QT-Opt training in simulation – As in QT-Opt, images are simulated to include various sources of lighting, pose, etc, and the robot trains on the simulated images. Unlike QT-Opt, the robot doesn’t learn its grasping technique on the raw simulated images but on their canonical versions, which are created via the pix2pix image translator.
  2. Grasping in the real world – After training, the robot attempts to grasp objects in the real world by first translating the real-world raw image to a canonical version, and then running its learned policy on the canonical version.

To further improve performance, the RCAN team added two additional types of images to the canonical simulation – a translation of images to a mask version, which clearly differentiates between different objects in the scene, and a translation of images to a depth analysis version. These are used as input to the policy CNN both in the training phase and in the policy execution (inference) phase.

The different image translations (Source: RCAN)


RCAN achieves a 70% success rate in grasping real-world objects without any real-world training and reaches a 91% success rate after 5,000 real-world grasps, surpassing the previous state-of-the-art result of 85%. Its benefits are more limited after 28,000 real-world grasps, where it reaches 94%, trailing the state-of-the-art-result of 96% achieved by Kalashnikov et al with QT-Opt.

 0 Grasps5,000 Grasps28,000 Grasps
Kalashnikov et al.Not Applicable85%96%

Therefore, it appears that the key value of RCAN is in cases where training data is sparse, a massive accomplishment considering the general ML difficulty of learning from sparse training data.

Compute & Equipment

The real-world experiments were performed using an unnamed number of Kuka IIWA grasping robots. The simulator uses the Bullet physics engine.

Implementation Details

The team has not provided data on the implementation of RCAN and has not indicated that it will open-source the code.


RCAN is a tour de force of applying training in simulation to solve real-world problems, achieving excellent results while relying on very little training data. For robots to be useful in previously unknown environments they’ll have to learn based on only small amounts of training data, and with these results, RCAN may provide a hint on how to bring robots closer to practicality.

Sign up to our weekly newsletter
Stay updated with the latest research in Deep Learning

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *