One of the most well-known challenges in robotics is ‘picking’, i.e. using a robotic claw to lift a single object, usually from a cluttered 3-dimensional pile of objects. Picking an object from a pile and moving it to a destination can be a useful capability in many real-world situations but human experience shows that in some situations it’s more efficient to stay in your initial location and simply throw the object to a target bin. An example of such a situation is cleaning debris from roads, where it could be enough to throw the debris to the side of the road.
TossingBot, a new paper published by researchers from Princeton, Google, Columbia, and MIT, presents a new kind of throwing robot that picks arbitrary objects from a pile and tosses them into relatively small bins (25×15 cm, 10”x6”) at high rates of success, surpassing the throwing skills of the researchers themselves.
TossingBot applies end-to-end deep learning to learn how to grasp and throw items to a target bin, first training in simulation and finalizing the training using a real-life robot. It uses a physics (kinematics) baseline to calculate where an object will land upon throw, and then learns how to adapt the calculated landing position (add a ‘residual’ to the calculation) based on the characteristics of the grasped item and the form of the grasp. For instance, a ball moves rather predictably in the air and generally adheres to the basic kinematics formula, while a marker pen can fly rather erratically and its movement is harder to predict.
How TossingBot Works
TossingBot receives RGB images as input and transforms them into RGB-D (Color + Depth) images by combining the view of two distinct cameras (for more information on depth calculation techniques, see here. It then uses three separate neural networks to analyze the image and decide how to act:
- The perception network, a 7-layer fully convolutional ResNet with two layers of 2×2 max-pooling layers added into it. Its input is the RGB-D image and output is a 45x35x512 representation of the image.
- The grasping network, a 7-layer fully convolutional ResNet with two upsampling layers added into it. Its input is the 45x35x512 perception representation and its output is an image with the robot’s location (x) and orientation (θ), with which it should attempt to grasp an object. The researchers assume that there are not obstacles in the robot’s path, therefore only the end pose matters.
- The throwing network, a 7-layer fully convolutional ResNet with two upsampling layers added into it. Its input is the 45x35x512 perception representation, as well as the kinematics data from a physics engine, and its output is an image of the environment after the object has been thrown. The location of the object after it has been thrown tells us in which bin the object is expected to land in, and using a fixed 45◦ angle, provides a recommended release velocity for the object.
In TossingBot, grasping and throwing are learned together in an end-to-end approach, meaning that the robot is more likely to grasp objects such that it’ll be able to later accurately throw them. The loss function is L = Lg + yi * Lt, combining the grasp success loss Lg and the throw distance loss Lt (distance from the middle of the expected bin), with yi representing the binary ground truth grasp success label, meaning the throw loss is considered only when the grasping is assumed to be successful.
The system is first trained in a PyBullet simulation where the laws of kinematics apply but no aerodynamic drag exists. The objects used in the simulation are a ball, a cube, a rod, and a hammer.
The system is then trained for 15,000 steps with a real-life UR5 arm with an RG2 gripper, grasping and throwing a collection of 80+ toy blocks, fake fruit, decorative items, and office objects. Two real-life cameras capture 640×480 color images and 12 target boxes are used, located outside the range of the robotic arm and tilted at a 15◦ angle. Each time the pile becomes empty. I.e. all the items have been thrown, the boxes are automatically lifted and the objects slide back into the pile (see GIF), allowing the robot to continuously train for hours without human intervention.
The trained TossingBot robot was successful in 84.7% of its throws at a speed of 608 picks per hour, allowing for 514 mean picks per hour (MPPH). The result is significantly higher than Dex-Net 4.0, a top-of-the-line grasping robot which achieves 312 MPPH. That said, the two aren’t directly comparable as Dex-Net has an error rate which is close to zero and the TossingBot’s 15.3% error rate would not be acceptable in many scenarios. In addition, obviously some items cannot be thrown safely.
The robotics community does not have a common benchmark for the accuracy of throwing robots but the robot was able to surpass the throwing results of the human research team, whose average accuracy was 80.1%.
Interestingly, over time the robot discovered the best ways to hold different objects, using their experience to take into consideration the object’s center of mass (see image).
TossingBot offers a unique example of a robot which learns not only to grasp objects but also to grasp them in a way that they can be thrown accurately, an additional level of complexity added to standard grasping robots. While TossingBot doesn’t currently have obvious real-world uses, similar techniques are likely to be applied in real-life robots in the upcoming years.