Robotic Pick-and-Place of Novel Objects in Clutter
with Multi-Affordance Grasping and Cross-Domain Image Matching

This paper presents a robotic pick-and-place system that is capable of grasping and recognizing both known and novel objects in cluttered environments. The key new feature of the system is that it handles a wide range of object categories without needing any task-specific training data for novel objects. To achieve this, it first uses a category-agnostic affordance prediction algorithm to select among four different grasping primitive behaviors. It then recognizes picked objects with a cross-domain image classification framework that matches observed images to product images. Since product images are readily available for a wide range of objects (e.g., from the web), the system works out-of-the-box for novel objects without requiring any additional training data. Exhaustive experimental results demonstrate that our multi-affordance grasping achieves high success rates for a wide variety of objects in clutter, and our recognition algorithm achieves high accuracy for both known and novel grasped objects. The approach was part of the MIT-Princeton Team system that took 1st place in the stowing task at 2017 Amazon Robotics Challenge.


Latest version (3 Oct 2017): arXiv:1710.01330 [cs.RO] or here
★ Best Systems Paper Award, Amazon Robotics ★
To appear at IEEE International Conference on Robotics and Automation (ICRA) 2018


Code

Our code is available on Github, including:
  •  Training/testing code (with Torch/Lua)
  •  Pre-trained models
  •  Baseline algorithms (with Matlab)
  •  Evaluation code (with Matlab)

Bibtex

If you find our research useful in your work, please consider citing:
@inproceedings{zeng2018robotic,
    title={Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching},
    author={Zeng, Andy and Song, Shuran and Yu, Kuan-Ting and Donlon, Elliott and Hogan, Francois Robert and Bauza, Maria and Ma, Daolin and Taylor, Orion and Liu, Melody and Romo, Eudald and Fazeli, Nima and Alet, Ferran and Dafle, Nikhil Chavan and Holladay, Rachel and Morona, Isabella and Nair, Prem Qu and Green, Druck and Taylor, Ian and Liu, Weber and Funkhouser, Thomas and Rodriguez, Alberto},
    booktitle={Proceedings of the IEEE International Conference on Robotics and Automation},
    year={2018} }

Our Robot In Action

A few selected clips to showcase our robot in action during our official runs at the Amazon Robotics Challenge.

Meet our robot! Fondly referred to as the "Beast from the East".

It has learned to pick up new objects by directly predicting affordances.

It recognizes new objects by matching them to their product images.

After failing once to grasp the object with suction-down, it quickly switches to a different motion primitive behavior (suction-side) to successfully pick up the object from an angle.

It can detect grasps for irregularly shaped novel objects (such as this paint roller). It is fast and optimized to have very little downtime between grasp attempts.

From its training data, it has learned that hang tags are good surfaces to grasp with suction. Here it picks up an unsuctionable bath sponge by its hang tag.

Some surfaces like the cover of this red notebook are not suctionable, which it finds out the hard way ...

... It then switches to attempt a flush-grasp, and (luckily) slides one of its fingers into the blue tag of the book to successfully pick it up.

Here it executes a flush-grasp to pick up a small soap box lying on the side of a bin, holding it up to recognize it, and gently placing it into the correct shipment box.

It makes use of both color and 3D data to detect grasps, which helps it handle a wide variety of objects, like this black metal mesh cup that cannot be seen with only 3D sensors.

Here it executes a flush-grasp to pick up an unsuctionable can opener.

Here it executes suction-side to pick up a thin flat book lying on the side of the bin.

 

Datasets

Grasping Dataset

Download: suction-based-grasping-dataset.zip (1.6 GB)
                  parallel-jaw-grasping-dataset.zip (711.8 MB)

A small and simple dataset featuring RGB-D images and heightmaps of various objects in a bin with manually annotated suctionable regions and parallel-jaw grasps (denoted as lines parallel to jaw motion).

Image Matching Dataset

Download: image-matching-dataset.zip (4.6 GB)

A collection of RGB-D images featuring grasped objects held up against a green screen, as well as their corresponding representative product images from Amazon.
 
Both datasets were captured using an IntelĀ® RealSenseā„¢ SR300 RGB-D Camera. Color images are saved as 24-bit RGB PNG. Depth images and heightmaps are saved as 16-bit PNG, where depth values are saved in deci-millimeters (10-4m). Invalid depth is set to 0. Depth images are aligned to their corresponding color images.

Image Matching Dataset contains observed-in-grasp RGB-D images and product images of 61 different objects: 41 of which are known and used for both training and testing, while 20 are novel and used only in testing. Observed-in-grasp images are found in the directories "train-imgs"/"test-imgs" and product images are found in the directories "train-item-data"/"test-item-data". Images of the same object are placed into directories with the same name. Each product image has been augmented with flips and 90 degree rotations.

Grasping Dataset contains two sub-datasets: one for suction and another for parallel-jaw grasping.
The suction sub-dataset contains the following directories:
color-input/depth-input - RGB-D images of the input scene
color-background/depth-background - RGB-D images of the background scene
camera-intrinsics - 3x3 camera intrinsics
camera-pose - 4x4 camera poses in camera-to-world coordinates
label - 8-bit grayscale PNG image labels with value 128 for suctionable, 0 for unsuctionable, and 255 for unlabeled
train-split.txt/test-split.txt - 4:1 training/testing split of images and labels

The grasping sub-dataset is similar in spirit to the suction sub-dataset, but with additional pre-computed heightmaps. The parallel-jaw grasp labels are also defined with respect to the pixel coordinates of these heightmaps. The grasping sub-dataset contains the following directories:
color-input/depth-input - RGB-D images of the input scene
color-background/depth-background - RGB-D images of the background scene
camera-intrinsics - 3x3 camera intrinsics
camera-pose - 4x4 camera poses in camera-to-world coordinates
train-split.txt/test-split.txt - 4:1 training/testing split of images and labels
heightmap-color/heightmap-depth - RGB-D heightmaps re-projected orthographically from the 3D point clouds of the original RGB-D images. The width of each pixel is 0.002m in world coordinates. These heightmaps have been merged from two RGB-D views and post-processed with background subtraction, hole-filling, denoising. and zero-padding. For details on exactly how these heightmaps were generated, see Matlab script getHeightmaps.m from the download
label - text files with manually labeled good and bad grasps, denoted as lines parallel to the gripper's jaw motion (endpoints are finger locations). Each row in a text file contains pixel coordinates (x1,y1,x2,y2) for a grasp label in the corresponding heightmap. x,y values are 1-indexed from the top-left image corner

Code for training and evaluating models over these datasets is available in our Github repository.


Contact

If you have any questions, please feel free to contact Andy Zeng at andyz[at]princeton[dot]edu



Page last updated: 23-Mar-2018
Posted by: Andy Zeng