InstaGAN – Instance-aware Image-to-image Translation – Using GANs for Object Transfiguration6 min read

Generative Adversarial Networks (GANs) have been used for many image processing tasks, among them, generating images from scratch (Style-based GANs) and applying new styles to images. A new paper, named InstaGAN, presents an innovative use of GANs – transfiguring instances of a given object in an image into another object while preserving the rest of the image as is and even some of the characteristics of the original object (color, pose, etc). For example, transforming pants to skirts or sheep to giraffes.

The proposed model is built on top of CycleGAN, a popular image-to-image model that can translate patterns and styles from one domain to another, e.g. Van Gogh’s painting ⇔ photograph, but not to transfigure specific objects. By training InstaGAN with segmentation masks that mark the relevant instances and a new loss function that encourages it to transfigure only the instances (but keep the rest as is), the model achieves impressive results. Though the quality isn’t perfect, it opens the door to interesting applications such as outfits demonstrations, home design, or even tweaking images for fun.   

Background – CycleGAN

Generative Adversarial Networks (GANs) are models that consist of two neural networks – a generator and a discriminator. The generator learns to create a fake but believable output, and the discriminator learns to discern which outputs are fake and which are real.

A common use of GANs is to generate images, and in the case of CycleGAN transform them from one domain to another, such as photographs to paintings, zebras to horses, etc. CycleGAN offers a new architecture that includes two generators, one to translate from domain X to domain Y and the other from Y to X, and two discriminators, one for each domain. By translating an image twice, to the other domain and back, the model becomes more “supervised”, as it can now minimize cycle-consistency loss – the difference between the original and the twice-translated one. Therefore, CycleGAN optimizes two functions:

  1. The “regular” GAN loss that forces the translate images to be similar to the target domain (to deceive its discriminator).
  2. A cycle-consistency loss that forces the translated image to stay as similar as possible to the original one and only change what’s necessary for the target domain, in order to make the translation back more accurate.   
CycleGAN overview (a) and its new cycle-consistency loss (b). Source: CycleGAN

The CycleGAN’s generators are identical and consist of an encoder with convolution layers and residual layers (the latter sometimes called transformer), and a decoder with deconvolution layers. The discriminator uses the PatchGAN architecture which includes several convolution layers and a classifier. A chart of CycleGAN’s architecture can be in Appendix A.

CycleGAN showed impressive results in several domain translations as can be seen in the examples below, but failed in domains that require object transfiguration, such as cats to dogs.  

CycleGAN examples in several domains. Source: CycleGAN

How it works

As mentioned, InstaGAN’s purpose is to transfigure instances of an object into another object. The challenge is different than in CycleGAN as the objects are of different shape and size (e.g. sheep and giraffe). InstaGAN is based on CycleGAN but includes a few additional parts to support this transfiguration.

The first part is adding information about these instances which comes in the form of segmentation masks that mark the area of each instance. The masks can be found in many image classification datasets or can be extracted with deep learning algorithms such as Masked R-CNN.

The model expands the CycleGAN’s architecture to incorporate the segmentation masks and generate two type of output – a transfigured image and transfigured masks. Each generator (from domain X to Y and from Y to X) is expanded to include two encoders and two decoders, one pair of encoder-decoder for the image, and one pair for the masks. After each input is first encoded separately, the decoders use the encoded data in the following way (See image below):

  1. The summation of the encoded masks and the encoded image are concatenated and fed into the image decoder. Simply, it “tells” the image decoder what areas to transfigure.    
  2. Each encoded output is also concatenated with the encoded image and the encoded masks summation. It is then fed into the mask decoder to generate the transfigured mask. These masks are also used to train the model as described below.

Similarly, the Discriminator is “duplicated” to classify the extended input, while dropping the last convolution layer and the softmax. The image is processed with one (PatchGAN) network and the segmentation masks are processed (separately) with another. The summation of the masks output is concatenated to the image output, and classified as fake or real using the last convolution layer and the softmax that were dropped before.

A single expanded generator with two encoders (fGX for images and fGA for masks) and two decoders gGX and gGA respectively), and an expanded discriminator with a PatchGAN network for images (fDX) and a similar for masks (fDA). Source: InstaGAN

Iterative training

Due to the fact that processing several instances simultaneously is memory inefficient, the authors suggest an iterative training. Each time a single mask is processed with the image to transfigure a single instance. The output image (and the masks summation) is then processed similarly with the next mask until all instances were transfigured. Not only that this technique more memory efficient, but it also creates data augmentation by using the intermediate images and masks.

Context preserving loss

Finally, to incentivize the model to transfigure the instances but not the rest of the image, a new context preserving loss is added to the CycleGAN loss function. This loss punishes the model for changing background pixels, by summing the difference between the original and translated image but only in pixels that are outside of both masks (original and translated). The paper shows that the new loss preserves the background better and also improves the quality of the transfigured instance.


The model was trained on pairs of objects, such as animals and clothes, that were extracted from multiple datasets (COCO, CCP and MHP). The images were resized to a maximum size of 300×200, though it’s unclear if it’s due to computing or model limitation. The image below shows the transfiguration results of InstaGAN in comparison to CycleGAN, and it’s easy to notice that InstaGAN results are significantly better (though CycleGAN wasn’t developed for that task).

Comparison of InstaGAN and CycleGAN results for different pairs of objects. Each example includes the original image and segmentation masks (combined into one mask for visualization), the CycleGAN output, and the InstaGAN output (image + combined masks). Source: InstaGAN

Another method to evaluate the quality of the transfigurations was using object classification. A pre-trained model (VGG-16) was fine-tuned for the task of binary classification (e.g. jeans or skirt) and then tried to predict the type of the object in the transfigured images. Though InstaGAN results are much better than CycleGAN, they are still less “recognizable” than real images of the same object.  

Comparison between the classification results (accuracy) of CycleGAN and InstaGAN. Source: InstaGAN

Compute & Implementation

The model was implemented in PyTorch and trained with 4 GPU for 200-400 epochs (the training duration isn’t specified). The open-source code is available here and includes pre-trained models.


InstaGAN shows another interesting use of GANs and their potential in multi-instance transfiguration. It gives the ability to tweak existing images in a new and deeper way by transforming their content in a realistic. Though its results are impressive, this field is still developing and many areas need to be explored – increasing image resolution, handling multiple types of objects simultaneously (zebras to dogs, horses, cats, etc), video transfiguration (the paper shows a demo) and more. The paper shows that it’s feasible and achievable even in the near future.

Sign up to our weekly newsletter
Stay updated with the latest research in Deep Learning

Appendix A

CycleGAN architecture (Source: Sarah Wolf’s blog post)

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *