CutMix: A new strategy for Data Augmentation

Get to know CutMix data augmentation strategy along with an overview of some other augmentations.

Sarthak khandelwal
6 min readJul 15, 2020
source: arXiv:1905.04899

As we know that to improve the performance of an ML model, we need to do some data preprocessing steps before actually training the model. One of the data preprocessing steps is Data Augmentation. Over recent years data augmentation has resulted in highly improving the model’s performance. As data augmentation techniques tend to increase the performance of models, efforts were also made to improve them. One such technique introduced recently is CutMix which we are gonna discuss in this article.

Note this article reflects the study of the original paper that introduced CutMix augmentation and thus some definitions and phrases are taken from it.

Table of Contents —

  1. Need for CutMix
  2. Cutmix and other augmentations
  3. Algorithm
  4. Visualizing CAMs
  5. Models’ Performance
  6. Conclusion
  7. References

Need for CutMix

Before CutMix was introduced, Regional Dropout strategies were used as a data augmentation step to enhance the performance of the CNNs. These augmentations remove informative pixels in training images by overlaying them with a patch of either black pixels or random noise. This causes information loss and hence less efficiency of the models. However, it also makes the model focus on non-discriminative parts of the object. Hence a strategy was to be introduced which had the effect of regional dropout but should also be able to retain the regularization effect and so CutMix was introduced. An example of a regional dropout strategy could be —

Image by author

CutMix and other augmentations

Let’s have a short description of CutMix and some other augmentation techniques.

CutMix

In CutMix augmentation we cut and paste random patches between the training images. The ground truth labels are mixed in proportion to the area of patches in the images. CutMix increases localization ability by making the model to focus on less discriminative parts of the object being classified and hence is also well suited for tasks like object detection.

Mixup

In Mixup augmentation two samples are mixed together by linear interpolation of their images and labels. Mixup samples suffer from unrealistic output and ambiguity among the labels and hence cannot perform well on tasks like image localization and object detection.

Cutout

Cutout augmentation is a kind of regional dropout strategy in which a random patch from an image is zeroed out (replaced with black pixels). Cutout samples suffer from the decrease in information and regularization capability.

source: arXiv:1905.04899

All the three (CutMix, Mixup and Cutout) augmentations improved the results from the vanilla Resnet-50 model for ImageNet classification task but Mixup and Cutout tend to decrease the score in ImageNet localization and object detection tasks. However CutMix still improves the score in these tasks and hence can be considered a good choice for data augmentation.

Algorithm

Let’s discuss about the algorithm behind this augmentation and the code to implement it.

Let x be an image of shape W*H*C where W, H and C are Width, Height and number of channels respectively and y be the ground truth label. We combine two samples (x_a, y_a) and (x_b, y_b) to produce a new sample (x_c, y_c). The equation can be given as —

Image by author

Here M ∈ {0,1} is a binary mask (shape — W*H) showing where to drop-out and fill-in from x_a and x_b. 1 is the matrix of ones. λ is the combination ratio derived from the β-distribution.

Binary mask M is sampled by taking the bounding box co-ordinates B from both the images which indicates the region to be cropped from both the images. Region B in x_a is being removed and filled in with the patch cropped from the region B in x_b.

bbox coordinates (source: arXiv:1905.04899)

The aspect ratio for mask M is proportional to the original image. The coordinates of bounding box B are uniformly sampled according to —

Unif(a,b) is uniform distribution from a to b (source: arXiv:1905.04899)

Now the cropped area ratio becomes —

source: arXiv:1905.04899

In every iteration a CutMix-ed sample is formed by randomly combining two images in the mini-batch. CutMix introduces an increase in computational cost and hence a GPU/TPU is a recommended hardware to use with it.

The following python code for implementing CutMix augmentation has been taken from the kernel by Chris Deotte here.

def cutmix(image, label, PROBABILITY = 1.0):
# input image - is a batch of images of size [n,dim,dim,3] not a single image of [dim,dim,3]
# output - a batch of images with cutmix applied

DIM = IMAGE_SIZE[0]
CLASSES = 104

imgs = []; labs = []
for j in range(AUG_BATCH):
# AUG_BATCH - size for the augmentation batch.
# DO CUTMIX WITH PROBABILITY DEFINED ABOVE


P = tf.cast( tf.random.uniform([],0,1)<=PROBABILITY, tf.int32)

# CHOOSE RANDOM IMAGE TO CUTMIX WITH
k = tf.cast( tf.random.uniform([],0,AUG_BATCH),tf.int32)

# CHOOSE RANDOM LOCATION
x = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
y = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
b = tf.random.uniform([],0,1) # this is beta dist with alpha=1.0
WIDTH = tf.cast( DIM * tf.math.sqrt(1-b),tf.int32) * P
ya = tf.math.maximum(0,y-WIDTH//2)
yb = tf.math.minimum(DIM,y+WIDTH//2)
xa = tf.math.maximum(0,x-WIDTH//2)
xb = tf.math.minimum(DIM,x+WIDTH//2)

# MAKE CUTMIX IMAGE
one = image[j,ya:yb,0:xa,:]
two = image[k,ya:yb,xa:xb,:]
three = image[j,ya:yb,xb:DIM,:]
middle = tf.concat([one,two,three],axis=1)
img = tf.concat([image[j,0:ya,:,:],middle
,image[j,yb:DIM,:,:]],axis=0)
imgs.append(img)

# MAKE CUTMIX LABEL
a = tf.cast(WIDTH*WIDTH/DIM/DIM,tf.float32)
if len(label.shape)==1:
lab1 = tf.one_hot(label[j],CLASSES)
lab2 = tf.one_hot(label[k],CLASSES)
else:
lab1 = label[j,]
lab2 = label[k,]
labs.append((1-a)*lab1 + a*lab2)

image2 = tf.reshape(tf.stack(imgs),(AUG_BATCH,DIM,DIM,3))
label2 = tf.reshape(tf.stack(labs),(AUG_BATCH,CLASSES))
return image2,label2

Visualizing CAMs

CutMix augmentation helps the model to classify two objects from their partial views in the same image. To visualize this let us look at the class activation maps (CAMs) of two images for CutMix, Mixup and Cutout augmentations.

source: arXiv:1905.04899

It is evident from the CAM for Mixup that the model got slightly confused when choosing cues for recognition. Cutout on the other hand is successful in making the model focus on less discriminative parts of the object but it diminishes accuracy because of unused pixels and finally from the CAM for CutMix we can say that it makes use of complete image pixels and still able to make the model focus on non-discriminative parts of the object.

Models’ Performance

Finally let’s look at the top-1 validation error plot for PyramidNet-200 and Resnet-50 models trained on CIFAR100 and ImageNet classification respectively with & without using CutMix.

Top-1 score is the score obtained by checking if the class being predicted with the highest probability is same as the target label.

Top-1 test error plot for CIFAR100 (left) and ImageNet (right) classification. Cutmix achieves lower test errors than the baseline at the end of training. (source: arXiv:1905.04899)

Conclusion

At the end we can say that even if CutMix got better results than Mixup and Cutout, we cannot blindly use it for every computer vision task. We must try and compare the performance of models trained on different augmentation techniques and choose the best performer.

Hope you enjoyed this explanation, ~Happy Learning~.

References

--

--

Sarthak khandelwal

A passionate data scientist with an aim to employ Data Science and Machine Learning to solve real-world issues such as healthcare and social issues.