CutMix: A new strategy for Data Augmentation
Get to know CutMix data augmentation strategy along with an overview of some other augmentations.
As we know that to improve the performance of an ML model, we need to do some data preprocessing steps before actually training the model. One of the data preprocessing steps is Data Augmentation. Over recent years data augmentation has resulted in highly improving the model’s performance. As data augmentation techniques tend to increase the performance of models, efforts were also made to improve them. One such technique introduced recently is CutMix which we are gonna discuss in this article.
Note this article reflects the study of the original paper that introduced CutMix augmentation and thus some definitions and phrases are taken from it.
Table of Contents —
- Need for CutMix
- Cutmix and other augmentations
- Algorithm
- Visualizing CAMs
- Models’ Performance
- Conclusion
- References
Need for CutMix
Before CutMix was introduced, Regional Dropout strategies were used as a data augmentation step to enhance the performance of the CNNs. These augmentations remove informative pixels in training images by overlaying them with a patch of either black pixels or random noise. This causes information loss and hence less efficiency of the models. However, it also makes the model focus on non-discriminative parts of the object. Hence a strategy was to be introduced which had the effect of regional dropout but should also be able to retain the regularization effect and so CutMix was introduced. An example of a regional dropout strategy could be —
CutMix and other augmentations
Let’s have a short description of CutMix and some other augmentation techniques.
CutMix
In CutMix augmentation we cut and paste random patches between the training images. The ground truth labels are mixed in proportion to the area of patches in the images. CutMix increases localization ability by making the model to focus on less discriminative parts of the object being classified and hence is also well suited for tasks like object detection.
Mixup
In Mixup augmentation two samples are mixed together by linear interpolation of their images and labels. Mixup samples suffer from unrealistic output and ambiguity among the labels and hence cannot perform well on tasks like image localization and object detection.
Cutout
Cutout augmentation is a kind of regional dropout strategy in which a random patch from an image is zeroed out (replaced with black pixels). Cutout samples suffer from the decrease in information and regularization capability.
All the three (CutMix, Mixup and Cutout) augmentations improved the results from the vanilla Resnet-50 model for ImageNet classification task but Mixup and Cutout tend to decrease the score in ImageNet localization and object detection tasks. However CutMix still improves the score in these tasks and hence can be considered a good choice for data augmentation.
Algorithm
Let’s discuss about the algorithm behind this augmentation and the code to implement it.
Let x be an image of shape W*H*C where W, H and C are Width, Height and number of channels respectively and y be the ground truth label. We combine two samples (x_a, y_a) and (x_b, y_b) to produce a new sample (x_c, y_c). The equation can be given as —
Here M ∈ {0,1} is a binary mask (shape — W*H) showing where to drop-out and fill-in from x_a and x_b. 1 is the matrix of ones. λ is the combination ratio derived from the β-distribution.
Binary mask M is sampled by taking the bounding box co-ordinates B from both the images which indicates the region to be cropped from both the images. Region B in x_a is being removed and filled in with the patch cropped from the region B in x_b.
The aspect ratio for mask M is proportional to the original image. The coordinates of bounding box B are uniformly sampled according to —
Now the cropped area ratio becomes —
In every iteration a CutMix-ed sample is formed by randomly combining two images in the mini-batch. CutMix introduces an increase in computational cost and hence a GPU/TPU is a recommended hardware to use with it.
The following python code for implementing CutMix augmentation has been taken from the kernel by Chris Deotte here.
def cutmix(image, label, PROBABILITY = 1.0):
# input image - is a batch of images of size [n,dim,dim,3] not a single image of [dim,dim,3]
# output - a batch of images with cutmix applied
DIM = IMAGE_SIZE[0]
CLASSES = 104
imgs = []; labs = []
for j in range(AUG_BATCH):
# AUG_BATCH - size for the augmentation batch.
# DO CUTMIX WITH PROBABILITY DEFINED ABOVE
P = tf.cast( tf.random.uniform([],0,1)<=PROBABILITY, tf.int32)
# CHOOSE RANDOM IMAGE TO CUTMIX WITH
k = tf.cast( tf.random.uniform([],0,AUG_BATCH),tf.int32)
# CHOOSE RANDOM LOCATION
x = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
y = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
b = tf.random.uniform([],0,1) # this is beta dist with alpha=1.0
WIDTH = tf.cast( DIM * tf.math.sqrt(1-b),tf.int32) * P
ya = tf.math.maximum(0,y-WIDTH//2)
yb = tf.math.minimum(DIM,y+WIDTH//2)
xa = tf.math.maximum(0,x-WIDTH//2)
xb = tf.math.minimum(DIM,x+WIDTH//2)
# MAKE CUTMIX IMAGE
one = image[j,ya:yb,0:xa,:]
two = image[k,ya:yb,xa:xb,:]
three = image[j,ya:yb,xb:DIM,:]
middle = tf.concat([one,two,three],axis=1)
img = tf.concat([image[j,0:ya,:,:],middle
,image[j,yb:DIM,:,:]],axis=0)
imgs.append(img)
# MAKE CUTMIX LABEL
a = tf.cast(WIDTH*WIDTH/DIM/DIM,tf.float32)
if len(label.shape)==1:
lab1 = tf.one_hot(label[j],CLASSES)
lab2 = tf.one_hot(label[k],CLASSES)
else:
lab1 = label[j,]
lab2 = label[k,]
labs.append((1-a)*lab1 + a*lab2)
image2 = tf.reshape(tf.stack(imgs),(AUG_BATCH,DIM,DIM,3))
label2 = tf.reshape(tf.stack(labs),(AUG_BATCH,CLASSES))
return image2,label2
Visualizing CAMs
CutMix augmentation helps the model to classify two objects from their partial views in the same image. To visualize this let us look at the class activation maps (CAMs) of two images for CutMix, Mixup and Cutout augmentations.
It is evident from the CAM for Mixup that the model got slightly confused when choosing cues for recognition. Cutout on the other hand is successful in making the model focus on less discriminative parts of the object but it diminishes accuracy because of unused pixels and finally from the CAM for CutMix we can say that it makes use of complete image pixels and still able to make the model focus on non-discriminative parts of the object.
Models’ Performance
Finally let’s look at the top-1 validation error plot for PyramidNet-200 and Resnet-50 models trained on CIFAR100 and ImageNet classification respectively with & without using CutMix.
Top-1 score is the score obtained by checking if the class being predicted with the highest probability is same as the target label.
Conclusion
At the end we can say that even if CutMix got better results than Mixup and Cutout, we cannot blindly use it for every computer vision task. We must try and compare the performance of models trained on different augmentation techniques and choose the best performer.
Hope you enjoyed this explanation, ~Happy Learning~.
References
- CutMix: https://arxiv.org/pdf/1905.04899.pdf
- Mixup: https://arxiv.org/pdf/1710.09412.pdf
- https://stats.stackexchange.com/questions/156471/imagenet-what-is-top-1-and-top-5-error-rate
- https://towardsdatascience.com/demystifying-convolutional-neural-networks-using-class-activation-maps-fe94eda4cef1
- https://towardsdatascience.com/beta-distribution-intuition-examples-and-derivation-cf00f4db57af