YOLOX Explanation — SimOTA For Dynamic Label Assignment

Gabriel Mongaras
13 min readMay 19, 2022

This article is the third in the series where I thoroughly explain how the YOLOX (You Only Look Once X) model works. If you are interested in the code, you can find a link to it below:

This series has 4 parts to fully go over the YOLOX algorithm:

Label Assignment in Object Detection

Label assignment is a critical task in object detection as it determines what bounding boxes go with what ground truth object during training. As mentioned in the previous article, label assignment splits anchors into positive and negative groups.

Positive grouped anchors are thought of as good predictions that bound an object while negative grouped anchors are thought of as bad predictions that don’t bound an object.

For example, take a look at the image below:

Bear Box

Notice there are three bounding boxes labeled as A, B, and C. If a human were to go through and label the bounding boxes as positive or negative, they would likely say that B is positive as it bounds the bear’s head completely while A and C are negative as it doesn’t bound the bear’s head. If the bear’s head was the ground truth object, they might also say that B would go with that ground truth object while A and C would not go with any ground truth or would just go with the background.

There are many methods that have come about to label bounding boxes as positive or negative and I will go over how YOLOX solves this problem using SimOTA.

Why Use Label Assignment During Training?

SimOTA requires ground truth objects to assign labels. So, it is only used during training and not during inference time.

Label assignment helps the model be more stable as it trains. Instead of optimizing all predictions (in YOLOX the number of predictions is somewhere around 1344 for an input image of 256), we can use label assignment to get the best predictions. Then, we can optimize the best predictions to make them even better.

Remember that the model optimizes positive predictions for regression and class targets, but optimizes the objectiveness for both pos/neg targets. This way, the model learns to make the predictions that are already good even better while not worrying about the other predictions. Puning out the bad predictions leads the final model to make much better predictions.

The reason training is more stable is that the model has to deal with fewer gradient updates. As opposed to the model dealing with thousands of gradient updates for the regression and class losses, it only has to deal with a few gradient updates per image. Since the model updates deal with less outputs, the optimization space is much easier to optimize.

Label Assignment Before SimOTA

Label assignment can be done in many ways. One of the most common ways to assign labels is by finding the highest IoU between a ground truth bounding box and all other bounding boxes. The predictions with the highest IoU are assigned to that ground truth object.

Other similar methods may be used by other algorithms, but the authors of SimOTA claim that “independently assigning pos/neg samples for each gt (ground truth) without context could be sub-optimal, just like the lack of context may lead to improper prediction.” To deal with the problem of sub-optimal label assignment, the authors suggest using global context in the image to assign labels rather than local context which old label assignment algorithms used.

OTA

OTA is the proposed method to deal with global context label assignment. OTA treats the label assignment problem as an optimal transport (OT) problem.

The authors of OTA (which are also the authors of YOLOX) define the OT problem as one which has “m supplies and n demanders in a certain area. The i-th supplier holds s units of goods while the j-th demanded needs d units of goods. Transporting cost for each unit of good from one suplier i to demander j is denoted as cᵢⱼ. The goal of the problem is to find a transportation plan 𝝅* according to which all goods from suppliers can be transported to demanders at a minimal transportation cost.” (page 3)

Essentially, OT has suppliers and demanders and the goal is to find the best plan to transport the supplies to the demanders at a minimal cost.

OTA formulates label assignment as this OT problem where the m suppliers are the gt targets and the n demanders are the predictions or the anchor locations on the image. Remember that each prediction is assigned to an anchor on the image, so the two can be used synonymously. The gt targets are supplying positive labels to the demanding anchors and the goal is to form the best plan to supply the positive labels to the anchors. As you can see, the goal is actually to find the most optimal way to label anchors/predictions as pos/neg using the OT problem.

Along with the gts, there is another supplier, the background. This supplier holds all other labels and is shown in step 4 of the algorithm.

Before explaining the OTA algorithm, first let’s define some notation:

  • I - The input image
  • A - The set of anchors on the input image
  • G - The ground truth bounding boxes in image I
  • m - The number of ground truth targets
  • n - The number of anchors
  • k - The number of positive labels each gt can supply
  • s - The supply of the ith gt (s = k)
  • d - The demand of the jth anchor
  • c - Cost to transport one positive label from gt to anchor a
  • Ø - The background class which is usually denoted as 0
  • α - The regression balancing coefficient (usually greater than or equal to 1 to weight the regression loss greater than 1)
  • T - The number of iterations to run Sinkhorn-Knopp Iteration

The algorithm to obtain the optimal label assignments is as follows.

  1. Assign m and n as the counts of the number of ground truths and number of anchors
  2. Get the class predictions Pᶜˡˢ and the regression predictions Pᵇᵒˣ by sending image I through the model.
  3. Create the supplying vector, s, which has m + 1 values. Use dynamic k estimation to get the supply of each gt and store it in the vector.
  4. s[m+1] = n - sum(s), The background supply at location m + 1 in the supplying vector is equal to the n - sum(s)
  5. Initialize the demanding vector, d, as a vector of size n filled with ones.
  6. Get the pairwise cls loss between each jth prediction and its corresponding ith ground truth label. cᶜˡˢ = FocalLoss(Pᶜˡˢ, Gᶜˡˢ)
  7. Get the pairwise reg loss between each jth prediction and its corresponding ith ground truth label. cʳᵉᵍ = IoULoss(Pᵇᵒˣ, Gᵇᵒˣ)
  8. Get the pairwise center prior between each jth anchor and its corresponding ith gt. cᶜᵖ = CenterPrior(A, Gᵇᵒˣ)
  9. Get the background class cost: cᵇᵍ = FocalLoss(Pᶜˡˢ, Ø)
  10. Get the foreground cost: cᶠᵍ = cᶜˡˢ + αcʳᵉᵍ + cᶜᵖ
  11. Compute the final cost matrix c by concatenating cᵇᵍ to cᶠᵍ to form a final matrix of shape (m+1, n)
  12. Initialize u and v to ones
  13. Fill u and v by running Sinkhorn-Knopp Iter for T steps.
  14. ^
  15. Compute the optimal assignment plan 𝝅* according to Eq.11 in the original paper.
  16. return 𝝅*

The paper shows the algorithm as follows.

https://arxiv.org/abs/2103.14259

SimOTA

The algorithm looks like a lot to deal with, but it’s not that bad. SimOTA makes the algorithm much simpler and faster.

The YOLOX authors realized that even though OTA improves the model’s performance, it also makes the model 25% slower which is a lot of time when talking about training a model for about 300 iterations. The authors realized they could make the OTA OT algorithm much faster while retaining a good performance boost by removing the Sinkhorn Iter steps and instead approximating the optimal assignment plan.

Instead of using Sinkhorn Iteration, SimOTA selects the top kᵢ (or s) predictions with the lowest cost as the positive samples for the ith ground truth object. Using the SimOTA method of assignment, a single iteration over all gt objects approximates the assignment instead of using an optimization algorithm to get the most optimal assignment.

The SimOTA algorithm looks like the following:

  1. Assign m and n as the counts of the number of ground truths and number of anchors
  2. Get the class predictions Pᶜˡˢ and the regression predictions Pᵇᵒˣ by sending image I through the model.
  3. Create the supplying vector, s, which has m + 1 values. Use dynamic k estimation to get the supply of each gt and store it in the vector.
  4. s[m+1] = n — sum(s), The background supply at location m + 1 in the supplying vector is equal to the n — sum(s)
  5. Initialize the demanding vector, d, as a vector of size n filled with ones.
  6. Get the pairwise cls loss between each jth prediction and its corresponding ith ground truth label. cᶜˡˢ = FocalLoss(Pᶜˡˢ, Gᶜˡˢ)
  7. Get the pairwise reg loss between each jth prediction and its corresponding ith ground truth label. cʳᵉᵍ = IoULoss(Pᵇᵒˣ, Gᵇᵒˣ)
  8. Get the pairwise center prior between each jth anchor and its corresponding ith gt. cᶜᵖ = CenterPrior(A, Gᵇᵒˣ)
  9. Get the background class cost: cᵇᵍ = FocalLoss(Pᶜˡˢ, Ø)
  10. Get the foreground cost: cᶠᵍ = cᶜˡˢ + αcʳᵉᵍ + cᶜᵖ
  11. Compute the final cost matrix c by concatenating cᵇᵍ to cᶠᵍ to form a final matrix of shape (m+1, n)
  12. Iterate over all supply sᵢ in s and get the top sᵢ best predictions with the lowest cost cᵢ. The resulting array should have m values where each mᵢ index in the resulting array has at most sᵢ predictions.
  13. Return the resulting array

After running SimOTA, the output will be an array of size m where each ith element in the resulting array is a positive labeled anchor/prediction corresponding to the ith ground truth Gᵢ. The rest of the predictions that are not in the resulting array are considered negative labeled predictions without a gt assignment.

Dynamic k Estimation

k is the supply of each gt object and there are two ways to calculate it. The naive way of calculating k is making it a constant across all gt objects. The problem with this way of assigning supply to each gt is that not all ground truths should have the same number of anchors assigned to them.

The second proposed way of calculating k is by looking at each gt separately. The authors of OTA propose using Dynamic k Estimation which approximates the supply of each gt. To approximate the supply for each gt, we can look at all predictions and select the top q predictions according to the IoU values between each prediction and the gt. Then, we sum up the top q IoU values and use that as the k value for that gt.

Using this method, we estimate the supply, or number of positive labels, for each gt by looking at how accurately each prediction bounds that gt. This way, gts with predictions that are more accurate are more likely to be assigned to that gt when using the SimOTA algorithm. The authors of OTA state the intuition behind this algorithm is that “the appropriate number of positive anchors for a certain gt should be positively correlated with the number of anchors that well-regress this gt.”

Note: Although k is no longer a parameter we must change, q is now a parameter that must be tuned. In my code, I use 20 as the q value which seemed to work fine.

Below is my code for dynamic k estimation:

# The supplying vector
s_i = np.ones(m+1, dtype=np.int16)
# The sum of all k values
k_sum = 0
# Iterate over all ground truth boxes (i = gt_i)
for i in range(0, m):
# Get the ith truth value
gt = G_reg[i]
# Get the (x, y) coordinates of the intersections
xA = np.maximum(gt[0], P_reg[:, 0])
yA = np.maximum(gt[1], P_reg[:, 1])
xB = np.minimum(gt[0]+gt[2], P_reg[:, 0]+P_reg[:, 2])
yB = np.minimum(gt[1]+gt[3], P_reg[:, 1]+P_reg[:, 3])
# Get the area of the intersections
intersectionArea = np.maximum(0, xB - xA + 1) * np.maximum(0, yB - yA + 1)
# Compute the area of both rectangles
areaA = (gt[2]+1)*(gt[3]+1)
areaB = (P_reg[:, 2]+1)*(P_reg[:, 3]+1)
# Get the union of the rectangles
union = areaA + areaB - intersectionArea
# Compute the intersection over union for all anchors
IoU = intersectionArea/union
# Get the q top IoU values (the top q predictions)
# and sum them up to get the k for this gt
k = np.sort(IoU)[-q:].sum()
# Add the k value to the total k sum
k_sum += k
# Save the k value to the supplying vector
# as an iteger
s_i[i] = int(round(k))

Center Prior

One problem that SimOTA runs into when assigning labels is that a positive label from a gt can be assigned to any anchor prediction. Sometimes a gt doesn’t have very good options to choose from when assigning positive labels, but since it has to assign k positive labels, the gt must resort to assigning bad positive labels to itself. For example, look at the following image below:

Notice how all bounding boxes on the image don’t cover the bear’s face well. If the bear’s face is a gt object and let’s say it has a supply or k value of 2, then it has to pick two predictions as positive labels. Clearly, no two predictions are good and the gt object will resort to making two bad predictions positive.

In the early stages of training, the bad prediction problem is very prominent. So, the issue of gts not having a very good prediction to pick from is a big problem.

The solution is to define a radius r and select the r² closest anchors to the center of each gt to not be penalized while penalizing all other anchors outside this r² radius by adding an additional cost to those predictions. As shown in step 8 in the SimOTA algorithm, a center prior cost is calculated and added to the total cost of all predictions.

The intuition behind selecting the predictions inside the r² radius having a higher chance to be selected as the positive anchor for that gt is that closer anchor predictions are easier to optimize while further predictions are harder to optimize.

To get the r² closest predictions, we can use the good old distance formula to get the distance between all points and the center of the gt.

To start, we can find the center of the gt, G, by adding half the width to the x coordinate and half the height to the y coordinate as follows:

center = (G[0]+G[2]//2, G[1]+G[3]//2)

Then, we can use the distance formula between all anchors A and gt G:

diff = A-center
dist = np.sqrt(np.sum(diff**2, axis=-1))

Note: center will be a pair of two coordinates. If A are the anchor locations on the (x, y) axis, then we can subtract center from A to get A number of difference pairs. So, the distance formula is computing the distance along these 2-d coordinate pairs. The result of dist should be equal to the number of anchors in A.

Below is a function I wrote which can be found in my code to calculate the center prior between a single ground truth and all anchors:

# Get the center prior between a gt and the anchor locations
# on the image
# Inputs:
# A - All anchors for a single image
# G_box - A ground truth box for this image
# r - radius used to select anchors in this function
# extraCost - The extra cost to add to those anchors not in
# the r**2 radius
# Output:
# Array with the same number of values as the number of anchors
# where each value is the center prior value of that anchor
def centerPrior(A, G_box, r, extraCost):
## Center Prior selects the r**2 closest anchors according to the
## center distance between the anchors and gts. Those anchors
## that are in the radius are not subject to any extra cost, but those
## anchors outside the radius are assigned extra cost to avoid
## having them be labelled as positive anchors for this gt.

# Get the center location of the ground truth boudning box
center = (G_box[0]+(G_box[2]//2), G_box[1]+(G_box[3]//2))

# Get the difference between the center locations in A and the
# center location of the gt bounding box
diff = A-center

# Use the distance formula to get the distance from the
# gt center location for each anchor
dist = np.sqrt(np.sum(diff**2, axis=-1))

# Get the indices of the distances which are greater
# than r**2 meaning the anchor is outside the radius
idx_neg = np.where(dist > r**2)

# Array of zeros corresponding to the center prior of each anchor
c_cp = np.zeros(A.shape[0])

# Those values not in the r**2 radius are subject to a constant cost
c_cp[idx_neg] = extraCost

return c_cp

Remember how each FPN level has a different number of anchors. On a 256 pixel input image, there will be 64, 256, 1024 anchors on the FPN levels with a stride of 32, 16, and 8 respectively. If the radius was pretty large and the bear’s head was the gt, then the intersection points on the grid in the blue circle would be the anchor points that are not subject to an extra cost.

The authors of OTA note that picking the r² closest anchors stabilizes training, especially at an early stage of training. Additionally, they claim that due to this extra stability in the model’s training, the model results in better performance.

YOLOX actually prunes out the labels outside of the r² radius. Instead of giving predictions outside the radius an extra cost, it removes all the predictions outside the r² radius. Sometimes this will lead to no predictions for that gt, but removing the predictions outside the radius maintains the center prior and helps the stability of the model so the gradients don’t blow up due to a few bad predictions. I actually found that completely removing the predictions outside the radius hurts performance, but that may be because I only trained on 1000 images.

That wraps up SimOTA. All that’s left to explain is the data augmentations that YOLOX uses which will be explained in the next article.

References

OTA: https://arxiv.org/abs/2103.14259

YOLOX: https://arxiv.org/abs/2107.08430

--

--

Gabriel Mongaras

AI enthusiast and CS student at SMU. For more information visit my website: https://gabrielm.cc/