Python – Deep Learning YOLO Object Detection: How to iterate over cells in a grid defined over the image

Deep Learning YOLO Object Detection: How to iterate over cells in a grid defined over the image… here is a solution to the problem.

Deep Learning YOLO Object Detection: How to iterate over cells in a grid defined over the image

I’m trying to implement the YOLOv2 object detection algorithm myself, just to understand how it works. Of course, I would use pre-trained weights to make things faster. I’m using the code repository in keras-yolo2 as the basis for my own code, But I have questions about how the code relates to the underlying YOLO algorithm.

As I understand – from a high level – YOLO (you only look at it once) will:

  1. Break up images into SxS grids.
  2. For each cell in the grid, classify and assign a probability to each potential label.
  3. Prune the category box based on whether the box/category confidence exceeds a certain threshold.

A lot of other things happen after this point, including non-maximum inhibition and so on.

I looked at some code in the above repository to try to figure out how the author actually breaks down the image into an SxS grid to perform object classification within the cell. Anyone can see where the algorithm happens in the code below. It may be that I lack knowledge of tensorflow, but I can’t tell where it is implemented in the code below. It seems to be a response to cell_x = tf.to_float(tf.reshape(tf.tile(tf.range(GRID_W), [GRID_H]]), (1, GRID_H, GRID_W, 1, 1))) would break up the image into cells, but I don’t understand how this would work without traversing each grid cell? I also don’t understand how tf.reshape and tf.tile and tf.range work together to break up the picture into a cell.

Any help would be appreciated.

IMAGE_H, IMAGE_W = 416, 416
GRID_H,  GRID_W  = 13 , 13
BOX              = 5
CLASS            = len(LABELS)
CLASS_WEIGHTS    = np.ones(CLASS, dtype='float32')
OBJ_THRESHOLD    = 0.3#0.5
NMS_THRESHOLD    = 0.3#0.45
ANCHORS          = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]

NO_OBJECT_SCALE  = 1.0
OBJECT_SCALE     = 5.0
COORD_SCALE      = 1.0
CLASS_SCALE      = 1.0

BATCH_SIZE       = 16
WARM_UP_BATCHES  = 0
TRUE_BOX_BUFFER  = 50

def custom_loss(y_true, y_pred):
    mask_shape = tf.shape(y_true)[:4]

cell_x = tf.to_float(tf.reshape(tf.tile(tf.range(GRID_W), [GRID_H]), (1, GRID_H, GRID_W, 1, 1)))
    cell_y = tf.transpose(cell_x, (0,2,1,3,4))

cell_grid = tf.tile(tf.concat([cell_x,cell_y], -1), [BATCH_SIZE, 1, 1, 5, 1])

coord_mask = tf.zeros(mask_shape)
    conf_mask  = tf.zeros(mask_shape)
    class_mask = tf.zeros(mask_shape)

seen = tf. Variable(0.)
    total_recall = tf. Variable(0.)

"""
    Adjust prediction
    """
    ### adjust x and y      
    pred_box_xy = tf.sigmoid(y_pred[..., :2]) + cell_grid

### adjust w and h
    pred_box_wh = tf.exp(y_pred[..., 2:4]) * np.reshape(ANCHORS, [1,1,1,BOX,2])

### adjust confidence
    pred_box_conf = tf.sigmoid(y_pred[..., 4])

### adjust class probabilities
    pred_box_class = y_pred[..., 5:]

"""
    Adjust ground truth
    """
    ### adjust x and y
    true_box_xy = y_true[..., 0:2] # relative position to the containing cell

### adjust w and h
    true_box_wh = y_true[..., 2:4] # number of cells accross, horizontally and vertically

### adjust confidence
    true_wh_half = true_box_wh / 2.
    true_mins    = true_box_xy - true_wh_half
    true_maxes   = true_box_xy + true_wh_half

pred_wh_half = pred_box_wh / 2.
    pred_mins    = pred_box_xy - pred_wh_half
    pred_maxes   = pred_box_xy + pred_wh_half       

intersect_mins  = tf.maximum(pred_mins,  true_mins)
    intersect_maxes = tf.minimum(pred_maxes, true_maxes)
    intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

true_areas = true_box_wh[..., 0] * true_box_wh[..., 1]
    pred_areas = pred_box_wh[..., 0] * pred_box_wh[..., 1]

union_areas = pred_areas + true_areas - intersect_areas
    iou_scores  = tf.truediv(intersect_areas, union_areas)

true_box_conf = iou_scores * y_true[..., 4]

### adjust class probabilities
    true_box_class = tf.argmax(y_true[..., 5:], -1)

"""
    Determine the masks
    """
    ### coordinate mask: simply the position of the ground truth boxes (the predictors)
    coord_mask = tf.expand_dims(y_true[..., 4], axis=-1) * COORD_SCALE

### confidence mask: penelize predictors + penalize boxes with low IOU
    # penalize the confidence of the boxes, which have IOU with some ground truth box < 0.6
    true_xy = true_boxes[..., 0:2]
    true_wh = true_boxes[..., 2:4]

true_wh_half = true_wh / 2.
    true_mins    = true_xy - true_wh_half
    true_maxes   = true_xy + true_wh_half

pred_xy = tf.expand_dims(pred_box_xy, 4)
    pred_wh = tf.expand_dims(pred_box_wh, 4)

pred_wh_half = pred_wh / 2.
    pred_mins    = pred_xy - pred_wh_half
    pred_maxes   = pred_xy + pred_wh_half    

intersect_mins  = tf.maximum(pred_mins,  true_mins)
    intersect_maxes = tf.minimum(pred_maxes, true_maxes)
    intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

true_areas = true_wh[..., 0] * true_wh[..., 1]
    pred_areas = pred_wh[..., 0] * pred_wh[..., 1]

union_areas = pred_areas + true_areas - intersect_areas
    iou_scores  = tf.truediv(intersect_areas, union_areas)

best_ious = tf.reduce_max(iou_scores, axis=4)
    conf_mask = conf_mask + tf.to_float(best_ious < 0.6) * (1 - y_true[..., 4]) * NO_OBJECT_SCALE

# penalize the confidence of the boxes, which are reponsible for corresponding ground truth box
    conf_mask = conf_mask + y_true[..., 4] * OBJECT_SCALE

### class mask: simply the position of the ground truth boxes (the predictors)
    class_mask = y_true[..., 4] * tf.gather(CLASS_WEIGHTS, true_box_class) * CLASS_SCALE       

"""
    Warm-up training
    """
    no_boxes_mask = tf.to_float(coord_mask < COORD_SCALE/2.)
    seen = tf.assign_add(seen, 1.)

true_box_xy, true_box_wh, coord_mask = tf.cond(tf.less(seen, WARM_UP_BATCHES), 
                          lambda: [true_box_xy + (0.5 + cell_grid) * no_boxes_mask, 
                                   true_box_wh + tf.ones_like(true_box_wh) * np.reshape(ANCHORS, [1,1,1,BOX,2]) * no_boxes_mask, 
                                   tf.ones_like(coord_mask)],
                          lambda: [true_box_xy, 
                                   true_box_wh,
                                   coord_mask])

"""
    Finalize the loss
    """
    nb_coord_box = tf.reduce_sum(tf.to_float(coord_mask > 0.0))
    nb_conf_box  = tf.reduce_sum(tf.to_float(conf_mask  > 0.0))
    nb_class_box = tf.reduce_sum(tf.to_float(class_mask > 0.0))

loss_xy    = tf.reduce_sum(tf.square(true_box_xy-pred_box_xy)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
    loss_wh    = tf.reduce_sum(tf.square(true_box_wh-pred_box_wh)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
    loss_conf  = tf.reduce_sum(tf.square(true_box_conf-pred_box_conf) * conf_mask)  / (nb_conf_box  + 1e-6) / 2.
    loss_class = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=true_box_class, logits=pred_box_class)
    loss_class = tf.reduce_sum(loss_class * class_mask) / (nb_class_box + 1e-6)

loss = loss_xy + loss_wh + loss_conf + loss_class

nb_true_box = tf.reduce_sum(y_true[..., 4])
    nb_pred_box = tf.reduce_sum(tf.to_float(true_box_conf > 0.5) * tf.to_float(pred_box_conf > 0.3))

"""
    Debugging code
    """    
    current_recall = nb_pred_box/(nb_true_box + 1e-6)
    total_recall = tf.assign_add(total_recall, current_recall) 

loss = tf. Print(loss, [tf.zeros((1))], message='Dummy Line \t', summarize=1000)
    loss = tf. Print(loss, [loss_xy], message='Loss XY \t', summarize=1000)
    loss = tf. Print(loss, [loss_wh], message='Loss WH \t', summarize=1000)
    loss = tf. Print(loss, [loss_conf], message='Loss Conf \t', summarize=1000)
    loss = tf. Print(loss, [loss_class], message='Loss Class \t', summarize=1000)
    loss = tf. Print(loss, [loss], message='Total Loss \t', summarize=1000)
    loss = tf. Print(loss, [current_recall], message='Current Recall \t', summarize=1000)
    loss = tf. Print(loss, [total_recall/seen], message='Average Recall \t', summarize=1000)

return loss

Solution

Yolo v2, supposedly, does not break the image into 13x13 grids, but makes predictions at the grid level instead of the pixel level.

The network takes an input image of size 416×416 and outputs 13×13 predictions, each of which is an array containing class probabilities and box coordinates (a 425 size vector, the actual output size is 13x13x425). Therefore, each output pixel is considered a prediction of an area in the input image. For example, the index [2,3] of the output corresponds to the prediction (425 length vector) of the input image area (64,96,96,128).

The box coordinates that are part of the 425 length vector are encoded relative to cell_grid.

enter image description here

The cell_grid in the code is just a mesh grid that calculates the 13x13 size of the entire batch, which is used to predict the actual coordinates, and that’s it.

cell_grid = tf.tile(tf.concat([cell_x,cell_y], -1), [BATCH_SIZE, 1, 1, 5, 1])

Related Problems and Solutions