retinanet_target_assign

paddle.fluid.layers.detection. retinanet_target_assign ( bbox_pred, cls_logits, anchor_box, anchor_var, gt_boxes, gt_labels, is_crowd, im_info, num_classes=1, positive_overlap=0.5, negative_overlap=0.4 ) [source]

Target Assign Layer for the detector RetinaNet.

This OP finds out positive and negative samples from all anchors for training the detector RetinaNet , and assigns target labels for classification along with target locations for regression to each sample, then takes out the part belonging to positive and negative samples from category prediction( cls_logits) and location prediction( bbox_pred) which belong to all anchors.

The searching principles for positive and negative samples are as followed:

1. Anchors are assigned to ground-truth boxes when it has the highest IoU overlap with a ground-truth box.

2. Anchors are assigned to ground-truth boxes when it has an IoU overlap higher than positive_overlap with any ground-truth box.

3. Anchors are assigned to background when its IoU overlap is lower than negative_overlap for all ground-truth boxes.

4. Anchors which do not meet the above conditions do not participate in the training process.

Retinanet predicts a \(C\)-vector for classification and a 4-vector for box regression for each anchor, hence the target label for each positive(or negative) sample is a \(C\)-vector and the target locations for each positive sample is a 4-vector. As for a positive sample, if the category of its assigned ground-truth box is class \(i\), the corresponding entry in its length \(C\) label vector is set to 1 and all other entries is set to 0, its box regression targets are computed as the offset between itself and its assigned ground-truth box. As for a negative sample, all entries in its length \(C\) label vector are set to 0 and box regression targets are omitted because negative samples do not participate in the training process of location regression.

After the assignment, the part belonging to positive and negative samples is taken out from category prediction( cls_logits ), and the part belonging to positive samples is taken out from location prediction( bbox_pred ).

Parameters
  • bbox_pred (Variable) – A 3-D Tensor with shape \([N, M, 4]\) represents the predicted locations of all anchors. \(N\) is the batch size( the number of images in a mini-batch), \(M\) is the number of all anchors of one image, and each anchor has 4 coordinate values. The data type of bbox_pred is float32 or float64.

  • cls_logits (Variable) – A 3-D Tensor with shape \([N, M, C]\) represents the predicted categories of all anchors. \(N\) is the batch size, \(M\) is the number of all anchors of one image, and \(C\) is the number of categories (Notice: excluding background). The data type of cls_logits is float32 or float64.

  • anchor_box (Variable) – A 2-D Tensor with shape \([M, 4]\) represents the locations of all anchors. \(M\) is the number of all anchors of one image, each anchor is represented as \([xmin, ymin, xmax, ymax]\), \([xmin, ymin]\) is the left top coordinate of the anchor box, \([xmax, ymax]\) is the right bottom coordinate of the anchor box. The data type of anchor_box is float32 or float64. Please refer to the OP api_fluid_layers_anchor_generator for the generation of anchor_box.

  • anchor_var (Variable) – A 2-D Tensor with shape \([M,4]\) represents the expanded factors of anchor locations used in loss function. \(M\) is number of all anchors of one image, each anchor possesses a 4-vector expanded factor. The data type of anchor_var is float32 or float64. Please refer to the OP api_fluid_layers_anchor_generator for the generation of anchor_var.

  • gt_boxes (Variable) – A 1-level 2-D LoDTensor with shape \([G, 4]\) represents locations of all ground-truth boxes. \(G\) is the total number of all ground-truth boxes in a mini-batch, and each ground-truth box has 4 coordinate values. The data type of gt_boxes is float32 or float64.

  • gt_labels (variable) – A 1-level 2-D LoDTensor with shape \([G, 1]\) represents categories of all ground-truth boxes, and the values are in the range of \([1, C]\). \(G\) is the total number of all ground-truth boxes in a mini-batch, and each ground-truth box has one category. The data type of gt_labels is int32.

  • is_crowd (Variable) – A 1-level 1-D LoDTensor with shape \([G]\) which indicates whether a ground-truth box is a crowd. If the value is 1, the corresponding box is a crowd, it is ignored during training. \(G\) is the total number of all ground-truth boxes in a mini-batch. The data type of is_crowd is int32.

  • im_info (Variable) – A 2-D Tensor with shape [N, 3] represents the size information of input images. \(N\) is the batch size, the size information of each image is a 3-vector which are the height and width of the network input along with the factor scaling the origin image to the network input. The data type of im_info is float32.

  • num_classes (int32) – The number of categories for classification, the default value is 1.

  • positive_overlap (float32) – Minimum overlap required between an anchor and ground-truth box for the anchor to be a positive sample, the default value is 0.5.

  • negative_overlap (float32) – Maximum overlap allowed between an anchor and ground-truth box for the anchor to be a negative sample, the default value is 0.4. negative_overlap should be less than or equal to positive_overlap, if not, the actual value of positive_overlap is negative_overlap.

Returns

predict_scores (Variable): A 2-D Tensor with shape \([F+B, C]\) represents category prediction belonging to positive and negative samples. \(F\) is the number of positive samples in a mini-batch, \(B\) is the number of negative samples, and \(C\) is the number of categories (Notice: excluding background). The data type of predict_scores is float32 or float64.

predict_location (Variable): A 2-D Tensor with shape \([F, 4]\) represents location prediction belonging to positive samples. \(F\) is the number of positive samples. \(F\) is the number of positive samples, and each sample has 4 coordinate values. The data type of predict_location is float32 or float64.

target_label (Variable): A 2-D Tensor with shape \([F+B, 1]\) represents target labels for classification belonging to positive and negative samples. \(F\) is the number of positive samples, \(B\) is the number of negative, and each sample has one target category. The data type of target_label is int32.

target_bbox (Variable): A 2-D Tensor with shape \([F, 4]\) represents target locations for box regression belonging to positive samples. \(F\) is the number of positive samples, and each sample has 4 coordinate values. The data type of target_bbox is float32 or float64.

bbox_inside_weight (Variable): A 2-D Tensor with shape \([F, 4]\) represents whether a positive sample is fake positive, if a positive sample is false positive, the corresponding entries in bbox_inside_weight are set 0, otherwise 1. \(F\) is the number of total positive samples in a mini-batch, and each sample has 4 coordinate values. The data type of bbox_inside_weight is float32 or float64.

fg_num (Variable): A 2-D Tensor with shape \([N, 1]\) represents the number of positive samples. \(N\) is the batch size. Notice: The number of positive samples is used as the denominator of later loss function, to avoid the condition that the denominator is zero, this OP has added 1 to the actual number of positive samples of each image. The data type of fg_num is int32.

Return type

A tuple with 6 Variables

Examples

import paddle.fluid as fluid
bbox_pred = fluid.data(name='bbox_pred', shape=[1, 100, 4],
                  dtype='float32')
cls_logits = fluid.data(name='cls_logits', shape=[1, 100, 10],
                  dtype='float32')
anchor_box = fluid.data(name='anchor_box', shape=[100, 4],
                  dtype='float32')
anchor_var = fluid.data(name='anchor_var', shape=[100, 4],
                  dtype='float32')
gt_boxes = fluid.data(name='gt_boxes', shape=[10, 4],
                  dtype='float32')
gt_labels = fluid.data(name='gt_labels', shape=[10, 1],
                  dtype='int32')
is_crowd = fluid.data(name='is_crowd', shape=[1],
                  dtype='int32')
im_info = fluid.data(name='im_info', shape=[1, 3],
                  dtype='float32')
score_pred, loc_pred, score_target, loc_target, bbox_inside_weight, fg_num = \\
      fluid.layers.retinanet_target_assign(bbox_pred, cls_logits, anchor_box,
      anchor_var, gt_boxes, gt_labels, is_crowd, im_info, 10)