Skip to content

Possible Evaluation Error in val.py #4251

@Johnathan-Xie

Description

@Johnathan-Xie

I believe there is a slight error in the current validation score that may be slightly lowering the mAP IOU=0.5:0.95, so making the fix should give a slight raise to that main metric.

Proof of error
When cloning and running this code "python val.py --data coco.yaml --img 640 --conf 0.001 --iou 0.65 --weights yolov5s.pt" command on the official repository, I added a print statement to print out the AP at each of the 10 IOU thresholds which gives:
[ 0.54585 0.51816 0.49107 0.45745 0.42016 0.3753 0.31443 0.23312 0.13052 0.021785]

Now if I run the exact same command, but this time I change the iou threshold to be only 6 points, 0.7 - 0.95, I receive this output
[ 0.44055 0.38887 0.32283 0.23714 0.13167 0.021903]
If the code were correct, then the last 6 values of the 10 point and the values of the 6 point metric should match, rather we see a slightly higher value for the 6 point metric.

Explanation of Error
Lines 63-72 of val.py shows

        ious, i = box_iou(predictions[pi, 0:4], labels[ti, 1:5]).max(1)  # best ious, indices
        detected_set = set()
        for j in (ious > iouv[0]).nonzero():
            d = ti[i[j]]  # detected label
            if d.item() not in detected_set:
                detected_set.add(d.item())
                detected.append(d)  # append detections
                correct[pi[j]] = ious[j] > iouv  # iou_thres is 1xn
                if len(detected) == nl:  # all labels already located in image
                    break

The code chooses all predictions above a certain AP for a given class and then iterates through them to record which detections they match with. While this gives an accurate measurement for AP50, I believe this will choose matches in a random order. This means that if two predictions with IOU > 0.5 say 0.6 and 0.7 match both to the same target the 0.6 one could be chosen over the 0.7 one. Then considering that the target would be in detected_set, the 0.7 one would not replace. When testing mAP @ 0.5:0.95 the result may be slightly lower than what should actually be reported because of occasional lower IOU matches when there is actually a higher IOU prediction for that specific target.

I am not certain what the best way to go about fixing this would be, as I'm not too familiar with how it may integrate with the rest of the testing code. Also I'm not sure if this will affect official results as those seem to be evaluated using COCO tools?

Metadata

Metadata

Labels

TODOHigh priority itemsbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions