CHAIR metric is a rule-based metric for evaluating object hallucination in caption generation.
I modified the original code, the original implementation CHAIR can be found at https://github.com/LisaAnne/Hallucination/blob/master/utils/chair.py
I did NOT changing its calculation to keep consistency.
Especially, i added a new metric Recall to calculate the percentage of recalled gt objects over all gt objects. Other modifications are listed below.
- The
hallucination_idxsfield of the output of the CHAIR evaluator is NOT correct when the caption have COCO double words (LisaAnne/Hallucination#4). This problem may also exist in the original CHAIR repo and does not affect the calculation of CHAIR. If this field is necessary for your use case, please consider fixing it. - There are two extra newlines at the beginning and the end of the
synonyms_txtvariable inchair.pythat should not be present. This introduces an empty string element in bothinverse_synonym_dictandmscoco_objects. This should not affect the calculation of CHAIR, becausenltk.word_tokenizeresults andid_to_name[annotation['category_id']]should not produce empty strings. If this issue affects your use case, you can fix it and rebuild the cache.
- adapt calculation of CHAIR-i and CHAIR-s for Python3, supports for both json and jsonl file input.
- integrate synonyms.txt to make the script standalone.
- remove machine-translation based metrics BLEU-n, CIDEr, ROGUE
- add new metric Recall, which represents the node words(i.e. lemmas of objects) coverage overall.
- add pickle cache mechanism to make it fast for repetitive evaluations.
pattern
nltk
tqdm
I aleady serilized the inited CHAIR evaluator object for coco into a pickle, you can use it by setting --cache, see #Example Run.
Or if you want to built CHAIR evaluator from scratch, do following steps.
Put these files into coco_annotations dir first. Download from http://images.cocodataset.org/annotations/annotations_trainval2014.zip
- captions_train2014.json
- captions_val2014.json
- instances_train2014.json
- instances_val2014.json
and python chair.py --cache <new_cache_path>, it should argue about inputs but it is okay to built cache.
python chair.py \
--cap_file example_inputs.jsonl \
--image_id_key image_id \
--caption_key caption \
--cache chair.pkl \
--save_path outputs.jsonoutputs:
CHAIRs : 0.0
CHAIRi : 0.0
Recall : 85.7
parser.add_argument("--cap_file", type=str, default='',
help="path towards json or jsonl saving image ids and their captions in list of dict.")
parser.add_argument("--image_id_key", type=str, default="image_id",
help="in each dict of cap_file, which key stores image id of coco.")
parser.add_argument("--caption_key", type=str, default="caption",
help="in each dict of cap_file, which key stores caption of the image.")
parser.add_argument("--cache", type=str, default="chair.pkl",
help="pre inited CHAIR evaluator object, for fast loading.")
parser.add_argument("--coco_path", type=str, default='coco_annotations',
help="only use for regenerating CHAIR evaluator object, will be ignored if uses cached evaluator.")
parser.add_argument("--save_path", type=str, default="",
help="saving CHAIR evaluate and results to json, useful for debugging the caption model.")Since the original implementation is written in Python2 and needs intermediate results to run, i have not test the original implementation yet.
I've tried my best to keep the consistency so i guess it could reproduce CHAIR results comparing to the original code, but there is no warrant.
This repo is not actively maintained.