📘 Conference Paper (To Appear) • 📝 arXiv Paper • 📦 Download Datasets • 📄 LVVO Dataset arXiv Paper
This repository contains the source code for the research paper:
"Visual Content Detection in Educational Videos with Transfer Learning and Dataset Enrichment"
by Dipayan Biswas, Shishir Shah, and Jaspal Subhlok (University of Houston)
This work presents a deep learning framework for detecting visual objects—such as tables, charts, images, and illustrations—in lecture video frames using transfer learning and dataset enrichment. Evaluated on three datasets (LDD, LPM, and the newly introduced LVVO), six object detection models were fine-tuned, with YOLOv11 achieving the best performance. The model was further optimized through cross-dataset training and a semi-supervised auto-labeling pipeline, demonstrating that transfer learning and data enrichment significantly improve detection accuracy under limited annotation.
- Introduced the LVVO dataset comprising 4,000 annotated lecture video frames.
- Benchmarked six state-of-the-art object detection models across LVVO, LDD, and LPM datasets.
- Addressed the challenge of generalization through cross-dataset training and analysis on diverse educational video sources.
- Boosted model accuracy with limited labeled data using a semi-supervised auto-labeling pipeline.
We utilize three annotated datasets: LVVO (4,000 frames, introduced in this work), LDD, and LPM.
🔗 See the LVVO Dataset Repository for details and downloads.
git clone https://github.com/dipayan1109033/edu-video-visual-detection.git
cd edu-video-visual-detection
cd src/utils
git clone https://github.com/dipayan1109033/calculate_ODmetrics.git
Set up a virtual environment and install dependencies from requirements.txt
.
Download the dataset from the LVVO Dataset Repository, place the zip files in data/processed/
, and unzip them:
cd data/processed
unzip dataset_name.zip
This project supports training both YOLOv11 and torchvision-based models (e.g., Faster R-CNN, RetinaNet, FCOS) using either manual split ratios or predefined dataset splits.
Predefined splits can be generated using the following script:
python src/prepare/setup_experiment.py
Click to expand key training arguments
model.identifier
: Model name (yolo
,rcnn
,maskrcnn
,retinanet
,fcos
,ssd
)model.pretrained_model
: Path or name of pretrained weights (for YOLOv11)model.code
: Two-digit code for torchvision models, specifying the backbone and number of frozen layers. Seesrc/models/torchvision_models.py
for details.exp.mode
: Training mode ("train"
or"crossval"
)exp.name
: User given experiment name (used to save logs and checkpoints)data.folder
: Dataset directory name (used withsplit_ratios
)data.split_ratios
: Train/val/test ratio, e.g.,[0.8,0.2,0.0]
data.split_code
: Identifier for a custom dataset split created usingsrc/prepare/setup_experiment.py
and saved inexperiments/input/custom_splits/
data.num_folds
: Number of folds for cross-validation (e.g.,5
)train.lr
: Learning rate (e.g.,0.001
)train.epoch
: Number of training epochs
➡️ For additional arguments and full configuration options, refer to configs/experiment.yaml
.
✅ YOLOv11 Training with split ratios
python src/main.py model.identifier="yolo" model.pretrained_model="yolo11m.pt" exp.mode="train" exp.name="train_yolo_LVVO1k" data.folder="LVVO_1k" data.split_ratios="[0.8,0.2,0.0]" train.lr=0.001 train.epoch=30
✅ YOLOv11 Training with split code
python src/main.py model.identifier="yolo" model.pretrained_model="yolo11m.pt" exp.mode="train" exp.name="train_yolo_csplitLVVO4k" data.split_code="LVVO_4k_val200_seed42" train.lr=0.001 train.epoch=30
✅ Torchvision Model Cross-validation (e.g., Faster R-CNN)
python src/main.py model.identifier="rcnn" model.code=33 exp.mode="crossval" exp.name="crossval_rcnn_LVVO1k" data.folder="LVVO_1k" data.num_folds=5 train.lr=0.001 train.epoch=30
✅ YOLOv11 Cross-validation with split code
python src/main.py model.identifier="yolo" model.pretrained_model="yolo11m.pt" exp.mode="crossval" exp.name="crossval_yolo_csplitLVVO4k" data.split_code="LVVO_4k_val200_cv5_seed42" train.lr=0.001 train.epoch=30
📊 Table 1: AP50% Comparison of Object Detection Models Across Datasets (80%:20% Train-Validation Split)
Model | LVVO_1k | LDD | LPM |
---|---|---|---|
SSD | 83.81 | 87.79 | 85.73 |
RetinaNet | 78.34 | 88.82 | 86.92 |
FCOS | 83.46 | 89.12 | 87.58 |
Faster-RCNN | 85.38 | 88.72 | 87.40 |
Mask-RCNN | 85.74 | 89.31 | 86.74 |
YOLOv11 | 89.45 | 94.29 | 92.08 |
Note: Table 1 reports the numerical results visualized in Figure 2 of the paper.
Model | Precision (%) | Recall (%) | F1 Score (%) |
---|---|---|---|
Logiform | 64.33 ± 2.73 | 62.88 ± 3.29 | 63.57 ± 2.67 |
YOLOv11 | 86.76 ± 1.87 | 83.60 ± 1.56 | 85.14 ± 1.25 |
Note: Table 2 reports the numerical results visualized in Figure 3 of the paper.
Training Dataset | Test on LVVO_1k | Test on LDD | Test on LPM |
---|---|---|---|
AP50 (%) | |||
LVVO_1k | 90.95 ± 1.12 | 69.69 ± 2.53 | 74.34 ± 2.50 |
LDD | 75.92 ± 1.67 | 93.56 ± 0.77 | 68.83 ± 3.68 |
LPM | 80.05 ± 1.62 | 58.66 ± 3.11 | 92.65 ± 0.66 |
Training Dataset | Test on LVVO_1k | Test on LDD | Test on LPM |
---|---|---|---|
AP (%) | |||
LVVO_1k | 77.93 ± 1.38 | 50.09 ± 2.60 | 50.10 ± 2.31 |
LDD | 59.57 ± 1.72 | 87.74 ± 0.58 | 40.95 ± 3.17 |
LPM | 55.35 ± 1.30 | 44.37 ± 2.81 | 77.49 ± 0.86 |
Note: Table 3 reports the numerical results visualized in Figure 4 of the paper.
Model | AP50 (%) | AP75 (%) | AP (%) | F1 Score (%) |
---|---|---|---|---|
Baseline | 90.75 ± 1.25 | 83.91 ± 1.86 | 77.60 ± 0.74 | 85.14 ± 1.25 |
Comprehensive FT | 94.67 ± 0.74 | 90.15 ± 1.63 | 83.89 ± 1.05 | 89.44 ± 1.82 |
Progressive FT | 95.32 ± 1.27 | 90.48 ± 2.06 | 84.19 ± 1.37 | 89.93 ± 1.52 |
Note: Table 4 provides detailed results corresponding to Table II in the paper.
If you use this code or dataset, please cite:
@inproceedings{biswas2025visualcontent,
title = {Visual Content Detection in Educational Videos with Transfer Learning and Dataset Enrichment},
author = {Biswas, Dipayan and Shah, Shishir and Subhlok, Jaspal},
booktitle = {Proceedings of the IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR)},
year = {2025},
note = {To appear}
}
This project is licensed under the MIT License.
See the LICENSE file for details.