|
| 1 | +# 475_VSDLM |
| 2 | +[](https://doi.org/10.5281/zenodo.17494543)  [](https://deepwiki.com/PINTO0309/vsdlm) |
| 3 | + |
| 4 | +Visual-only speech detection driven by lip movements. |
| 5 | + |
| 6 | +There are countless situations where you can't hear the audio, and it's really frustrating. |
| 7 | + |
| 8 | +https://github.com/user-attachments/assets/e204662f-dd54-4c19-8d9f-5a1fd8f4fab8 |
| 9 | + |
| 10 | +https://github.com/user-attachments/assets/9d68a0f0-b769-473d-8eeb-43ac7447c499 |
| 11 | + |
| 12 | +|Variant|Size|F1|CPU<br>inference<br>latency|ONNX| |
| 13 | +|:-:|:-:|:-:|:-:|:-:| |
| 14 | +|P|112 KB|0.9502|0.18 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_p.onnx)| |
| 15 | +|N|176 KB|0.9586|0.31 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_n.onnx)| |
| 16 | +|S|494 KB|0.9696|0.50 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_s.onnx)| |
| 17 | +|C|875 KB|0.9777|0.60 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_c.onnx)| |
| 18 | +|M|1.7 MB|0.9801|0.70 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_m.onnx)| |
| 19 | +|L|6.4 MB|0.9891|0.91 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_l.onnx)| |
| 20 | + |
| 21 | +## Setup |
| 22 | + |
| 23 | +```bash |
| 24 | +git clone https://github.com/PINTO0309/VSDLM.git && cd VSDLM |
| 25 | +curl -LsSf https://astral.sh/uv/install.sh | sh |
| 26 | +uv sync |
| 27 | +source .venv/bin/activate |
| 28 | +``` |
| 29 | + |
| 30 | +## Inference |
| 31 | + |
| 32 | +```bash |
| 33 | +uv run demo_vsdlm.py \ |
| 34 | +-v 0 \ |
| 35 | +-m deimv2_dinov3_s_wholebody34_1750query_n_batch_640x640.onnx \ |
| 36 | +-vm vsdlm_l.onnx \ |
| 37 | +-ep cuda |
| 38 | + |
| 39 | +uv run demo_vsdlm.py \ |
| 40 | +-v 0 \ |
| 41 | +-m deimv2_dinov3_s_wholebody34_1750query_n_batch_640x640.onnx \ |
| 42 | +-vm vsdlm_l.onnx \ |
| 43 | +-ep tensorrt |
| 44 | +``` |
| 45 | + |
| 46 | +## Arch |
| 47 | + |
| 48 | +<img width="300" alt="vsdlm_p" src="https://github.com/user-attachments/assets/1616215b-99f0-4c28-a1fa-b3dc647adf11" /> |
| 49 | + |
| 50 | +## Citation |
| 51 | + |
| 52 | +If you find this project useful, please consider citing: |
| 53 | + |
| 54 | +```bibtex |
| 55 | +@software{hyodo2025vsdlm, |
| 56 | + author = {Katsuya Hyodo}, |
| 57 | + title = {PINTO0309/VSDLM}, |
| 58 | + month = {10}, |
| 59 | + year = {2025}, |
| 60 | + publisher = {Zenodo}, |
| 61 | + doi = {10.5281/zenodo.17494543}, |
| 62 | + url = {https://github.com/PINTO0309/vsdlm}, |
| 63 | + abstract = {Visual only speech detection by lip movement.}, |
| 64 | +} |
| 65 | +``` |
| 66 | + |
| 67 | +## Acknowledgements |
| 68 | + |
| 69 | +1. https://zenodo.org/records/3625687 - CC BY 4.0 License |
| 70 | +2. https://spandh.dcs.shef.ac.uk/avlombard - CC BY 4.0 License |
| 71 | +3. https://github.com/hhj1897/face_alignment - MIT License |
| 72 | +4. https://github.com/hhj1897/face_detection - MIT License |
| 73 | +5. https://github.com/PINTO0309/Face_Mask_Augmentation - MIT License |
| 74 | +6. https://github.com/PINTO0309/PINTO_model_zoo/tree/main/472_DEIMv2-Wholebody34 - Apache 2.0 |
| 75 | +7. https://github.com/PINTO0309/VSDLM - MIT License |
0 commit comments