Skip to content

Commit 2b6a671

Browse files
authored
Merge pull request #464 from PINTO0309/475_VSDLM
475_VSDLM
2 parents 562f47d + 3ccf496 commit 2b6a671

File tree

6 files changed

+2037
-0
lines changed

6 files changed

+2037
-0
lines changed

475_VSDLM/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Katsuya Hyodo
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

475_VSDLM/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# 475_VSDLM
2+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17494543.svg)](https://doi.org/10.5281/zenodo.17494543) ![GitHub License](https://img.shields.io/github/license/pinto0309/vsdlm) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/PINTO0309/vsdlm)
3+
4+
Visual-only speech detection driven by lip movements.
5+
6+
There are countless situations where you can't hear the audio, and it's really frustrating.
7+
8+
https://github.com/user-attachments/assets/e204662f-dd54-4c19-8d9f-5a1fd8f4fab8
9+
10+
https://github.com/user-attachments/assets/9d68a0f0-b769-473d-8eeb-43ac7447c499
11+
12+
|Variant|Size|F1|CPU<br>inference<br>latency|ONNX|
13+
|:-:|:-:|:-:|:-:|:-:|
14+
|P|112 KB|0.9502|0.18 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_p.onnx)|
15+
|N|176 KB|0.9586|0.31 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_n.onnx)|
16+
|S|494 KB|0.9696|0.50 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_s.onnx)|
17+
|C|875 KB|0.9777|0.60 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_c.onnx)|
18+
|M|1.7 MB|0.9801|0.70 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_m.onnx)|
19+
|L|6.4 MB|0.9891|0.91 ms|[Download](https://github.com/PINTO0309/VSDLM/releases/download/onnx/vsdlm_l.onnx)|
20+
21+
## Setup
22+
23+
```bash
24+
git clone https://github.com/PINTO0309/VSDLM.git && cd VSDLM
25+
curl -LsSf https://astral.sh/uv/install.sh | sh
26+
uv sync
27+
source .venv/bin/activate
28+
```
29+
30+
## Inference
31+
32+
```bash
33+
uv run demo_vsdlm.py \
34+
-v 0 \
35+
-m deimv2_dinov3_s_wholebody34_1750query_n_batch_640x640.onnx \
36+
-vm vsdlm_l.onnx \
37+
-ep cuda
38+
39+
uv run demo_vsdlm.py \
40+
-v 0 \
41+
-m deimv2_dinov3_s_wholebody34_1750query_n_batch_640x640.onnx \
42+
-vm vsdlm_l.onnx \
43+
-ep tensorrt
44+
```
45+
46+
## Arch
47+
48+
<img width="300" alt="vsdlm_p" src="https://github.com/user-attachments/assets/1616215b-99f0-4c28-a1fa-b3dc647adf11" />
49+
50+
## Citation
51+
52+
If you find this project useful, please consider citing:
53+
54+
```bibtex
55+
@software{hyodo2025vsdlm,
56+
author = {Katsuya Hyodo},
57+
title = {PINTO0309/VSDLM},
58+
month = {10},
59+
year = {2025},
60+
publisher = {Zenodo},
61+
doi = {10.5281/zenodo.17494543},
62+
url = {https://github.com/PINTO0309/vsdlm},
63+
abstract = {Visual only speech detection by lip movement.},
64+
}
65+
```
66+
67+
## Acknowledgements
68+
69+
1. https://zenodo.org/records/3625687 - CC BY 4.0 License
70+
2. https://spandh.dcs.shef.ac.uk/avlombard - CC BY 4.0 License
71+
3. https://github.com/hhj1897/face_alignment - MIT License
72+
4. https://github.com/hhj1897/face_detection - MIT License
73+
5. https://github.com/PINTO0309/Face_Mask_Augmentation - MIT License
74+
6. https://github.com/PINTO0309/PINTO_model_zoo/tree/main/472_DEIMv2-Wholebody34 - Apache 2.0
75+
7. https://github.com/PINTO0309/VSDLM - MIT License

0 commit comments

Comments
 (0)