Huanshere
diff --git a/‎README.md
Lines changed: 17 additions & 15 deletions b/‎README.md
Lines changed: 17 additions & 15 deletions
diff --git a/‎config.yaml
Lines changed: 13 additions & 6 deletions b/‎config.yaml
Lines changed: 13 additions & 6 deletions
diff --git a/‎core/all_tts_functions/siliconflow_fish_tts.py
Lines changed: 1 addition & 1 deletion b/‎core/all_tts_functions/siliconflow_fish_tts.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎core/all_tts_functions/tts_main.py
Lines changed: 1 addition & 1 deletion b/‎core/all_tts_functions/tts_main.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎core/all_whisper_methods/whisperX_utils.py renamed to ‎core/all_whisper_methods/audio_preprocess.py b/‎core/all_whisper_methods/whisperX_utils.py renamed to ‎core/all_whisper_methods/audio_preprocess.py
diff --git a/‎core/all_whisper_methods/whisperX_302.py
Lines changed: 82 additions & 0 deletions b/‎core/all_whisper_methods/whisperX_302.py
Lines changed: 82 additions & 0 deletions
diff --git a/‎core/all_whisper_methods/whisperX_local.py
Lines changed: 134 additions & 0 deletions b/‎core/all_whisper_methods/whisperX_local.py
Lines changed: 134 additions & 0 deletions
diff --git a/‎core/step10_gen_audio.py
Lines changed: 1 addition & 1 deletion b/‎core/step10_gen_audio.py
Lines changed: 1 addition & 1 deletion
@@ -4,13 +4,11 @@
 
 # Connect the World, Frame by Frame
 
-[Website](https://videolingo.io) | [Documentation](https://docs.videolingo.io/docs/start)
-
-[**English**](/README.md)｜[**中文**](/i18n/README.zh.md)
+[**English**](/README.md)｜[**简体中文**](/translations/README.zh.md)｜[**繁體中文**](/translations/README.zh-TW.md)｜[**日本語**](/translations/README.ja.md)｜[**Español**](/translations/README.es.md)｜[**Русский**](/translations/README.ru.md)｜[**Français**](/translations/README.fr.md)
 
 </div>
 
-## 🌟 Overview ([Try VideoLingo For Free!](https://videolingo.io))
+## 🌟 Overview ([Try VL Now!](https://videolingo.io))
 
 VideoLingo is an all-in-one video translation, localization, and dubbing tool aimed at generating Netflix-quality subtitles. It eliminates stiff machine translations and multi-line subtitles while adding high-quality dubbing, enabling global knowledge sharing across language barriers.
 
@@ -31,6 +29,8 @@ Key features:
 
 - 🚀 One-click startup and processing in Streamlit
 
+- 🌍 Multi-language support in Streamlit UI
+
 - 📝 Detailed logging with progress resumption
 
 Difference from similar projects: **Single-line subtitles only, superior translation quality, seamless dubbing experience**
@@ -68,7 +68,9 @@ https://github.com/user-attachments/assets/47d965b2-b4ab-4a0b-9d08-b49a7bf3508c
 
 ## Installation
 
-> **Note:** To use NVIDIA GPU acceleration on Windows, please complete the following steps first:
+You don't have to read the whole docs, [**here**](https://share.fastgpt.in/chat/share?shareId=066w11n3r9aq6879r4z0v9rh) is an online AI agent to help you.
+
+> **Note:** For Windows users with NVIDIA GPU, follow these steps before installation:
 > 1. Install [CUDA Toolkit 12.6](https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.76_windows.exe)
 > 2. Install [CUDNN 9.3.0](https://developer.download.nvidia.com/compute/cudnn/9.3.0/local_installers/cudnn_9.3.0_windows.exe)
 > 3. Add `C:\Program Files\NVIDIA\CUDNN\v9.3\bin\12.6` to your system PATH
@@ -77,7 +79,7 @@ https://github.com/user-attachments/assets/47d965b2-b4ab-4a0b-9d08-b49a7bf3508c
 > **Note:** FFmpeg is required. Please install it via package managers:
 > - Windows: ```choco install ffmpeg``` (via [Chocolatey](https://chocolatey.org/))
 > - macOS: ```brew install ffmpeg``` (via [Homebrew](https://brew.sh/))
-> - Linux: ```sudo apt install ffmpeg``` (Debian/Ubuntu) or ```sudo dnf install ffmpeg``` (Fedora)
+> - Linux: ```sudo apt install ffmpeg``` (Debian/Ubuntu)
 
 1. Clone the repository
 
@@ -108,12 +110,13 @@ docker build -t videolingo .
 docker run -d -p 8501:8501 --gpus all videolingo
 ```
 
-## API
-VideoLingo supports OpenAI-Like API format and various dubbing interfaces:
-- `claude-3-5-sonnet-20240620`, **`gemini-2.0-flash-exp`**, `gpt-4o`, `deepseek-coder`, ... (sorted by performance)
-- `azure-tts`, `openai-tts`, `siliconflow-fishtts`, **`fish-tts`**, `GPT-SoVITS`, `edge-tts`, `*custom-tts`(ask gpt to help you define in custom_tts.py)
+## APIs
+VideoLingo supports OpenAI-Like API format and various TTS interfaces:
+- LLM: `claude-3-5-sonnet-20240620`, `deepseek-chat(v3)`, `gemini-2.0-flash-exp`, `gpt-4o`, ... (sorted by performance)
+- WhisperX: Run whisperX locally or use 302.ai API
+- TTS: `azure-tts`, `openai-tts`, `siliconflow-fishtts`, **`fish-tts`**, `GPT-SoVITS`, `edge-tts`, `*custom-tts`(You can modify your own TTS in custom_tts.py!)
 
-> **Note:** VideoLingo is now integrated with [302.ai](https://gpt302.saaslink.net/C2oHR9), **one API KEY** for both LLM and TTS! Also supports fully local deployment using Ollama for LLM and Edge-TTS for dubbing, no cloud API required!
+> **Note:** VideoLingo works with **[302.ai](https://gpt302.saaslink.net/C2oHR9)** - one API key for all services (LLM, WhisperX, TTS). Or run locally with Ollama and Edge-TTS for free, no API needed!
 
 For detailed installation, API configuration, and batch mode instructions, please refer to the documentation: [English](/docs/pages/docs/start.en-US.md) | [中文](/docs/pages/docs/start.zh-CN.md)
 
@@ -135,11 +138,10 @@ This project is licensed under the Apache 2.0 License. Special thanks to the fol
 
 [whisperX](https://github.com/m-bain/whisperX), [yt-dlp](https://github.com/yt-dlp/yt-dlp), [json_repair](https://github.com/mangiucugna/json_repair), [BELLE](https://github.com/LianjiaTech/BELLE)
 
-## 📬 Contact Us
+## 📬 Contact Me
 
-- Join our Discord: https://discord.gg/9F2G92CWPp
 - Submit [Issues](https://github.com/Huanshere/VideoLingo/issues) or [Pull Requests](https://github.com/Huanshere/VideoLingo/pulls) on GitHub
-- Follow me on Twitter: [@Huanshere](https://twitter.com/Huanshere)
+- DM me on Twitter: [@Huanshere](https://twitter.com/Huanshere)
 - Email me at: [email protected]
 
 ## ⭐ Star History
@@ -148,4 +150,4 @@ This project is licensed under the Apache 2.0 License. Special thanks to the fol
 
 ---
 
-<p align="center">If you find VideoLingo helpful, please give us a ⭐️!</p>
+<p align="center">If you find VideoLingo helpful, please give me a ⭐️!</p>
@@ -1,27 +1,33 @@
 # * Settings marked with * are advanced settings that won't appear in the Streamlit page and can only be modified manually in config.py
-version: "2.1.2"
+version: "2.2.0"
 ## ======================== Basic Settings ======================== ##
+display_language: "zh-CN"
+
 # API settings
 api:
   key: 'YOUR_API_KEY'
   base_url: 'https://api.302.ai'
-  model: 'gemini-2.0-flash-exp'
+  model: 'deepseek-chat'
 
 # Language settings, written into the prompt, can be described in natural language
 target_language: '简体中文'
 
 # Whether to use Demucs for vocal separation before transcription
-demucs: false
+demucs: true
 
 whisper:
   # ["medium", "large-v3", "large-v3-turbo"]. Note: for zh model will force to use Belle/large-v3
   model: 'large-v3'
   # Whisper specified recognition language [en, zh, ...]
   language: 'en'
   detected_language: 'en'
+  # Whisper running mode ["local", "cloud"]. Specifies where to run, cloud uses 302.ai API
+  runtime: 'cloud'
+  # 302.ai API key
+  whisperX_302_api_key: 'YOUR_302_API_KEY'
 
-# Video resolution [0x0, 640x360, 1920x1080]  0x0 will generate a 0-second black video placeholder
-resolution: '1920x1080'
+# Whether to burn subtitles into the video
+burn_subtitles: true
 
 ## ======================== Advanced Settings ======================== ##
 # *Default resolution for downloading YouTube videos [360, 1080, best]
@@ -33,7 +39,7 @@ subtitle:
   # *Translated subtitles are slightly larger than source subtitles, affecting the reference length for subtitle splitting
   target_multiplier: 1.2
 
-# * Summary length, set low to 2k if using local LLM
+# *Summary length, set low to 2k if using local LLM
 summary_length: 8000
 
 # *Number of LLM multi-threaded accesses, set to 1 if using local LLM
@@ -135,6 +141,7 @@ llm_support_json:
 - 'gpt-4o-mini'
 - 'gemini-2.0-flash-exp'
 - 'deepseek-coder'
+- 'deepseek-chat'
 
 # have problems
 # - 'Qwen/Qwen2.5-72B-Instruct'
 
@@ -7,7 +7,7 @@
 sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
 from core.config_utils import load_key, update_key
 from core.step1_ytdlp import find_video_files
-from core.all_whisper_methods.whisperX_utils import get_audio_duration
+from core.all_whisper_methods.audio_preprocess import get_audio_duration
 import hashlib
 from rich import print as rprint
 from pydub import AudioSegment
 
@@ -5,7 +5,7 @@
 
 sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
 from core.config_utils import load_key
-from core.all_whisper_methods.whisperX_utils import get_audio_duration
+from core.all_whisper_methods.audio_preprocess import get_audio_duration
 from core.all_tts_functions.gpt_sovits_tts import gpt_sovits_tts_for_videolingo
 from core.all_tts_functions.siliconflow_fish_tts import siliconflow_fish_tts_for_videolingo
 from core.all_tts_functions.openai_tts import openai_tts
 
@@ -0,0 +1,82 @@
+import requests
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+from core.config_utils import load_key
+from rich import print as rprint
+import time
+import json
+import tempfile
+import subprocess
+
+OUTPUT_LOG_DIR = "output/log"
+def transcribe_audio_302(audio_path: str, start: float = None, end: float = None):
+    os.makedirs(OUTPUT_LOG_DIR, exist_ok=True)
+    LOG_FILE = f"{OUTPUT_LOG_DIR}/whisperx302.json"
+    if os.path.exists(LOG_FILE):
+        with open(LOG_FILE, "r", encoding="utf-8") as f:
+            return json.load(f)
+        
+    WHISPER_LANGUAGE = load_key("whisper.language")
+    url = "https://api.302.ai/302/whisperx"
+    
+    # 如果指定了开始和结束时间，创建临时音频片段
+    if start is not None and end is not None:
+        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_audio:
+            temp_audio_path = temp_audio.name
+            
+        # 使用ffmpeg截取音频片段
+        ffmpeg_cmd = f'ffmpeg -y -i "{audio_path}" -ss {start} -t {end-start} -vn -ar 32000 -ac 1 "{temp_audio_path}"'
+        subprocess.run(ffmpeg_cmd, shell=True, check=True, capture_output=True)
+        audio_path = temp_audio_path
+    
+    payload = {
+        "processing_type": "align",
+        "language": WHISPER_LANGUAGE,
+        "output": "raw"
+    }
+    
+    start_time = time.time()
+    rprint(f"[cyan]🎤 Transcribing audio with language:  <{WHISPER_LANGUAGE}> ...[/cyan]")
+    files = [
+        ('audio_input',(
+            os.path.basename(audio_path),
+            open(audio_path, 'rb'),
+            'application/octet-stream'
+        ))
+    ]
+    
+    headers = {
+        'Authorization': f'Bearer {load_key("whisper.whisperX_302_api_key")}'
+    }
+
+    response = requests.request("POST", url, headers=headers, data=payload, files=files)
+    
+    # 清理临时文件
+    if start is not None and end is not None:
+        if os.path.exists(temp_audio_path):
+            os.unlink(temp_audio_path)
+    
+    with open(LOG_FILE, "w", encoding="utf-8") as f:
+        json.dump(response.json(), f, indent=4, ensure_ascii=False)
+        
+    # 调整时间戳
+    if start is not None:
+        result = response.json()
+        for segment in result['segments']:
+            segment['start'] += start
+            segment['end'] += start
+            for word in segment.get('words', []):
+                if 'start' in word:
+                    word['start'] += start
+                if 'end' in word:
+                    word['end'] += start
+        response._content = json.dumps(result).encode()
+    
+    elapsed_time = time.time() - start_time
+    rprint(f"[green]✓ Transcription completed in {elapsed_time:.2f} seconds[/green]")
+    return response.json()
+
+if __name__ == "__main__":  
+    # 使用示例:
+    result = transcribe_audio_302("output/audio/raw.mp3")
+    rprint(result)
@@ -0,0 +1,134 @@
+import os,sys
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import warnings
+warnings.filterwarnings("ignore")
+
+import whisperx
+import torch
+import time
+import subprocess
+from typing import Dict
+from rich import print as rprint
+import librosa
+import tempfile
+from core.config_utils import load_key
+from core.all_whisper_methods.audio_preprocess import save_language
+
+MODEL_DIR = load_key("model_dir")
+
+def check_hf_mirror() -> str:
+    """Check and return the fastest HF mirror"""
+    mirrors = {
+        'Official': 'huggingface.co',
+        'Mirror': 'hf-mirror.com'
+    }
+    fastest_url = f"https://{mirrors['Official']}"
+    best_time = float('inf')
+    rprint("[cyan]🔍 Checking HuggingFace mirrors...[/cyan]")
+    for name, domain in mirrors.items():
+        try:
+            if os.name == 'nt':
+                cmd = ['ping', '-n', '1', '-w', '3000', domain]
+            else:
+                cmd = ['ping', '-c', '1', '-W', '3', domain]
+            start = time.time()
+            result = subprocess.run(cmd, capture_output=True, text=True)
+            response_time = time.time() - start
+            if result.returncode == 0:
+                if response_time < best_time:
+                    best_time = response_time
+                    fastest_url = f"https://{domain}"
+                rprint(f"[green]✓ {name}:[/green] {response_time:.2f}s")
+        except:
+            rprint(f"[red]✗ {name}:[/red] Failed to connect")
+    if best_time == float('inf'):
+        rprint("[yellow]⚠️ All mirrors failed, using default[/yellow]")
+    rprint(f"[cyan]🚀 Selected mirror:[/cyan] {fastest_url} ({best_time:.2f}s)")
+    return fastest_url
+
+def transcribe_audio(audio_file: str, start: float, end: float) -> Dict:
+    os.environ['HF_ENDPOINT'] = check_hf_mirror() #? don't know if it's working...
+    WHISPER_LANGUAGE = load_key("whisper.language")
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    rprint(f"🚀 Starting WhisperX using device: {device} ...")
+    
+    if device == "cuda":
+        gpu_mem = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+        batch_size = 16 if gpu_mem > 8 else 2
+        compute_type = "float16" if torch.cuda.is_bf16_supported() else "int8"
+        rprint(f"[cyan]🎮 GPU memory:[/cyan] {gpu_mem:.2f} GB, [cyan]📦 Batch size:[/cyan] {batch_size}, [cyan]⚙️ Compute type:[/cyan] {compute_type}")
+    else:
+        batch_size = 1
+        compute_type = "int8"
+        rprint(f"[cyan]📦 Batch size:[/cyan] {batch_size}, [cyan]⚙️ Compute type:[/cyan] {compute_type}")
+    rprint(f"[green]▶️ Starting WhisperX for segment {start:.2f}s to {end:.2f}s...[/green]")
+    
+    try:
+        if WHISPER_LANGUAGE == 'zh':
+            model_name = "Huan69/Belle-whisper-large-v3-zh-punct-fasterwhisper"
+            local_model = os.path.join(MODEL_DIR, "Belle-whisper-large-v3-zh-punct-fasterwhisper")
+        else:
+            model_name = load_key("whisper.model")
+            local_model = os.path.join(MODEL_DIR, model_name)
+            
+        if os.path.exists(local_model):
+            rprint(f"[green]📥 Loading local WHISPER model:[/green] {local_model} ...")
+            model_name = local_model
+        else:
+            rprint(f"[green]📥 Using WHISPER model from HuggingFace:[/green] {model_name} ...")
+
+        vad_options = {"vad_onset": 0.500,"vad_offset": 0.363}
+        asr_options = {"temperatures": [0],"initial_prompt": "",}
+        whisper_language = None if 'auto' in WHISPER_LANGUAGE else WHISPER_LANGUAGE
+        rprint("[bold yellow]**You can ignore warning of `Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118...`**[/bold yellow]")
+        model = whisperx.load_model(model_name, device, compute_type=compute_type, language=whisper_language, vad_options=vad_options, asr_options=asr_options, download_root=MODEL_DIR)
+
+        # Create temp file with wav format for better compatibility
+        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_audio:
+            temp_audio_path = temp_audio.name
+        
+        # Extract audio segment using ffmpeg
+        ffmpeg_cmd = f'ffmpeg -y -i "{audio_file}" -ss {start} -t {end-start} -vn -ar 32000 -ac 1 "{temp_audio_path}"'
+        subprocess.run(ffmpeg_cmd, shell=True, check=True, capture_output=True)
+        
+        try:
+            # Load audio segment with librosa
+            audio_segment, sample_rate = librosa.load(temp_audio_path, sr=16000)
+        finally:
+            # Clean up temp file
+            if os.path.exists(temp_audio_path):
+                os.unlink(temp_audio_path)
+
+        rprint("[bold green]note: You will see Progress if working correctly[/bold green]")
+        result = model.transcribe(audio_segment, batch_size=batch_size, print_progress=True)
+
+        # Free GPU resources
+        del model
+        torch.cuda.empty_cache()
+
+        # Save language
+        save_language(result['language'])
+        if result['language'] == 'zh' and WHISPER_LANGUAGE != 'zh':
+            raise ValueError("Please specify the transcription language as zh and try again!")
+
+        # Align whisper output
+        model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
+        result = whisperx.align(result["segments"], model_a, metadata, audio_segment, device, return_char_alignments=False)
+
+        # Free GPU resources again
+        torch.cuda.empty_cache()
+        del model_a
+
+        # Adjust timestamps
+        for segment in result['segments']:
+            segment['start'] += start
+            segment['end'] += start
+            for word in segment['words']:
+                if 'start' in word:
+                    word['start'] += start
+                if 'end' in word:
+                    word['end'] += start
+        return result
+    except Exception as e:
+        rprint(f"[red]WhisperX processing error:[/red] {e}")
+        raise
@@ -14,7 +14,7 @@
 
 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from core.config_utils import load_key
-from core.all_whisper_methods.whisperX_utils import get_audio_duration
+from core.all_whisper_methods.audio_preprocess import get_audio_duration
 from core.all_tts_functions.tts_main import tts_main
 
 console = Console()