training

#2
by ABDALLALSWAITI - opened

I am preparing training , and I believe the results u obtained are excellent. Could you please share the training parameters or any information that might help with the training? I would like to use this dataset https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 , and I think the same settings you used would work.

how long time need on h100 gpu 80 g

Hi! I recommend starting with the default parameters. I used these: src.finetune_vibevoice_lora --model_name_or_path aoi-ot/VibeVoice-Large --processor_name_or_path src/vibevoice/processor --train_jsonl train_vibevoice.jsonl --text_column_name text --audio_column_name audio --output_dir outputs --per_device_train_batch_size 8 --gradient_accumulation_steps 16 --learning_rate 2.5e-5 --num_train_epochs 5 --logging_steps 1 --save_steps 100 --report_to wandb --remove_unused_columns False --bf16 True --ddpm_batch_mul 4 --ce_loss_weight 0.0 --diffusion_loss_weight 1.0 --do_train --train_diffusion_head --gradient_clipping --gradient_checkpointing True --lora_target_modules NONE --voice_prompt_drop_rate 0.2

I trained if for 3.5 epochs on an RTX 6000 PRO it took around 6 hours if I recall properly

i do training but the result was very poor i think the issue with train_vibevoice.jsonl i created ! maybe !
can u share urs jsonl or the script to extract it from https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/
hers mine i tested it using comfyui
https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-v1

my jsonl {"text": "Speaker 0: ูˆู…ุง ูƒุงู† ุฑุจูƒ ู„ูŠู‡ู„ูƒ ุงู„ู‚ุฑู‰ ุจุธู„ู… ูˆุฃู‡ู„ู‡ุง ู…ุตู„ุญูˆู†", "audio": "vibevoice_data/train_audio/audio_000000.wav"}
{"text": "Speaker 0: ุฃู‚ุฏุฑ ุชุนุงูˆู†ูƒ ู…ุนู†ุง.", "audio": "vibevoice_data/train_audio/audio_000001.wav"}
{"text": "Speaker 0: ูˆูŽู„ููˆุทู‹ุง ุฅูุฐู’ ู‚ูŽุงู„ูŽ ู„ูู‚ูŽูˆู’ู…ูู‡ู ุฃูŽุชูŽุฃู’ุชููˆู†ูŽ ุงู„ู’ููŽุงุญูุดูŽุฉูŽ ู…ูŽุง ุณูŽุจูŽู‚ูŽูƒูู…ู’ ุจูู‡ูŽุง ู…ูู†ู’ ุฃูŽุญูŽุฏู ู…ูู†ูŽ ุงู„ู’ุนูŽุงู„ูŽู…ููŠู†ูŽ", "audio": "vibevoice_data/train_audio/audio_000002.wav"}
{"text": "Speaker 0: ูู„ู…ุง ุฑุฃูˆุง ุจุฃุณู†ุง ู‚ุงู„ูˆุง ุขู…ู†ุง ุจุงู„ู„ู‡ ูˆุญุฏู‡ ูˆูƒูุฑู†ุง ุจู…ุง ูƒู†ุง ุจู‡ ู…ุดุฑูƒูŠู†", "audio": "vibevoice_data/train_audio/audio_000003.wav"}
{"text": "Speaker 0: ู„ู…ู† ู‡ุฐุง ุงู„ูƒุชุงุจุŸ", "audio": "vibevoice_data/train_audio/audio_000004.wav"}
{"text": "Speaker 0: ุฅู†ู‡ ุฎุทูŠุฑ ู„ู„ุบุงูŠุฉ.", "audio": "vibevoice_data/train_audio/audio_000005.wav"}
{"text": "Speaker 0: ุจู„ ู…ุชุนุช ู‡ุคู„ุงุก ูˆุขุจุงุกู‡ู… ุญุชู‰ ุฌุงุกู‡ู… ุงู„ุญู‚ ูˆุฑุณูˆู„ ู…ุจูŠู†", "audio": "vibevoice_data/train_audio/audio_000006.wav"}
{"text": "Speaker 0: ููŽูŠูŽุตููŠุฑู ุจูุงู„ุฑู‘ูŽู…ู’ุฒู ุณูŽุงุฆูุฑู‹ุง ูˆูŽูููŠ ุงู„ุตู‘ูุญููู ู…ูุฎูŽู„ู‘ูŽุฏู‹ุง", "audio": "vibevoice_data/train_audio/audio_000007.wav"}
{"text": "Speaker 0: ุฃู†ุง ุดุฑูŠูƒูƒ.", "audio": "vibevoice_data/train_audio/audio_000008.wav"}
{"text": "Speaker 0: ู„ูƒู† ุงู„ุฑุณูˆู„ ูˆุงู„ุฐูŠู† ุขู…ู†ูˆุง ู…ุนู‡ ุฌุงู‡ุฏูˆุง ุจุฃู…ูˆุงู„ู‡ู… ูˆุฃู†ูุณู‡ู… ูˆุฃูˆู„ุฆูƒ ู„ู‡ู… ุงู„ุฎูŠุฑุงุช ูˆุฃูˆู„ุฆูƒ ู‡ู… ุงู„ู…ูู„ุญูˆู†", "audio": "vibevoice_data/train_audio/audio_000009.wav"}
{"text": "Speaker 0: ุซูู…ู‘ูŽ ุฅูู†ู‘ูŽ ู…ูŽุฑู’ุฌูุนูŽู‡ูู…ู’ ู„ูŽุฅูู„ูŽู‰ ุงู„ู’ุฌูŽุญููŠู…ู", "audio": "vibevoice_data/train_audio/audio_000010.wav"}
{"text": "Speaker 0: ู„ู† ู†ู‚ุชุฑุจ ู…ู† ุงู„ุนุฏูˆ.", "audio": "vibevoice_data/train_audio/audio_000011.wav"}
{"text": "Speaker 0: ูˆูŽุฃูŽู†ู‘ูŽู‡ูู…ู’ ูŠูŽู‚ููˆู„ููˆู†ูŽ ู…ูŽุง ู„ูŽุง ูŠูŽูู’ุนูŽู„ููˆู†ูŽ", "audio": "vibevoice_data/train_audio/audio_000012.wav"}
{"text": "Speaker 0: ู…ุง ู‡ุฐุง ุงู„ุฐูŠ ุชูุนู„ู‡ ูŠุง ูุชู‰ุŸ", "audio": "vibevoice_data/train_audio/audio_000013.wav"}
{"text": "Speaker 0: ูˆูŽุงู„ุดู‘ูŽู‚ููŠู‘ู ู…ูŽู†ู’ ุฌูŽู…ูŽุนูŽ ู„ูุบูŽูŠู’ุฑูู‡ู ูˆูŽุจูŽุฎูู„ูŽ ุนูŽู„ูŽู‰ ู†ูŽูู’ุณูู‡ู", "audio": "vibevoice_data/train_audio/audio_000014.wav"}
{"text": "Speaker 0: ุฃูŠุทู…ุน ูƒู„ ุงู…ุฑุฆ ู…ู†ู‡ู… ุฃู† ูŠุฏุฎู„ ุฌู†ุฉ ู†ุนูŠู…", "audio": "vibevoice_data/train_audio/audio_000015.wav"}
{"text": "Speaker 0: ูˆูŽู„ูŽุง ูŠูุธู’ู‡ูุฑู ู„ูŽู‡ู ุงู„ูุงุณู’ุชููƒู’ููŽุงุกูŽ ู…ูู†ู’ู‡ู ูˆูŽุงู„ูุงุณู’ุชูุบู’ู†ูŽุงุกูŽ ุนูŽู†ู’ู‡ู", "audio": "vibevoice_data/train_audio/audio_000016.wav"}
{"text": "Speaker 0: ูˆูŽุฐูŽู„ููƒูŽ ู„ูŽุง ูŠููˆุฌูŽุฏู ู…ูู†ู’ู‡ู ุฅู„ู‘ูŽุง ุนูู†ู’ุฏูŽ ูƒูŽู…ูŽุงู„ู ุนูŽู‚ู’ู„ูู‡ู", "audio": "vibevoice_data/train_audio/audio_000017.wav"}
{"text": "Speaker 0: ู„ู‚ุฏ ุนูˆู‚ุจ ุนู„ู‰ ุฌุฑุงุฆู…ู‡.", "audio": "vibevoice_data/train_audio/audio_000018.wav"}
{"text": "Speaker 0: ุงู†ุชุธุฑ ุญุชู‰ ูŠุชูˆู‚ู ุงู„ู…ุทุฑ ุนู† ุงู„ู‡ุทูˆู„.", "audio": "vibevoice_data/train_audio/audio_000019.wav"}
{"text": "Speaker 0: ููŽุฏูŽุนูŽุง ุฑูŽุจู‘ูŽู‡ู ุฃูŽู†ู‘ููŠ ู…ูŽุบู’ู„ููˆุจูŒ ููŽุงู†ุชูŽุตูุฑู’", "audio": "vibevoice_data/train_audio/audio_000020.wav"}
{"text": "Speaker 0: ูˆูŽุฅูุฐูŽุง ู‚ููŠู„ูŽ ู„ูŽู‡ูู…ู ุงุชู‘ูŽู‚ููˆุง ู…ูŽุง ุจูŽูŠู’ู†ูŽ ุฃูŽูŠู’ุฏููŠูƒูู…ู’ ูˆูŽู…ูŽุง ุฎูŽู„ู’ููŽูƒูู…ู’ ู„ูŽุนูŽู„ู‘ูŽูƒูู…ู’ ุชูุฑู’ุญูŽู…ููˆู†ูŽ", "audio": "vibevoice_data/train_audio/audio_000021.wav"}
{"text": "Speaker 0: ูˆู‚ู‡ู… ุงู„ุณูŠุฆุงุช ูˆู…ู† ุชู‚ ุงู„ุณูŠุฆุงุช ูŠูˆู…ุฆุฐ ูู‚ุฏ ุฑุญู…ุชู‡ ูˆุฐู„ูƒ ู‡ูˆ ุงู„ููˆุฒ ุงู„ุนุธูŠู…", "audio": "vibevoice_data/train_audio/audio_000022.wav"}
{"text": "Speaker 0: ุฃูŽูˆู’ ูŠูŽูƒููˆู†ูŽ ู†ูŽุชููŠุฌูŽุฉู‹ ู…ูู†ู’ ุบูŽูŠู’ุฑูู‡ู", "audio": "vibevoice_data/train_audio/audio_000023.wav"}
{"text": "Speaker 0: ูˆุฅุฐุง ูƒุงู„ูˆู‡ู… ุฃูˆ ูˆุฒู†ูˆู‡ู… ูŠุฎุณุฑูˆู†", "audio": "vibevoice_data/train_audio/audio_000024.wav"}
{"text": "Speaker 0: ุญูู…ูŽุงุฉูŽ ุงู„ู’ุญูู…ูŽู‰ ูŠูŽุง ุญูู…ูŽุงุฉูŽ ุงู„ู’ุญูู…ูŽู‰ ู‡ูŽู„ูู…ูู‘ูˆุง ู‡ูŽู„ูู…ูู‘ูˆุง ู„ูู…ูŽุฌู’ุฏู ุงู„ุฒูŽู‘ู…ูŽู†ู’", "audio": "vibevoice_data/train_audio/audio_000025.wav"}
{"text": "Speaker 0: ุจุฏุฃ ุงู„ุชูƒุณุงุณูŠูˆู† ุจุชู†ุธูŠู… ุฌูŠูˆุดู‡ู….", "audio": "vibevoice_data/train_audio/audio_000026.wav"}
{"text": "Speaker 0: ู„ู…ุงุฐุง ุตุนุฏุช ุฅู„ู‰ ุณู‚ู ุจูŠุชู‡ุงุŸ", "audio": "vibevoice_data/train_audio/audio_000027.wav"}
{"text": "Speaker 0: ุงู„ู„ู‡ ูŠุจุฏุฃ ุงู„ุฎู„ู‚ ุซู… ูŠุนูŠุฏู‡ ุซู… ุฅู„ูŠู‡ ุชุฑุฌุนูˆู†", "audio": "vibevoice_data/train_audio/audio_000028.wav"}
{"text": "Speaker 0: ุฅู† ุนุฐุงุจ ุฑุจู‡ู… ุบูŠุฑ ู…ุฃู…ูˆู†", "audio": "vibevoice_data/train_audio/audio_000029.wav"}
{"text": "Speaker 0: ูˆู„ู‚ุฏ ุตุฑูู†ุงู‡ ุจูŠู†ู‡ู… ู„ูŠุฐูƒุฑูˆุง ูุฃุจู‰ ุฃูƒุซุฑ ุงู„ู†ุงุณ ุฅู„ุง ูƒููˆุฑุง", "audio": "vibevoice_data/train_audio/audio_000030.wav"}
{"text": "Speaker 0: ู‚ุงู„ ุชูˆู… ุฃู†ู‘ู‡ ู„ู… ูŠููุตู„ ุฃุญุฏ.", "audio": "vibevoice_data/train_audio/audio_000031.wav"}
{"text": "Speaker 0: ุฃุณูƒู† ููŠ ู‚ุทุฑ.", "audio": "vibevoice_data/train_audio/audio_000032.wav"}
{"text": "Speaker 0: ู„ู‚ุฏ ุญู„ ูุตู„ ุงู„ุฑุจูŠุน ุŒ ุงู„ุญุฑุงุฑุฉ ุชุฒุฏุงุฏ ูŠูˆู…ุง ุจุนุฏ ูŠูˆู….", "audio": "vibevoice_data/train_audio/audio_000033.wav"}
{"text": "Speaker 0: ูˆูŽู…ูŽู‡ู’ู…ูŽุง ุชูŽูƒูู†ู’ ุนูู†ู’ุฏูŽ ุงู…ู’ุฑูุฆู ู…ูู†ู’ ุฎูŽู„ููŠู‚ูŽุฉู ูˆูŽุฅูู†ู’ ุฎูŽุงู„ูŽู‡ูŽุง ุชูŽุฎู’ููŽู‰ ุนูŽู„ูŽู‰ ุงู„ู†ู‘ูŽุงุณู ุชูุนู’ู„ูŽู…ู’", "audio": "vibevoice_data/train_audio/audio_000034.wav"}
{"text": "Speaker 0: ุงู„ูุฑุณ ู‡ูŠ ุฃู†ุซู‰ ุงู„ุญุตุงู†", "audio": "vibevoice_data/train_audio/audio_000035.wav"}
{"text": "Speaker 0: ู„ููŠูŽูƒู’ููุฑููˆุง ุจูู…ูŽุง ุขุชูŽูŠู’ู†ูŽุงู‡ูู…ู’ ูˆูŽู„ููŠูŽุชูŽู…ูŽุชู‘ูŽุนููˆุง ููŽุณูŽูˆู’ููŽ ูŠูŽุนู’ู„ูŽู…ููˆู†ูŽ", "audio": "vibevoice_data/train_audio/audio_000036.wav"}
{"text": "Speaker 0: ู„ู„ุทูŠูˆุฑ ุงุนุดุงุด ุŒ ูˆู„ู„ุนู†ุงูƒุจ ุดุจุงูƒ ุŒ ูˆู„ู„ู†ุงุณ ุงู„ุตุฏุงู‚ุงุช.", "audio": "vibevoice_data/train_audio/audio_000037.wav"}
{"text": "Speaker 0: ุฃุฎุถุฑู‡ุง ู…ู† ุงู„ุฃุฑุถุŒ ูˆุฃุฒุฑู‚ู‡ุง ู…ู† ุงู„ุณู…ุงุก", "audio": "vibevoice_data/train_audio/audio_000038.wav"}
{"text": "Speaker 0: ูˆูŽุนูู„ู’ู…ูู‡ู ู…ูŽุญู’ู‚ููˆุฑูŒ", "audio": "vibevoice_data/train_audio/audio_000039.wav"}
{"text": "Speaker 0: ู…ุง ุงู„ุฐูŠ ูŠุฑุงู‡ู ููŠู‡ุงุŸ", "audio": "vibevoice_data/train_audio/audio_000040.wav"}
{"text": "Speaker 0: ู„ูŽู‚ูŽุฏู’ ุฃูŽุถูŽู„ู‘ูŽู†ููŠ ุนูŽู†ู ุงู„ุฐู‘ููƒู’ุฑู ุจูŽุนู’ุฏูŽ ุฅูุฐู’ ุฌูŽุงุกูŽู†ููŠ ูˆูŽูƒูŽุงู†ูŽ ุงู„ุดู‘ูŽูŠู’ุทูŽุงู†ู ู„ูู„ู’ุฅูู†ู’ุณูŽุงู†ู ุฎูŽุฐููˆู„ู‹ุง", "audio": "vibevoice_data/train_audio/audio_000041.wav"}
{"text": "Speaker 0: ู„ูŽู‚ูŽุฏู’ ุฎูŽู„ูŽู‚ู’ู†ูŽุง ุงู„ู’ุฅูู†ู’ุณูŽุงู†ูŽ ูููŠ ุฃูŽุญู’ุณูŽู†ู ุชูŽู‚ู’ูˆููŠู…ู", "audio": "vibevoice_data/train_audio/audio_000042.wav"}
{"text": "Speaker 0: ูƒุชุจ ุนู„ูŠู‡ ุฃู†ู‡ ู…ู† ุชูˆู„ุงู‡ ูุฃู†ู‡ ูŠุถู„ู‡ ูˆูŠู‡ุฏูŠู‡ ุฅู„ู‰ ุนุฐุงุจ ุงู„ุณุนูŠุฑ", "audio": "vibevoice_data/train_audio/audio_000043.wav"}
{"text": "Speaker 0: ูุณูˆู ูŠุญุงุณุจ ุญุณุงุจุง ูŠุณูŠุฑุง", "audio": "vibevoice_data/train_audio/audio_000044.wav"}
{"text": "Speaker 0: ูˆูŽู„ูŽูŠู’ุณูŽ ุจูŽุนู’ุฏูŽ ุงู„ู’ู…ูŽูˆู’ุชู ุดูŽูŠู’ุกูŒ ุฅู„ู‘ูŽุง ุงู„ู’ู…ูŽูˆู’ุชู ุฃูŽูŠู’ุณูŽุฑู ู…ูู†ู’ู‡ู", "audio": "vibevoice_data/train_audio/audio_000045.wav"}
{"text": "Speaker 0: ุฃุฑุงุฏ ุณุงู…ูŠ ุฃู† ูŠู‚ุชู„ ู„ูŠู„ู‰ ูƒูŠ ูŠุณุชูˆู„ูŠ ุนู„ู‰ ู…ู„ูƒูŠุชู‡ุง ุงู„ุนู‚ู‘ุงุฑูŠู‘ุฉ.", "audio": "vibevoice_data/train_audio/audio_000046.wav"}
{"text": "Speaker 0: ู‚ูŽุงู„ูŽ ุงู„ู‘ูŽุฐููŠู†ูŽ ุงุณู’ุชูŽูƒู’ุจูŽุฑููˆุง ุฅูู†ู‘ูŽุง ุจูุงู„ู‘ูŽุฐููŠ ุขู…ูŽู†ู’ุชูู…ู’ ุจูู‡ู ูƒูŽุงููุฑููˆู†ูŽ", "audio": "vibevoice_data/train_audio/audio_000047.wav"}
{"text": "Speaker 0: ูˆูŽู†ูŽุจู‘ูุฆู’ู‡ูู…ู’ ุนูŽู†ู’ ุถูŽูŠู’ูู ุฅูุจู’ุฑูŽุงู‡ููŠู…ูŽ", "audio": "vibevoice_data/train_audio/audio_000048.wav"}
{"text": "Speaker 0: ู†ุฑูŠุฏ ุงู„ุณู„ุงู… ููŠ ุงู„ุนุงู„ู….", "audio": "vibevoice_data/train_audio/audio_000049.wav"}
{"text": "Speaker 0: ุญู„ู…ูŠ ู‡ูˆ ุฃู† ุฃุตุจุญ ู…ุบู†ู‘ูŠุง.", "audio": "vibevoice_data/train_audio/audio_000050.wav"}
{"text": "Speaker 0: ู‡ู„ ุชุฑูŠุฏู†ูŠ ุญู‚ุงู‹ ุฃู† ุฃุฎุจุฑ ุชูˆู… ุจุดุฃู†ูƒ ุฃู†ุชูŽ ูˆ ู…ุงุฑูŠ ุŸ", "audio": "vibevoice_data/train_audio/audio_000051.wav"}
{"text": "Speaker 0: ูˆุฌู‡ู‡ุง ู…ุถุญููƒ.", "audio": "vibevoice_data/train_audio/audio_000052.wav"}
{"text": "Speaker 0: ูƒุงู† ุดูˆุจ ุŒ ููุชุญุช ุงู„ุดุจุงูƒ", "audio": "vibevoice_data/train_audio/audio_000053.wav"}
{"text": "Speaker 0: ู‡ูŠ ุชุดุฑุจ ุงู„ู…ุงุก ูู‚ุท.", "audio": "vibevoice_data/train_audio/audio_000054.wav"}
{"text": "Speaker 0: ุงู†ุง ุงู†ุชุธุฑู‡ ู…ู†ุฐ ู‡ุฐุง ุงู„ุตุจุงุญ ุงู„ุจุงูƒุฑ.", "audio": "vibevoice_data/train_audio/audio_000055.wav"}
{"text": "Speaker 0: ุงู„ุญุงู†ูˆุช ูŠุจูŠุน ุงู„ุฌุฑุงุฆุฏ ูˆ ุงู„ู…ุฌู„ุงุช.", "audio": "vibevoice_data/train_audio/audio_000056.wav"}
{"text": "Speaker 0: ุงู‚ู’ุชูู„ููˆุง ูŠููˆุณูููŽ ุฃูŽูˆู ุงุทู’ุฑูŽุญููˆู‡ู ุฃูŽุฑู’ุถู‹ุง ูŠูŽุฎู’ู„ู ู„ูŽูƒูู…ู’ ูˆูŽุฌู’ู‡ู ุฃูŽุจููŠูƒูู…ู’ ูˆูŽุชูŽูƒููˆู†ููˆุง ู…ูู†ู’ ุจูŽุนู’ุฏูู‡ู ู‚ูŽูˆู’ู…ู‹ุง ุตูŽุงู„ูุญููŠู†ูŽ", "audio": "vibevoice_data/train_audio/audio_000057.wav"}
{"text": "Speaker 0: ุจูŠู†ู…ุง ูƒู†ุช ุฐุงู‡ุจุง ุฅู„ู‰ ุงู„ุนู…ู„ ุŒ ุงู„ุชู‚ูŠุช ุจุนู…ูŠ.", "audio": "vibevoice_data/train_audio/audio_000058.wav"}
{"text": "Speaker 0: ู‚ู„ ู‡ูˆ ุงู„ุฐูŠ ุฐุฑุฃูƒู… ููŠ ุงู„ุฃุฑุถ ูˆุฅู„ูŠู‡ ุชุญุดุฑูˆู†", "audio": "vibevoice_data/train_audio/audio_000059.wav"}
{"text": "Speaker 0: ูˆูŽู„ูŽูˆู’ ุดูŽุงุกูŽ ุงู„ู„ู‘ูŽู‡ู ู…ูŽุง ุฃูŽุดู’ุฑูŽูƒููˆุง ูˆูŽู…ูŽุง ุฌูŽุนูŽู„ู’ู†ูŽุงูƒูŽ ุนูŽู„ูŽูŠู’ู‡ูู…ู’ ุญูŽูููŠุธู‹ุง ูˆูŽู…ูŽุง ุฃูŽู†ู’ุชูŽ ุนูŽู„ูŽูŠู’ู‡ูู…ู’ ุจููˆูŽูƒููŠู„ู", "audio": "vibevoice_data/train_audio/audio_000060.wav"}
{"text": "Speaker 0: ู„ู‚ุฏ ูˆุนุฏุชู†ูŠ.", "audio": "vibevoice_data/train_audio/audio_000061.wav"}
{"text": "Speaker 0: ูƒูŽู„ู‘ูŽุง ุณูŽูŠูŽูƒู’ููุฑููˆู†ูŽ ุจูุนูุจูŽุงุฏูŽุชูู‡ูู…ู’ ูˆูŽูŠูŽูƒููˆู†ููˆู†ูŽ ุนูŽู„ูŽูŠู’ู‡ูู…ู’ ุถูุฏู‘ู‹ุง", "audio": "vibevoice_data/train_audio/audio_000062.wav"}
{"text": "Speaker 0: ูˆูŽุฃูŽู…ู‘ูŽุง ู…ูŽุง ูŠูŽุตู’ู„ูุญู ุจูู‡ู ุญูŽุงู„ู ุงู„ู’ุฅูู†ู’ุณูŽุงู†ู ูููŠู‡ูŽุง ููŽุซูŽู„ูŽุงุซูŽุฉู ุฃูŽุดู’ูŠูŽุงุกูŽ", "audio": "vibevoice_data/train_audio/audio_000063.wav"}
{"text": "Speaker 0: ูƒุงู† ุนู…ุฑ ุงู„ูู‚ูŠุฏ ุซู…ุงู†ูŠู† ุณู†ุฉ.", "audio": "vibevoice_data/train_audio/audio_000064.wav"}
{"text": "Speaker 0: ุงุดุชุฑูŠุช ุณูŠุงุฑุฉ ุฌุฏูŠุฏุฉ ุงู„ุงุณุจูˆุน ุงู„ู…ุงุถูŠ.", "audio": "vibevoice_data/train_audio/audio_000065.wav"}
{"text": "Speaker 0: ูˆูŽู…ูู†ู’ู‡ูŽุง ูƒูŽุซู’ุฑูŽุฉู ุงุดู’ุชูุบูŽุงู„ูู‡ู ูˆูŽุชูŽุฑูŽุงุฏููู ุญูŽุงู„ูŽุงุชูู‡ู ุญูŽุชู‘ูŽู‰ ุฃูŽู†ู‘ูŽู‡ูŽุง ุชูŽุณู’ุชูŽูˆู’ุนูุจู ุฒูŽู…ูŽุงู†ูŽู‡ู ูˆูŽุชูŽุณู’ุชูŽู†ู’ููุฏู ุฃูŽูŠู‘ูŽุงู…ูŽู‡ู", "audio": "vibevoice_data/train_audio/audio_000066.wav"}
{"text": "Speaker 0: ู„ุง ุฃุตุฏู‚ ุฃู† ุชูˆู… ู‡ูˆ ุงู„ู‚ุงุชู„.", "audio": "vibevoice_data/train_audio/audio_000067.wav"}
{"text": "Speaker 0: ู‡ุฐุง ุงู„ู…ูƒุชุจ ู„ูŠ.", "audio": "vibevoice_data/train_audio/audio_000068.wav"}

Hi! I recommend starting with the default parameters. I used these: src.finetune_vibevoice_lora --model_name_or_path aoi-ot/VibeVoice-Large --processor_name_or_path src/vibevoice/processor --train_jsonl train_vibevoice.jsonl --text_column_name text --audio_column_name audio --output_dir outputs --per_device_train_batch_size 8 --gradient_accumulation_steps 16 --learning_rate 2.5e-5 --num_train_epochs 5 --logging_steps 1 --save_steps 100 --report_to wandb --remove_unused_columns False --bf16 True --ddpm_batch_mul 4 --ce_loss_weight 0.0 --diffusion_loss_weight 1.0 --do_train --train_diffusion_head --gradient_clipping --gradient_checkpointing True --lora_target_modules NONE --voice_prompt_drop_rate 0.2

I trained if for 3.5 epochs on an RTX 6000 PRO it took around 6 hours if I recall properly

i did training using this command the result was full model ! https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-v2

the script saves all modules but it only trains the diffusion head. Regardless you see everything, only the diffusion head is having updated weights

Sign up or log in to comment