training
I am preparing training , and I believe the results u obtained are excellent. Could you please share the training parameters or any information that might help with the training? I would like to use this dataset https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 , and I think the same settings you used would work.
how long time need on h100 gpu 80 g
Hi! I recommend starting with the default parameters. I used these: src.finetune_vibevoice_lora --model_name_or_path aoi-ot/VibeVoice-Large --processor_name_or_path src/vibevoice/processor --train_jsonl train_vibevoice.jsonl --text_column_name text --audio_column_name audio --output_dir outputs --per_device_train_batch_size 8 --gradient_accumulation_steps 16 --learning_rate 2.5e-5 --num_train_epochs 5 --logging_steps 1 --save_steps 100 --report_to wandb --remove_unused_columns False --bf16 True --ddpm_batch_mul 4 --ce_loss_weight 0.0 --diffusion_loss_weight 1.0 --do_train --train_diffusion_head --gradient_clipping --gradient_checkpointing True --lora_target_modules NONE --voice_prompt_drop_rate 0.2
I trained if for 3.5 epochs on an RTX 6000 PRO it took around 6 hours if I recall properly
i do training but the result was very poor i think the issue with train_vibevoice.jsonl i created ! maybe !
can u share urs jsonl or the script to extract it from https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/
hers mine i tested it using comfyui
https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-v1
my jsonl {"text": "Speaker 0: ูู
ุง ูุงู ุฑุจู ููููู ุงููุฑู ุจุธูู
ูุฃูููุง ู
ุตูุญูู", "audio": "vibevoice_data/train_audio/audio_000000.wav"}
{"text": "Speaker 0: ุฃูุฏุฑ ุชุนุงููู ู
ุนูุง.", "audio": "vibevoice_data/train_audio/audio_000001.wav"}
{"text": "Speaker 0: ูููููุทูุง ุฅูุฐู ููุงูู ููููููู
ููู ุฃูุชูุฃูุชูููู ุงููููุงุญูุดูุฉู ู
ูุง ุณูุจูููููู
ู ุจูููุง ู
ููู ุฃูุญูุฏู ู
ููู ุงููุนูุงููู
ูููู", "audio": "vibevoice_data/train_audio/audio_000002.wav"}
{"text": "Speaker 0: ููู
ุง ุฑุฃูุง ุจุฃุณูุง ูุงููุง ุขู
ูุง ุจุงููู ูุญุฏู ูููุฑูุง ุจู
ุง ููุง ุจู ู
ุดุฑููู", "audio": "vibevoice_data/train_audio/audio_000003.wav"}
{"text": "Speaker 0: ูู
ู ูุฐุง ุงููุชุงุจุ", "audio": "vibevoice_data/train_audio/audio_000004.wav"}
{"text": "Speaker 0: ุฅูู ุฎุทูุฑ ููุบุงูุฉ.", "audio": "vibevoice_data/train_audio/audio_000005.wav"}
{"text": "Speaker 0: ุจู ู
ุชุนุช ูุคูุงุก ูุขุจุงุกูู
ุญุชู ุฌุงุกูู
ุงูุญู ูุฑุณูู ู
ุจูู", "audio": "vibevoice_data/train_audio/audio_000006.wav"}
{"text": "Speaker 0: ููููุตููุฑู ุจูุงูุฑููู
ูุฒู ุณูุงุฆูุฑูุง ููููู ุงูุตููุญููู ู
ูุฎููููุฏูุง", "audio": "vibevoice_data/train_audio/audio_000007.wav"}
{"text": "Speaker 0: ุฃูุง ุดุฑููู.", "audio": "vibevoice_data/train_audio/audio_000008.wav"}
{"text": "Speaker 0: ููู ุงูุฑุณูู ูุงูุฐูู ุขู
ููุง ู
ุนู ุฌุงูุฏูุง ุจุฃู
ูุงููู
ูุฃููุณูู
ูุฃููุฆู ููู
ุงูุฎูุฑุงุช ูุฃููุฆู ูู
ุงูู
ููุญูู", "audio": "vibevoice_data/train_audio/audio_000009.wav"}
{"text": "Speaker 0: ุซูู
ูู ุฅูููู ู
ูุฑูุฌูุนูููู
ู ููุฅูููู ุงููุฌูุญููู
ู", "audio": "vibevoice_data/train_audio/audio_000010.wav"}
{"text": "Speaker 0: ูู ููุชุฑุจ ู
ู ุงูุนุฏู.", "audio": "vibevoice_data/train_audio/audio_000011.wav"}
{"text": "Speaker 0: ููุฃููููููู
ู ูููููููููู ู
ูุง ููุง ููููุนูููููู", "audio": "vibevoice_data/train_audio/audio_000012.wav"}
{"text": "Speaker 0: ู
ุง ูุฐุง ุงูุฐู ุชูุนูู ูุง ูุชูุ", "audio": "vibevoice_data/train_audio/audio_000013.wav"}
{"text": "Speaker 0: ููุงูุดููููููู ู
ููู ุฌูู
ูุนู ููุบูููุฑููู ููุจูุฎููู ุนูููู ููููุณููู", "audio": "vibevoice_data/train_audio/audio_000014.wav"}
{"text": "Speaker 0: ุฃูุทู
ุน ูู ุงู
ุฑุฆ ู
ููู
ุฃู ูุฏุฎู ุฌูุฉ ูุนูู
", "audio": "vibevoice_data/train_audio/audio_000015.wav"}
{"text": "Speaker 0: ููููุง ููุธูููุฑู ูููู ุงููุงุณูุชูููููุงุกู ู
ููููู ููุงููุงุณูุชูุบูููุงุกู ุนููููู", "audio": "vibevoice_data/train_audio/audio_000016.wav"}
{"text": "Speaker 0: ููุฐููููู ููุง ูููุฌูุฏู ู
ููููู ุฅูููุง ุนูููุฏู ููู
ูุงูู ุนููููููู", "audio": "vibevoice_data/train_audio/audio_000017.wav"}
{"text": "Speaker 0: ููุฏ ุนููุจ ุนูู ุฌุฑุงุฆู
ู.", "audio": "vibevoice_data/train_audio/audio_000018.wav"}
{"text": "Speaker 0: ุงูุชุธุฑ ุญุชู ูุชููู ุงูู
ุทุฑ ุนู ุงููุทูู.", "audio": "vibevoice_data/train_audio/audio_000019.wav"}
{"text": "Speaker 0: ููุฏูุนูุง ุฑูุจูููู ุฃููููู ู
ูุบููููุจู ููุงูุชูุตูุฑู", "audio": "vibevoice_data/train_audio/audio_000020.wav"}
{"text": "Speaker 0: ููุฅูุฐูุง ููููู ููููู
ู ุงุชูููููุง ู
ูุง ุจููููู ุฃูููุฏููููู
ู ููู
ูุง ุฎูููููููู
ู ููุนููููููู
ู ุชูุฑูุญูู
ูููู", "audio": "vibevoice_data/train_audio/audio_000021.wav"}
{"text": "Speaker 0: ูููู
ุงูุณูุฆุงุช ูู
ู ุชู ุงูุณูุฆุงุช ููู
ุฆุฐ ููุฏ ุฑุญู
ุชู ูุฐูู ูู ุงูููุฒ ุงูุนุธูู
", "audio": "vibevoice_data/train_audio/audio_000022.wav"}
{"text": "Speaker 0: ุฃููู ููููููู ููุชููุฌูุฉู ู
ููู ุบูููุฑููู", "audio": "vibevoice_data/train_audio/audio_000023.wav"}
{"text": "Speaker 0: ูุฅุฐุง ูุงูููู
ุฃู ูุฒูููู
ูุฎุณุฑูู", "audio": "vibevoice_data/train_audio/audio_000024.wav"}
{"text": "Speaker 0: ุญูู
ูุงุฉู ุงููุญูู
ูู ููุง ุญูู
ูุงุฉู ุงููุญูู
ูู ููููู
ูููุง ููููู
ูููุง ููู
ูุฌูุฏู ุงูุฒููู
ููู", "audio": "vibevoice_data/train_audio/audio_000025.wav"}
{"text": "Speaker 0: ุจุฏุฃ ุงูุชูุณุงุณููู ุจุชูุธูู
ุฌููุดูู
.", "audio": "vibevoice_data/train_audio/audio_000026.wav"}
{"text": "Speaker 0: ูู
ุงุฐุง ุตุนุฏุช ุฅูู ุณูู ุจูุชูุงุ", "audio": "vibevoice_data/train_audio/audio_000027.wav"}
{"text": "Speaker 0: ุงููู ูุจุฏุฃ ุงูุฎูู ุซู
ูุนูุฏู ุซู
ุฅููู ุชุฑุฌุนูู", "audio": "vibevoice_data/train_audio/audio_000028.wav"}
{"text": "Speaker 0: ุฅู ุนุฐุงุจ ุฑุจูู
ุบูุฑ ู
ุฃู
ูู", "audio": "vibevoice_data/train_audio/audio_000029.wav"}
{"text": "Speaker 0: ูููุฏ ุตุฑููุงู ุจูููู
ููุฐูุฑูุง ูุฃุจู ุฃูุซุฑ ุงููุงุณ ุฅูุง ูููุฑุง", "audio": "vibevoice_data/train_audio/audio_000030.wav"}
{"text": "Speaker 0: ูุงู ุชูู
ุฃููู ูู
ูููุตู ุฃุญุฏ.", "audio": "vibevoice_data/train_audio/audio_000031.wav"}
{"text": "Speaker 0: ุฃุณูู ูู ูุทุฑ.", "audio": "vibevoice_data/train_audio/audio_000032.wav"}
{"text": "Speaker 0: ููุฏ ุญู ูุตู ุงูุฑุจูุน ุ ุงูุญุฑุงุฑุฉ ุชุฒุฏุงุฏ ููู
ุง ุจุนุฏ ููู
.", "audio": "vibevoice_data/train_audio/audio_000033.wav"}
{"text": "Speaker 0: ููู
ูููู
ูุง ุชููููู ุนูููุฏู ุงู
ูุฑูุฆู ู
ููู ุฎููููููุฉู ููุฅููู ุฎูุงููููุง ุชูุฎูููู ุนูููู ุงููููุงุณู ุชูุนูููู
ู", "audio": "vibevoice_data/train_audio/audio_000034.wav"}
{"text": "Speaker 0: ุงููุฑุณ ูู ุฃูุซู ุงูุญุตุงู", "audio": "vibevoice_data/train_audio/audio_000035.wav"}
{"text": "Speaker 0: ููููููููุฑููุง ุจูู
ูุง ุขุชูููููุงููู
ู ููููููุชูู
ูุชููุนููุง ููุณููููู ููุนูููู
ูููู", "audio": "vibevoice_data/train_audio/audio_000036.wav"}
{"text": "Speaker 0: ููุทููุฑ ุงุนุดุงุด ุ ูููุนูุงูุจ ุดุจุงู ุ ููููุงุณ ุงูุตุฏุงูุงุช.", "audio": "vibevoice_data/train_audio/audio_000037.wav"}
{"text": "Speaker 0: ุฃุฎุถุฑูุง ู
ู ุงูุฃุฑุถุ ูุฃุฒุฑููุง ู
ู ุงูุณู
ุงุก", "audio": "vibevoice_data/train_audio/audio_000038.wav"}
{"text": "Speaker 0: ููุนูููู
ููู ู
ูุญููููุฑู", "audio": "vibevoice_data/train_audio/audio_000039.wav"}
{"text": "Speaker 0: ู
ุง ุงูุฐู ูุฑุงูู ูููุงุ", "audio": "vibevoice_data/train_audio/audio_000040.wav"}
{"text": "Speaker 0: ููููุฏู ุฃูุถููููููู ุนููู ุงูุฐููููุฑู ุจูุนูุฏู ุฅูุฐู ุฌูุงุกูููู ููููุงูู ุงูุดููููุทูุงูู ููููุฅูููุณูุงูู ุฎูุฐููููุง", "audio": "vibevoice_data/train_audio/audio_000041.wav"}
{"text": "Speaker 0: ููููุฏู ุฎูููููููุง ุงููุฅูููุณูุงูู ููู ุฃูุญูุณููู ุชููููููู
ู", "audio": "vibevoice_data/train_audio/audio_000042.wav"}
{"text": "Speaker 0: ูุชุจ ุนููู ุฃูู ู
ู ุชููุงู ูุฃูู ูุถูู ูููุฏูู ุฅูู ุนุฐุงุจ ุงูุณุนูุฑ", "audio": "vibevoice_data/train_audio/audio_000043.wav"}
{"text": "Speaker 0: ูุณูู ูุญุงุณุจ ุญุณุงุจุง ูุณูุฑุง", "audio": "vibevoice_data/train_audio/audio_000044.wav"}
{"text": "Speaker 0: ููููููุณู ุจูุนูุฏู ุงููู
ูููุชู ุดูููุกู ุฅูููุง ุงููู
ูููุชู ุฃูููุณูุฑู ู
ููููู", "audio": "vibevoice_data/train_audio/audio_000045.wav"}
{"text": "Speaker 0: ุฃุฑุงุฏ ุณุงู
ู ุฃู ููุชู ูููู ูู ูุณุชููู ุนูู ู
ูููุชูุง ุงูุนููุงุฑููุฉ.", "audio": "vibevoice_data/train_audio/audio_000046.wav"}
{"text": "Speaker 0: ููุงูู ุงูููุฐูููู ุงุณูุชูููุจูุฑููุง ุฅููููุง ุจูุงูููุฐูู ุขู
ูููุชูู
ู ุจููู ููุงููุฑูููู", "audio": "vibevoice_data/train_audio/audio_000047.wav"}
{"text": "Speaker 0: ููููุจููุฆูููู
ู ุนููู ุถููููู ุฅูุจูุฑูุงูููู
ู", "audio": "vibevoice_data/train_audio/audio_000048.wav"}
{"text": "Speaker 0: ูุฑูุฏ ุงูุณูุงู
ูู ุงูุนุงูู
.", "audio": "vibevoice_data/train_audio/audio_000049.wav"}
{"text": "Speaker 0: ุญูู
ู ูู ุฃู ุฃุตุจุญ ู
ุบูููุง.", "audio": "vibevoice_data/train_audio/audio_000050.wav"}
{"text": "Speaker 0: ูู ุชุฑูุฏูู ุญูุงู ุฃู ุฃุฎุจุฑ ุชูู
ุจุดุฃูู ุฃูุชู ู ู
ุงุฑู ุ", "audio": "vibevoice_data/train_audio/audio_000051.wav"}
{"text": "Speaker 0: ูุฌููุง ู
ุถุญูู.", "audio": "vibevoice_data/train_audio/audio_000052.wav"}
{"text": "Speaker 0: ูุงู ุดูุจ ุ ููุชุญุช ุงูุดุจุงู", "audio": "vibevoice_data/train_audio/audio_000053.wav"}
{"text": "Speaker 0: ูู ุชุดุฑุจ ุงูู
ุงุก ููุท.", "audio": "vibevoice_data/train_audio/audio_000054.wav"}
{"text": "Speaker 0: ุงูุง ุงูุชุธุฑู ู
ูุฐ ูุฐุง ุงูุตุจุงุญ ุงูุจุงูุฑ.", "audio": "vibevoice_data/train_audio/audio_000055.wav"}
{"text": "Speaker 0: ุงูุญุงููุช ูุจูุน ุงูุฌุฑุงุฆุฏ ู ุงูู
ุฌูุงุช.", "audio": "vibevoice_data/train_audio/audio_000056.wav"}
{"text": "Speaker 0: ุงููุชููููุง ูููุณููู ุฃููู ุงุทูุฑูุญูููู ุฃูุฑูุถูุง ููุฎููู ููููู
ู ููุฌููู ุฃูุจููููู
ู ููุชูููููููุง ู
ููู ุจูุนูุฏููู ููููู
ูุง ุตูุงููุญูููู", "audio": "vibevoice_data/train_audio/audio_000057.wav"}
{"text": "Speaker 0: ุจููู
ุง ููุช ุฐุงูุจุง ุฅูู ุงูุนู
ู ุ ุงูุชููุช ุจุนู
ู.", "audio": "vibevoice_data/train_audio/audio_000058.wav"}
{"text": "Speaker 0: ูู ูู ุงูุฐู ุฐุฑุฃูู
ูู ุงูุฃุฑุถ ูุฅููู ุชุญุดุฑูู", "audio": "vibevoice_data/train_audio/audio_000059.wav"}
{"text": "Speaker 0: ูููููู ุดูุงุกู ุงูููููู ู
ูุง ุฃูุดูุฑููููุง ููู
ูุง ุฌูุนูููููุงูู ุนูููููููู
ู ุญููููุธูุง ููู
ูุง ุฃูููุชู ุนูููููููู
ู ุจูููููููู", "audio": "vibevoice_data/train_audio/audio_000060.wav"}
{"text": "Speaker 0: ููุฏ ูุนุฏุชูู.", "audio": "vibevoice_data/train_audio/audio_000061.wav"}
{"text": "Speaker 0: ูููููุง ุณูููููููุฑูููู ุจูุนูุจูุงุฏูุชูููู
ู ูููููููููููู ุนูููููููู
ู ุถูุฏููุง", "audio": "vibevoice_data/train_audio/audio_000062.wav"}
{"text": "Speaker 0: ููุฃูู
ููุง ู
ูุง ููุตูููุญู ุจููู ุญูุงูู ุงููุฅูููุณูุงูู ูููููุง ููุซูููุงุซูุฉู ุฃูุดูููุงุกู", "audio": "vibevoice_data/train_audio/audio_000063.wav"}
{"text": "Speaker 0: ูุงู ุนู
ุฑ ุงููููุฏ ุซู
ุงููู ุณูุฉ.", "audio": "vibevoice_data/train_audio/audio_000064.wav"}
{"text": "Speaker 0: ุงุดุชุฑูุช ุณูุงุฑุฉ ุฌุฏูุฏุฉ ุงูุงุณุจูุน ุงูู
ุงุถู.", "audio": "vibevoice_data/train_audio/audio_000065.wav"}
{"text": "Speaker 0: ููู
ูููููุง ููุซูุฑูุฉู ุงุดูุชูุบูุงูููู ููุชูุฑูุงุฏููู ุญูุงููุงุชููู ุญูุชููู ุฃููููููุง ุชูุณูุชูููุนูุจู ุฒูู
ูุงูููู ููุชูุณูุชูููููุฏู ุฃููููุงู
ููู", "audio": "vibevoice_data/train_audio/audio_000066.wav"}
{"text": "Speaker 0: ูุง ุฃุตุฏู ุฃู ุชูู
ูู ุงููุงุชู.", "audio": "vibevoice_data/train_audio/audio_000067.wav"}
{"text": "Speaker 0: ูุฐุง ุงูู
ูุชุจ ูู.", "audio": "vibevoice_data/train_audio/audio_000068.wav"}
Hi! I recommend starting with the default parameters. I used these: src.finetune_vibevoice_lora --model_name_or_path aoi-ot/VibeVoice-Large --processor_name_or_path src/vibevoice/processor --train_jsonl train_vibevoice.jsonl --text_column_name text --audio_column_name audio --output_dir outputs --per_device_train_batch_size 8 --gradient_accumulation_steps 16 --learning_rate 2.5e-5 --num_train_epochs 5 --logging_steps 1 --save_steps 100 --report_to wandb --remove_unused_columns False --bf16 True --ddpm_batch_mul 4 --ce_loss_weight 0.0 --diffusion_loss_weight 1.0 --do_train --train_diffusion_head --gradient_clipping --gradient_checkpointing True --lora_target_modules NONE --voice_prompt_drop_rate 0.2
I trained if for 3.5 epochs on an RTX 6000 PRO it took around 6 hours if I recall properly
i did training using this command the result was full model ! https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-v2
the script saves all modules but it only trains the diffusion head. Regardless you see everything, only the diffusion head is having updated weights