VoiceTuner

Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation

Abstract. Voice large language models (LLMs) cast voice synthesis as a language modeling task in a discrete space, and have demonstrated significant progress to date. Despite the recent success, the current development of voice LLMs in low-resource applications is hampered by data scarcity and high computational cost. In this work, we propose VoiceTuner, with a self-supervised pre-training and efficient fine-tuning approach for low-resource voice generation. Specifically, 1) to mitigate data scarcity, we leverage large-scale unlabeled dataset and pre-train VoiceTuner-SSL without pre-defined applications, which can be fine-tuned in downstream tasks; 2) to further reduce the high training cost in complete fine-tuning, we introduce a multiscale adapter to effectively update around only 1% parameters as a plug-and-play module; and 3) to alleviate the challenges of modeling long audio tokens inherited from inefficient attention mechanism, we introduce VoiceTuner-Mamba with multiscale state space models in place of transformers. Experimental results demonstrate that VoiceTuner-SSL presents strong acoustic continuations. VoiceTuner exhibits superior quality and style similarity in three low-resource (1h, 10h, 30h) generation tasks. Audio samples are available at https://VoiceTuner.github.io .

Model Overview

Audio Continuity of VoiceTuner-SSL

Low-Resource Finetuning

Audio Continuity of VoiceTuner-SSL

In this section, we provide samples of audio continuation generation of VoiceTuner-SSL, including its ability to maintain speed, style, emotion and noisy environment.

Speed

Speed	Prefix	Base	Medium	Large
slow
slow
normal
fast
fast

Style

Style	Prefix	Base	Medium	Large
whispering
angry
excited
newscast formal
shouting

Emotion

Emotion	Prefix	Base	Medium	Large
angry
sad
surprise

Noise Conditon

Prefix	Base	Medium	Large

Low-Resource Finetuning

In this section, we provide the results of VoiceTuner on low-resource finetuning tasks, including instruction-guided tts, zero-shot tts and singing voice synthesis.

Instruction-Guided TTS

Text	Instruction	GT	GT Voc.	30h full-p	30h adapter	30h full-s	10h full-p	10h adapter	1h full-p	1h adapter
As soon as we left Liverpool Street she produced a bag of shrimps, grasped each individual in turn firmly by the head and tail, and ate him.	Please generate a slow speaking female voice.
If you had taken another step, peter Rabbit, you would have stepped right on our eggs.	Please help me generate a male voice with high tone and fast speaking speed.
In international affairs we are a short sighted people.	High pitched, normal speaking speed, low volume, man.
Now he wanted to join forces with his own kind, whether those men were potential enemies or not.	Talk with her unique bass.
Wonder where Jill was-Jill Bayne, Fayne, Sayne-what the devil-neck hurts, darned uncomfortable seat.	This woman whispered, her voice was shrilly.

Zero-Shot TTS

Text	GT	GT Voc.	30h full-p	30h adapter	10h full-p	10h adapter	1h full-p	1h adapter
No candle was burning in the caravan, probably from the economy of want. The hut was lighted only by a red tinge, arising from the opening at the top of the stove, in which sparkled a peat fire.
Having lost in his old age the intellect he had never possessed in his youth, was not responsible for the calamities of his reign.
He approached the great mansion.
Behind, projecting hinges indicated a door, and in the centre of this door a square opening showed a light inside the caravan.
In those two houses neither candle nor lamp was to be seen; nor in the whole street; nor in the whole town, so far as eye could reach.