Introducing Dia: A New Open Source Text-to-Speech Model Taking on Industry Leaders

Nari Labs Introduces Dia: An Innovative Text-to-Speech Model
Overview of Dia
Nari Labs, a two-person startup, has launched Dia, a text-to-speech (TTS) model featuring 1.6 billion parameters. This powerful model is designed to generate realistic dialogue from text prompts. According to one of its creators, Toby Kim, Dia exceeds the quality of established competitors like ElevenLabs and Google’s NotebookLM AI podcast feature.
The Background of Dia
The journey of Dia began out of a desire for better control over TTS outputs. Kim noted in various social media posts that both he and his collaborator were not originally experts in AI. Their exploration in the field started when they were impressed by the capabilities of NotebookLM’s podcast feature. After testing every TTS API available, they decided to develop their own model due to dissatisfaction with the realism of existing offerings.
Kim credited Google for providing access to Tensor Processing Units (TPUs) which helped them in training Dia via Google’s Research Cloud. The model’s code, along with its internal connections (known as weights), is freely accessible for download from platforms like Hugging Face and GitHub. Users can even experiment with Dia by generating speech on a Hugging Face Space.
Key Features of Dia
Customization and Control
Dia goes beyond traditional TTS capabilities with advanced features that enhance user experience:
- Emotional Tone: Ability to convey different emotional undertones in dialogue.
- Speaker Tagging: Users can label different speakers using tags like [S1] and [S2].
- Nonverbal Cues: Allows inclusion of nonverbal expressions (e.g., (laughs), (coughs)) directly from text, which Dia interprets accurately to create more engaging dialogues.
Currently, Dia supports only the English language and can generate varied voices unless specified otherwise through a generation seed or an audio clip for audio conditioning. This feature provides users with greater flexibility in shaping the voice tone and likeness.
Competitive Analysis: Dia Versus ElevenLabs and Sesame
Nari Labs has shared various audio examples that compare Dia to competitors such as ElevenLabs Studio and Sesame, a recent offering from Oculus VR co-creator Brendan Iribe. These comparisons reveal that:
- Natural Dialogue: Dia manages to deliver better timing and incorporates nonverbal expressions effectively. For instance, when a script includes a laugh, Dia produces natural laughter rather than a text substitute like "haha."
- Emotional Range: In multi-turn conversations featuring emotional content, Dia smoothly transitions in tone better than its rivals, especially in high-intensity scenarios.
- Handling Nonverbal Scripts: Dia can adeptly process scripts that consist solely of nonverbal cues, something its competitors often overlook.
Through testing, it became clear that Dia excels even with rhythmically complex content, such as rap lyrics, delivering fluid and engaging speech that retains the intended tempo.
Accessing and Using Dia
Developers interested in working with Dia can find it on Nari Labs’ GitHub and Hugging Face pages. The model operates on PyTorch 2.0+ and CUDA 12.6, requiring approximately 10GB of VRAM to function optimally. On robust GPUs such as the NVIDIA A4000, it can achieve around 40 tokens per second during inference.
Although currently GPU-dependent, future plans include launching versions that can run on CPUs and quantized versions to improve accessibility. Nari Labs provides both a Python library and a command-line interface (CLI) tool to facilitate easier deployment.
Open Source Commitment
Dia is fully released under an Apache 2.0 license, allowing commercial use. However, Nari Labs has outlined specific restrictions against unethical applications, such as impersonation or misinformation. The creators advocate for responsible experimentation, demonstrating a strong commitment to ethical practices in AI development.
The development of Dia was supported by Google’s TPU Research Cloud and contributions from platforms like Hugging Face. Nari Labs operates with a small team but actively invites community collaboration via Discord and GitHub.
With Dia, Nari Labs aims to provide a more human-like interaction in TTS, expanding possibilities in fields ranging from content creation to assistive technologies and beyond.