The DirectML runtime for KokoroSharp: an inference engine for Kokoro TTS with ONNX runtime, enabling fast and flexible local text-to-speech (fp/quanted) purely via C#. It features segment streaming, voice mixing, linear job scheduling, and optional playback.
$ dotnet add package KokoroSharp.DirectMLhttps://github.com/user-attachments/assets/82a32382-2e9b-4233-a66f-987b2802717e
KokoroSharp is a fully-featured inference engine for Kokoro TTS, built entirely in C# with ONNX runtime. It enables developers to perform flexible and fast text-to-speech synthesis utilizing multiple speakers and languages.
Supports languages/accents:
[American English, British English, MandarinChinese, Japanese, Hindi, Spanish, French, Italian, Brazilian/Portuguese].(phonemes -> tokens) conversion.KokoroTTS tts = KokoroTTS.LoadModel(); // Load or download the model (~320MB for full precision)
KokoroVoice heartVoice = KokoroVoiceManager.GetVoice("af_heart"); // Grab a voice of your liking,
while (true) { tts.SpeakFast(Console.ReadLine(), heartVoice); } // .. and have it speak your text!
// Note: Language detection is automated based on what the loaded voice supports.
Above is a simple way to get started on the highest level. For more control, check out the example Program, which covers more advanced parts like job scheduling, voice mixing, and long-term, speaker-agnostic playback queuing.
KokoroTTS.LoadModel("path/to/model"), or downloaded automatically with KokoroTTS.LoadModel(). Check out the various overloads of KokoroTTS.LoadModel for background loading.KokoroSharp prioritizes a smooth developer experience by logging potential misuse instead of throwing exceptions. Wherever possible, the library attempts to automatically resolve issues to minimize disruptions.
All communication with the AI model and playback devices happens on background threads, letting the main thread focus on rendering the UI in peace. The library is carefully designed with thread-safety in mind.
The voices folder are automatically copied to your build path when you build and are ready to be accessed. Same with the mentioned espeak backends. Developers may opt to remove them when shipping their apps.
Mind that LoadVoicesFromPath exists as an option, in case developers want to implement their custom voice-loading logic when shipping a project that utilizes KokoroSharp for text-to-speech synthesis.
In addition, the built-in tokenization (text -> tokens) is NOT mandatory, and can be bypassed for platforms like Android/iOS, given developers provide pre-phonemized input with their phonemization solution of choice.
For Phoneme Literals, you can use the following syntax: "[tomato](/təmeɪtoʊ/) [tomato](/təmɑːtoʊ/).".