Second version of the ESPER library for speech and singing parametrization, modification, compression and recovery
$ dotnet add package libESPER-V2Second version of the ESPER library for speech parametrization, modification and recovery.
ESPER stands for "enhanced separate processing of excitation and residuals" and is, at its core, an algorithm to split audio into a voiced and unvoiced part, and describe both parts in a specialized format. The unique statistical properties of both parts can then be used to efficiently modify speech parameters that are hard or impossible to change in a waveform or MEL representation. Additionally, audio in the ESPER format can be compressed to ≈1/10th of its original size with minimal loss, and the format is well-suited for processing with AI/ML techniques.
A typical workflow using ESPER consists of three steps:
Additionally, several other data paths are available, which enable libESPER-V2 to be used in a variety of more complex applications:
The main features of the library are the "forward" and "backward" ESPER transforms, which convert an audio waveform to the ESPER format and back respectively. For both directions, an exact and a faster, approximate method is available. Additionally, the time step size and number of channels used to describe the voiced/unvoiced parts can be chosen freely.
Part of the forward transform is a pitch detection step, which can also be run on its own. It uses a custom, graph-based algorithm that outputs both pitch values with arbitrary time step resolution, and annotations for the position of individual pitch periods. If an approximate pitch is known, either as a single value for the whole sample or as a (possibly incomplete) array, the algorithm can optionally use it as guidance.
One of the most interesting capabilities of libESPER-V2 is to produce natural, artifact-free pitch shifted versions of spoken audio. The shift does not need to be uniform, audio can be shifted from any source pitch curve to any target pitch curve. Completely unvoiced sections with no determinable pitch can be handled as well.
libESPER-V2 can stretch or compress samples in time, speeding up or slowing down the audio. This can be done by arbitrary factors, without producing artifacts.
libESPER-V2 also includes a collection of vocal effects, useful for a range of speech modification tasks:
The provided compression functions discard phase information about the voiced part, which only contains negligible information in most cases, compresses the unvoiced part using a MEL shale with configurable resolution, and applies downsampling in time by a configurable factor. As such, it is NOT lossless, but in practice, the losses are minimal compared to the compression ratio.
The library also includes function for serializing and deserializing audio data in the ESPER format to binary strings. These strings include all metadata information required for decoding, and to ensure version compatibility. This setup was chosen over a file parser implementation to allow developers to save several ESPER-format audio samples as part of a single, custom file if so desired.
Please see the wiki at https://github.com/CdrSonan/libESPER-V2/wiki for a comprehensive overview of available classes, functions and best practices.
libESPER-V2 is available on NuGet!
https://www.nuget.org/packages/libESPER-V2
To add it to an existing .NET8 (or later) project, open the .NET CLI and run:
dotnet add package libESPER-V2
For other build and dependency management systems, follow the instructions on the NuGet website.
In addition to NuGet, a ready-to-use binary package is provided with every release. They can be downloaded from the Releases page. After downloading, move the files to an appropriate location in the file structure of your target project. Open your target project in Visual Studio, then right-click on the project (not solution!) and choose
Add > Reference...,
Then select the .DLL file. Afterward, you can access the contents of the libESPER-V2 package by putting
using libESPER-V2
at the top of your file(s).
The main entry point for editing and building the project is the provided .sln file. It is compatible with Microsoft Visual Studio 2022 or later, and JetBrains Rider 2024 or later. (compatibility with older versions and other IDEs has not been tested.)
After loading the .sln file, you will notice it contains two projects:
Your IDE should automatically detect the latter as the testing project for the library. Use the standard build, debug/release config, and test functions of the libESPER-V2 project to build and test it in different configurations. When building, three items will be generated in your build directory:
The .dll and its metadata files can then be used in the same way as the binaries downloadable from the Releases page.