.NET wrapper of HuggingFace Tokenizers library
$ dotnet add package Tokenizers.DotNet.NET wrapper of HuggingFace Tokenizers library
.json) from localTokenizers.DotNet packageTokenizers.DotNet.runtime.<OS>-<ARCH> package too (e.a win-x64 or linux-arm64, check Nuget package list above).Check following example code:
using Tokenizers.DotNet;
// Download skt/kogpt2-base-v2/tokenizer.json from the hub
var hubName = "skt/kogpt2-base-v2";
var filePath = "tokenizer.json";
var fileFullPath = await HuggingFace.GetFileFromHub(hubName, filePath, "deps");
Console.WriteLine($"Downloaded {fileFullPath}");
// Create a tokenizer instance
Tokenizer tokenizer;
try
{
tokenizer = new Tokenizer(vocabPath: fileFullPath);
}
catch (TokenizerException e)
{
Console.WriteLine(e.Message);
return;
}
try
{
var text = "음, 이제 식사도 해볼까요";
Console.WriteLine($"Input text: {text}");
var tokens = tokenizer.Encode(text);
Console.WriteLine($"Encoded: {string.Join(", ", tokens)}");
var decoded = tokenizer.Decode(tokens);
Console.WriteLine($"Decoded: {decoded}");
}
catch (TokenizerException e)
{
Console.WriteLine(e.Message);
return;
}
Console.WriteLine($"Version of Tokenizers.DotNet.runtime.win: {tokenizer.GetVersion()}");
Console.WriteLine("--------------------------------------------------");
//// Download openai-community/gpt2 from the hub
hubName = "openai-community/gpt2";
filePath = "tokenizer.json";
fileFullPath = await HuggingFace.GetFileFromHub(hubName, filePath, "deps");
// Create a tokenizer instance
Tokenizer tokenizer2;
try
{
tokenizer2 = new Tokenizer(vocabPath: fileFullPath);
}
catch (TokenizerException e)
{
Console.WriteLine(e.Message);
return;
}
try
{
var text2 = "i was nervous before the exam, and i had a fever.";
Console.WriteLine($"Input text: {text2}");
var tokens2 = tokenizer2.Encode(text2);
Console.WriteLine($"Encoded: {string.Join(", ", tokens2)}");
var decoded2 = tokenizer2.Decode(tokens2);
Console.WriteLine($"Decoded: {decoded2}");
}
catch (TokenizerException e)
{
Console.WriteLine(e.Message);
return;
}
Console.WriteLine($"Version of Tokenizers.DotNet.runtime.win: {tokenizer2.GetVersion()}");
Console.ReadKey();
You can use Docker to compile this library for Windows x64/arm64 and Linux x64/arm64
Run update_version.ps1 before running Docker to update the package version.
Windows:
PS > docker build -f Dockerfile -t ghcr.io/sappho192/tokenizers.dotnet:latest .
PS > docker run -v .\nuget:/out --rm ghcr.io/sappho192/tokenizers.dotnet:latest
Linux/MacOS:
$ docker build -f Dockerfile -t ghcr.io/sappho192/tokenizers.dotnet:latest .
$ docker run -v ./nuget:/out --rm ghcr.io/sappho192/tokenizers.dotnet:latest
Built packages will be in the nuget folder.
(Note that this has been confirmed only in Windows machine)
cargo)dotnet 6.0 or above)7.4.2 or above)NATIVE_LIB_VERSION.txtbuild_all_clean.ps1
Tokenizers.DotNet.runtime.<OS> only, run build_rust.ps1Tokenizers.DotNet only, run build_dotnet.ps1Each build artifacts will be in nuget directory.
update_version.ps1 to update the version number-RC0, -RC1, ... until final check.
BE SURE to check following files are committed:NATIVE_LIB_VERSION.txtCargo.tomlTokenizers.DotNet.nuspecTokenizers.DotNet.runtime.win-x64.nuspecTokenizers.DotNet.runtime.win-arm64.nuspecTokenizers.DotNet.runtime.linux-x64.nuspecTokenizers.DotNet.runtime.linux-arm64.nuspec-RC0, ...)