Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace
$ dotnet add package EDMTranslatorText translator library based on LLM models, especially EncoderDecoderModel in HuggingFace
| Package | repo | description |
|---|---|---|
| EDMTranslator | Main library |
tohoku-nlp/bert-base-japanese-v2 and openai-community/gpt2, fine-tuned with JESC datasettohoku-nlp/bert-base-japanese-v2 and skt/kogpt2-base-v2, fine-tuned with FF14 datasettohoku-nlp/bert-base-japanese-v2 and skt/kogpt2-base-v2, fine-tuned with AIHub datasetFollowing guide supposes that you are to use JESCJaEnTranslator mentioned above.
EDMTranslator packageTokenizers.DotNet.runtime.win package toounidic-mecab-2.1.2_bin.zip from https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/ and unzip the archive into somewhereonnx_jesc-ja-en.7z) and unzip the archive into somewhereWrite the code like below and you are good to go 🫡
Note that you need to fix the path of encoderDictDir and modelDir correctly.
// Console application which translates Japanese sentence to English with JESCJaEnTranslator
using EDMTranslator.Tokenization;
using EDMTranslator.Translation;
// Prepare the tokenizer
var encoderVocabPath = await BertJapaneseTokenizer.HuggingFace.GetVocabFromHub("tohoku-nlp/bert-base-japanese-v2");
var hubName = "openai-community/gpt2";
var decoderVocabFilename = "tokenizer.json";
var decoderVocabPath = await Tokenizers.DotNet.HuggingFace.GetFileFromHub(hubName, decoderVocabFilename, "deps");
string encoderDictDir = @"D:\DATASET\unidic-mecab-2.1.2_bin";
var tokenizer = new BertJa2GPTTokenizer(
encoderDictDir: encoderDictDir, encoderVocabPath: encoderVocabPath,
decoderVocabPath: decoderVocabPath);
void TestTokenizer(ITokenizer tokenizer)
{
Console.WriteLine("--Tokenizer test--");
Console.WriteLine("[Encode]");
var sentenceJa = "打ち合わせが終わった後にご飯を食べましょう。";
Console.WriteLine($"Input: {sentenceJa}");
var (embeddingsJa, attentionMask) = tokenizer.Encode(sentenceJa);
Console.WriteLine($"Encoded: {string.Join(", ", embeddingsJa)}");
Console.WriteLine("[Decode]");
// Tokens of "i was nervous before the exam, and i had a fever."
var tokens = new uint[] { 72, 373, 10927, 878, 262, 2814, 11, 290, 1312, 550, 257, 17372, 13 };
Console.WriteLine($"Input: {string.Join(", ", tokens)}");
var decoded = tokenizer.Decode(tokens);
Console.WriteLine($"Decoded: {decoded}");
}
TestTokenizer(tokenizer);
// Prepare the translator
string modelDir = @"D:\MODEL\jesc-ja-en-translator\onnx"; // The folder should contains encoder_model.onnx and decoder_model_merged.onnx
var translator = new JESCJaEnTranslator(tokenizer, modelDir);
void TestTranslator(JESCJaEnTranslator translator)
{
Console.WriteLine("--Translator test--");
Translate(translator, "打ち合わせが終わった後にご飯を食べましょう。");
Translate(translator, "試験前に緊張したあまり、熱がでてしまった。");
Translate(translator, "山田は英語にかけてはクラスの誰にも負けない。");
Translate(translator, "この本によれば、最初の人工橋梁は新石器時代にさかのぼるという。");
}
TestTranslator(translator);
static void Translate(JESCJaEnTranslator translator, string sentence)
{
Console.WriteLine($"SourceText: {sentence}");
string translated = translator.Translate(sentence);
Console.WriteLine($"Translated: {translated}");
}
dotnet 6.0, 7.0, 8.0)7.4.2 or above)cbuild.ps1The build artifact will be saved in nuget directory.