Training

Lindera Node.js supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.

Prerequisites

Build lindera-nodejs with the train feature enabled (enabled by default):

npm run build -- --features train

Training a Model

Use train() to train a CRF model from a seed lexicon and annotated corpus:

const { train } = require("lindera-nodejs");

train({
  seed: "resources/training/seed.csv",
  corpus: "resources/training/corpus.txt",
  charDef: "resources/training/char.def",
  unkDef: "resources/training/unk.def",
  featureDef: "resources/training/feature.def",
  rewriteDef: "resources/training/rewrite.def",
  output: "/tmp/model.dat",
  lambda: 0.01,
  maxIter: 100,
  maxThreads: 4,
});

Training Parameters

ParameterTypeDefaultDescription
seedstringrequiredPath to the seed lexicon file (CSV format)
corpusstringrequiredPath to the annotated training corpus
charDefstringrequiredPath to the character definition file (char.def)
unkDefstringrequiredPath to the unknown word definition file (unk.def)
featureDefstringrequiredPath to the feature definition file (feature.def)
rewriteDefstringrequiredPath to the rewrite rule definition file (rewrite.def)
outputstringrequiredOutput path for the trained model file
lambdanumber0.01L1 regularization cost (0.0--1.0)
maxIternumber100Maximum number of training iterations
maxThreadsnumber | undefinedundefinedNumber of threads (undefined = auto-detect CPU cores)

Exporting a Trained Model

After training, export the model to dictionary source files using exportModel():

const { exportModel } = require("lindera-nodejs");

exportModel({
  model: "/tmp/model.dat",
  output: "/tmp/dictionary_source",
  metadata: "resources/training/metadata.json",
});

Export Parameters

ParameterTypeDefaultDescription
modelstringrequiredPath to the trained model file (.dat)
outputstringrequiredOutput directory for dictionary source files
metadatastring | undefinedundefinedPath to a base metadata.json file

The export creates the following files in the output directory:

  • lex.csv -- Lexicon entries with trained costs
  • matrix.def -- Connection cost matrix
  • unk.def -- Unknown word definitions
  • char.def -- Character category definitions
  • metadata.json -- Updated metadata (when metadata parameter is provided)

Complete Workflow

The full workflow for training and using a custom dictionary:

const {
  train,
  exportModel,
  buildDictionary,
  Metadata,
  TokenizerBuilder,
} = require("lindera-nodejs");

// Step 1: Train the CRF model
train({
  seed: "resources/training/seed.csv",
  corpus: "resources/training/corpus.txt",
  charDef: "resources/training/char.def",
  unkDef: "resources/training/unk.def",
  featureDef: "resources/training/feature.def",
  rewriteDef: "resources/training/rewrite.def",
  output: "/tmp/model.dat",
  lambda: 0.01,
  maxIter: 100,
});

// Step 2: Export to dictionary source files
exportModel({
  model: "/tmp/model.dat",
  output: "/tmp/dictionary_source",
  metadata: "resources/training/metadata.json",
});

// Step 3: Build the dictionary from exported source files
const metadata = Metadata.fromJsonFile("/tmp/dictionary_source/metadata.json");
buildDictionary("/tmp/dictionary_source", "/tmp/dictionary", metadata);

// Step 4: Use the trained dictionary
const tokenizer = new TokenizerBuilder()
  .setDictionary("/tmp/dictionary")
  .setMode("normal")
  .build();

const tokens = tokenizer.tokenize("形態素解析のテスト");
for (const token of tokens) {
  console.log(`${token.surface}\t${token.details.join(",")}`);
}