Training
Lindera Node.js supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.
Prerequisites
Build lindera-nodejs with the train feature enabled (enabled by default):
npm run build -- --features train
Training a Model
Use train() to train a CRF model from a seed lexicon and annotated corpus:
const { train } = require("lindera-nodejs");
train({
seed: "resources/training/seed.csv",
corpus: "resources/training/corpus.txt",
charDef: "resources/training/char.def",
unkDef: "resources/training/unk.def",
featureDef: "resources/training/feature.def",
rewriteDef: "resources/training/rewrite.def",
output: "/tmp/model.dat",
lambda: 0.01,
maxIter: 100,
maxThreads: 4,
});
Training Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
seed | string | required | Path to the seed lexicon file (CSV format) |
corpus | string | required | Path to the annotated training corpus |
charDef | string | required | Path to the character definition file (char.def) |
unkDef | string | required | Path to the unknown word definition file (unk.def) |
featureDef | string | required | Path to the feature definition file (feature.def) |
rewriteDef | string | required | Path to the rewrite rule definition file (rewrite.def) |
output | string | required | Output path for the trained model file |
lambda | number | 0.01 | L1 regularization cost (0.0--1.0) |
maxIter | number | 100 | Maximum number of training iterations |
maxThreads | number | undefined | undefined | Number of threads (undefined = auto-detect CPU cores) |
Exporting a Trained Model
After training, export the model to dictionary source files using exportModel():
const { exportModel } = require("lindera-nodejs");
exportModel({
model: "/tmp/model.dat",
output: "/tmp/dictionary_source",
metadata: "resources/training/metadata.json",
});
Export Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | Path to the trained model file (.dat) |
output | string | required | Output directory for dictionary source files |
metadata | string | undefined | undefined | Path to a base metadata.json file |
The export creates the following files in the output directory:
lex.csv-- Lexicon entries with trained costsmatrix.def-- Connection cost matrixunk.def-- Unknown word definitionschar.def-- Character category definitionsmetadata.json-- Updated metadata (whenmetadataparameter is provided)
Complete Workflow
The full workflow for training and using a custom dictionary:
const {
train,
exportModel,
buildDictionary,
Metadata,
TokenizerBuilder,
} = require("lindera-nodejs");
// Step 1: Train the CRF model
train({
seed: "resources/training/seed.csv",
corpus: "resources/training/corpus.txt",
charDef: "resources/training/char.def",
unkDef: "resources/training/unk.def",
featureDef: "resources/training/feature.def",
rewriteDef: "resources/training/rewrite.def",
output: "/tmp/model.dat",
lambda: 0.01,
maxIter: 100,
});
// Step 2: Export to dictionary source files
exportModel({
model: "/tmp/model.dat",
output: "/tmp/dictionary_source",
metadata: "resources/training/metadata.json",
});
// Step 3: Build the dictionary from exported source files
const metadata = Metadata.fromJsonFile("/tmp/dictionary_source/metadata.json");
buildDictionary("/tmp/dictionary_source", "/tmp/dictionary", metadata);
// Step 4: Use the trained dictionary
const tokenizer = new TokenizerBuilder()
.setDictionary("/tmp/dictionary")
.setMode("normal")
.build();
const tokens = tokenizer.tokenize("形態素解析のテスト");
for (const token of tokens) {
console.log(`${token.surface}\t${token.details.join(",")}`);
}