Training Pipeline

Lindera provides CRF-based dictionary training functionality for creating custom morphological analysis models. This feature requires the train feature flag.

Overview

The training pipeline follows three stages:

lindera train --> model.dat --> lindera export --> dictionary files --> lindera build --> compiled dictionary

Train: Learn CRF weights from an annotated corpus and seed dictionary, producing a binary model file.
Export: Convert the trained model into Lindera dictionary source files.
Build: Compile the source files into a binary dictionary that Lindera can load at runtime.

Required Input Files

1. Seed Lexicon (seed.csv)

Base vocabulary dictionary in MeCab CSV format.

外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
参政,0,0,0,名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ

Each line contains: surface,left_id,right_id,cost,pos,pos_detail1,pos_detail2,pos_detail3,inflection_type,inflection_form,base_form,reading,pronunciation

The left_id, right_id, and cost fields are set to 0 in the seed dictionary -- the trainer will compute appropriate values from the CRF model.

2. Training Corpus (corpus.txt)

Annotated text data in tab-separated format. Each line is surface<TAB>pos_info, and sentences are separated by EOS.

外国	名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人	名詞,接尾,一般,*,*,*,人,ジン,ジン
参政	名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権	名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS

これ	連体詞,*,*,*,*,*,これ,コレ,コレ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト	名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
EOS

Training quality depends heavily on the quantity and quality of this corpus.

3. Character Definition (char.def)

Defines character type categories and Unicode code point ranges.

# Category definition: category_name compatibility_flag continuity_flag length
DEFAULT 0 1 0
HIRAGANA 1 1 0
KATAKANA 1 1 0
KANJI 0 0 2
ALPHA 1 1 0
NUMERIC 1 1 0

# Character range mapping
0x3041..0x3096 HIRAGANA  # Hiragana
0x30A1..0x30F6 KATAKANA  # Katakana
0x4E00..0x9FAF KANJI     # Kanji
0x0030..0x0039 NUMERIC   # Numbers
0x0041..0x005A ALPHA     # Uppercase letters
0x0061..0x007A ALPHA     # Lowercase letters

Parameters control how unknown words of each character type are segmented: compatibility with adjacent characters, whether runs of the same type continue as a single token, and default token length.

4. Unknown Word Definition (unk.def)

Defines how out-of-vocabulary words are handled by character type.

DEFAULT,0,0,0,名詞,一般,*,*,*,*,*,*,*
HIRAGANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KATAKANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KANJI,0,0,0,名詞,一般,*,*,*,*,*,*,*
ALPHA,0,0,0,名詞,固有名詞,一般,*,*,*,*,*,*
NUMERIC,0,0,0,名詞,数,*,*,*,*,*,*,*

5. Feature Template (feature.def)

MeCab-compatible feature extraction patterns that define what information the CRF model uses for learning.

# Unigram features (word-level)
UNIGRAM U00:%F[0]           # POS
UNIGRAM U01:%F[0],%F?[1]    # POS + POS detail (%F?[n] = optional, skipped if *)
UNIGRAM U02:%F[6]           # Base form
UNIGRAM U03:%w              # Surface form

# Bigram features (context combination)
BIGRAM B00:%L[0]/%R[0]      # Left POS / Right POS
BIGRAM B01:%L[0],%L[1]/%R[0],%R[1]  # Left POS detail / Right POS detail

Template variables:

Variable	Description
`%F[n]` / `%F?[n]`	Feature field at index n (`?` = optional, skipped if value is `*`)
`%L[n]`	Left context feature field (from rewrite.def left section)
`%R[n]`	Right context feature field (from rewrite.def right section)
`%w`	Surface form of the word
`%u`	Unigram rewritten feature string
`%l`	Left rewritten feature string
`%r`	Right rewritten feature string

6. Feature Rewrite Rules (rewrite.def)

Feature normalization rules in MeCab-compatible 3-section format. Sections are separated by blank lines.

# Section 1: Unigram rewrite rules
名詞,固有名詞,*  名詞,固有名詞
助動詞,*,*,*,特殊・デス  助動詞
*  *

# Section 2: Left context rewrite rules
名詞,固有名詞,*  名詞,固有名詞
助詞,*  助詞
*  *

# Section 3: Right context rewrite rules
名詞,固有名詞,*  名詞,固有名詞
助詞,*  助詞
*  *

Each line is pattern<TAB>replacement. Patterns use * as a wildcard and are matched by prefix. The first matching rule in each section is applied. Different rules can be applied to unigram, left context, and right context independently, enabling fine-grained feature normalization to reduce sparsity.

Training Parameters

Parameter	Description	Default
`lambda`	L1 regularization coefficient (controls overfitting)	0.01
`max-iterations`	Maximum number of training iterations	100
`max-threads`	Number of parallel processing threads	1

CLI Usage

Train

lindera train \
    --seed seed.csv \
    --corpus corpus.txt \
    --char-def char.def \
    --unk-def unk.def \
    --feature-def feature.def \
    --rewrite-def rewrite.def \
    --lambda 0.01 \
    --max-iter 100 \
    --max-threads 4 \
    --output model.dat

Export

Convert the trained model into dictionary source files:

lindera export --model model.dat --output-dir ./dict-source

This produces the following files:

File	Description
`lex.csv`	Lexicon with trained costs
`matrix.def`	Connection cost matrix
`unk.def`	Unknown word definition
`char.def`	Character definition
`feature.def`	Feature template
`rewrite.def`	Feature rewrite rules
`left-id.def`	Left context ID mapping
`right-id.def`	Right context ID mapping
`metadata.json`	Dictionary metadata

Build

Compile the exported source files into a binary dictionary:

lindera build --input-dir ./dict-source --output-dir ./dict-compiled

Output Model Format

The trained model is serialized in rkyv binary format for fast loading. It contains:

Feature weights learned by the CRF
Label set (vocabulary entries)
Part-of-speech information
Feature templates
Training metadata (regularization, iterations, feature/label counts)

API Usage

#![allow(unused)]
fn main() {
use std::fs::File;
use lindera_dictionary::trainer::{Corpus, Trainer, TrainerConfig};

// Load configuration from files
let seed_file = File::open("resources/training/seed.csv")?;
let char_file = File::open("resources/training/char.def")?;
let unk_file = File::open("resources/training/unk.def")?;
let feature_file = File::open("resources/training/feature.def")?;
let rewrite_file = File::open("resources/training/rewrite.def")?;

let config = TrainerConfig::from_readers(
    seed_file,
    char_file,
    unk_file,
    feature_file,
    rewrite_file
)?;

// Initialize and configure trainer
let trainer = Trainer::new(config)?
    .regularization_cost(0.01)
    .max_iter(100)
    .num_threads(4);

// Load corpus
let corpus_file = File::open("resources/training/corpus.txt")?;
let corpus = Corpus::from_reader(corpus_file)?;

// Execute training
let model = trainer.train(corpus)?;

// Save model (binary format)
let mut output = File::create("trained_model.dat")?;
model.write_model(&mut output)?;

// Output in Lindera dictionary format
let mut lex_out = File::create("output_lex.csv")?;
let mut conn_out = File::create("output_conn.dat")?;
let mut unk_out = File::create("output_unk.def")?;
let mut user_out = File::create("output_user.csv")?;
model.write_dictionary(&mut lex_out, &mut conn_out, &mut unk_out, &mut user_out)?;

Ok::<(), Box<dyn std::error::Error>>(())
}

Recommended Corpus Specifications

For generating effective dictionaries for real applications:

Corpus Size

Level	Sentences	Use Case
Minimum	100+	Basic operation verification
Recommended	1,000+	Practical applications
Ideal	10,000+	Commercial quality

Quality Guidelines

Vocabulary diversity: Balanced distribution of different parts of speech, coverage of inflections and suffixes, appropriate inclusion of technical terms and proper nouns.
Consistency: Apply analysis criteria consistently across the corpus.
Verification: Manually verify morphological analysis results. Maintain an error rate below 5%.

Lindera Documentation