Training

Lindera PHP supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.

Prerequisites

Build lindera-php with the train feature enabled:

cargo build -p lindera-php --features train,embed-ipadic

Training a Model

Use Lindera\Trainer::train() to train a CRF model from a seed lexicon and annotated corpus:

<?php

Lindera\Trainer::train(
    seed: 'resources/training/seed.csv',
    corpus: 'resources/training/corpus.txt',
    char_def: 'resources/training/char.def',
    unk_def: 'resources/training/unk.def',
    feature_def: 'resources/training/feature.def',
    rewrite_def: 'resources/training/rewrite.def',
    output: '/tmp/model.dat',
    lambda: 0.01,
    max_iter: 100,
    max_threads: null,
);

Training Parameters

ParameterTypeDefaultDescription
$seedstringrequiredPath to the seed lexicon file (CSV format)
$corpusstringrequiredPath to the annotated training corpus
$char_defstringrequiredPath to the character definition file (char.def)
$unk_defstringrequiredPath to the unknown word definition file (unk.def)
$feature_defstringrequiredPath to the feature definition file (feature.def)
$rewrite_defstringrequiredPath to the rewrite rule definition file (rewrite.def)
$outputstringrequiredOutput path for the trained model file
$lambdafloat0.01L1 regularization cost (0.0--1.0)
$max_iterint100Maximum number of training iterations
$max_threadsint|nullnullNumber of threads (null = auto-detect CPU cores)

Exporting a Trained Model

After training, export the model to dictionary source files using Lindera\Trainer::export():

<?php

Lindera\Trainer::export(
    model: '/tmp/model.dat',
    output: '/tmp/dictionary_source',
    metadata: 'resources/training/metadata.json',
);

Export Parameters

ParameterTypeDefaultDescription
$modelstringrequiredPath to the trained model file (.dat)
$outputstringrequiredOutput directory for dictionary source files
$metadatastring|nullnullPath to a base metadata.json file

The export creates the following files in the output directory:

  • lex.csv -- Lexicon entries with trained costs
  • matrix.def -- Connection cost matrix
  • unk.def -- Unknown word definitions
  • char.def -- Character category definitions
  • metadata.json -- Updated metadata (when $metadata parameter is provided)

Complete Workflow

The full workflow for training and using a custom dictionary:

<?php

// Step 1: Train the CRF model
Lindera\Trainer::train(
    seed: 'resources/training/seed.csv',
    corpus: 'resources/training/corpus.txt',
    char_def: 'resources/training/char.def',
    unk_def: 'resources/training/unk.def',
    feature_def: 'resources/training/feature.def',
    rewrite_def: 'resources/training/rewrite.def',
    output: '/tmp/model.dat',
    lambda: 0.01,
    max_iter: 100,
);

// Step 2: Export to dictionary source files
Lindera\Trainer::export(
    model: '/tmp/model.dat',
    output: '/tmp/dictionary_source',
    metadata: 'resources/training/metadata.json',
);

// Step 3: Build the dictionary from exported source files
$metadata = Lindera\Metadata::fromJsonFile('/tmp/dictionary_source/metadata.json');
Lindera\Dictionary::build('/tmp/dictionary_source', '/tmp/dictionary', $metadata);

// Step 4: Use the trained dictionary
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('/tmp/dictionary')
    ->setMode('normal')
    ->build();

$tokens = $tokenizer->tokenize('形態素解析のテスト');
foreach ($tokens as $token) {
    echo $token->surface . "\t" . implode(',', $token->details) . "\n";
}