Training

Lindera Ruby supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.

Prerequisites

Build lindera-ruby with the train feature enabled:

LINDERA_FEATURES="embed-ipadic,train" bundle exec rake compile

Training a Model

Use Lindera::Trainer.train to train a CRF model from a seed lexicon and annotated corpus:

require 'lindera'

Lindera::Trainer.train(
  'resources/training/seed.csv',
  'resources/training/corpus.txt',
  'resources/training/char.def',
  'resources/training/unk.def',
  'resources/training/feature.def',
  'resources/training/rewrite.def',
  '/tmp/model.dat',
  0.01,  # lambda (L1 regularization)
  100,   # max_iter
  nil    # max_threads (nil = auto-detect CPU cores)
)

Training Parameters

Parameters are passed as positional arguments in the following order:

PositionNameTypeDescription
1seedStringPath to the seed lexicon file (CSV format)
2corpusStringPath to the annotated training corpus
3char_defStringPath to the character definition file (char.def)
4unk_defStringPath to the unknown word definition file (unk.def)
5feature_defStringPath to the feature definition file (feature.def)
6rewrite_defStringPath to the rewrite rule definition file (rewrite.def)
7outputStringOutput path for the trained model file
8lambdaFloatL1 regularization cost (0.0--1.0)
9max_iterIntegerMaximum number of training iterations
10max_threadsInteger or nilNumber of threads (nil = auto-detect CPU cores)

Exporting a Trained Model

After training, export the model to dictionary source files using Lindera::Trainer.export:

require 'lindera'

Lindera::Trainer.export(
  '/tmp/model.dat',
  '/tmp/dictionary_source',
  'resources/training/metadata.json'
)

Export Parameters

PositionNameTypeDescription
1modelStringPath to the trained model file (.dat)
2outputStringOutput directory for dictionary source files
3metadataString or nilPath to a base metadata.json file

The export creates the following files in the output directory:

  • lex.csv -- Lexicon entries with trained costs
  • matrix.def -- Connection cost matrix
  • unk.def -- Unknown word definitions
  • char.def -- Character category definitions
  • metadata.json -- Updated metadata (when metadata parameter is provided)

Complete Workflow

The full workflow for training and using a custom dictionary:

require 'lindera'

# Step 1: Train the CRF model
Lindera::Trainer.train(
  'resources/training/seed.csv',
  'resources/training/corpus.txt',
  'resources/training/char.def',
  'resources/training/unk.def',
  'resources/training/feature.def',
  'resources/training/rewrite.def',
  '/tmp/model.dat',
  0.01,  # lambda
  100,   # max_iter
  nil    # max_threads
)

# Step 2: Export to dictionary source files
Lindera::Trainer.export(
  '/tmp/model.dat',
  '/tmp/dictionary_source',
  'resources/training/metadata.json'
)

# Step 3: Build the dictionary from exported source files
metadata = Lindera::Metadata.from_json_file('/tmp/dictionary_source/metadata.json')
Lindera.build_dictionary('/tmp/dictionary_source', '/tmp/dictionary', metadata)

# Step 4: Use the trained dictionary
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/tmp/dictionary')
builder.set_mode('normal')
tokenizer = builder.build

tokens = tokenizer.tokenize('形態素解析のテスト')
tokens.each do |token|
  puts "#{token.surface}\t#{token.details.join(',')}"
end