Training
Lindera Ruby supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.
Prerequisites
Build lindera-ruby with the train feature enabled:
LINDERA_FEATURES="embed-ipadic,train" bundle exec rake compile
Training a Model
Use Lindera::Trainer.train to train a CRF model from a seed lexicon and annotated corpus:
require 'lindera'
Lindera::Trainer.train(
'resources/training/seed.csv',
'resources/training/corpus.txt',
'resources/training/char.def',
'resources/training/unk.def',
'resources/training/feature.def',
'resources/training/rewrite.def',
'/tmp/model.dat',
0.01, # lambda (L1 regularization)
100, # max_iter
nil # max_threads (nil = auto-detect CPU cores)
)
Training Parameters
Parameters are passed as positional arguments in the following order:
| Position | Name | Type | Description |
|---|---|---|---|
| 1 | seed | String | Path to the seed lexicon file (CSV format) |
| 2 | corpus | String | Path to the annotated training corpus |
| 3 | char_def | String | Path to the character definition file (char.def) |
| 4 | unk_def | String | Path to the unknown word definition file (unk.def) |
| 5 | feature_def | String | Path to the feature definition file (feature.def) |
| 6 | rewrite_def | String | Path to the rewrite rule definition file (rewrite.def) |
| 7 | output | String | Output path for the trained model file |
| 8 | lambda | Float | L1 regularization cost (0.0--1.0) |
| 9 | max_iter | Integer | Maximum number of training iterations |
| 10 | max_threads | Integer or nil | Number of threads (nil = auto-detect CPU cores) |
Exporting a Trained Model
After training, export the model to dictionary source files using Lindera::Trainer.export:
require 'lindera'
Lindera::Trainer.export(
'/tmp/model.dat',
'/tmp/dictionary_source',
'resources/training/metadata.json'
)
Export Parameters
| Position | Name | Type | Description |
|---|---|---|---|
| 1 | model | String | Path to the trained model file (.dat) |
| 2 | output | String | Output directory for dictionary source files |
| 3 | metadata | String or nil | Path to a base metadata.json file |
The export creates the following files in the output directory:
lex.csv-- Lexicon entries with trained costsmatrix.def-- Connection cost matrixunk.def-- Unknown word definitionschar.def-- Character category definitionsmetadata.json-- Updated metadata (whenmetadataparameter is provided)
Complete Workflow
The full workflow for training and using a custom dictionary:
require 'lindera'
# Step 1: Train the CRF model
Lindera::Trainer.train(
'resources/training/seed.csv',
'resources/training/corpus.txt',
'resources/training/char.def',
'resources/training/unk.def',
'resources/training/feature.def',
'resources/training/rewrite.def',
'/tmp/model.dat',
0.01, # lambda
100, # max_iter
nil # max_threads
)
# Step 2: Export to dictionary source files
Lindera::Trainer.export(
'/tmp/model.dat',
'/tmp/dictionary_source',
'resources/training/metadata.json'
)
# Step 3: Build the dictionary from exported source files
metadata = Lindera::Metadata.from_json_file('/tmp/dictionary_source/metadata.json')
Lindera.build_dictionary('/tmp/dictionary_source', '/tmp/dictionary', metadata)
# Step 4: Use the trained dictionary
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/tmp/dictionary')
builder.set_mode('normal')
tokenizer = builder.build
tokens = tokenizer.tokenize('形態素解析のテスト')
tokens.each do |token|
puts "#{token.surface}\t#{token.details.join(',')}"
end