Lindera
A morphological analysis library in Rust. This project is forked from kuromoji-rs.
Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.
Installation
Put the following in Cargo.toml:
[dependencies]
lindera = { version = "1.2.0", features = ["embedded-ipadic"] }
Environment Variables
LINDERA_CACHE
The LINDERA_CACHE environment variable specifies a directory for caching dictionary source files. This enables:
- Offline builds: Once downloaded, dictionary source files are preserved for future builds
- Faster builds: Subsequent builds skip downloading if valid cached files exist
- Reproducible builds: Ensures consistent dictionary versions across builds
Usage:
export LINDERA_CACHE=/path/to/cache
cargo build --features=ipadic
When set, dictionary source files are stored in $LINDERA_CACHE/<version>/ where <version> is the lindera-dictionary crate version. The cache validates files using MD5 checksums - invalid files are automatically re-downloaded.
LINDERA_CONFIG_PATH
The LINDERA_CONFIG_PATH environment variable specifies the path to a YAML configuration file for the tokenizer. This allows you to configure tokenizer behavior without modifying Rust code.
export LINDERA_CONFIG_PATH=./resources/config/lindera.yml
See the Configuration section for details on the configuration format.
DOCS_RS
The DOCS_RS environment variable is automatically set by docs.rs when building documentation. When this variable is detected, Lindera creates dummy dictionary files instead of downloading actual dictionary data, allowing documentation to be built without network access or large file downloads.
This is primarily used internally by docs.rs and typically doesn't need to be set by users.
LINDERA_WORKDIR
The LINDERA_WORKDIR environment variable is automatically set during the build process by the lindera-dictionary crate. It points to the directory containing the built dictionary data files and is used internally by dictionary crates to locate their data files.
This variable is set automatically and should not be modified by users.
Quick Start
This example covers the basic usage of Lindera.
It will:
- Create a tokenizer in normal mode
- Tokenize the input text
- Output the tokens
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "関西国際空港限定トートバッグ"; let mut tokens = tokenizer.tokenize(text)?; println!("text:\t{}", text); for token in tokens.iter_mut() { let details = token.details().join(","); println!("token:\t{}\t{}", token.surface.as_ref(), details); } Ok(()) }
The above example can be run as follows:
% cargo run --features=embedded-ipadic --example=tokenize
You can see the result as follows:
text: 関西国際空港限定トートバッグ
token: 関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token: 限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token: トートバッグ UNK
Dictionaries
Lindera supports various dictionaries. This section describes the format of each dictionary and the format for user dictionaries.
- IPADIC - The most common dictionary for Japanese.
- IPADIC NEologd - IPADIC with neologisms (new words).
- UniDic - A dictionary with uniform word unit definitions.
- ko-dic - A dictionary for Korean.
- CC-CEDICT - A dictionary for Chinese.
Lindera IPADIC
Dictionary version
This repository contains mecab-ipadic.
Dictionary format
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 品詞 | Part-of-speech | |
| 2 | 読み | Reading |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation | |
| 13 | - | - | After 13, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Lindera IPADIC NEologd
Dictionary version
This repository contains mecab-ipadic-neologd.
Dictionary format
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 品詞 | Part-of-speech | |
| 2 | 読み | Reading |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation | |
| 13 | - | - | After 13, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Lindera UniDic
Dictionary version
This repository contains unidic-mecab.
Dictionary format
Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞大分類 | Part-of-speech | |
| 5 | 品詞中分類 | Part-of-speech subcategory 1 | |
| 6 | 品詞小分類 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類 | Part-of-speech subcategory 3 | |
| 8 | 活用型 | Conjugation type | |
| 9 | 活用形 | Conjugation form | |
| 10 | 語彙素読み | Reading | |
| 11 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | |
| 12 | 書字形出現形 | Orthographic surface form | |
| 13 | 発音形出現形 | Phonological surface form | |
| 14 | 書字形基本形 | Orthographic base form | |
| 15 | 発音形基本形 | Phonological base form | |
| 16 | 語種 | Word type | |
| 17 | 語頭変化型 | Initial mutation type | |
| 18 | 語頭変化形 | Initial mutation form | |
| 19 | 語末変化型 | Final mutation type | |
| 20 | 語末変化形 | Final mutation form |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 品詞大分類 | Part-of-speech | |
| 2 | 語彙素読み | Reading |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞大分類 | Part-of-speech | |
| 5 | 品詞中分類 | Part-of-speech subcategory 1 | |
| 6 | 品詞小分類 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類 | Part-of-speech subcategory 3 | |
| 8 | 活用型 | Conjugation type | |
| 9 | 活用形 | Conjugation form | |
| 10 | 語彙素読み | Reading | |
| 11 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | |
| 12 | 書字形出現形 | Orthographic surface form | |
| 13 | 発音形出現形 | Phonological surface form | |
| 14 | 書字形基本形 | Orthographic base form | |
| 15 | 発音形基本形 | Phonological base form | |
| 16 | 語種 | Word type | |
| 17 | 語頭変化型 | Initial mutation type | |
| 18 | 語頭変化形 | Initial mutation form | |
| 19 | 語末変化型 | Final mutation type | |
| 20 | 語末変化形 | Final mutation form | |
| 21 | - | - | After 21, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Lindera ko-dic
Dictionary version
This repository contains mecab-ko-dic.
Dictionary format
Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.
Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).
The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0 on the above-linked spreadsheet.
The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0 of the spreadsheet. Any blank values default to *.
| Index | Name (Korean) | Name (English) | Notes |
|---|---|---|---|
| 0 | 표면 | Surface | |
| 1 | 왼쪽 문맥 ID | Left context ID | |
| 2 | 오른쪽 문맥 ID | Right context ID | |
| 3 | 비용 | Cost | |
| 4 | 품사 태그 | Part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
| 5 | 의미 부류 | Meaning | (too few examples for me to be sure) |
| 6 | 종성 유무 | Presence or absence | T for true; F for false; else * |
| 7 | 읽기 | Reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
| 8 | 타입 | Type | One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석) |
| 9 | 첫번째 품사 | First part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV |
| 10 | 마지막 품사 | Last part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP |
| 11 | 표현 | Expression | 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 표면 | Surface | |
| 1 | 품사 태그 | part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
| 2 | 읽기 | reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
Detailed version
| Index | Name (Korean) | Name (English) | Notes |
|---|---|---|---|
| 0 | 표면 | Surface | |
| 1 | 왼쪽 문맥 ID | Left context ID | |
| 2 | 오른쪽 문맥 ID | Right context ID | |
| 3 | 비용 | Cost | |
| 4 | 품사 태그 | part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
| 5 | 의미 부류 | meaning | (too few examples for me to be sure) |
| 6 | 종성 유무 | presence or absence | T for true; F for false; else * |
| 7 | 읽기 | reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
| 8 | 타입 | type | One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석) |
| 9 | 첫번째 품사 | first part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV |
| 10 | 마지막 품사 | last part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP |
| 11 | 표현 | expression | 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized |
| 12 | - | - | After 12, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Lindera CC-CE-DICT
Dictionary version
This repository contains CC-CEDICT-MeCab.
Dictionary format
Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.
| Index | Name (Chinese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 左语境ID | Left context ID | |
| 2 | 右语境ID | Right context ID | |
| 3 | 成本 | Cost | |
| 4 | 词类 | Part-of-speech | |
| 5 | 词类1 | Part-of-speech subcategory 1 | |
| 6 | 词类2 | Part-of-speech subcategory 2 | |
| 7 | 词类3 | Part-of-speech subcategory 3 | |
| 8 | 併音 | Pinyin | |
| 9 | 繁体字 | Traditional | |
| 10 | 簡体字 | Simplified | |
| 11 | 定义 | Definition |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 词类 | Part-of-speech | |
| 2 | 併音 | Pinyin |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 左语境ID | Left context ID | |
| 2 | 右语境ID | Right context ID | |
| 3 | 成本 | Cost | |
| 4 | 词类 | Part-of-speech | |
| 5 | 词类1 | Part-of-speech subcategory 1 | |
| 6 | 词类2 | Part-of-speech subcategory 2 | |
| 7 | 词类3 | Part-of-speech subcategory 3 | |
| 8 | 併音 | Pinyin | |
| 9 | 繁体字 | Traditional | |
| 10 | 簡体字 | Simplified | |
| 11 | 定义 | Definition | |
| 12 | - | - | After 12, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Configuration
Lindera is able to read YAML format configuration files.
Specify the path to the following file in the environment variable LINDERA_CONFIG_PATH. You can use it easily without having to code the behavior of the tokenizer in Rust code.
segmenter:
mode: "normal"
dictionary:
kind: "ipadic"
user_dictionary:
path: "./resources/user_dict/ipadic_simple.csv"
kind: "ipadic"
character_filters:
- kind: "unicode_normalize"
args:
kind: "nfkc"
- kind: "japanese_iteration_mark"
args:
normalize_kanji: true
normalize_kana: true
- kind: mapping
args:
mapping:
リンデラ: Lindera
token_filters:
- kind: "japanese_compound_word"
args:
tags:
- "名詞,数"
- "名詞,接尾,助数詞"
new_tag: "名詞,数"
- kind: "japanese_number"
args:
tags:
- "名詞,数"
- kind: "japanese_stop_tags"
args:
tags:
- "接続詞"
- "助詞"
- "助詞,格助詞"
- "助詞,格助詞,一般"
- "助詞,格助詞,引用"
- "助詞,格助詞,連語"
- "助詞,係助詞"
- "助詞,副助詞"
- "助詞,間投助詞"
- "助詞,並立助詞"
- "助詞,終助詞"
- "助詞,副助詞/並立助詞/終助詞"
- "助詞,連体化"
- "助詞,副詞化"
- "助詞,特殊"
- "助動詞"
- "記号"
- "記号,一般"
- "記号,読点"
- "記号,句点"
- "記号,空白"
- "記号,括弧閉"
- "その他,間投"
- "フィラー"
- "非言語音"
- kind: "japanese_katakana_stem"
args:
min: 3
- kind: "remove_diacritical_mark"
args:
japanese: false
% export LINDERA_CONFIG_PATH=./resources/config/lindera.yml
use std::path::PathBuf; use lindera::tokenizer::TokenizerBuilder; use lindera::LinderaResult; fn main() -> LinderaResult<()> { // Load tokenizer configuration from file let path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) .join("../resources") .join("config") .join("lindera.yml"); let builder = TokenizerBuilder::from_file(&path)?; let tokenizer = builder.build()?; let text = "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。".to_string(); println!("text: {text}"); let tokens = tokenizer.tokenize(&text)?; for token in tokens { println!( "token: {:?}, start: {:?}, end: {:?}, details: {:?}", token.surface, token.byte_start, token.byte_end, token.details ); } Ok(()) }
Advanced Usage
Tokenization with user dictionary
You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.
<surface>,<part_of_speech>,<reading>
Put the following in Cargo.toml:
[dependencies]
lindera = { version = "1.2.0", features = ["embedded-ipadic"] }
For example:
% cat ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
With an user dictionary, Tokenizer will be created as follows:
use std::fs::File; use std::path::PathBuf; use lindera::dictionary::{Metadata, load_dictionary, load_user_dictionary}; use lindera::error::LinderaErrorKind; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let user_dict_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) .join("../resources") .join("user_dict") .join("ipadic_simple_userdic.csv"); let metadata_file = PathBuf::from(env!("CARGO_MANIFEST_DIR")) .join("../lindera-ipadic") .join("metadata.json"); let metadata: Metadata = serde_json::from_reader( File::open(metadata_file) .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err))) .unwrap(), ) .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err))) .unwrap(); let dictionary = load_dictionary("embedded://ipadic")?; let user_dictionary = load_user_dictionary(user_dict_path.to_str().unwrap(), &metadata)?; let segmenter = Segmenter::new( Mode::Normal, dictionary, Some(user_dictionary), // Using the loaded user dictionary ); // Create a tokenizer. let tokenizer = Tokenizer::new(segmenter); // Tokenize a text. let text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"; let mut tokens = tokenizer.tokenize(text)?; // Print the text and tokens. println!("text:\t{}", text); for token in tokens.iter_mut() { let details = token.details().join(","); println!("token:\t{}\t{}", token.surface.as_ref(), details); } Ok(()) }
The above example can be run by cargo run --example:
% cargo run --features=embedded-ipadic --example=tokenize_with_user_dict
text: 東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です
token: 東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
token: の 助詞,連体化,*,*,*,*,の,ノ,ノ
token: 最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
token: は 助詞,係助詞,*,*,*,*,は,ハ,ワ
token: とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
token: です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
Tokenize with filters
Put the following in Cargo.toml:
[dependencies]
lindera = { version = "1.2.0", features = ["embedded-ipadic"] }
This example covers the basic usage of Lindera Analysis Framework.
It will:
- Apply character filter for Unicode normalization (NFKC)
- Tokenize the input text with IPADIC
- Apply token filters for removing stop tags (Part-of-speech) and Japanese Katakana stem filter
use lindera::character_filter::BoxCharacterFilter; use lindera::character_filter::japanese_iteration_mark::JapaneseIterationMarkCharacterFilter; use lindera::character_filter::unicode_normalize::{ UnicodeNormalizeCharacterFilter, UnicodeNormalizeKind, }; use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::token_filter::BoxTokenFilter; use lindera::token_filter::japanese_compound_word::JapaneseCompoundWordTokenFilter; use lindera::token_filter::japanese_number::JapaneseNumberTokenFilter; use lindera::token_filter::japanese_stop_tags::JapaneseStopTagsTokenFilter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new( Mode::Normal, dictionary, None, // No user dictionary for this example ); let unicode_normalize_char_filter = UnicodeNormalizeCharacterFilter::new(UnicodeNormalizeKind::NFKC); let japanese_iteration_mark_char_filter = JapaneseIterationMarkCharacterFilter::new(true, true); let japanese_compound_word_token_filter = JapaneseCompoundWordTokenFilter::new( vec!["名詞,数".to_string(), "名詞,接尾,助数詞".to_string()] .into_iter() .collect(), Some("複合語".to_string()), ); let japanese_number_token_filter = JapaneseNumberTokenFilter::new(Some(vec!["名詞,数".to_string()].into_iter().collect())); let japanese_stop_tags_token_filter = JapaneseStopTagsTokenFilter::new( vec![ "接続詞".to_string(), "助詞".to_string(), "助詞,格助詞".to_string(), "助詞,格助詞,一般".to_string(), "助詞,格助詞,引用".to_string(), "助詞,格助詞,連語".to_string(), "助詞,係助詞".to_string(), "助詞,副助詞".to_string(), "助詞,間投助詞".to_string(), "助詞,並立助詞".to_string(), "助詞,終助詞".to_string(), "助詞,副助詞/並立助詞/終助詞".to_string(), "助詞,連体化".to_string(), "助詞,副詞化".to_string(), "助詞,特殊".to_string(), "助動詞".to_string(), "記号".to_string(), "記号,一般".to_string(), "記号,読点".to_string(), "記号,句点".to_string(), "記号,空白".to_string(), "記号,括弧閉".to_string(), "その他,間投".to_string(), "フィラー".to_string(), "非言語音".to_string(), ] .into_iter() .collect(), ); // Create a tokenizer. let mut tokenizer = Tokenizer::new(segmenter); tokenizer .append_character_filter(BoxCharacterFilter::from(unicode_normalize_char_filter)) .append_character_filter(BoxCharacterFilter::from( japanese_iteration_mark_char_filter, )) .append_token_filter(BoxTokenFilter::from(japanese_compound_word_token_filter)) .append_token_filter(BoxTokenFilter::from(japanese_number_token_filter)) .append_token_filter(BoxTokenFilter::from(japanese_stop_tags_token_filter)); // Tokenize a text. let text = "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。"; let tokens = tokenizer.tokenize(text)?; // Print the text and tokens. println!("text: {}", text); for token in tokens { println!( "token: {:?}, start: {:?}, end: {:?}, details: {:?}", token.surface, token.byte_start, token.byte_end, token.details ); } Ok(()) }
The above example can be run as follows:
% cargo run --features=embedded-ipadic --example=tokenize_with_filters
You can see the result as follows:
text: Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。
token: "Lindera", start: 0, end: 21, details: Some(["UNK"])
token: "形態素", start: 24, end: 33, details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: "解析", start: 33, end: 39, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: "エンジン", start: 39, end: 54, details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: "ユーザー", start: 63, end: 75, details: Some(["名詞", "一般", "*", "*", "*", "*", "ユーザー", "ユーザー", "ユーザー"])
token: "辞書", start: 75, end: 81, details: Some(["名詞", "一般", "*", "*", "*", "*", "辞書", "ジショ", "ジショ"])
token: "利用", start: 84, end: 90, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "利用", "リヨウ", "リヨー"])
token: "可能", start: 90, end: 96, details: Some(["名詞", "形容動詞語幹", "*", "*", "*", "*", "可能", "カノウ", "カノー"])
Dictionary Training (Experimental)
Lindera provides CRF-based dictionary training functionality for creating custom morphological analysis models.
Overview
Lindera Trainer is a Conditional Random Field (CRF) based morphological analyzer training system with the following advanced features:
- CRF-based statistical learning: Efficient implementation using rucrf crate
- L1 regularization: Prevents overfitting
- Multi-threaded training: Parallel processing for faster training
- Comprehensive Unicode support: Full CJK extension support
- Advanced unknown word handling: Intelligent mixed character type classification
- Multi-stage weight optimization: Advanced normalization system for trained weights
- Lindera dictionary compatibility: Full compatibility with existing dictionary formats
CLI Usage
For detailed CLI command usage, see lindera-cli/README.md.
Required File Format Specifications
1. Vocabulary Dictionary (seed.csv)
Role: Base vocabulary dictionary Format: MeCab format CSV
外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
参政,0,0,0,名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
- Purpose: Define basic words and their part-of-speech information for training
- Structure:
surface,left_id,right_id,cost,pos,pos_detail1,pos_detail2,pos_detail3,inflection_type,inflection_form,base_form,reading,pronunciation
2. Unknown Word Definition (unk.def)
Role: Unknown word processing definition Format: Unknown word parameters by character type
DEFAULT,0,0,0,名詞,一般,*,*,*,*,*,*,*
HIRAGANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KATAKANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KANJI,0,0,0,名詞,一般,*,*,*,*,*,*,*
ALPHA,0,0,0,名詞,固有名詞,一般,*,*,*,*,*,*
NUMERIC,0,0,0,名詞,数,*,*,*,*,*,*,*
- Purpose: Define processing methods for out-of-vocabulary words by character type
- Note: These labels are for internal processing and are not output in the final dictionary file
3. Training Corpus (corpus.txt)
Role: Training data (annotated corpus) Format: Tab-separated tokenized text
外国 名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人 名詞,接尾,一般,*,*,*,人,ジン,ジン
参政 名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権 名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS
これ 連体詞,*,*,*,*,*,これ,コレ,コレ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト 名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
EOS
- Purpose: Sentences and their correct analysis results for training
- Format: Each line is
surface\tpos_info, sentences end withEOS - Important: Training quality heavily depends on the quantity and quality of this corpus
4. Character Type Definition (char.def)
Role: Character type definition Format: Character categories and character code ranges
# Character category definition (category_name compatibility_flag continuity_flag length)
DEFAULT 0 1 0
HIRAGANA 1 1 0
KATAKANA 1 1 0
KANJI 0 0 2
ALPHA 1 1 0
NUMERIC 1 1 0
# Character range mapping
0x3041..0x3096 HIRAGANA # Hiragana
0x30A1..0x30F6 KATAKANA # Katakana
0x4E00..0x9FAF KANJI # Kanji
0x0030..0x0039 NUMERIC # Numbers
0x0041..0x005A ALPHA # Uppercase letters
0x0061..0x007A ALPHA # Lowercase letters
- Purpose: Define which characters belong to which category
- Parameters: Settings for compatibility, continuity, default length, etc.
5. Feature Template (feature.def)
Role: Feature template definition Format: Feature extraction patterns
# Unigram features (word-level features)
UNIGRAM:%F[0] # POS (feature element 0)
UNIGRAM:%F[1] # POS detail 1
UNIGRAM:%F[6] # Base form
UNIGRAM:%F[7] # Reading (Katakana)
# Left context features
LEFT:%L[0] # POS of left word
LEFT:%L[1] # POS detail of left word
# Right context features
RIGHT:%R[0] # POS of right word
RIGHT:%R[1] # POS detail of right word
# Bigram features (combination features)
UNIGRAM:%F[0]/%F[1] # POS + POS detail
UNIGRAM:%F[0]/%F[6] # POS + base form
- Purpose: Define which information to extract features from
- Templates:
%F[n](feature),%L[n](left context),%R[n](right context)
6. Feature Normalization Rules (rewrite.def)
Role: Feature normalization rules Format: Replacement rules (tab-separated)
# Normalize numeric expressions
数 NUM
* UNK
# Normalize proper nouns
名詞,固有名詞 名詞,一般
# Simplify auxiliary verbs
助動詞,*,*,*,特殊・デス 助動詞
助動詞,*,*,*,特殊・ダ 助動詞
- Purpose: Normalize features to improve training efficiency
- Format:
original_pattern\treplacement_pattern - Effect: Generalize rare features to reduce sparsity problems
7. Output Model Format
Role: Output model file Format: Binary (rkyv) format is standard, JSON format also supported
The model contains the following information:
{
"feature_weights": [0.0, 0.084, 0.091, ...],
"labels": ["外国", "人", "参政", "権", ...],
"pos_info": ["名詞,一般,*,*,*,*,*,*,*", "名詞,接尾,一般,*,*,*,*,*,*", ...],
"feature_templates": ["UNIGRAM:%F[0]", ...],
"metadata": {
"version": "1.0.0",
"regularization": 0.01,
"iterations": 100,
"feature_count": 13,
"label_count": 19
}
}
- Purpose: Save training results for later dictionary generation
Training Parameter Specifications
- Regularization coefficient (lambda): Controls L1 regularization strength (default: 0.01)
- Maximum iterations (iter): Maximum number of training iterations (default: 100)
- Parallel threads (threads): Number of parallel processing threads (default: 1)
API Usage Example
#![allow(unused)] fn main() { use std::fs::File; use lindera_dictionary::trainer::{Corpus, Trainer, TrainerConfig}; // Load configuration from files let seed_file = File::open("resources/training/seed.csv")?; let char_file = File::open("resources/training/char.def")?; let unk_file = File::open("resources/training/unk.def")?; let feature_file = File::open("resources/training/feature.def")?; let rewrite_file = File::open("resources/training/rewrite.def")?; let config = TrainerConfig::from_readers( seed_file, char_file, unk_file, feature_file, rewrite_file )?; // Initialize and configure trainer let trainer = Trainer::new(config)? .regularization_cost(0.01) .max_iter(100) .num_threads(4); // Load corpus let corpus_file = File::open("resources/training/corpus.txt")?; let corpus = Corpus::from_reader(corpus_file)?; // Execute training let model = trainer.train(corpus)?; // Save model (binary format) let mut output = File::create("trained_model.dat")?; model.write_model(&mut output)?; // Output in Lindera dictionary format let mut lex_out = File::create("output_lex.csv")?; let mut conn_out = File::create("output_conn.dat")?; let mut unk_out = File::create("output_unk.def")?; let mut user_out = File::create("output_user.csv")?; model.write_dictionary(&mut lex_out, &mut conn_out, &mut unk_out, &mut user_out)?; Ok::<(), Box<dyn std::error::Error>>(()) }
Implementation Status
Completed Features
Core Features
- Core architecture: Complete trainer module structure
- CRF training: Conditional Random Field training via rucrf integration
- CLI integration:
lindera traincommand with full parameter support - Corpus processing: Full MeCab format corpus support
- Dictionary integration: Dictionary construction from seed.csv, char.def, unk.def
- Feature extraction: Extraction and transformation of unigram/bigram features
- Model saving: Output trained models in JSON/rkyv format
- Dictionary output: Generate Lindera format dictionary files
Advanced Unknown Word Processing
- Comprehensive Unicode support: Full support for CJK extensions, Katakana extensions, Hiragana extensions
- Category-specific POS assignment: Automatic assignment of appropriate POS information by character type
- DEFAULT: 名詞,一般 (unknown character type)
- HIRAGANA/KATAKANA/KANJI: 名詞,一般 (Japanese characters)
- ALPHA: 名詞,固有名詞 (alphabetic characters)
- NUMERIC: 名詞,数 (numeric characters)
- Surface form analysis: Feature generation based on character patterns, length, and position information
- Dynamic cost calculation: Adaptive cost considering character type and context
Refactored Implementation (September 2024 Latest)
- Constant management: Magic number elimination via cost_constants module
- Method splitting: Improved readability by splitting large methods
train()→build_lattices_from_corpus(),extract_labels(),train_crf_model(),create_final_model()
- Unified cost calculation: Improved maintainability by unifying duplicate code
calculate_known_word_cost(): Known word cost calculationcalculate_unknown_word_cost(): Unknown word cost calculation
- Organized debug output: Structured logging via log_debug! macro
- Enhanced error handling: Comprehensive error handling and documentation
Architecture
lindera-dictionary/src/trainer.rs # Main Trainer struct
lindera-dictionary/src/trainer/
├── config.rs # Configuration management
├── corpus.rs # Corpus processing
├── feature_extractor.rs # Feature extraction
├── feature_rewriter.rs # Feature rewriting
└── model.rs # Trained model
Advanced Unknown Word Processing System
Comprehensive Unicode Character Type Detection
The latest implementation significantly extends the basic Unicode ranges and fully supports the following character sets. (See the Category-specific POS assignment details in the Advanced Unknown Word Processing section above.)
Feature Weight Optimization
Cost Calculation Constants
#![allow(unused)] fn main() { mod cost_constants { // Known word cost calculation pub const KNOWN_WORD_BASE_COST: i16 = 1000; pub const KNOWN_WORD_COST_MULTIPLIER: f64 = 500.0; pub const KNOWN_WORD_COST_MIN: i16 = 500; pub const KNOWN_WORD_COST_MAX: i16 = 3000; pub const KNOWN_WORD_DEFAULT_COST: i16 = 1500; // Unknown word cost calculation pub const UNK_BASE_COST: i32 = 3000; pub const UNK_COST_MULTIPLIER: f64 = 500.0; pub const UNK_COST_MIN: i32 = 2500; pub const UNK_COST_MAX: i32 = 4500; // Category-specific adjustments pub const UNK_DEFAULT_ADJUSTMENT: i32 = 0; // DEFAULT pub const UNK_HIRAGANA_ADJUSTMENT: i32 = 200; // HIRAGANA - minor penalty pub const UNK_KATAKANA_ADJUSTMENT: i32 = 0; // KATAKANA - medium pub const UNK_KANJI_ADJUSTMENT: i32 = 400; // KANJI - high penalty pub const UNK_ALPHA_ADJUSTMENT: i32 = 100; // ALPHA - mild penalty pub const UNK_NUMERIC_ADJUSTMENT: i32 = -100; // NUMERIC - bonus (regular) } }
Unified Cost Calculation
#![allow(unused)] fn main() { // Known word cost calculation fn calculate_known_word_cost(&self, feature_weight: f64) -> i16 { let scaled_weight = (feature_weight * cost_constants::KNOWN_WORD_COST_MULTIPLIER) as i32; let final_cost = cost_constants::KNOWN_WORD_BASE_COST as i32 + scaled_weight; final_cost.clamp( cost_constants::KNOWN_WORD_COST_MIN as i32, cost_constants::KNOWN_WORD_COST_MAX as i32 ) as i16 } // Unknown word cost calculation fn calculate_unknown_word_cost(&self, feature_weight: f64, category: usize) -> i32 { let base_cost = cost_constants::UNK_BASE_COST; let category_adjustment = match category { 0 => cost_constants::UNK_DEFAULT_ADJUSTMENT, 1 => cost_constants::UNK_HIRAGANA_ADJUSTMENT, 2 => cost_constants::UNK_KATAKANA_ADJUSTMENT, 3 => cost_constants::UNK_KANJI_ADJUSTMENT, 4 => cost_constants::UNK_ALPHA_ADJUSTMENT, 5 => cost_constants::UNK_NUMERIC_ADJUSTMENT, _ => 0, }; let scaled_weight = (feature_weight * cost_constants::UNK_COST_MULTIPLIER) as i32; let final_cost = base_cost + category_adjustment + scaled_weight; final_cost.clamp( cost_constants::UNK_COST_MIN, cost_constants::UNK_COST_MAX ) } }
Performance Optimization
Memory Efficiency
- Lazy evaluation: Create merged_model only when needed
- Unused feature removal: Automatic deletion of unnecessary features after training
- Efficient binary format: Fast serialization using rkyv
Parallel Processing Support
#![allow(unused)] fn main() { let trainer = rucrf::Trainer::new() .regularization(rucrf::Regularization::L1, regularization_cost)? .max_iter(max_iter)? .n_threads(self.num_threads)?; // Multi-threaded training }
Practical Training Data Requirements
Recommended Corpus Specifications
Recommendations for generating effective dictionaries for real applications:
-
Corpus Size
- Minimum: 100 sentences (for basic operation verification)
- Recommended: 1,000+ sentences (practical level)
- Ideal: 10,000+ sentences (commercial quality)
-
Vocabulary Diversity
- Balanced distribution of different parts of speech
- Coverage of inflections and suffixes
- Appropriate inclusion of technical terms and proper nouns
-
Quality Control
- Manual verification of morphological analysis results
- Consistent application of analysis criteria
- Maintain error rate below 5%
Lindera CLI
A morphological analysis command-line interface for Lindera.
Install
You can install binary via cargo as follows:
% cargo install lindera-cli
Alternatively, you can download a binary from the following release page:
Build
Build with IPADIC (Japanese dictionary)
The "ipadic" feature flag allows Lindera to include IPADIC.
% cargo build --release --features=embedded-ipadic
Build with UniDic (Japanese dictionary)
The "unidic" feature flag allows Lindera to include UniDic.
% cargo build --release --features=embedded-unidic
Build with ko-dic (Korean dictionary)
The "ko-dic" feature flag allows Lindera to include ko-dic.
% cargo build --release --features=embedded-ko-dic
Build with CC-CEDICT (Chinese dictionary)
The "cc-cedict" feature flag allows Lindera to include CC-CEDICT.
% cargo build --release --features=embedded-cc-cedict
Build without dictionaries
To reduce Lindera's binary size, omit the feature flag. This results in a binary containing only the tokenizer and trainer, as it no longer includes the dictionary.
% cargo build --release
Build with all features
% cargo build --release --all-features
Build dictionary
Build (compile) a morphological analysis dictionary from source CSV files for use with Lindera.
Basic build usage
# Build a system dictionary
lindera build \
--src /path/to/dictionary/csv \
--dest /path/to/output/dictionary \
--metadata ./lindera-ipadic/metadata.json
# Build a user dictionary
lindera build \
--src ./user_dict.csv \
--dest ./user_dictionary \
--metadata ./lindera-ipadic/metadata.json \
--user
Build parameters
--src/-s: Source directory containing dictionary CSV files (or single CSV file for user dictionary)--dest/-d: Destination directory for compiled dictionary output--metadata/-m: Metadata configuration file (metadata.json) that defines dictionary structure--user/-u: Build user dictionary instead of system dictionary (optional flag)
Dictionary types
System dictionary
A full morphological analysis dictionary containing:
- Lexicon entries (word definitions)
- Connection cost matrix
- Unknown word handling rules
- Character type definitions
User dictionary
A supplementary dictionary for custom words that works alongside a system dictionary.
Examples
Build IPADIC (Japanese dictionary)
# Download and extract IPADIC source files
% curl -L -o /tmp/mecab-ipadic-2.7.0-20250920.tar.gz "https://Lindera.dev/mecab-ipadic-2.7.0-20250920.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20250920.tar.gz -C /tmp
# Build the dictionary
% lindera build \
--src /tmp/mecab-ipadic-2.7.0-20250920 \
--dest /tmp/lindera-ipadic-2.7.0-20250920 \
--metadata ./lindera-ipadic/metadata.json
% ls -al /tmp/lindera-ipadic-2.7.0-20250920
% (cd /tmp && zip -r lindera-ipadic-2.7.0-20250920.zip lindera-ipadic-2.7.0-20250920/)
% tar -czf /tmp/lindera-ipadic-2.7.0-20250920.tar.gz -C /tmp lindera-ipadic-2.7.0-20250920
Build IPADIC NEologd (Japanese dictionary)
# Download and extract IPADIC NEologd source files
% curl -L -o /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz "https://lindera.dev/mecab-ipadic-neologd-0.0.7-20200820.tar.gz"
% tar zxvf /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp
# Build the dictionary
% lindera build \
--src /tmp/mecab-ipadic-neologd-0.0.7-20200820 \
--dest /tmp/lindera-ipadic-neologd-0.0.7-20200820 \
--metadata ./lindera-ipadic-neologd/metadata.json
% ls -al /tmp/lindera-ipadic-neologd-0.0.7-20200820
% (cd /tmp && zip -r lindera-ipadic-neologd-0.0.7-20200820.zip lindera-ipadic-neologd-0.0.7-20200820/)
% tar -czf /tmp/lindera-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp lindera-ipadic-neologd-0.0.7-20200820
Build UniDic (Japanese dictionary)
# Download and extract UniDic source files
% curl -L -o /tmp/unidic-mecab-2.1.2.tar.gz "https://Lindera.dev/unidic-mecab-2.1.2.tar.gz"
% tar zxvf /tmp/unidic-mecab-2.1.2.tar.gz -C /tmp
# Build the dictionary
% lindera build \
--src /tmp/unidic-mecab-2.1.2 \
--dest /tmp/lindera-unidic-2.1.2 \
--metadata ./lindera-unidic/metadata.json
% ls -al /tmp/lindera-unidic-2.1.2
% (cd /tmp && zip -r lindera-unidic-2.1.2.zip lindera-unidic-2.1.2/)
% tar -czf /tmp/lindera-unidic-2.1.2.tar.gz -C /tmp lindera-unidic-2.1.2
Build CC-CEDICT (Chinese dictionary)
# Download and extract CC-CEDICT source files
% curl -L -o /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz "https://lindera.dev/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz"
% tar zxvf /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz -C /tmp
# Build the dictionary
% lindera build \
--src /tmp/CC-CEDICT-MeCab-0.1.0-20200409 \
--dest /tmp/lindera-cc-cedict-0.1.0-20200409 \
--metadata ./lindera-cc-cedict/metadata.json
% ls -al /tmp/lindera-cc-cedict-0.1.0-20200409
% (cd /tmp && zip -r lindera-cc-cedict-0.1.0-20200409.zip lindera-cc-cedict-0.1.0-20200409/)
% tar -czf /tmp/lindera-cc-cedict-0.1.0-20200409.tar.gz -C /tmp lindera-cc-cedict-0.1.0-20200409
Build ko-dic (Korean dictionary)
# Download and extract ko-dic source files
% curl -L -o /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz "https://Lindera.dev/mecab-ko-dic-2.1.1-20180720.tar.gz"
% tar zxvf /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz -C /tmp
# Build the dictionary
% lindera build \
--src /tmp/mecab-ko-dic-2.1.1-20180720 \
--dest /tmp/lindera-ko-dic-2.1.1-20180720 \
--metadata ./lindera-ko-dic/metadata.json
% ls -al /tmp/lindera-ko-dic-2.1.1-20180720
% (cd /tmp && zip -r lindera-ko-dic-2.1.1-20180720.zip lindera-ko-dic-2.1.1-20180720/)
% tar -czf /tmp/lindera-ko-dic-2.1.1-20180720.tar.gz -C /tmp lindera-ko-dic-2.1.1-20180720
Build user dictionary
Build IPADIC user dictionary (Japanese)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/ipadic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-ipadic/metadata.json \
--user
Build UniDic user dictionary (Japanese)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/unidic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-unidic/metadata.json \
--user
Build CC-CEDICT user dictionary (Chinese)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/cc-cedict_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-cc-cedict/metadata.json \
--user
Build ko-dic user dictionary (Korean)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/ko-dic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-ko-dic/metadata.json \
--user
Tokenize text
Perform morphological analysis (tokenization) on Japanese, Chinese, or Korean text using various dictionaries.
Basic tokenization usage
# Tokenize text using a dictionary directory
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /path/to/dictionary
# Tokenize text using embedded dictionary
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic
# Tokenize with different output format
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic \
--output json
# Tokenize text from file
lindera tokenize \
--dict /path/to/dictionary \
--output wakati \
input.txt
Tokenization parameters
--dict/-d: Dictionary path or URI (required)- File path:
/path/to/dictionary - Embedded:
embedded://ipadic,embedded://unidic, etc.
- File path:
--output/-o: Output format (default: mecab)mecab: MeCab-compatible format with part-of-speech infowakati: Space-separated tokens onlyjson: Detailed JSON format with all token information
--user-dict/-u: User dictionary path (optional)--mode/-m: Tokenization mode (default: normal)normal: Standard tokenizationdecompose: Decompose compound words
--char-filter/-c: Character filter configuration (JSON)--token-filter/-t: Token filter configuration (JSON)- Input file: Optional file path (default: stdin)
Examples with external dictionaries
Tokenize with external IPADIC (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-ipadic-2.7.0-20250920
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Tokenize with external IPADIC Neologd (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-ipadic-neologd-0.0.7-20200820
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析 名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Tokenize with external UniDic (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-unidic-2.1.2
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語 名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態 名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素 接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析 名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う 動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと 名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が 助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき 動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
Tokenize with external ko-dic (Korean dictionary)
% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
--dict /tmp/lindera-ko-dic-2.1.1-20180720
한국어 NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의 JKG,*,F,의,*,*,*,*
형태 NNG,*,F,형태,*,*,*,*
해석 NNG,행위,T,해석,*,*,*,*
을 JKO,*,T,을,*,*,*,*
실시 NNG,행위,F,실시,*,*,*,*
할 VV+ETM,*,T,할,Inflect,VV,ETM,하/VV/*+ᆯ/ETM/*
수 NNG,*,F,수,*,*,*,*
있 VX,*,T,있,*,*,*,*
습니다 EF,*,F,습니다,*,*,*,*
. UNK
EOS
Tokenize with external CC-CEDICT (Chinese dictionary)
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict /tmp/lindera-cc-cedict-0.1.0-20200409
可以 *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行 *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文 *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学 *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析 *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。 UNK
EOS
Examples with embedded dictionaries
Lindera can include dictionaries directly in the binary when built with specific feature flags. This allows tokenization without external dictionary files.
Tokenize with embedded IPADIC (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
NOTE: To include IPADIC dictionary in the binary, you must build with the --features=embedded-ipadic option.
Tokenize with embedded UniDic (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://unidic
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語 名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態 名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素 接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析 名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う 動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと 名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が 助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき 動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
NOTE: To include UniDic dictionary in the binary, you must build with the --features=embedded-unidic option.
Tokenize with embedded IPADIC NEologd (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic-neologd
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析 名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
NOTE: To include UniDic dictionary in the binary, you must build with the --features=embedded-ipadic-neologd option.
Tokenize with embedded ko-dic (Korean dictionary)
% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
--dict embedded://ko-dic
한국어 NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의 JKG,*,F,의,*,*,*,*
형태 NNG,*,F,형태,*,*,*,*
해석 NNG,행위,T,해석,*,*,*,*
을 JKO,*,T,을,*,*,*,*
실시 NNG,행위,F,실시,*,*,*,*
할 VV+ETM,*,T,할,Inflect,VV,ETM,하/VV/*+ᆯ/ETM/*
수 NNG,*,F,수,*,*,*,*
있 VX,*,T,있,*,*,*,*
습니다 EF,*,F,습니다,*,*,*,*
. UNK
EOS
NOTE: To include ko-dic dictionary in the binary, you must build with the --features=embedded-ko-dic option.
Tokenize with embedded CC-CEDICT (Chinese dictionary)
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict embedded://cc-cedict
可以 *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行 *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文 *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学 *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析 *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。 UNK
EOS
NOTE: To include CC-CEDICT dictionary in the binary, you must build with the --features=embedded-cc-cedict option.
User dictionary examples
Lindera supports user dictionaries to add custom words alongside system dictionaries. User dictionaries can be in CSV or binary format.
Use user dictionary (CSV format)
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict embedded://ipadic \
--user-dict ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
Use user dictionary (Binary format)
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict /tmp/lindera-ipadic-2.7.0-20250920 \
--user-dict ./resources/user_dict/ipadic_simple_userdic.bin
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
Tokenization modes
Lindera provides two tokenization modes: normal and decompose.
Normal mode (default)
Tokenizes faithfully based on words registered in the dictionary:
% echo "関西国際空港限定トートバッグ" | lindera tokenize \
--dict embedded://ipadic \
--mode normal
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK,*,*,*,*,*,*,*,*
EOS
Decompose mode
Tokenizes compound noun words additionally:
% echo "関西国際空港限定トートバッグ" | lindera tokenize \
--dict embedded://ipadic \
--mode decompose
関西 名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際 名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港 名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK,*,*,*,*,*,*,*,*
EOS
Output formats
Lindera provides three output formats: mecab, wakati and json.
MeCab format (default)
Outputs results in MeCab-compatible format with part-of-speech information:
% echo "お待ちしております。" | lindera tokenize \
--dict embedded://ipadic \
--output mecab
お待ち 名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
おり 動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Wakati format
Outputs only the token text separated by spaces:
% echo "お待ちしております。" | lindera tokenize \
--dict embedded://ipadic \
--output wakati
お待ち し て おり ます 。
JSON format
Outputs detailed token information in JSON format:
% echo "お待ちしております。" | lindera tokenize \
--dict embedded://ipadic \
--output json
[
{
"base_form": "お待ち",
"byte_end": 9,
"byte_start": 0,
"conjugation_form": "*",
"conjugation_type": "*",
"part_of_speech": "名詞",
"part_of_speech_subcategory_1": "サ変接続",
"part_of_speech_subcategory_2": "*",
"part_of_speech_subcategory_3": "*",
"pronunciation": "オマチ",
"reading": "オマチ",
"surface": "お待ち",
"word_id": 14698
},
{
"base_form": "する",
"byte_end": 12,
"byte_start": 9,
"conjugation_form": "サ変・スル",
"conjugation_type": "連用形",
"part_of_speech": "動詞",
"part_of_speech_subcategory_1": "自立",
"part_of_speech_subcategory_2": "*",
"part_of_speech_subcategory_3": "*",
"pronunciation": "シ",
"reading": "シ",
"surface": "し",
"word_id": 30763
},
{
"base_form": "て",
"byte_end": 15,
"byte_start": 12,
"conjugation_form": "*",
"conjugation_type": "*",
"part_of_speech": "助詞",
"part_of_speech_subcategory_1": "接続助詞",
"part_of_speech_subcategory_2": "*",
"part_of_speech_subcategory_3": "*",
"pronunciation": "テ",
"reading": "テ",
"surface": "て",
"word_id": 46603
},
{
"base_form": "おる",
"byte_end": 21,
"byte_start": 15,
"conjugation_form": "五段・ラ行",
"conjugation_type": "連用形",
"part_of_speech": "動詞",
"part_of_speech_subcategory_1": "非自立",
"part_of_speech_subcategory_2": "*",
"part_of_speech_subcategory_3": "*",
"pronunciation": "オリ",
"reading": "オリ",
"surface": "おり",
"word_id": 14239
},
{
"base_form": "ます",
"byte_end": 27,
"byte_start": 21,
"conjugation_form": "特殊・マス",
"conjugation_type": "基本形",
"part_of_speech": "助動詞",
"part_of_speech_subcategory_1": "*",
"part_of_speech_subcategory_2": "*",
"part_of_speech_subcategory_3": "*",
"pronunciation": "マス",
"reading": "マス",
"surface": "ます",
"word_id": 68733
},
{
"base_form": "。",
"byte_end": 30,
"byte_start": 27,
"conjugation_form": "*",
"conjugation_type": "*",
"part_of_speech": "記号",
"part_of_speech_subcategory_1": "句点",
"part_of_speech_subcategory_2": "*",
"part_of_speech_subcategory_3": "*",
"pronunciation": "。",
"reading": "。",
"surface": "。",
"word_id": 101
}
]
Advanced tokenization
Lindera provides an analytical framework that combines character filters, tokenizers, and token filters for advanced text processing. Filters are configured using JSON.
Tokenize with character and token filters
% echo "すもももももももものうち" | lindera tokenize \
--dict embedded://ipadic \
--char-filter 'unicode_normalize:{"kind":"nfkc"}' \
--token-filter 'japanese_keep_tags:{"tags":["名詞,一般"]}'
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
EOS
Dictionary Training (Experimental)
Train a new morphological analysis model from annotated corpus data. To use this feature, you must build with the train feature flag enabled. (The train feature flag is enabled by default.)
Training parameters
--seed/-s: Seed lexicon file (CSV format) to be weighted--corpus/-c: Training corpus (annotated text)--char-def/-C: Character definition file (char.def)--unk-def/-u: Unknown word definition file (unk.def) to be weighted--feature-def/-f: Feature definition file (feature.def)--rewrite-def/-r: Rewrite rule definition file (rewrite.def)--output/-o: Output model file--lambda/-l: L1 regularization (0.0-1.0) (default: 0.01)--max-iterations/-i: Maximum number of iterations for training (default: 100)--max-threads/-t: Maximum number of threads (defaults to CPU core count, auto-adjusted based on dataset size)
Basic workflow
1. Prepare training files
Seed lexicon file (seed.csv):
The seed lexicon file contains initial dictionary entries used for training the CRF model. Each line represents a word entry with comma-separated fields. The specific field structure varies depending on the dictionary format:
- Surface
- Left context ID
- Right context ID
- Word cost
- Part-of-speech tags (multiple fields)
- Base form
- Reading (katakana)
- Pronunciation
Note: The exact field definitions differ between dictionary formats (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.
外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
Training corpus (corpus.txt):
The training corpus file contains annotated text data used to train the CRF model. Each line consists of:
- A surface form (word) followed by a tab character
- Comma-separated morphological features (part-of-speech tags, base form, reading, pronunciation)
- Sentences are separated by "EOS" (End Of Sentence) markers
Note: The morphological feature format varies depending on the dictionary (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.
外国 名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人 名詞,接尾,一般,*,*,*,人,ジン,ジン
参政 名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権 名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS
これ 連体詞,*,*,*,*,*,これ,コレ,コレ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト 名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。 記号,句点,*,*,*,*,。,。,。
EOS
形態 名詞,一般,*,*,*,*,形態,ケイタイ,ケイタイ
素 名詞,接尾,一般,*,*,*,素,ソ,ソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
EOS
For detailed information about file formats and advanced features, see TRAINER_README.md.
2. Train model
lindera train \
--seed ./resources/training/seed.csv \
--corpus ./resources/training/corpus.txt \
--unk-def ./resources/training/unk.def \
--char-def ./resources/training/char.def \
--feature-def ./resources/training/feature.def \
--rewrite-def ./resources/training/rewrite.def \
--output /tmp/lindera/training/model.dat \
--lambda 0.01 \
--max-iterations 100
3. Training results
The trained model will contain:
- Existing words: All seed dictionary records with newly learned weights
- New words: Words from the corpus not in the seed dictionary, added with appropriate weights
Export trained model to dictionary
Export a trained model file to Lindera dictionary format files. This feature requires building with the train feature flag enabled.
Basic export usage
# Export trained model to dictionary files
lindera export \
--model /tmp/lindera/training/model.dat \
--metadata ./resources/training/metadata.json \
--output /tmp/lindera/training/dictionary
Export parameters
--model/-m: Path to the trained model file (.dat format)--output/-o: Directory to output the dictionary files--metadata: Optional metadata.json file to update with trained model information
Output files
The export command creates the following dictionary files in the output directory:
lex.csv: Lexicon file with learned weightsmatrix.def: Connection cost matrixunk.def: Unknown word definitionschar.def: Character type definitionsmetadata.json: Updated metadata file (if--metadataoption is provided)
Complete workflow example
1. Train model
lindera train \
--seed ./resources/training/seed.csv \
--corpus ./resources/training/corpus.txt \
--unk-def ./resources/training/unk.def \
--char-def ./resources/training/char.def \
--feature-def ./resources/training/feature.def \
--rewrite-def ./resources/training/rewrite.def \
--output /tmp/lindera/training/model.dat \
--lambda 0.01 \
--max-iterations 100
2. Export to dictionary format
lindera export \
--model /tmp/lindera/training/model.dat \
--metadata ./resources/training/metadata.json \
--output /tmp/lindera/training/dictionary
3. Build dictionary
lindera build \
--src /tmp/lindera/training/dictionary \
--dest /tmp/lindera/training/compiled_dictionary \
--metadata /tmp/lindera/training/dictionary/metadata.json
4. Use trained dictionary
echo "これは外国人参政権です。" | lindera tokenize \
-d /tmp/lindera/training/compiled_dictionary
Metadata update feature
When the --metadata option is provided, the export command will:
-
Read the base metadata.json file to preserve existing configuration
-
Update specific fields with values from the trained model:
default_left_context_id: Maximum left context ID from trained modeldefault_right_context_id: Maximum right context ID from trained modeldefault_word_cost: Calculated from feature weight medianmodel_info: Training statistics including:feature_count: Number of features in the modellabel_count: Number of labels in the modelmax_left_context_id: Maximum left context IDmax_right_context_id: Maximum right context IDconnection_matrix_size: Size of connection cost matrixtraining_iterations: Number of training iterations performedregularization: L1 regularization parameter usedversion: Model versionupdated_at: Timestamp of when the model was exported
-
Preserve existing settings such as:
- Dictionary name
- Character encoding settings
- Schema definitions
- Other user-defined configuration
This allows you to maintain your base dictionary configuration while incorporating the optimized parameters learned during training.
API reference
The API reference is available. Please see following URL:
API Reference
The API reference is available. Please see following URL:
Contributing
(Content for Contributing goes here)