Lindera

A morphological analysis library in Rust. This project is forked from kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

Installation

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "2.1.1", features = ["embed-ipadic"] }

Environment Variables

LINDERA_DICTIONARIES_PATH

The LINDERA_DICTIONARIES_PATH environment variable specifies a directory for caching dictionary source files. This enables:

Offline builds: Once downloaded, dictionary source files are preserved for future builds
Faster builds: Subsequent builds skip downloading if valid cached files exist
Reproducible builds: Ensures consistent dictionary versions across builds

Usage:

export LINDERA_DICTIONARIES_PATH=/path/to/dicts
cargo build --features=ipadic

When set, dictionary source files are stored in $LINDERA_DICTIONARIES_PATH/<version>/ where <version> is the lindera-dictionary crate version. The cache validates files using MD5 checksums - invalid files are automatically re-downloaded.

[!NOTE] LINDERA_CACHE is deprecated but still supported for backward compatibility. It will be used if LINDERA_DICTIONARIES_PATH is not set.

LINDERA_CONFIG_PATH

The LINDERA_CONFIG_PATH environment variable specifies the path to a YAML configuration file for the tokenizer. This allows you to configure tokenizer behavior without modifying Rust code.

export LINDERA_CONFIG_PATH=./resources/config/lindera.yml

See the Configuration section for details on the configuration format.

DOCS_RS

The DOCS_RS environment variable is automatically set by docs.rs when building documentation. When this variable is detected, Lindera creates dummy dictionary files instead of downloading actual dictionary data, allowing documentation to be built without network access or large file downloads.

This is primarily used internally by docs.rs and typically doesn't need to be set by users.

LINDERA_WORKDIR

The LINDERA_WORKDIR environment variable is automatically set during the build process by the lindera-dictionary crate. It points to the directory containing the built dictionary data files and is used internally by dictionary crates to locate their data files.

This variable is set automatically and should not be modified by users.

Quick Start

This example covers the basic usage of Lindera.

It will:

Create a tokenizer in normal mode
Tokenize the input text
Output the tokens

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "関西国際空港限定トートバッグ";
    let mut tokens = tokenizer.tokenize(text)?;
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=embed-ipadic --example=tokenize

You can see the result as follows:

text:   関西国際空港限定トートバッグ
token:  関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token:  限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token:  トートバッグ    名詞,一般,*,*,*,*,*,*,*

Dictionaries

Lindera supports various dictionaries. This section describes the format of each dictionary and the format for user dictionaries.

IPADIC - The most common dictionary for Japanese.
IPADIC NEologd - IPADIC with neologisms (new words).
UniDic - A dictionary with uniform word unit definitions.
ko-dic - A dictionary for Korean.
CC-CEDICT - A dictionary for Chinese.

Lindera IPADIC

Dictionary version

This repository contains mecab-ipadic.

Dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	Part-of-speech
5	品詞細分類1	Part-of-speech subcategory 1
6	品詞細分類2	Part-of-speech subcategory 2
7	品詞細分類3	Part-of-speech subcategory 3
8	活用形	Conjugation form
9	活用型	Conjugation type
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation

User dictionary format (CSV)

Simple version

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	品詞	Part-of-speech
2	読み	Reading

Detailed version

Index	Name (Japanese)	Name (English)	Notes
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	Part-of-speech
5	品詞細分類1	Part-of-speech subcategory 1
6	品詞細分類2	Part-of-speech subcategory 2
7	品詞細分類3	Part-of-speech subcategory 3
8	活用形	Conjugation form
9	活用型	Conjugation type
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation
13	-	-	After 13, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

lindera-ipadic

Lindera IPADIC NEologd

Dictionary version

This repository contains mecab-ipadic-neologd.

Dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	Part-of-speech
5	品詞細分類1	Part-of-speech subcategory 1
6	品詞細分類2	Part-of-speech subcategory 2
7	品詞細分類3	Part-of-speech subcategory 3
8	活用形	Conjugation form
9	活用型	Conjugation type
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation

User dictionary format (CSV)

Simple version

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	品詞	Part-of-speech
2	読み	Reading

Detailed version

Index	Name (Japanese)	Name (English)	Notes
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	Part-of-speech
5	品詞細分類1	Part-of-speech subcategory 1
6	品詞細分類2	Part-of-speech subcategory 2
7	品詞細分類3	Part-of-speech subcategory 3
8	活用形	Conjugation form
9	活用型	Conjugation type
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation
13	-	-	After 13, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

lindera-ipadic-neologd

Lindera UniDic

Dictionary version

This repository contains unidic-mecab.

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞大分類	Part-of-speech
5	品詞中分類	Part-of-speech subcategory 1
6	品詞小分類	Part-of-speech subcategory 2
7	品詞細分類	Part-of-speech subcategory 3
8	活用型	Conjugation type
9	活用形	Conjugation form
10	語彙素読み	Reading
11	語彙素（語彙素表記 + 語彙素細分類）	Lexeme
12	書字形出現形	Orthographic surface form
13	発音形出現形	Phonological surface form
14	書字形基本形	Orthographic base form
15	発音形基本形	Phonological base form
16	語種	Word type
17	語頭変化型	Initial mutation type
18	語頭変化形	Initial mutation form
19	語末変化型	Final mutation type
20	語末変化形	Final mutation form

User dictionary format (CSV)

Simple version

Index	Name (Japanese)	Name (English)
0	表層形	Surface
1	品詞大分類	Part-of-speech
2	語彙素読み	Reading

Detailed version

Index	Name (Japanese)	Name (English)	Notes
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞大分類	Part-of-speech
5	品詞中分類	Part-of-speech subcategory 1
6	品詞小分類	Part-of-speech subcategory 2
7	品詞細分類	Part-of-speech subcategory 3
8	活用型	Conjugation type
9	活用形	Conjugation form
10	語彙素読み	Reading
11	語彙素（語彙素表記 + 語彙素細分類）	Lexeme
12	書字形出現形	Orthographic surface form
13	発音形出現形	Phonological surface form
14	書字形基本形	Orthographic base form
15	発音形基本形	Phonological base form
16	語種	Word type
17	語頭変化型	Initial mutation type
18	語頭変化形	Initial mutation form
19	語末変化型	Final mutation type
20	語末変化形	Final mutation form
21	-	-	After 21, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

lindera-unidic

Lindera ko-dic

Dictionary version

This repository contains mecab-ko-dic.

Dictionary format

Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.

Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).

The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0 on the above-linked spreadsheet.

The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0 of the spreadsheet. Any blank values default to *.

Index	Name (Korean)	Name (English)	Notes
0	표면	Surface
1	왼쪽 문맥 ID	Left context ID
2	오른쪽 문맥 ID	Right context ID
3	비용	Cost
4	품사 태그	Part-of-speech tag	See `태그 v2.0` tab on spreadsheet
5	의미 부류	Meaning	(too few examples for me to be sure)
6	종성 유무	Presence or absence	`T` for true; `F` for false; else `*`
7	읽기	Reading	usually matches surface, but may differ for foreign words e.g. Chinese character words
8	타입	Type	One of: `Inflect` (활용); `Compound` (복합명사); or `Preanalysis` (기분석)
9	첫번째 품사	First part-of-speech	e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return `VV`
10	마지막 품사	Last part-of-speech	e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return `EP`
11	표현	Expression	`활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드` – Fields that tell how usage, compound nouns, and key analysis are organized

User dictionary format (CSV)

Simple version

Index	Name (Japanese)	Name (English)	Notes
0	표면	Surface
1	품사 태그	part-of-speech tag	See `태그 v2.0` tab on spreadsheet
2	읽기	reading	usually matches surface, but may differ for foreign words e.g. Chinese character words

Detailed version

Index	Name (Korean)	Name (English)	Notes
0	표면	Surface
1	왼쪽 문맥 ID	Left context ID
2	오른쪽 문맥 ID	Right context ID
3	비용	Cost
4	품사 태그	part-of-speech tag	See `태그 v2.0` tab on spreadsheet
5	의미 부류	meaning	(too few examples for me to be sure)
6	종성 유무	presence or absence	`T` for true; `F` for false; else `*`
7	읽기	reading	usually matches surface, but may differ for foreign words e.g. Chinese character words
8	타입	type	One of: `Inflect` (활용); `Compound` (복합명사); or `Preanalysis` (기분석)
9	첫번째 품사	first part-of-speech	e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return `VV`
10	마지막 품사	last part-of-speech	e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return `EP`
11	표현	expression	`활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드` – Fields that tell how usage, compound nouns, and key analysis are organized
12	-	-	After 12, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

lindera-ko-dic

Lindera CC-CE-DICT

Dictionary version

This repository contains CC-CEDICT-MeCab.

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

Index	Name (Chinese)	Name (English)
0	表面形式	Surface
1	左语境ID	Left context ID
2	右语境ID	Right context ID
3	成本	Cost
4	词类	Part-of-speech
5	词类1	Part-of-speech subcategory 1
6	词类2	Part-of-speech subcategory 2
7	词类3	Part-of-speech subcategory 3
8	併音	Pinyin
9	繁体字	Traditional
10	簡体字	Simplified
11	定义	Definition

User dictionary format (CSV)

Simple version

Index	Name (Japanese)	Name (English)
0	表面形式	Surface
1	词类	Part-of-speech
2	併音	Pinyin

Detailed version

Index	Name (Japanese)	Name (English)	Notes
0	表面形式	Surface
1	左语境ID	Left context ID
2	右语境ID	Right context ID
3	成本	Cost
4	词类	Part-of-speech
5	词类1	Part-of-speech subcategory 1
6	词类2	Part-of-speech subcategory 2
7	词类3	Part-of-speech subcategory 3
8	併音	Pinyin
9	繁体字	Traditional
10	簡体字	Simplified
11	定义	Definition
12	-	-	After 12, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

lindera-cc-cedict

Configuration

Lindera is able to read YAML format configuration files. Specify the path to the following file in the environment variable LINDERA_CONFIG_PATH. You can use it easily without having to code the behavior of the tokenizer in Rust code.

segmenter:
  mode: "normal"
  dictionary:
    kind: "ipadic"
  user_dictionary:
    path: "./resources/user_dict/ipadic_simple.csv"
    kind: "ipadic"

character_filters:
  - kind: "unicode_normalize"
    args:
      kind: "nfkc"
  - kind: "japanese_iteration_mark"
    args:
      normalize_kanji: true
      normalize_kana: true
  - kind: mapping
    args:
       mapping:
         リンデラ: Lindera

token_filters:
  - kind: "japanese_compound_word"
    args:
      tags:
        - "名詞,数"
        - "名詞,接尾,助数詞"
      new_tag: "名詞,数"
  - kind: "japanese_number"
    args:
      tags:
        - "名詞,数"
  - kind: "japanese_stop_tags"
    args:
      tags:
        - "接続詞"
        - "助詞"
        - "助詞,格助詞"
        - "助詞,格助詞,一般"
        - "助詞,格助詞,引用"
        - "助詞,格助詞,連語"
        - "助詞,係助詞"
        - "助詞,副助詞"
        - "助詞,間投助詞"
        - "助詞,並立助詞"
        - "助詞,終助詞"
        - "助詞,副助詞／並立助詞／終助詞"
        - "助詞,連体化"
        - "助詞,副詞化"
        - "助詞,特殊"
        - "助動詞"
        - "記号"
        - "記号,一般"
        - "記号,読点"
        - "記号,句点"
        - "記号,空白"
        - "記号,括弧閉"
        - "その他,間投"
        - "フィラー"
        - "非言語音"
  - kind: "japanese_katakana_stem"
    args:
      min: 3
  - kind: "remove_diacritical_mark"
    args:
      japanese: false

% export LINDERA_CONFIG_PATH=./resources/config/lindera.yml

use std::path::PathBuf;

use lindera::tokenizer::TokenizerBuilder;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    // Load tokenizer configuration from file
    let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../resources")
        .join("config")
        .join("lindera.yml");

    let builder = TokenizerBuilder::from_file(&path)?;

    let tokenizer = builder.build()?;

    let text = "Ｌｉｎｄｅｒａは形態素解析ｴﾝｼﾞﾝです。ユーザー辞書も利用可能です。".to_string();
    println!("text: {text}");

    let tokens = tokenizer.tokenize(&text)?;

    for token in tokens {
        println!(
            "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
            token.surface, token.byte_start, token.byte_end, token.details
        );
    }

    Ok(())
}

Advanced Usage

Tokenization with user dictionary

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface>,<part_of_speech>,<reading>

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "2.1.1", features = ["embed-ipadic"] }

For example:

% cat ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use std::fs::File;
use std::path::PathBuf;

use lindera::dictionary::{Metadata, load_dictionary, load_user_dictionary};
use lindera::error::LinderaErrorKind;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let user_dict_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../resources")
        .join("user_dict")
        .join("ipadic_simple_userdic.csv");

    let metadata_file = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../lindera-ipadic")
        .join("metadata.json");
    let metadata: Metadata = serde_json::from_reader(
        File::open(metadata_file)
            .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))
            .unwrap(),
    )
    .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))
    .unwrap();

    let dictionary = load_dictionary("embedded://ipadic")?;
    let user_dictionary = load_user_dictionary(user_dict_path.to_str().unwrap(), &metadata)?;
    let segmenter = Segmenter::new(
        Mode::Normal,
        dictionary,
        Some(user_dictionary), // Using the loaded user dictionary
    );

    // Create a tokenizer.
    let tokenizer = Tokenizer::new(segmenter);

    // Tokenize a text.
    let text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です";
    let mut tokens = tokenizer.tokenize(text)?;

    // Print the text and tokens.
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

The above example can be run by cargo run --example:

% cargo run --features=embed-ipadic --example=tokenize_with_user_dict
text:   東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です
token:  東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
token:  の      助詞,連体化,*,*,*,*,の,ノ,ノ
token:  最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
token:  は      助詞,係助詞,*,*,*,*,は,ハ,ワ
token:  とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
token:  です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス

Tokenize with filters

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "2.1.1", features = ["embed-ipadic"] }

This example covers the basic usage of Lindera Analysis Framework.

It will:

Apply character filter for Unicode normalization (NFKC)
Tokenize the input text with IPADIC
Apply token filters for removing stop tags (Part-of-speech) and Japanese Katakana stem filter

use lindera::character_filter::BoxCharacterFilter;
use lindera::character_filter::japanese_iteration_mark::JapaneseIterationMarkCharacterFilter;
use lindera::character_filter::unicode_normalize::{
    UnicodeNormalizeCharacterFilter, UnicodeNormalizeKind,
};
use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::token_filter::BoxTokenFilter;
use lindera::token_filter::japanese_compound_word::JapaneseCompoundWordTokenFilter;
use lindera::token_filter::japanese_number::JapaneseNumberTokenFilter;
use lindera::token_filter::japanese_stop_tags::JapaneseStopTagsTokenFilter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(
        Mode::Normal,
        dictionary,
        None, // No user dictionary for this example
    );

    let unicode_normalize_char_filter =
        UnicodeNormalizeCharacterFilter::new(UnicodeNormalizeKind::NFKC);

    let japanese_iteration_mark_char_filter =
        JapaneseIterationMarkCharacterFilter::new(true, true);

    let japanese_compound_word_token_filter = JapaneseCompoundWordTokenFilter::new(
        vec!["名詞,数".to_string(), "名詞,接尾,助数詞".to_string()]
            .into_iter()
            .collect(),
        Some("複合語".to_string()),
    );

    let japanese_number_token_filter =
        JapaneseNumberTokenFilter::new(Some(vec!["名詞,数".to_string()].into_iter().collect()));

    let japanese_stop_tags_token_filter = JapaneseStopTagsTokenFilter::new(
        vec![
            "接続詞".to_string(),
            "助詞".to_string(),
            "助詞,格助詞".to_string(),
            "助詞,格助詞,一般".to_string(),
            "助詞,格助詞,引用".to_string(),
            "助詞,格助詞,連語".to_string(),
            "助詞,係助詞".to_string(),
            "助詞,副助詞".to_string(),
            "助詞,間投助詞".to_string(),
            "助詞,並立助詞".to_string(),
            "助詞,終助詞".to_string(),
            "助詞,副助詞／並立助詞／終助詞".to_string(),
            "助詞,連体化".to_string(),
            "助詞,副詞化".to_string(),
            "助詞,特殊".to_string(),
            "助動詞".to_string(),
            "記号".to_string(),
            "記号,一般".to_string(),
            "記号,読点".to_string(),
            "記号,句点".to_string(),
            "記号,空白".to_string(),
            "記号,括弧閉".to_string(),
            "その他,間投".to_string(),
            "フィラー".to_string(),
            "非言語音".to_string(),
        ]
        .into_iter()
        .collect(),
    );

    // Create a tokenizer.
    let mut tokenizer = Tokenizer::new(segmenter);

    tokenizer
        .append_character_filter(BoxCharacterFilter::from(unicode_normalize_char_filter))
        .append_character_filter(BoxCharacterFilter::from(
            japanese_iteration_mark_char_filter,
        ))
        .append_token_filter(BoxTokenFilter::from(japanese_compound_word_token_filter))
        .append_token_filter(BoxTokenFilter::from(japanese_number_token_filter))
        .append_token_filter(BoxTokenFilter::from(japanese_stop_tags_token_filter));

    // Tokenize a text.
    let text = "Ｌｉｎｄｅｒａは形態素解析ｴﾝｼﾞﾝです。ユーザー辞書も利用可能です。";
    let tokens = tokenizer.tokenize(text)?;

    // Print the text and tokens.
    println!("text: {}", text);
    for token in tokens {
        println!(
            "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
            token.surface, token.byte_start, token.byte_end, token.details
        );
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=embed-ipadic --example=tokenize_with_filters

You can see the result as follows:

text: Ｌｉｎｄｅｒａは形態素解析ｴﾝｼﾞﾝです。ユーザー辞書も利用可能です。
token: "Lindera", start: 0, end: 21, details: Some(["名詞", "固有名詞", "組織", "*", "*", "*", "*", "*", "*"])
token: "形態素", start: 24, end: 33, details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: "解析", start: 33, end: 39, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: "エンジン", start: 39, end: 54, details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: "ユーザー", start: 63, end: 75, details: Some(["名詞", "一般", "*", "*", "*", "*", "ユーザー", "ユーザー", "ユーザー"])
token: "辞書", start: 75, end: 81, details: Some(["名詞", "一般", "*", "*", "*", "*", "辞書", "ジショ", "ジショ"])
token: "利用", start: 84, end: 90, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "利用", "リヨウ", "リヨー"])
token: "可能", start: 90, end: 96, details: Some(["名詞", "形容動詞語幹", "*", "*", "*", "*", "可能", "カノウ", "カノー"])

N-Best tokenization

Lindera supports N-Best tokenization, which enumerates the top N tokenization candidates ordered by total path cost (lower cost = better). This is based on the Forward-DP Backward-A* algorithm, compatible with MeCab's N-Best implementation.

Basic N-Best usage

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "2.1.1", features = ["embed-ipadic"] }

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "すもももももももものうち";

    // Get top 3 tokenization results
    let results = tokenizer.tokenize_nbest(text, 3, false, None)?;

    for (rank, (tokens, cost)) in results.iter().enumerate() {
        println!("--- NBEST {} (cost={}) ---", rank + 1, cost);
        for token in tokens {
            let details = token.details().join(",");
            println!("{}\t{}", token.surface.as_ref(), details);
        }
    }

    Ok(())
}

Output:

--- NBEST 1 (cost=7546) ---
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
--- NBEST 2 (cost=7914) ---
...

N-Best with unique results and cost threshold

The tokenize_nbest method accepts the following parameters:

text: The text to tokenize.
n: Number of N-best results to return.
unique: When true, deduplicates results that produce the same segmentation (same word boundary positions).
cost_threshold: When Some(threshold), only returns paths with cost within best_cost + threshold.

#![allow(unused)]
fn main() {
// Get top 10 unique results within cost threshold of 5000
let results = tokenizer.tokenize_nbest(text, 10, true, Some(5000))?;
}

N-Best with lattice reuse

For repeated tokenization, you can reuse a Lattice to reduce memory allocations:

#![allow(unused)]
fn main() {
use lindera_dictionary::viterbi::Lattice;

let mut lattice = Lattice::default();
let results = tokenizer.tokenize_nbest_with_lattice(text, &mut lattice, 3, false, None)?;
}

Dictionary Training (Experimental)

Lindera provides CRF-based dictionary training functionality for creating custom morphological analysis models.

Overview

Lindera Trainer is a Conditional Random Field (CRF) based morphological analyzer training system with the following advanced features:

CRF-based statistical learning: Efficient implementation using rucrf crate
L1 regularization: Prevents overfitting
Multi-threaded training: Parallel processing for faster training
Comprehensive Unicode support: Full CJK extension support
Advanced unknown word handling: Intelligent mixed character type classification
Multi-stage weight optimization: Advanced normalization system for trained weights
Lindera dictionary compatibility: Full compatibility with existing dictionary formats

CLI Usage

For detailed CLI command usage, see lindera-cli/README.md.

Required File Format Specifications

1. Vocabulary Dictionary (seed.csv)

Role: Base vocabulary dictionary Format: MeCab format CSV

外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
参政,0,0,0,名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ

Purpose: Define basic words and their part-of-speech information for training
Structure: surface,left_id,right_id,cost,pos,pos_detail1,pos_detail2,pos_detail3,inflection_type,inflection_form,base_form,reading,pronunciation

2. Unknown Word Definition (unk.def)

Role: Unknown word processing definition Format: Unknown word parameters by character type

DEFAULT,0,0,0,名詞,一般,*,*,*,*,*,*,*
HIRAGANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KATAKANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KANJI,0,0,0,名詞,一般,*,*,*,*,*,*,*
ALPHA,0,0,0,名詞,固有名詞,一般,*,*,*,*,*,*
NUMERIC,0,0,0,名詞,数,*,*,*,*,*,*,*

Purpose: Define processing methods for out-of-vocabulary words by character type
Note: These labels are for internal processing and are not output in the final dictionary file

3. Training Corpus (corpus.txt)

Role: Training data (annotated corpus) Format: Tab-separated tokenized text

外国	名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人	名詞,接尾,一般,*,*,*,人,ジン,ジン
参政	名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権	名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS

これ	連体詞,*,*,*,*,*,これ,コレ,コレ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト	名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
EOS

Purpose: Sentences and their correct analysis results for training
Format: Each line is surface\tpos_info, sentences end with EOS
Important: Training quality heavily depends on the quantity and quality of this corpus

4. Character Type Definition (char.def)

Role: Character type definition Format: Character categories and character code ranges

# Character category definition (category_name compatibility_flag continuity_flag length)
DEFAULT 0 1 0
HIRAGANA 1 1 0
KATAKANA 1 1 0
KANJI 0 0 2
ALPHA 1 1 0
NUMERIC 1 1 0

# Character range mapping
0x3041..0x3096 HIRAGANA  # Hiragana
0x30A1..0x30F6 KATAKANA  # Katakana
0x4E00..0x9FAF KANJI     # Kanji
0x0030..0x0039 NUMERIC   # Numbers
0x0041..0x005A ALPHA     # Uppercase letters
0x0061..0x007A ALPHA     # Lowercase letters

Purpose: Define which characters belong to which category
Parameters: Settings for compatibility, continuity, default length, etc.

5. Feature Template (feature.def)

Role: Feature template definition Format: Feature extraction patterns

# Unigram features (word-level features)
UNIGRAM:%F[0]         # POS (feature element 0)
UNIGRAM:%F[1]         # POS detail 1
UNIGRAM:%F[6]         # Base form
UNIGRAM:%F[7]         # Reading (Katakana)

# Left context features
LEFT:%L[0]            # POS of left word
LEFT:%L[1]            # POS detail of left word

# Right context features
RIGHT:%R[0]           # POS of right word
RIGHT:%R[1]           # POS detail of right word

# Bigram features (combination features)
UNIGRAM:%F[0]/%F[1]   # POS + POS detail
UNIGRAM:%F[0]/%F[6]   # POS + base form

Purpose: Define which information to extract features from
Templates: %F[n] (feature), %L[n] (left context), %R[n] (right context)

6. Feature Normalization Rules (rewrite.def)

Role: Feature normalization rules Format: Replacement rules (tab-separated)

# Normalize numeric expressions
数	NUM
*	UNK

# Normalize proper nouns
名詞,固有名詞	名詞,一般

# Simplify auxiliary verbs
助動詞,*,*,*,特殊・デス	助動詞
助動詞,*,*,*,特殊・ダ	助動詞

Purpose: Normalize features to improve training efficiency
Format: original_pattern\treplacement_pattern
Effect: Generalize rare features to reduce sparsity problems

7. Output Model Format

Role: Output model file Format: Binary (rkyv) format is standard, JSON format also supported

The model contains the following information:

{
  "feature_weights": [0.0, 0.084, 0.091, ...],
  "labels": ["外国", "人", "参政", "権", ...],
  "pos_info": ["名詞,一般,*,*,*,*,*,*,*", "名詞,接尾,一般,*,*,*,*,*,*", ...],
  "feature_templates": ["UNIGRAM:%F[0]", ...],
  "metadata": {
    "version": "1.0.0",
    "regularization": 0.01,
    "iterations": 100,
    "feature_count": 13,
    "label_count": 19
  }
}

Purpose: Save training results for later dictionary generation

Training Parameter Specifications

Regularization coefficient (lambda): Controls L1 regularization strength (default: 0.01)
Maximum iterations (iter): Maximum number of training iterations (default: 100)
Parallel threads (threads): Number of parallel processing threads (default: 1)

API Usage Example

#![allow(unused)]
fn main() {
use std::fs::File;
use lindera_dictionary::trainer::{Corpus, Trainer, TrainerConfig};

// Load configuration from files
let seed_file = File::open("resources/training/seed.csv")?;
let char_file = File::open("resources/training/char.def")?;
let unk_file = File::open("resources/training/unk.def")?;
let feature_file = File::open("resources/training/feature.def")?;
let rewrite_file = File::open("resources/training/rewrite.def")?;

let config = TrainerConfig::from_readers(
    seed_file,
    char_file,
    unk_file,
    feature_file,
    rewrite_file
)?;

// Initialize and configure trainer
let trainer = Trainer::new(config)?
    .regularization_cost(0.01)
    .max_iter(100)
    .num_threads(4);

// Load corpus
let corpus_file = File::open("resources/training/corpus.txt")?;
let corpus = Corpus::from_reader(corpus_file)?;

// Execute training
let model = trainer.train(corpus)?;

// Save model (binary format)
let mut output = File::create("trained_model.dat")?;
model.write_model(&mut output)?;

// Output in Lindera dictionary format
let mut lex_out = File::create("output_lex.csv")?;
let mut conn_out = File::create("output_conn.dat")?;
let mut unk_out = File::create("output_unk.def")?;
let mut user_out = File::create("output_user.csv")?;
model.write_dictionary(&mut lex_out, &mut conn_out, &mut unk_out, &mut user_out)?;

Ok::<(), Box<dyn std::error::Error>>(())
}

Implementation Status

Completed Features

Core Features

Core architecture: Complete trainer module structure
CRF training: Conditional Random Field training via rucrf integration
CLI integration: lindera train command with full parameter support
Corpus processing: Full MeCab format corpus support
Dictionary integration: Dictionary construction from seed.csv, char.def, unk.def
Feature extraction: Extraction and transformation of unigram/bigram features
Model saving: Output trained models in JSON/rkyv format
Dictionary output: Generate Lindera format dictionary files

Advanced Unknown Word Processing

Comprehensive Unicode support: Full support for CJK extensions, Katakana extensions, Hiragana extensions
Category-specific POS assignment: Automatic assignment of appropriate POS information by character type
- DEFAULT: 名詞,一般 (unknown character type)
- HIRAGANA/KATAKANA/KANJI: 名詞,一般 (Japanese characters)
- ALPHA: 名詞,固有名詞 (alphabetic characters)
- NUMERIC: 名詞,数 (numeric characters)
Surface form analysis: Feature generation based on character patterns, length, and position information
Dynamic cost calculation: Adaptive cost considering character type and context

Refactored Implementation (September 2024 Latest)

Constant management: Magic number elimination via cost_constants module
Method splitting: Improved readability by splitting large methods
- train() → build_lattices_from_corpus(), extract_labels(), train_crf_model(), create_final_model()
Unified cost calculation: Improved maintainability by unifying duplicate code
- calculate_known_word_cost(): Known word cost calculation
- calculate_unknown_word_cost(): Unknown word cost calculation
Organized debug output: Structured logging via log_debug! macro
Enhanced error handling: Comprehensive error handling and documentation

Architecture

lindera-dictionary/src/trainer.rs  # Main Trainer struct
lindera-dictionary/src/trainer/
├── config.rs           # Configuration management
├── corpus.rs           # Corpus processing
├── feature_extractor.rs # Feature extraction
├── feature_rewriter.rs  # Feature rewriting
└── model.rs            # Trained model

Advanced Unknown Word Processing System

Comprehensive Unicode Character Type Detection

The latest implementation significantly extends the basic Unicode ranges and fully supports the following character sets. (See the Category-specific POS assignment details in the Advanced Unknown Word Processing section above.)

Feature Weight Optimization

Cost Calculation Constants

#![allow(unused)]
fn main() {
mod cost_constants {
    // Known word cost calculation
    pub const KNOWN_WORD_BASE_COST: i16 = 1000;
    pub const KNOWN_WORD_COST_MULTIPLIER: f64 = 500.0;
    pub const KNOWN_WORD_COST_MIN: i16 = 500;
    pub const KNOWN_WORD_COST_MAX: i16 = 3000;
    pub const KNOWN_WORD_DEFAULT_COST: i16 = 1500;

    // Unknown word cost calculation
    pub const UNK_BASE_COST: i32 = 3000;
    pub const UNK_COST_MULTIPLIER: f64 = 500.0;
    pub const UNK_COST_MIN: i32 = 2500;
    pub const UNK_COST_MAX: i32 = 4500;

    // Category-specific adjustments
    pub const UNK_DEFAULT_ADJUSTMENT: i32 = 0;     // DEFAULT
    pub const UNK_HIRAGANA_ADJUSTMENT: i32 = 200;  // HIRAGANA - minor penalty
    pub const UNK_KATAKANA_ADJUSTMENT: i32 = 0;    // KATAKANA - medium
    pub const UNK_KANJI_ADJUSTMENT: i32 = 400;     // KANJI - high penalty
    pub const UNK_ALPHA_ADJUSTMENT: i32 = 100;     // ALPHA - mild penalty
    pub const UNK_NUMERIC_ADJUSTMENT: i32 = -100;  // NUMERIC - bonus (regular)
}
}

Unified Cost Calculation

#![allow(unused)]
fn main() {
// Known word cost calculation
fn calculate_known_word_cost(&self, feature_weight: f64) -> i16 {
    let scaled_weight = (feature_weight * cost_constants::KNOWN_WORD_COST_MULTIPLIER) as i32;
    let final_cost = cost_constants::KNOWN_WORD_BASE_COST as i32 + scaled_weight;
    final_cost.clamp(
        cost_constants::KNOWN_WORD_COST_MIN as i32,
        cost_constants::KNOWN_WORD_COST_MAX as i32
    ) as i16
}

// Unknown word cost calculation
fn calculate_unknown_word_cost(&self, feature_weight: f64, category: usize) -> i32 {
    let base_cost = cost_constants::UNK_BASE_COST;
    let category_adjustment = match category {
        0 => cost_constants::UNK_DEFAULT_ADJUSTMENT,
        1 => cost_constants::UNK_HIRAGANA_ADJUSTMENT,
        2 => cost_constants::UNK_KATAKANA_ADJUSTMENT,
        3 => cost_constants::UNK_KANJI_ADJUSTMENT,
        4 => cost_constants::UNK_ALPHA_ADJUSTMENT,
        5 => cost_constants::UNK_NUMERIC_ADJUSTMENT,
        _ => 0,
    };
    let scaled_weight = (feature_weight * cost_constants::UNK_COST_MULTIPLIER) as i32;
    let final_cost = base_cost + category_adjustment + scaled_weight;
    final_cost.clamp(
        cost_constants::UNK_COST_MIN,
        cost_constants::UNK_COST_MAX
    )
}
}

Performance Optimization

Memory Efficiency

Lazy evaluation: Create merged_model only when needed
Unused feature removal: Automatic deletion of unnecessary features after training
Efficient binary format: Fast serialization using rkyv

Parallel Processing Support

#![allow(unused)]
fn main() {
let trainer = rucrf::Trainer::new()
    .regularization(rucrf::Regularization::L1, regularization_cost)?
    .max_iter(max_iter)?
    .n_threads(self.num_threads)?;  // Multi-threaded training
}

Practical Training Data Requirements

Recommended Corpus Specifications

Recommendations for generating effective dictionaries for real applications:

Corpus Size
- Minimum: 100 sentences (for basic operation verification)
- Recommended: 1,000+ sentences (practical level)
- Ideal: 10,000+ sentences (commercial quality)
Vocabulary Diversity
- Balanced distribution of different parts of speech
- Coverage of inflections and suffixes
- Appropriate inclusion of technical terms and proper nouns
Quality Control
- Manual verification of morphological analysis results
- Consistent application of analysis criteria
- Maintain error rate below 5%

Lindera CLI

A morphological analysis command-line interface for Lindera.

Install

You can install binary via cargo as follows:

% cargo install lindera-cli

Alternatively, you can download a binary from the following release page:

https://github.com/lindera/lindera/releases

Build

Build with IPADIC (Japanese dictionary)

The "ipadic" feature flag allows Lindera to include IPADIC.

% cargo build --release --features=embed-ipadic

Build with UniDic (Japanese dictionary)

The "unidic" feature flag allows Lindera to include UniDic.

% cargo build --release --features=embed-unidic

Build with ko-dic (Korean dictionary)

The "ko-dic" feature flag allows Lindera to include ko-dic.

% cargo build --release --features=embed-ko-dic

Build with CC-CEDICT (Chinese dictionary)

The "cc-cedict" feature flag allows Lindera to include CC-CEDICT.

% cargo build --release --features=embed-cc-cedict

Build without dictionaries

To reduce Lindera's binary size, omit the feature flag. This results in a binary containing only the tokenizer and trainer, as it no longer includes the dictionary.

% cargo build --release

Build with all features

% cargo build --release --all-features

Build dictionary

Build (compile) a morphological analysis dictionary from source CSV files for use with Lindera.

Basic build usage

# Build a system dictionary
lindera build \
  --src /path/to/dictionary/csv \
  --dest /path/to/output/dictionary \
  --metadata ./lindera-ipadic/metadata.json

# Build a user dictionary
lindera build \
  --src ./user_dict.csv \
  --dest ./user_dictionary \
  --metadata ./lindera-ipadic/metadata.json \
  --user

Build parameters

--src / -s: Source directory containing dictionary CSV files (or single CSV file for user dictionary)
--dest / -d: Destination directory for compiled dictionary output
--metadata / -m: Metadata configuration file (metadata.json) that defines dictionary structure
--user / -u: Build user dictionary instead of system dictionary (optional flag)

Dictionary types

System dictionary

A full morphological analysis dictionary containing:

Lexicon entries (word definitions)
Connection cost matrix
Unknown word handling rules
Character type definitions

User dictionary

A supplementary dictionary for custom words that works alongside a system dictionary.

Examples

Build IPADIC (Japanese dictionary)

# Download and extract IPADIC source files
% curl -L -o /tmp/mecab-ipadic-2.7.0-20250920.tar.gz "https://Lindera.dev/mecab-ipadic-2.7.0-20250920.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20250920.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ipadic-2.7.0-20250920 \
  --dest /tmp/lindera-ipadic-2.7.0-20250920 \
  --metadata ./lindera-ipadic/metadata.json

% ls -al /tmp/lindera-ipadic-2.7.0-20250920
% (cd /tmp && zip -r lindera-ipadic-2.7.0-20250920.zip lindera-ipadic-2.7.0-20250920/)
% tar -czf /tmp/lindera-ipadic-2.7.0-20250920.tar.gz -C /tmp lindera-ipadic-2.7.0-20250920

Build IPADIC NEologd (Japanese dictionary)

# Download and extract IPADIC NEologd source files
% curl -L -o /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz "https://lindera.dev/mecab-ipadic-neologd-0.0.7-20200820.tar.gz"
% tar zxvf /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ipadic-neologd-0.0.7-20200820 \
  --dest /tmp/lindera-ipadic-neologd-0.0.7-20200820 \
  --metadata ./lindera-ipadic-neologd/metadata.json

% ls -al /tmp/lindera-ipadic-neologd-0.0.7-20200820
% (cd /tmp && zip -r lindera-ipadic-neologd-0.0.7-20200820.zip lindera-ipadic-neologd-0.0.7-20200820/)
% tar -czf /tmp/lindera-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp lindera-ipadic-neologd-0.0.7-20200820

Build UniDic (Japanese dictionary)

# Download and extract UniDic source files
% curl -L -o /tmp/unidic-mecab-2.1.2.tar.gz "https://Lindera.dev/unidic-mecab-2.1.2.tar.gz"
% tar zxvf /tmp/unidic-mecab-2.1.2.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/unidic-mecab-2.1.2 \
  --dest /tmp/lindera-unidic-2.1.2 \
  --metadata ./lindera-unidic/metadata.json

% ls -al /tmp/lindera-unidic-2.1.2
% (cd /tmp && zip -r lindera-unidic-2.1.2.zip lindera-unidic-2.1.2/)
% tar -czf /tmp/lindera-unidic-2.1.2.tar.gz -C /tmp lindera-unidic-2.1.2

Build CC-CEDICT (Chinese dictionary)

# Download and extract CC-CEDICT source files
% curl -L -o /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz "https://lindera.dev/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz"
% tar zxvf /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/CC-CEDICT-MeCab-0.1.0-20200409 \
  --dest /tmp/lindera-cc-cedict-0.1.0-20200409 \
  --metadata ./lindera-cc-cedict/metadata.json

% ls -al /tmp/lindera-cc-cedict-0.1.0-20200409
% (cd /tmp && zip -r lindera-cc-cedict-0.1.0-20200409.zip lindera-cc-cedict-0.1.0-20200409/)
% tar -czf /tmp/lindera-cc-cedict-0.1.0-20200409.tar.gz -C /tmp lindera-cc-cedict-0.1.0-20200409

Build ko-dic (Korean dictionary)

# Download and extract ko-dic source files
% curl -L -o /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz "https://Lindera.dev/mecab-ko-dic-2.1.1-20180720.tar.gz"
% tar zxvf /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ko-dic-2.1.1-20180720 \
  --dest /tmp/lindera-ko-dic-2.1.1-20180720 \
  --metadata ./lindera-ko-dic/metadata.json

% ls -al /tmp/lindera-ko-dic-2.1.1-20180720
% (cd /tmp && zip -r lindera-ko-dic-2.1.1-20180720.zip lindera-ko-dic-2.1.1-20180720/)
% tar -czf /tmp/lindera-ko-dic-2.1.1-20180720.tar.gz -C /tmp lindera-ko-dic-2.1.1-20180720

Build user dictionary

Build IPADIC user dictionary (Japanese)

For more details about user dictionary format please refer to the following URL:

Lindera IPADIC Builder/User Dictionary Format

% lindera build \
  --src ./resources/user_dict/ipadic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ipadic/metadata.json \
  --user

Build UniDic user dictionary (Japanese)

For more details about user dictionary format please refer to the following URL:

Lindera UniDic Builder/User Dictionary Format

% lindera build \
  --src ./resources/user_dict/unidic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-unidic/metadata.json \
  --user

Build CC-CEDICT user dictionary (Chinese)

For more details about user dictionary format please refer to the following URL:

Lindera CC-CEDICT Builder/User Dictionary Format

% lindera build \
  --src ./resources/user_dict/cc-cedict_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-cc-cedict/metadata.json \
  --user

Build ko-dic user dictionary (Korean)

For more details about user dictionary format please refer to the following URL:

Lindera ko-dic Builder/User Dictionary Format

% lindera build \
  --src ./resources/user_dict/ko-dic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ko-dic/metadata.json \
  --user

Tokenize text

Perform morphological analysis (tokenization) on Japanese, Chinese, or Korean text using various dictionaries.

Basic tokenization usage

# Tokenize text using a dictionary directory
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /path/to/dictionary

# Tokenize text using embedded dictionary
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic

# Tokenize with different output format
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic \
  --output json

# Tokenize text from file
lindera tokenize \
  --dict /path/to/dictionary \
  --output wakati \
  input.txt

Tokenization parameters

--dict / -d: Dictionary path or URI (required)
- File path: /path/to/dictionary
- Embedded: embedded://ipadic, embedded://unidic, etc.
--output / -o: Output format (default: mecab)
- mecab: MeCab-compatible format with part-of-speech info
- wakati: Space-separated tokens only
- json: Detailed JSON format with all token information
--user-dict / -u: User dictionary path (optional)
--mode / -m: Tokenization mode (default: normal)
- normal: Standard tokenization
- decompose: Decompose compound words
--char-filter / -c: Character filter configuration (JSON)
--token-filter / -t: Token filter configuration (JSON)
Input file: Optional file path (default: stdin)

Examples with external dictionaries

Tokenize with external IPADIC (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920

日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Tokenize with external IPADIC Neologd (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-neologd-0.0.7-20200820

日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Tokenize with external UniDic (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-unidic-2.1.2

日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Tokenize with external ko-dic (Korean dictionary)

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict /tmp/lindera-ko-dic-2.1.1-20180720

한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수      NNB,*,F,수,*,*,*,*
있      VV,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       SF,*,*,*,*,*,*,*
EOS

Tokenize with external CC-CEDICT (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict /tmp/lindera-cc-cedict-0.1.0-20200409

可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      *,*,*,*,*,*,*,*
EOS

Examples with embedded dictionaries

Lindera can include dictionaries directly in the binary when built with specific feature flags. This allows tokenization without external dictionary files.

Tokenize with embedded IPADIC (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic

日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include IPADIC dictionary in the binary, you must build with the --features=embed-ipadic option.

Tokenize with embedded UniDic (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://unidic

日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-unidic option.

Tokenize with embedded IPADIC NEologd (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic-neologd

日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-ipadic-neologd option.

Tokenize with embedded ko-dic (Korean dictionary)

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict embedded://ko-dic

한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수      NNB,*,F,수,*,*,*,*
있      VV,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       SF,*,*,*,*,*,*,*
EOS

NOTE: To include ko-dic dictionary in the binary, you must build with the --features=embed-ko-dic option.

Tokenize with embedded CC-CEDICT (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict embedded://cc-cedict

可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      *,*,*,*,*,*,*,*
EOS

NOTE: To include CC-CEDICT dictionary in the binary, you must build with the --features=embed-cc-cedict option.

User dictionary examples

Lindera supports user dictionaries to add custom words alongside system dictionaries. User dictionaries can be in CSV or binary format.

Use user dictionary (CSV format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict embedded://ipadic \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.csv

東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Use user dictionary (Binary format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920 \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.bin

東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Tokenization modes

Lindera provides two tokenization modes: normal and decompose.

Normal mode (default)

Tokenizes faithfully based on words registered in the dictionary:

% echo "関西国際空港限定トートバッグ" | lindera tokenize \
  --dict embedded://ipadic \
  --mode normal

関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    名詞,一般,*,*,*,*,*,*,*
EOS

Decompose mode

Tokenizes compound noun words additionally:

% echo "関西国際空港限定トートバッグ" | lindera tokenize \
  --dict embedded://ipadic \
  --mode decompose

関西    名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港    名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    名詞,一般,*,*,*,*,*,*,*
EOS

Output formats

Lindera provides three output formats: mecab, wakati and json.

MeCab format (default)

Outputs results in MeCab-compatible format with part-of-speech information:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output mecab

お待ち  名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチ
し  動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て  助詞,接続助詞,*,*,*,*,て,テ,テ
おり  動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます  助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。  記号,句点,*,*,*,*,。,。,。
EOS

Wakati format

Outputs only the token text separated by spaces:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output wakati

お待ち し て おり ます 。

JSON format

Outputs detailed token information in JSON format:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output json

[
  {
    "base_form": "お待ち",
    "byte_end": 9,
    "byte_start": 0,
    "conjugation_form": "*",
    "conjugation_type": "*",
    "part_of_speech": "名詞",
    "part_of_speech_subcategory_1": "サ変接続",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "オマチ",
    "reading": "オマチ",
    "surface": "お待ち",
    "word_id": 14698
  },
  {
    "base_form": "する",
    "byte_end": 12,
    "byte_start": 9,
    "conjugation_form": "サ変・スル",
    "conjugation_type": "連用形",
    "part_of_speech": "動詞",
    "part_of_speech_subcategory_1": "自立",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "シ",
    "reading": "シ",
    "surface": "し",
    "word_id": 30763
  },
  {
    "base_form": "て",
    "byte_end": 15,
    "byte_start": 12,
    "conjugation_form": "*",
    "conjugation_type": "*",
    "part_of_speech": "助詞",
    "part_of_speech_subcategory_1": "接続助詞",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "テ",
    "reading": "テ",
    "surface": "て",
    "word_id": 46603
  },
  {
    "base_form": "おる",
    "byte_end": 21,
    "byte_start": 15,
    "conjugation_form": "五段・ラ行",
    "conjugation_type": "連用形",
    "part_of_speech": "動詞",
    "part_of_speech_subcategory_1": "非自立",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "オリ",
    "reading": "オリ",
    "surface": "おり",
    "word_id": 14239
  },
  {
    "base_form": "ます",
    "byte_end": 27,
    "byte_start": 21,
    "conjugation_form": "特殊・マス",
    "conjugation_type": "基本形",
    "part_of_speech": "助動詞",
    "part_of_speech_subcategory_1": "*",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "マス",
    "reading": "マス",
    "surface": "ます",
    "word_id": 68733
  },
  {
    "base_form": "。",
    "byte_end": 30,
    "byte_start": 27,
    "conjugation_form": "*",
    "conjugation_type": "*",
    "part_of_speech": "記号",
    "part_of_speech_subcategory_1": "句点",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "。",
    "reading": "。",
    "surface": "。",
    "word_id": 101
  }
]

N-Best tokenization

Lindera supports N-Best tokenization, which returns the top N tokenization candidates ordered by cost (lower cost = better). This is based on the Forward-DP Backward-A* algorithm, compatible with MeCab's N-Best implementation.

N-Best parameters

--nbest / -N: Number of N-best results to return (default: 1). When set to 2 or more, N-best output is enabled.
--nbest-unique: Deduplicate N-best results by removing paths that produce the same segmentation. Different tokenizations that happen to split at the same positions are collapsed into one.
--nbest-cost-threshold: Maximum cost difference from the best path. Only paths with cost within best_cost + threshold are returned. This is useful for filtering out unlikely tokenizations.

Basic N-Best example

% echo "すもももももももものうち" | lindera tokenize \
  --dict embedded://ipadic \
  -N 3

NBEST 1 (cost=7546)
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
NBEST 2 (cost=7914)
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
NBEST 3 (cost=10060)
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
も      助詞,係助詞,*,*,*,*,も,モ,モ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

N-Best with unique results

When the same segmentation appears in multiple paths (differing only in internal Viterbi states), use --nbest-unique to deduplicate:

% echo "営業部長谷川です" | lindera tokenize \
  --dict embedded://ipadic \
  -N 5 --nbest-unique -o wakati

NBEST 1 (cost=15760)
営業 部長 谷川 です
NBEST 2 (cost=17758)
営業 部長 谷 川 です
NBEST 3 (cost=18816)
営業 部 長谷川 です
NBEST 4 (cost=19320)
営業 部長 谷川 で す
NBEST 5 (cost=20814)
営業 部 長谷 川 です

N-Best with cost threshold

Use --nbest-cost-threshold to limit results to paths within a certain cost range of the best path:

% echo "営業部長谷川です" | lindera tokenize \
  --dict embedded://ipadic \
  -N 10 --nbest-unique --nbest-cost-threshold 5000 -o wakati

NBEST 1 (cost=15760)
営業 部長 谷川 です
NBEST 2 (cost=17758)
営業 部長 谷 川 です
NBEST 3 (cost=18816)
営業 部 長谷川 です

Only 3 results are returned because the remaining candidates exceed 15760 + 5000 = 20760.

Advanced tokenization

Lindera provides an analytical framework that combines character filters, tokenizers, and token filters for advanced text processing. Filters are configured using JSON.

Tokenize with character and token filters

% echo "すもももももももものうち" | lindera tokenize \
  --dict embedded://ipadic \
  --char-filter 'unicode_normalize:{"kind":"nfkc"}' \
  --token-filter 'japanese_keep_tags:{"tags":["名詞,一般"]}'

すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
EOS

Dictionary Training (Experimental)

Train a new morphological analysis model from annotated corpus data. To use this feature, you must build with the train feature flag enabled. (The train feature flag is enabled by default.)

Training parameters

--seed / -s: Seed lexicon file (CSV format) to be weighted
--corpus / -c: Training corpus (annotated text)
--char-def / -C: Character definition file (char.def)
--unk-def / -u: Unknown word definition file (unk.def) to be weighted
--feature-def / -f: Feature definition file (feature.def)
--rewrite-def / -r: Rewrite rule definition file (rewrite.def)
--output / -o: Output model file
--lambda / -l: L1 regularization (0.0-1.0) (default: 0.01)
--max-iterations / -i: Maximum number of iterations for training (default: 100)
--max-threads / -t: Maximum number of threads (defaults to CPU core count, auto-adjusted based on dataset size)

Basic workflow

1. Prepare training files

Seed lexicon file (seed.csv):

The seed lexicon file contains initial dictionary entries used for training the CRF model. Each line represents a word entry with comma-separated fields. The specific field structure varies depending on the dictionary format:

Surface
Left context ID
Right context ID
Word cost
Part-of-speech tags (multiple fields)
Base form
Reading (katakana)
Pronunciation

Note: The exact field definitions differ between dictionary formats (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.

外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン

Training corpus (corpus.txt):

The training corpus file contains annotated text data used to train the CRF model. Each line consists of:

A surface form (word) followed by a tab character
Comma-separated morphological features (part-of-speech tags, base form, reading, pronunciation)
Sentences are separated by "EOS" (End Of Sentence) markers

Note: The morphological feature format varies depending on the dictionary (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.

外国	名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人	名詞,接尾,一般,*,*,*,人,ジン,ジン
参政	名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権	名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS

これ	連体詞,*,*,*,*,*,これ,コレ,コレ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト	名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

形態	名詞,一般,*,*,*,*,形態,ケイタイ,ケイタイ
素	名詞,接尾,一般,*,*,*,素,ソ,ソ
解析	名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う	動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
EOS

For detailed information about file formats and advanced features, see TRAINER_README.md.

2. Train model

lindera train \
  --seed ./resources/training/seed.csv \
  --corpus ./resources/training/corpus.txt \
  --unk-def ./resources/training/unk.def \
  --char-def ./resources/training/char.def \
  --feature-def ./resources/training/feature.def \
  --rewrite-def ./resources/training/rewrite.def \
  --output /tmp/lindera/training/model.dat \
  --lambda 0.01 \
  --max-iterations 100

3. Training results

The trained model will contain:

Existing words: All seed dictionary records with newly learned weights
New words: Words from the corpus not in the seed dictionary, added with appropriate weights

Export trained model to dictionary

Export a trained model file to Lindera dictionary format files. This feature requires building with the train feature flag enabled.

Basic export usage

# Export trained model to dictionary files
lindera export \
  --model /tmp/lindera/training/model.dat \
  --metadata ./resources/training/metadata.json \
  --output /tmp/lindera/training/dictionary

Export parameters

--model / -m: Path to the trained model file (.dat format)
--output / -o: Directory to output the dictionary files
--metadata: Optional metadata.json file to update with trained model information

Output files

The export command creates the following dictionary files in the output directory:

lex.csv: Lexicon file with learned weights
matrix.def: Connection cost matrix
unk.def: Unknown word definitions
char.def: Character type definitions
metadata.json: Updated metadata file (if --metadata option is provided)

Complete workflow example

1. Train model

lindera train \
  --seed ./resources/training/seed.csv \
  --corpus ./resources/training/corpus.txt \
  --unk-def ./resources/training/unk.def \
  --char-def ./resources/training/char.def \
  --feature-def ./resources/training/feature.def \
  --rewrite-def ./resources/training/rewrite.def \
  --output /tmp/lindera/training/model.dat \
  --lambda 0.01 \
  --max-iterations 100

2. Export to dictionary format

lindera export \
  --model /tmp/lindera/training/model.dat \
  --metadata ./resources/training/metadata.json \
  --output /tmp/lindera/training/dictionary

3. Build dictionary

lindera build \
  --src /tmp/lindera/training/dictionary \
  --dest /tmp/lindera/training/compiled_dictionary \
  --metadata /tmp/lindera/training/dictionary/metadata.json

4. Use trained dictionary

echo "これは外国人参政権です。" | lindera tokenize \
  -d /tmp/lindera/training/compiled_dictionary

Metadata update feature

When the --metadata option is provided, the export command will:

Read the base metadata.json file to preserve existing configuration
Update specific fields with values from the trained model:
- default_left_context_id: Maximum left context ID from trained model
- default_right_context_id: Maximum right context ID from trained model
- default_word_cost: Calculated from feature weight median
- model_info: Training statistics including:
  - feature_count: Number of features in the model
  - label_count: Number of labels in the model
  - max_left_context_id: Maximum left context ID
  - max_right_context_id: Maximum right context ID
  - connection_matrix_size: Size of connection cost matrix
  - training_iterations: Number of training iterations performed
  - regularization: L1 regularization parameter used
  - version: Model version
  - updated_at: Timestamp of when the model was exported
Preserve existing settings such as:
- Dictionary name
- Character encoding settings
- Schema definitions
- Other user-defined configuration

This allows you to maintain your base dictionary configuration while incorporating the optimized parameters learned during training.

API reference

The API reference is available. Please see following URL:

lindera-cli

API Reference

The API reference is available. Please see following URL:

lindera

Contributing

(Content for Contributing goes here)

Lindera User Guide