Lindera

License: MIT Crates.io

A morphological analysis library in Rust. This project is forked from kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

Installation

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "1.2.0", features = ["embedded-ipadic"] }

Environment Variables

LINDERA_CACHE

The LINDERA_CACHE environment variable specifies a directory for caching dictionary source files. This enables:

  • Offline builds: Once downloaded, dictionary source files are preserved for future builds
  • Faster builds: Subsequent builds skip downloading if valid cached files exist
  • Reproducible builds: Ensures consistent dictionary versions across builds

Usage:

export LINDERA_CACHE=/path/to/cache
cargo build --features=ipadic

When set, dictionary source files are stored in $LINDERA_CACHE/<version>/ where <version> is the lindera-dictionary crate version. The cache validates files using MD5 checksums - invalid files are automatically re-downloaded.

LINDERA_CONFIG_PATH

The LINDERA_CONFIG_PATH environment variable specifies the path to a YAML configuration file for the tokenizer. This allows you to configure tokenizer behavior without modifying Rust code.

export LINDERA_CONFIG_PATH=./resources/config/lindera.yml

See the Configuration section for details on the configuration format.

DOCS_RS

The DOCS_RS environment variable is automatically set by docs.rs when building documentation. When this variable is detected, Lindera creates dummy dictionary files instead of downloading actual dictionary data, allowing documentation to be built without network access or large file downloads.

This is primarily used internally by docs.rs and typically doesn't need to be set by users.

LINDERA_WORKDIR

The LINDERA_WORKDIR environment variable is automatically set during the build process by the lindera-dictionary crate. It points to the directory containing the built dictionary data files and is used internally by dictionary crates to locate their data files.

This variable is set automatically and should not be modified by users.

Quick Start

This example covers the basic usage of Lindera.

It will:

  • Create a tokenizer in normal mode
  • Tokenize the input text
  • Output the tokens
use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "関西国際空港限定トートバッグ";
    let mut tokens = tokenizer.tokenize(text)?;
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=embedded-ipadic --example=tokenize

You can see the result as follows:

text:   関西国際空港限定トートバッグ
token:  関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token:  限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token:  トートバッグ    UNK

Dictionaries

Lindera supports various dictionaries. This section describes the format of each dictionary and the format for user dictionaries.

  • IPADIC - The most common dictionary for Japanese.
  • IPADIC NEologd - IPADIC with neologisms (new words).
  • UniDic - A dictionary with uniform word unit definitions.
  • ko-dic - A dictionary for Korean.
  • CC-CEDICT - A dictionary for Chinese.

Lindera IPADIC

Dictionary version

This repository contains mecab-ipadic.

Dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1品詞Part-of-speech
2読みReading

Detailed version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation
13--After 13, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Lindera IPADIC NEologd

Dictionary version

This repository contains mecab-ipadic-neologd.

Dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1品詞Part-of-speech
2読みReading

Detailed version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation
13--After 13, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Lindera UniDic

Dictionary version

This repository contains unidic-mecab.

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞大分類Part-of-speech
5品詞中分類Part-of-speech subcategory 1
6品詞小分類Part-of-speech subcategory 2
7品詞細分類Part-of-speech subcategory 3
8活用型Conjugation type
9活用形Conjugation form
10語彙素読みReading
11語彙素(語彙素表記 + 語彙素細分類)Lexeme
12書字形出現形Orthographic surface form
13発音形出現形Phonological surface form
14書字形基本形Orthographic base form
15発音形基本形Phonological base form
16語種Word type
17語頭変化型Initial mutation type
18語頭変化形Initial mutation form
19語末変化型Final mutation type
20語末変化形Final mutation form

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1品詞大分類Part-of-speech
2語彙素読みReading

Detailed version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞大分類Part-of-speech
5品詞中分類Part-of-speech subcategory 1
6品詞小分類Part-of-speech subcategory 2
7品詞細分類Part-of-speech subcategory 3
8活用型Conjugation type
9活用形Conjugation form
10語彙素読みReading
11語彙素(語彙素表記 + 語彙素細分類)Lexeme
12書字形出現形Orthographic surface form
13発音形出現形Phonological surface form
14書字形基本形Orthographic base form
15発音形基本形Phonological base form
16語種Word type
17語頭変化型Initial mutation type
18語頭変化形Initial mutation form
19語末変化型Final mutation type
20語末変化形Final mutation form
21--After 21, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Lindera ko-dic

Dictionary version

This repository contains mecab-ko-dic.

Dictionary format

Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.

Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).

The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0 on the above-linked spreadsheet.

The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0 of the spreadsheet. Any blank values default to *.

IndexName (Korean)Name (English)Notes
0표면Surface
1왼쪽 문맥 IDLeft context ID
2오른쪽 문맥 IDRight context ID
3비용Cost
4품사 태그Part-of-speech tagSee 태그 v2.0 tab on spreadsheet
5의미 부류Meaning(too few examples for me to be sure)
6종성 유무Presence or absenceT for true; F for false; else *
7읽기Readingusually matches surface, but may differ for foreign words e.g. Chinese character words
8타입TypeOne of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9첫번째 품사First part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10마지막 품사Last part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11표현Expression활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0표면Surface
1품사 태그part-of-speech tagSee 태그 v2.0 tab on spreadsheet
2읽기readingusually matches surface, but may differ for foreign words e.g. Chinese character words

Detailed version

IndexName (Korean)Name (English)Notes
0표면Surface
1왼쪽 문맥 IDLeft context ID
2오른쪽 문맥 IDRight context ID
3비용Cost
4품사 태그part-of-speech tagSee 태그 v2.0 tab on spreadsheet
5의미 부류meaning(too few examples for me to be sure)
6종성 유무presence or absenceT for true; F for false; else *
7읽기readingusually matches surface, but may differ for foreign words e.g. Chinese character words
8타입typeOne of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9첫번째 품사first part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10마지막 품사last part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11표현expression활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 – Fields that tell how usage, compound nouns, and key analysis are organized
12--After 12, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Lindera CC-CE-DICT

Dictionary version

This repository contains CC-CEDICT-MeCab.

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

IndexName (Chinese)Name (English)Notes
0表面形式Surface
1左语境IDLeft context ID
2右语境IDRight context ID
3成本Cost
4词类Part-of-speech
5词类1Part-of-speech subcategory 1
6词类2Part-of-speech subcategory 2
7词类3Part-of-speech subcategory 3
8併音Pinyin
9繁体字Traditional
10簡体字Simplified
11定义Definition

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表面形式Surface
1词类Part-of-speech
2併音Pinyin

Detailed version

IndexName (Japanese)Name (English)Notes
0表面形式Surface
1左语境IDLeft context ID
2右语境IDRight context ID
3成本Cost
4词类Part-of-speech
5词类1Part-of-speech subcategory 1
6词类2Part-of-speech subcategory 2
7词类3Part-of-speech subcategory 3
8併音Pinyin
9繁体字Traditional
10簡体字Simplified
11定义Definition
12--After 12, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Configuration

Lindera is able to read YAML format configuration files. Specify the path to the following file in the environment variable LINDERA_CONFIG_PATH. You can use it easily without having to code the behavior of the tokenizer in Rust code.

segmenter:
  mode: "normal"
  dictionary:
    kind: "ipadic"
  user_dictionary:
    path: "./resources/user_dict/ipadic_simple.csv"
    kind: "ipadic"

character_filters:
  - kind: "unicode_normalize"
    args:
      kind: "nfkc"
  - kind: "japanese_iteration_mark"
    args:
      normalize_kanji: true
      normalize_kana: true
  - kind: mapping
    args:
       mapping:
         リンデラ: Lindera

token_filters:
  - kind: "japanese_compound_word"
    args:
      tags:
        - "名詞,数"
        - "名詞,接尾,助数詞"
      new_tag: "名詞,数"
  - kind: "japanese_number"
    args:
      tags:
        - "名詞,数"
  - kind: "japanese_stop_tags"
    args:
      tags:
        - "接続詞"
        - "助詞"
        - "助詞,格助詞"
        - "助詞,格助詞,一般"
        - "助詞,格助詞,引用"
        - "助詞,格助詞,連語"
        - "助詞,係助詞"
        - "助詞,副助詞"
        - "助詞,間投助詞"
        - "助詞,並立助詞"
        - "助詞,終助詞"
        - "助詞,副助詞/並立助詞/終助詞"
        - "助詞,連体化"
        - "助詞,副詞化"
        - "助詞,特殊"
        - "助動詞"
        - "記号"
        - "記号,一般"
        - "記号,読点"
        - "記号,句点"
        - "記号,空白"
        - "記号,括弧閉"
        - "その他,間投"
        - "フィラー"
        - "非言語音"
  - kind: "japanese_katakana_stem"
    args:
      min: 3
  - kind: "remove_diacritical_mark"
    args:
      japanese: false
% export LINDERA_CONFIG_PATH=./resources/config/lindera.yml
use std::path::PathBuf;

use lindera::tokenizer::TokenizerBuilder;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    // Load tokenizer configuration from file
    let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../resources")
        .join("config")
        .join("lindera.yml");

    let builder = TokenizerBuilder::from_file(&path)?;

    let tokenizer = builder.build()?;

    let text = "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。".to_string();
    println!("text: {text}");

    let tokens = tokenizer.tokenize(&text)?;

    for token in tokens {
        println!(
            "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
            token.surface, token.byte_start, token.byte_end, token.details
        );
    }

    Ok(())
}

Advanced Usage

Tokenization with user dictionary

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface>,<part_of_speech>,<reading>

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "1.2.0", features = ["embedded-ipadic"] }

For example:

% cat ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use std::fs::File;
use std::path::PathBuf;

use lindera::dictionary::{Metadata, load_dictionary, load_user_dictionary};
use lindera::error::LinderaErrorKind;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let user_dict_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../resources")
        .join("user_dict")
        .join("ipadic_simple_userdic.csv");

    let metadata_file = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../lindera-ipadic")
        .join("metadata.json");
    let metadata: Metadata = serde_json::from_reader(
        File::open(metadata_file)
            .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))
            .unwrap(),
    )
    .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))
    .unwrap();

    let dictionary = load_dictionary("embedded://ipadic")?;
    let user_dictionary = load_user_dictionary(user_dict_path.to_str().unwrap(), &metadata)?;
    let segmenter = Segmenter::new(
        Mode::Normal,
        dictionary,
        Some(user_dictionary), // Using the loaded user dictionary
    );

    // Create a tokenizer.
    let tokenizer = Tokenizer::new(segmenter);

    // Tokenize a text.
    let text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です";
    let mut tokens = tokenizer.tokenize(text)?;

    // Print the text and tokens.
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

The above example can be run by cargo run --example:

% cargo run --features=embedded-ipadic --example=tokenize_with_user_dict
text:   東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です
token:  東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
token:  の      助詞,連体化,*,*,*,*,の,ノ,ノ
token:  最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
token:  は      助詞,係助詞,*,*,*,*,は,ハ,ワ
token:  とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
token:  です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス

Tokenize with filters

Put the following in Cargo.toml:

[dependencies]
lindera = { version = "1.2.0", features = ["embedded-ipadic"] }

This example covers the basic usage of Lindera Analysis Framework.

It will:

  • Apply character filter for Unicode normalization (NFKC)
  • Tokenize the input text with IPADIC
  • Apply token filters for removing stop tags (Part-of-speech) and Japanese Katakana stem filter
use lindera::character_filter::BoxCharacterFilter;
use lindera::character_filter::japanese_iteration_mark::JapaneseIterationMarkCharacterFilter;
use lindera::character_filter::unicode_normalize::{
    UnicodeNormalizeCharacterFilter, UnicodeNormalizeKind,
};
use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::token_filter::BoxTokenFilter;
use lindera::token_filter::japanese_compound_word::JapaneseCompoundWordTokenFilter;
use lindera::token_filter::japanese_number::JapaneseNumberTokenFilter;
use lindera::token_filter::japanese_stop_tags::JapaneseStopTagsTokenFilter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(
        Mode::Normal,
        dictionary,
        None, // No user dictionary for this example
    );

    let unicode_normalize_char_filter =
        UnicodeNormalizeCharacterFilter::new(UnicodeNormalizeKind::NFKC);

    let japanese_iteration_mark_char_filter =
        JapaneseIterationMarkCharacterFilter::new(true, true);

    let japanese_compound_word_token_filter = JapaneseCompoundWordTokenFilter::new(
        vec!["名詞,数".to_string(), "名詞,接尾,助数詞".to_string()]
            .into_iter()
            .collect(),
        Some("複合語".to_string()),
    );

    let japanese_number_token_filter =
        JapaneseNumberTokenFilter::new(Some(vec!["名詞,数".to_string()].into_iter().collect()));

    let japanese_stop_tags_token_filter = JapaneseStopTagsTokenFilter::new(
        vec![
            "接続詞".to_string(),
            "助詞".to_string(),
            "助詞,格助詞".to_string(),
            "助詞,格助詞,一般".to_string(),
            "助詞,格助詞,引用".to_string(),
            "助詞,格助詞,連語".to_string(),
            "助詞,係助詞".to_string(),
            "助詞,副助詞".to_string(),
            "助詞,間投助詞".to_string(),
            "助詞,並立助詞".to_string(),
            "助詞,終助詞".to_string(),
            "助詞,副助詞/並立助詞/終助詞".to_string(),
            "助詞,連体化".to_string(),
            "助詞,副詞化".to_string(),
            "助詞,特殊".to_string(),
            "助動詞".to_string(),
            "記号".to_string(),
            "記号,一般".to_string(),
            "記号,読点".to_string(),
            "記号,句点".to_string(),
            "記号,空白".to_string(),
            "記号,括弧閉".to_string(),
            "その他,間投".to_string(),
            "フィラー".to_string(),
            "非言語音".to_string(),
        ]
        .into_iter()
        .collect(),
    );

    // Create a tokenizer.
    let mut tokenizer = Tokenizer::new(segmenter);

    tokenizer
        .append_character_filter(BoxCharacterFilter::from(unicode_normalize_char_filter))
        .append_character_filter(BoxCharacterFilter::from(
            japanese_iteration_mark_char_filter,
        ))
        .append_token_filter(BoxTokenFilter::from(japanese_compound_word_token_filter))
        .append_token_filter(BoxTokenFilter::from(japanese_number_token_filter))
        .append_token_filter(BoxTokenFilter::from(japanese_stop_tags_token_filter));

    // Tokenize a text.
    let text = "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。";
    let tokens = tokenizer.tokenize(text)?;

    // Print the text and tokens.
    println!("text: {}", text);
    for token in tokens {
        println!(
            "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
            token.surface, token.byte_start, token.byte_end, token.details
        );
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --features=embedded-ipadic --example=tokenize_with_filters

You can see the result as follows:

text: Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。
token: "Lindera", start: 0, end: 21, details: Some(["UNK"])
token: "形態素", start: 24, end: 33, details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: "解析", start: 33, end: 39, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: "エンジン", start: 39, end: 54, details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: "ユーザー", start: 63, end: 75, details: Some(["名詞", "一般", "*", "*", "*", "*", "ユーザー", "ユーザー", "ユーザー"])
token: "辞書", start: 75, end: 81, details: Some(["名詞", "一般", "*", "*", "*", "*", "辞書", "ジショ", "ジショ"])
token: "利用", start: 84, end: 90, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "利用", "リヨウ", "リヨー"])
token: "可能", start: 90, end: 96, details: Some(["名詞", "形容動詞語幹", "*", "*", "*", "*", "可能", "カノウ", "カノー"])

Dictionary Training (Experimental)

Lindera provides CRF-based dictionary training functionality for creating custom morphological analysis models.

Overview

Lindera Trainer is a Conditional Random Field (CRF) based morphological analyzer training system with the following advanced features:

  • CRF-based statistical learning: Efficient implementation using rucrf crate
  • L1 regularization: Prevents overfitting
  • Multi-threaded training: Parallel processing for faster training
  • Comprehensive Unicode support: Full CJK extension support
  • Advanced unknown word handling: Intelligent mixed character type classification
  • Multi-stage weight optimization: Advanced normalization system for trained weights
  • Lindera dictionary compatibility: Full compatibility with existing dictionary formats

CLI Usage

For detailed CLI command usage, see lindera-cli/README.md.

Required File Format Specifications

1. Vocabulary Dictionary (seed.csv)

Role: Base vocabulary dictionary Format: MeCab format CSV

外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
参政,0,0,0,名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
  • Purpose: Define basic words and their part-of-speech information for training
  • Structure: surface,left_id,right_id,cost,pos,pos_detail1,pos_detail2,pos_detail3,inflection_type,inflection_form,base_form,reading,pronunciation

2. Unknown Word Definition (unk.def)

Role: Unknown word processing definition Format: Unknown word parameters by character type

DEFAULT,0,0,0,名詞,一般,*,*,*,*,*,*,*
HIRAGANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KATAKANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KANJI,0,0,0,名詞,一般,*,*,*,*,*,*,*
ALPHA,0,0,0,名詞,固有名詞,一般,*,*,*,*,*,*
NUMERIC,0,0,0,名詞,数,*,*,*,*,*,*,*
  • Purpose: Define processing methods for out-of-vocabulary words by character type
  • Note: These labels are for internal processing and are not output in the final dictionary file

3. Training Corpus (corpus.txt)

Role: Training data (annotated corpus) Format: Tab-separated tokenized text

外国	名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人	名詞,接尾,一般,*,*,*,人,ジン,ジン
参政	名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権	名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS

これ	連体詞,*,*,*,*,*,これ,コレ,コレ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト	名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
EOS
  • Purpose: Sentences and their correct analysis results for training
  • Format: Each line is surface\tpos_info, sentences end with EOS
  • Important: Training quality heavily depends on the quantity and quality of this corpus

4. Character Type Definition (char.def)

Role: Character type definition Format: Character categories and character code ranges

# Character category definition (category_name compatibility_flag continuity_flag length)
DEFAULT 0 1 0
HIRAGANA 1 1 0
KATAKANA 1 1 0
KANJI 0 0 2
ALPHA 1 1 0
NUMERIC 1 1 0

# Character range mapping
0x3041..0x3096 HIRAGANA  # Hiragana
0x30A1..0x30F6 KATAKANA  # Katakana
0x4E00..0x9FAF KANJI     # Kanji
0x0030..0x0039 NUMERIC   # Numbers
0x0041..0x005A ALPHA     # Uppercase letters
0x0061..0x007A ALPHA     # Lowercase letters
  • Purpose: Define which characters belong to which category
  • Parameters: Settings for compatibility, continuity, default length, etc.

5. Feature Template (feature.def)

Role: Feature template definition Format: Feature extraction patterns

# Unigram features (word-level features)
UNIGRAM:%F[0]         # POS (feature element 0)
UNIGRAM:%F[1]         # POS detail 1
UNIGRAM:%F[6]         # Base form
UNIGRAM:%F[7]         # Reading (Katakana)

# Left context features
LEFT:%L[0]            # POS of left word
LEFT:%L[1]            # POS detail of left word

# Right context features
RIGHT:%R[0]           # POS of right word
RIGHT:%R[1]           # POS detail of right word

# Bigram features (combination features)
UNIGRAM:%F[0]/%F[1]   # POS + POS detail
UNIGRAM:%F[0]/%F[6]   # POS + base form
  • Purpose: Define which information to extract features from
  • Templates: %F[n] (feature), %L[n] (left context), %R[n] (right context)

6. Feature Normalization Rules (rewrite.def)

Role: Feature normalization rules Format: Replacement rules (tab-separated)

# Normalize numeric expressions
数	NUM
*	UNK

# Normalize proper nouns
名詞,固有名詞	名詞,一般

# Simplify auxiliary verbs
助動詞,*,*,*,特殊・デス	助動詞
助動詞,*,*,*,特殊・ダ	助動詞
  • Purpose: Normalize features to improve training efficiency
  • Format: original_pattern\treplacement_pattern
  • Effect: Generalize rare features to reduce sparsity problems

7. Output Model Format

Role: Output model file Format: Binary (rkyv) format is standard, JSON format also supported

The model contains the following information:

{
  "feature_weights": [0.0, 0.084, 0.091, ...],
  "labels": ["外国", "人", "参政", "権", ...],
  "pos_info": ["名詞,一般,*,*,*,*,*,*,*", "名詞,接尾,一般,*,*,*,*,*,*", ...],
  "feature_templates": ["UNIGRAM:%F[0]", ...],
  "metadata": {
    "version": "1.0.0",
    "regularization": 0.01,
    "iterations": 100,
    "feature_count": 13,
    "label_count": 19
  }
}
  • Purpose: Save training results for later dictionary generation

Training Parameter Specifications

  • Regularization coefficient (lambda): Controls L1 regularization strength (default: 0.01)
  • Maximum iterations (iter): Maximum number of training iterations (default: 100)
  • Parallel threads (threads): Number of parallel processing threads (default: 1)

API Usage Example

#![allow(unused)]
fn main() {
use std::fs::File;
use lindera_dictionary::trainer::{Corpus, Trainer, TrainerConfig};

// Load configuration from files
let seed_file = File::open("resources/training/seed.csv")?;
let char_file = File::open("resources/training/char.def")?;
let unk_file = File::open("resources/training/unk.def")?;
let feature_file = File::open("resources/training/feature.def")?;
let rewrite_file = File::open("resources/training/rewrite.def")?;

let config = TrainerConfig::from_readers(
    seed_file,
    char_file,
    unk_file,
    feature_file,
    rewrite_file
)?;

// Initialize and configure trainer
let trainer = Trainer::new(config)?
    .regularization_cost(0.01)
    .max_iter(100)
    .num_threads(4);

// Load corpus
let corpus_file = File::open("resources/training/corpus.txt")?;
let corpus = Corpus::from_reader(corpus_file)?;

// Execute training
let model = trainer.train(corpus)?;

// Save model (binary format)
let mut output = File::create("trained_model.dat")?;
model.write_model(&mut output)?;

// Output in Lindera dictionary format
let mut lex_out = File::create("output_lex.csv")?;
let mut conn_out = File::create("output_conn.dat")?;
let mut unk_out = File::create("output_unk.def")?;
let mut user_out = File::create("output_user.csv")?;
model.write_dictionary(&mut lex_out, &mut conn_out, &mut unk_out, &mut user_out)?;

Ok::<(), Box<dyn std::error::Error>>(())
}

Implementation Status

Completed Features

Core Features
  • Core architecture: Complete trainer module structure
  • CRF training: Conditional Random Field training via rucrf integration
  • CLI integration: lindera train command with full parameter support
  • Corpus processing: Full MeCab format corpus support
  • Dictionary integration: Dictionary construction from seed.csv, char.def, unk.def
  • Feature extraction: Extraction and transformation of unigram/bigram features
  • Model saving: Output trained models in JSON/rkyv format
  • Dictionary output: Generate Lindera format dictionary files
Advanced Unknown Word Processing
  • Comprehensive Unicode support: Full support for CJK extensions, Katakana extensions, Hiragana extensions
  • Category-specific POS assignment: Automatic assignment of appropriate POS information by character type
    • DEFAULT: 名詞,一般 (unknown character type)
    • HIRAGANA/KATAKANA/KANJI: 名詞,一般 (Japanese characters)
    • ALPHA: 名詞,固有名詞 (alphabetic characters)
    • NUMERIC: 名詞,数 (numeric characters)
  • Surface form analysis: Feature generation based on character patterns, length, and position information
  • Dynamic cost calculation: Adaptive cost considering character type and context
Refactored Implementation (September 2024 Latest)
  • Constant management: Magic number elimination via cost_constants module
  • Method splitting: Improved readability by splitting large methods
    • train()build_lattices_from_corpus(), extract_labels(), train_crf_model(), create_final_model()
  • Unified cost calculation: Improved maintainability by unifying duplicate code
    • calculate_known_word_cost(): Known word cost calculation
    • calculate_unknown_word_cost(): Unknown word cost calculation
  • Organized debug output: Structured logging via log_debug! macro
  • Enhanced error handling: Comprehensive error handling and documentation

Architecture

lindera-dictionary/src/trainer.rs  # Main Trainer struct
lindera-dictionary/src/trainer/
├── config.rs           # Configuration management
├── corpus.rs           # Corpus processing
├── feature_extractor.rs # Feature extraction
├── feature_rewriter.rs  # Feature rewriting
└── model.rs            # Trained model

Advanced Unknown Word Processing System

Comprehensive Unicode Character Type Detection

The latest implementation significantly extends the basic Unicode ranges and fully supports the following character sets. (See the Category-specific POS assignment details in the Advanced Unknown Word Processing section above.)

Feature Weight Optimization

Cost Calculation Constants
#![allow(unused)]
fn main() {
mod cost_constants {
    // Known word cost calculation
    pub const KNOWN_WORD_BASE_COST: i16 = 1000;
    pub const KNOWN_WORD_COST_MULTIPLIER: f64 = 500.0;
    pub const KNOWN_WORD_COST_MIN: i16 = 500;
    pub const KNOWN_WORD_COST_MAX: i16 = 3000;
    pub const KNOWN_WORD_DEFAULT_COST: i16 = 1500;

    // Unknown word cost calculation
    pub const UNK_BASE_COST: i32 = 3000;
    pub const UNK_COST_MULTIPLIER: f64 = 500.0;
    pub const UNK_COST_MIN: i32 = 2500;
    pub const UNK_COST_MAX: i32 = 4500;

    // Category-specific adjustments
    pub const UNK_DEFAULT_ADJUSTMENT: i32 = 0;     // DEFAULT
    pub const UNK_HIRAGANA_ADJUSTMENT: i32 = 200;  // HIRAGANA - minor penalty
    pub const UNK_KATAKANA_ADJUSTMENT: i32 = 0;    // KATAKANA - medium
    pub const UNK_KANJI_ADJUSTMENT: i32 = 400;     // KANJI - high penalty
    pub const UNK_ALPHA_ADJUSTMENT: i32 = 100;     // ALPHA - mild penalty
    pub const UNK_NUMERIC_ADJUSTMENT: i32 = -100;  // NUMERIC - bonus (regular)
}
}
Unified Cost Calculation
#![allow(unused)]
fn main() {
// Known word cost calculation
fn calculate_known_word_cost(&self, feature_weight: f64) -> i16 {
    let scaled_weight = (feature_weight * cost_constants::KNOWN_WORD_COST_MULTIPLIER) as i32;
    let final_cost = cost_constants::KNOWN_WORD_BASE_COST as i32 + scaled_weight;
    final_cost.clamp(
        cost_constants::KNOWN_WORD_COST_MIN as i32,
        cost_constants::KNOWN_WORD_COST_MAX as i32
    ) as i16
}

// Unknown word cost calculation
fn calculate_unknown_word_cost(&self, feature_weight: f64, category: usize) -> i32 {
    let base_cost = cost_constants::UNK_BASE_COST;
    let category_adjustment = match category {
        0 => cost_constants::UNK_DEFAULT_ADJUSTMENT,
        1 => cost_constants::UNK_HIRAGANA_ADJUSTMENT,
        2 => cost_constants::UNK_KATAKANA_ADJUSTMENT,
        3 => cost_constants::UNK_KANJI_ADJUSTMENT,
        4 => cost_constants::UNK_ALPHA_ADJUSTMENT,
        5 => cost_constants::UNK_NUMERIC_ADJUSTMENT,
        _ => 0,
    };
    let scaled_weight = (feature_weight * cost_constants::UNK_COST_MULTIPLIER) as i32;
    let final_cost = base_cost + category_adjustment + scaled_weight;
    final_cost.clamp(
        cost_constants::UNK_COST_MIN,
        cost_constants::UNK_COST_MAX
    )
}
}

Performance Optimization

Memory Efficiency

  • Lazy evaluation: Create merged_model only when needed
  • Unused feature removal: Automatic deletion of unnecessary features after training
  • Efficient binary format: Fast serialization using rkyv

Parallel Processing Support

#![allow(unused)]
fn main() {
let trainer = rucrf::Trainer::new()
    .regularization(rucrf::Regularization::L1, regularization_cost)?
    .max_iter(max_iter)?
    .n_threads(self.num_threads)?;  // Multi-threaded training
}

Practical Training Data Requirements

Recommendations for generating effective dictionaries for real applications:

  1. Corpus Size

    • Minimum: 100 sentences (for basic operation verification)
    • Recommended: 1,000+ sentences (practical level)
    • Ideal: 10,000+ sentences (commercial quality)
  2. Vocabulary Diversity

    • Balanced distribution of different parts of speech
    • Coverage of inflections and suffixes
    • Appropriate inclusion of technical terms and proper nouns
  3. Quality Control

    • Manual verification of morphological analysis results
    • Consistent application of analysis criteria
    • Maintain error rate below 5%

Lindera CLI

A morphological analysis command-line interface for Lindera.

Install

You can install binary via cargo as follows:

% cargo install lindera-cli

Alternatively, you can download a binary from the following release page:

Build

Build with IPADIC (Japanese dictionary)

The "ipadic" feature flag allows Lindera to include IPADIC.

% cargo build --release --features=embedded-ipadic

Build with UniDic (Japanese dictionary)

The "unidic" feature flag allows Lindera to include UniDic.

% cargo build --release --features=embedded-unidic

Build with ko-dic (Korean dictionary)

The "ko-dic" feature flag allows Lindera to include ko-dic.

% cargo build --release --features=embedded-ko-dic

Build with CC-CEDICT (Chinese dictionary)

The "cc-cedict" feature flag allows Lindera to include CC-CEDICT.

% cargo build --release --features=embedded-cc-cedict

Build without dictionaries

To reduce Lindera's binary size, omit the feature flag. This results in a binary containing only the tokenizer and trainer, as it no longer includes the dictionary.

% cargo build --release

Build with all features

% cargo build --release --all-features

Build dictionary

Build (compile) a morphological analysis dictionary from source CSV files for use with Lindera.

Basic build usage

# Build a system dictionary
lindera build \
  --src /path/to/dictionary/csv \
  --dest /path/to/output/dictionary \
  --metadata ./lindera-ipadic/metadata.json

# Build a user dictionary
lindera build \
  --src ./user_dict.csv \
  --dest ./user_dictionary \
  --metadata ./lindera-ipadic/metadata.json \
  --user

Build parameters

  • --src / -s: Source directory containing dictionary CSV files (or single CSV file for user dictionary)
  • --dest / -d: Destination directory for compiled dictionary output
  • --metadata / -m: Metadata configuration file (metadata.json) that defines dictionary structure
  • --user / -u: Build user dictionary instead of system dictionary (optional flag)

Dictionary types

System dictionary

A full morphological analysis dictionary containing:

  • Lexicon entries (word definitions)
  • Connection cost matrix
  • Unknown word handling rules
  • Character type definitions

User dictionary

A supplementary dictionary for custom words that works alongside a system dictionary.

Examples

Build IPADIC (Japanese dictionary)

# Download and extract IPADIC source files
% curl -L -o /tmp/mecab-ipadic-2.7.0-20250920.tar.gz "https://Lindera.dev/mecab-ipadic-2.7.0-20250920.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20250920.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ipadic-2.7.0-20250920 \
  --dest /tmp/lindera-ipadic-2.7.0-20250920 \
  --metadata ./lindera-ipadic/metadata.json

% ls -al /tmp/lindera-ipadic-2.7.0-20250920
% (cd /tmp && zip -r lindera-ipadic-2.7.0-20250920.zip lindera-ipadic-2.7.0-20250920/)
% tar -czf /tmp/lindera-ipadic-2.7.0-20250920.tar.gz -C /tmp lindera-ipadic-2.7.0-20250920

Build IPADIC NEologd (Japanese dictionary)

# Download and extract IPADIC NEologd source files
% curl -L -o /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz "https://lindera.dev/mecab-ipadic-neologd-0.0.7-20200820.tar.gz"
% tar zxvf /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ipadic-neologd-0.0.7-20200820 \
  --dest /tmp/lindera-ipadic-neologd-0.0.7-20200820 \
  --metadata ./lindera-ipadic-neologd/metadata.json

% ls -al /tmp/lindera-ipadic-neologd-0.0.7-20200820
% (cd /tmp && zip -r lindera-ipadic-neologd-0.0.7-20200820.zip lindera-ipadic-neologd-0.0.7-20200820/)
% tar -czf /tmp/lindera-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp lindera-ipadic-neologd-0.0.7-20200820

Build UniDic (Japanese dictionary)

# Download and extract UniDic source files
% curl -L -o /tmp/unidic-mecab-2.1.2.tar.gz "https://Lindera.dev/unidic-mecab-2.1.2.tar.gz"
% tar zxvf /tmp/unidic-mecab-2.1.2.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/unidic-mecab-2.1.2 \
  --dest /tmp/lindera-unidic-2.1.2 \
  --metadata ./lindera-unidic/metadata.json

% ls -al /tmp/lindera-unidic-2.1.2
% (cd /tmp && zip -r lindera-unidic-2.1.2.zip lindera-unidic-2.1.2/)
% tar -czf /tmp/lindera-unidic-2.1.2.tar.gz -C /tmp lindera-unidic-2.1.2

Build CC-CEDICT (Chinese dictionary)

# Download and extract CC-CEDICT source files
% curl -L -o /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz "https://lindera.dev/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz"
% tar zxvf /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/CC-CEDICT-MeCab-0.1.0-20200409 \
  --dest /tmp/lindera-cc-cedict-0.1.0-20200409 \
  --metadata ./lindera-cc-cedict/metadata.json

% ls -al /tmp/lindera-cc-cedict-0.1.0-20200409
% (cd /tmp && zip -r lindera-cc-cedict-0.1.0-20200409.zip lindera-cc-cedict-0.1.0-20200409/)
% tar -czf /tmp/lindera-cc-cedict-0.1.0-20200409.tar.gz -C /tmp lindera-cc-cedict-0.1.0-20200409

Build ko-dic (Korean dictionary)

# Download and extract ko-dic source files
% curl -L -o /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz "https://Lindera.dev/mecab-ko-dic-2.1.1-20180720.tar.gz"
% tar zxvf /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ko-dic-2.1.1-20180720 \
  --dest /tmp/lindera-ko-dic-2.1.1-20180720 \
  --metadata ./lindera-ko-dic/metadata.json

% ls -al /tmp/lindera-ko-dic-2.1.1-20180720
% (cd /tmp && zip -r lindera-ko-dic-2.1.1-20180720.zip lindera-ko-dic-2.1.1-20180720/)
% tar -czf /tmp/lindera-ko-dic-2.1.1-20180720.tar.gz -C /tmp lindera-ko-dic-2.1.1-20180720

Build user dictionary

Build IPADIC user dictionary (Japanese)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/ipadic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ipadic/metadata.json \
  --user

Build UniDic user dictionary (Japanese)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/unidic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-unidic/metadata.json \
  --user

Build CC-CEDICT user dictionary (Chinese)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/cc-cedict_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-cc-cedict/metadata.json \
  --user

Build ko-dic user dictionary (Korean)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/ko-dic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ko-dic/metadata.json \
  --user

Tokenize text

Perform morphological analysis (tokenization) on Japanese, Chinese, or Korean text using various dictionaries.

Basic tokenization usage

# Tokenize text using a dictionary directory
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /path/to/dictionary

# Tokenize text using embedded dictionary
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic

# Tokenize with different output format
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic \
  --output json

# Tokenize text from file
lindera tokenize \
  --dict /path/to/dictionary \
  --output wakati \
  input.txt

Tokenization parameters

  • --dict / -d: Dictionary path or URI (required)
    • File path: /path/to/dictionary
    • Embedded: embedded://ipadic, embedded://unidic, etc.
  • --output / -o: Output format (default: mecab)
    • mecab: MeCab-compatible format with part-of-speech info
    • wakati: Space-separated tokens only
    • json: Detailed JSON format with all token information
  • --user-dict / -u: User dictionary path (optional)
  • --mode / -m: Tokenization mode (default: normal)
    • normal: Standard tokenization
    • decompose: Decompose compound words
  • --char-filter / -c: Character filter configuration (JSON)
  • --token-filter / -t: Token filter configuration (JSON)
  • Input file: Optional file path (default: stdin)

Examples with external dictionaries

Tokenize with external IPADIC (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Tokenize with external IPADIC Neologd (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-neologd-0.0.7-20200820
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Tokenize with external UniDic (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-unidic-2.1.2
日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Tokenize with external ko-dic (Korean dictionary)

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict /tmp/lindera-ko-dic-2.1.1-20180720
한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      VV+ETM,*,T,할,Inflect,VV,ETM,하/VV/*+ᆯ/ETM/*
수      NNG,*,F,수,*,*,*,*
있      VX,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       UNK
EOS

Tokenize with external CC-CEDICT (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict /tmp/lindera-cc-cedict-0.1.0-20200409
可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      UNK
EOS

Examples with embedded dictionaries

Lindera can include dictionaries directly in the binary when built with specific feature flags. This allows tokenization without external dictionary files.

Tokenize with embedded IPADIC (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include IPADIC dictionary in the binary, you must build with the --features=embedded-ipadic option.

Tokenize with embedded UniDic (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://unidic
日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

NOTE: To include UniDic dictionary in the binary, you must build with the --features=embedded-unidic option.

Tokenize with embedded IPADIC NEologd (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic-neologd
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include UniDic dictionary in the binary, you must build with the --features=embedded-ipadic-neologd option.

Tokenize with embedded ko-dic (Korean dictionary)

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict embedded://ko-dic
한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      VV+ETM,*,T,할,Inflect,VV,ETM,하/VV/*+ᆯ/ETM/*
수      NNG,*,F,수,*,*,*,*
있      VX,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       UNK
EOS

NOTE: To include ko-dic dictionary in the binary, you must build with the --features=embedded-ko-dic option.

Tokenize with embedded CC-CEDICT (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict embedded://cc-cedict
可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      UNK
EOS

NOTE: To include CC-CEDICT dictionary in the binary, you must build with the --features=embedded-cc-cedict option.

User dictionary examples

Lindera supports user dictionaries to add custom words alongside system dictionaries. User dictionaries can be in CSV or binary format.

Use user dictionary (CSV format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict embedded://ipadic \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Use user dictionary (Binary format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920 \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.bin
東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Tokenization modes

Lindera provides two tokenization modes: normal and decompose.

Normal mode (default)

Tokenizes faithfully based on words registered in the dictionary:

% echo "関西国際空港限定トートバッグ" | lindera tokenize \
  --dict embedded://ipadic \
  --mode normal
関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

Decompose mode

Tokenizes compound noun words additionally:

% echo "関西国際空港限定トートバッグ" | lindera tokenize \
  --dict embedded://ipadic \
  --mode decompose
関西    名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港    名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK,*,*,*,*,*,*,*,*
EOS

Output formats

Lindera provides three output formats: mecab, wakati and json.

MeCab format (default)

Outputs results in MeCab-compatible format with part-of-speech information:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output mecab
お待ち  名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチ
し  動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て  助詞,接続助詞,*,*,*,*,て,テ,テ
おり  動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます  助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。  記号,句点,*,*,*,*,。,。,。
EOS

Wakati format

Outputs only the token text separated by spaces:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output wakati
お待ち し て おり ます 。

JSON format

Outputs detailed token information in JSON format:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output json
[
  {
    "base_form": "お待ち",
    "byte_end": 9,
    "byte_start": 0,
    "conjugation_form": "*",
    "conjugation_type": "*",
    "part_of_speech": "名詞",
    "part_of_speech_subcategory_1": "サ変接続",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "オマチ",
    "reading": "オマチ",
    "surface": "お待ち",
    "word_id": 14698
  },
  {
    "base_form": "する",
    "byte_end": 12,
    "byte_start": 9,
    "conjugation_form": "サ変・スル",
    "conjugation_type": "連用形",
    "part_of_speech": "動詞",
    "part_of_speech_subcategory_1": "自立",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "シ",
    "reading": "シ",
    "surface": "し",
    "word_id": 30763
  },
  {
    "base_form": "て",
    "byte_end": 15,
    "byte_start": 12,
    "conjugation_form": "*",
    "conjugation_type": "*",
    "part_of_speech": "助詞",
    "part_of_speech_subcategory_1": "接続助詞",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "テ",
    "reading": "テ",
    "surface": "て",
    "word_id": 46603
  },
  {
    "base_form": "おる",
    "byte_end": 21,
    "byte_start": 15,
    "conjugation_form": "五段・ラ行",
    "conjugation_type": "連用形",
    "part_of_speech": "動詞",
    "part_of_speech_subcategory_1": "非自立",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "オリ",
    "reading": "オリ",
    "surface": "おり",
    "word_id": 14239
  },
  {
    "base_form": "ます",
    "byte_end": 27,
    "byte_start": 21,
    "conjugation_form": "特殊・マス",
    "conjugation_type": "基本形",
    "part_of_speech": "助動詞",
    "part_of_speech_subcategory_1": "*",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "マス",
    "reading": "マス",
    "surface": "ます",
    "word_id": 68733
  },
  {
    "base_form": "。",
    "byte_end": 30,
    "byte_start": 27,
    "conjugation_form": "*",
    "conjugation_type": "*",
    "part_of_speech": "記号",
    "part_of_speech_subcategory_1": "句点",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "。",
    "reading": "。",
    "surface": "。",
    "word_id": 101
  }
]

Advanced tokenization

Lindera provides an analytical framework that combines character filters, tokenizers, and token filters for advanced text processing. Filters are configured using JSON.

Tokenize with character and token filters

% echo "すもももももももものうち" | lindera tokenize \
  --dict embedded://ipadic \
  --char-filter 'unicode_normalize:{"kind":"nfkc"}' \
  --token-filter 'japanese_keep_tags:{"tags":["名詞,一般"]}'
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
EOS

Dictionary Training (Experimental)

Train a new morphological analysis model from annotated corpus data. To use this feature, you must build with the train feature flag enabled. (The train feature flag is enabled by default.)

Training parameters

  • --seed / -s: Seed lexicon file (CSV format) to be weighted
  • --corpus / -c: Training corpus (annotated text)
  • --char-def / -C: Character definition file (char.def)
  • --unk-def / -u: Unknown word definition file (unk.def) to be weighted
  • --feature-def / -f: Feature definition file (feature.def)
  • --rewrite-def / -r: Rewrite rule definition file (rewrite.def)
  • --output / -o: Output model file
  • --lambda / -l: L1 regularization (0.0-1.0) (default: 0.01)
  • --max-iterations / -i: Maximum number of iterations for training (default: 100)
  • --max-threads / -t: Maximum number of threads (defaults to CPU core count, auto-adjusted based on dataset size)

Basic workflow

1. Prepare training files

Seed lexicon file (seed.csv):

The seed lexicon file contains initial dictionary entries used for training the CRF model. Each line represents a word entry with comma-separated fields. The specific field structure varies depending on the dictionary format:

  • Surface
  • Left context ID
  • Right context ID
  • Word cost
  • Part-of-speech tags (multiple fields)
  • Base form
  • Reading (katakana)
  • Pronunciation

Note: The exact field definitions differ between dictionary formats (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.

外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン

Training corpus (corpus.txt):

The training corpus file contains annotated text data used to train the CRF model. Each line consists of:

  • A surface form (word) followed by a tab character
  • Comma-separated morphological features (part-of-speech tags, base form, reading, pronunciation)
  • Sentences are separated by "EOS" (End Of Sentence) markers

Note: The morphological feature format varies depending on the dictionary (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.

外国	名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人	名詞,接尾,一般,*,*,*,人,ジン,ジン
参政	名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権	名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS

これ	連体詞,*,*,*,*,*,これ,コレ,コレ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト	名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

形態	名詞,一般,*,*,*,*,形態,ケイタイ,ケイタイ
素	名詞,接尾,一般,*,*,*,素,ソ,ソ
解析	名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う	動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
EOS

For detailed information about file formats and advanced features, see TRAINER_README.md.

2. Train model

lindera train \
  --seed ./resources/training/seed.csv \
  --corpus ./resources/training/corpus.txt \
  --unk-def ./resources/training/unk.def \
  --char-def ./resources/training/char.def \
  --feature-def ./resources/training/feature.def \
  --rewrite-def ./resources/training/rewrite.def \
  --output /tmp/lindera/training/model.dat \
  --lambda 0.01 \
  --max-iterations 100

3. Training results

The trained model will contain:

  • Existing words: All seed dictionary records with newly learned weights
  • New words: Words from the corpus not in the seed dictionary, added with appropriate weights

Export trained model to dictionary

Export a trained model file to Lindera dictionary format files. This feature requires building with the train feature flag enabled.

Basic export usage

# Export trained model to dictionary files
lindera export \
  --model /tmp/lindera/training/model.dat \
  --metadata ./resources/training/metadata.json \
  --output /tmp/lindera/training/dictionary

Export parameters

  • --model / -m: Path to the trained model file (.dat format)
  • --output / -o: Directory to output the dictionary files
  • --metadata: Optional metadata.json file to update with trained model information

Output files

The export command creates the following dictionary files in the output directory:

  • lex.csv: Lexicon file with learned weights
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character type definitions
  • metadata.json: Updated metadata file (if --metadata option is provided)

Complete workflow example

1. Train model

lindera train \
  --seed ./resources/training/seed.csv \
  --corpus ./resources/training/corpus.txt \
  --unk-def ./resources/training/unk.def \
  --char-def ./resources/training/char.def \
  --feature-def ./resources/training/feature.def \
  --rewrite-def ./resources/training/rewrite.def \
  --output /tmp/lindera/training/model.dat \
  --lambda 0.01 \
  --max-iterations 100

2. Export to dictionary format

lindera export \
  --model /tmp/lindera/training/model.dat \
  --metadata ./resources/training/metadata.json \
  --output /tmp/lindera/training/dictionary

3. Build dictionary

lindera build \
  --src /tmp/lindera/training/dictionary \
  --dest /tmp/lindera/training/compiled_dictionary \
  --metadata /tmp/lindera/training/dictionary/metadata.json

4. Use trained dictionary

echo "これは外国人参政権です。" | lindera tokenize \
  -d /tmp/lindera/training/compiled_dictionary

Metadata update feature

When the --metadata option is provided, the export command will:

  1. Read the base metadata.json file to preserve existing configuration

  2. Update specific fields with values from the trained model:

    • default_left_context_id: Maximum left context ID from trained model
    • default_right_context_id: Maximum right context ID from trained model
    • default_word_cost: Calculated from feature weight median
    • model_info: Training statistics including:
      • feature_count: Number of features in the model
      • label_count: Number of labels in the model
      • max_left_context_id: Maximum left context ID
      • max_right_context_id: Maximum right context ID
      • connection_matrix_size: Size of connection cost matrix
      • training_iterations: Number of training iterations performed
      • regularization: L1 regularization parameter used
      • version: Model version
      • updated_at: Timestamp of when the model was exported
  3. Preserve existing settings such as:

    • Dictionary name
    • Character encoding settings
    • Schema definitions
    • Other user-defined configuration

This allows you to maintain your base dictionary configuration while incorporating the optimized parameters learned during training.

API reference

The API reference is available. Please see following URL:

API Reference

The API reference is available. Please see following URL:

Contributing

(Content for Contributing goes here)