Lindera

License: MIT Crates.io

A morphological analysis library in Rust. Lindera is forked from kuromoji-rs and aims to provide easy installation and concise APIs for tokenizing text in multiple languages.

Key Features

FeatureDescription
Morphological AnalysisViterbi-based segmentation and part-of-speech tagging
Multi-language SupportJapanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
Dictionary SystemPre-built dictionaries, user dictionaries, and custom dictionary training
Text Processing PipelineComposable character filters and token filters for flexible text normalization
CRF TrainingTrain custom CRF models for dictionary cost estimation
Python BindingsUse Lindera from Python via PyO3
WebAssemblyRun Lindera in the browser via wasm-bindgen
Pure RustNo C/C++ dependencies; works on any platform Rust supports

Tokenization Flow

graph LR
    subgraph Your Application
        T["Text"]
    end
    subgraph Lindera
        CF["Character Filters"]
        SEG["Segmenter\n(Dictionary + Viterbi)"]
        TF["Token Filters"]
    end
    T --> CF --> SEG --> TF --> R["Tokens"]

Document Map

SectionDescription
Getting StartedInstallation, quick start, and examples
DictionariesAvailable dictionaries and how to use them
ConfigurationYAML-based tokenizer configuration
Advanced UsageUser dictionaries, filters, and CRF training
CLICommand-line interface reference
ArchitectureCrate structure and design overview
API ReferenceRust API documentation
ContributingHow to contribute to Lindera

Quick Example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "関西国際空港限定トートバッグ";
    let mut tokens = tokenizer.tokenize(text)?;
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

Run the example:

cargo run --features=embed-ipadic --example=tokenize

Output:

text:   関西国際空港限定トートバッグ
token:  関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token:  限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token:  トートバッグ    名詞,一般,*,*,*,*,*,*,*

License

Lindera is released under the MIT License.

Architecture

Lindera is organized as a Cargo workspace comprising multiple crates. Each crate has a focused responsibility, from low-level CRF computation to high-level CLI and language bindings.

Crate Dependency Graph

graph TB
    CRF["lindera-crf\n(CRF Engine)"]
    DICT["lindera-dictionary\n(Dictionary Base)"]
    IPADIC["lindera-ipadic"]
    UNIDIC["lindera-unidic"]
    KODIC["lindera-ko-dic"]
    CCCEDICT["lindera-cc-cedict"]
    JIEBA["lindera-jieba"]
    NEOLOGD["lindera-ipadic-neologd"]
    LIB["lindera\n(Core Library)"]
    CLI["lindera-cli\n(CLI)"]
    PY["lindera-python\n(Python)"]
    WASM["lindera-wasm\n(WebAssembly)"]

    CRF --> DICT
    DICT --> IPADIC
    DICT --> UNIDIC
    DICT --> KODIC
    DICT --> CCCEDICT
    DICT --> JIEBA
    DICT --> NEOLOGD
    DICT --> LIB
    IPADIC --> LIB
    UNIDIC --> LIB
    KODIC --> LIB
    CCCEDICT --> LIB
    JIEBA --> LIB
    NEOLOGD --> LIB
    LIB --> CLI
    LIB --> PY
    LIB --> WASM

Crate Overview

CrateTypeDescription
lindera-crfCorePure Rust CRF (Conditional Random Field) implementation. Supports no_std. Uses rkyv for serialization.
lindera-dictionaryCoreDictionary base library. Provides dictionary loading, building, and training (with the train feature).
linderaCoreMain morphological analysis library. Integrates dictionaries, segmenter, character filters, and token filters.
lindera-cliApplicationCommand-line interface for tokenization, dictionary building, and CRF training.
lindera-ipadicDictionaryJapanese dictionary based on IPADIC.
lindera-ipadic-neologdDictionaryJapanese dictionary based on IPADIC NEologd (includes neologisms).
lindera-unidicDictionaryJapanese dictionary based on UniDic.
lindera-ko-dicDictionaryKorean dictionary based on ko-dic.
lindera-cc-cedictDictionaryChinese dictionary based on CC-CEDICT.
lindera-jiebaDictionaryChinese dictionary based on Jieba.
lindera-pythonBindingPython bindings via PyO3.
lindera-wasmBindingWebAssembly bindings via wasm-bindgen.

Tokenization Pipeline

Lindera processes text through a multi-stage pipeline:

Input Text
  |
  v
Character Filters    -- Normalize characters (e.g., Unicode normalization, mapping)
  |
  v
Segmenter            -- Segment text into tokens using a dictionary and the Viterbi algorithm
  |
  v
Token Filters        -- Transform tokens (e.g., POS filtering, stop words, stemming)
  |
  v
Output Tokens

The Segmenter is the core component. It builds a lattice of candidate tokens from the dictionary, then applies the Viterbi algorithm to find the lowest-cost path, producing the most likely segmentation.

Feature Flags

FeatureDescriptionDefault
mmapMemory-mapped file support for dictionary loadingEnabled
trainCRF-based dictionary training functionality (depends on lindera-crf)CLI only
embed-ipadicEmbed the IPADIC dictionary into the binaryDisabled
embed-cjkEmbed IPADIC + ko-dic + Jieba dictionariesDisabled
embed-cjk2Embed UniDic + ko-dic + Jieba dictionariesDisabled
embed-cjk3Embed IPADIC NEologd + ko-dic + Jieba dictionariesDisabled

Learn More

Getting Started

This section will guide you through installing Lindera and running your first morphological analysis.

  • Installation -- Add Lindera to your project and configure environment variables
  • Quick Start -- Tokenize your first text in just a few lines of code
  • Examples -- Explore example programs for common use cases

Installation

Put the following in Cargo.toml:

[dependencies]
lindera = "3.0.0"

Dictionary Setup

Lindera requires a pre-built dictionary at runtime. Download a dictionary from GitHub Releases and specify its path when loading:

#![allow(unused)]
fn main() {
let dictionary = load_dictionary("/path/to/ipadic")?;
}

[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding embed-* feature flag and load it using the embedded:// scheme:

#![allow(unused)]
fn main() {
// Cargo.toml: lindera = { version = "3.0.0", features = ["embed-ipadic"] }
let dictionary = load_dictionary("embedded://ipadic")?;
}

See Feature Flags for details.

Environment Variables

LINDERA_DICTIONARIES_PATH

The LINDERA_DICTIONARIES_PATH environment variable specifies a directory for caching dictionary source files. This enables:

  • Offline builds: Once downloaded, dictionary source files are preserved for future builds
  • Faster builds: Subsequent builds skip downloading if valid cached files exist
  • Reproducible builds: Ensures consistent dictionary versions across builds

Usage:

export LINDERA_DICTIONARIES_PATH=/path/to/dicts
cargo build --features=ipadic

When set, dictionary source files are stored in $LINDERA_DICTIONARIES_PATH/<version>/ where <version> is the lindera-dictionary crate version. The cache validates files using MD5 checksums - invalid files are automatically re-downloaded.

[!NOTE] LINDERA_CACHE is deprecated but still supported for backward compatibility. It will be used if LINDERA_DICTIONARIES_PATH is not set.

LINDERA_CONFIG_PATH

The LINDERA_CONFIG_PATH environment variable specifies the path to a YAML configuration file for the tokenizer. This allows you to configure tokenizer behavior without modifying Rust code.

export LINDERA_CONFIG_PATH=./resources/config/lindera.yml

See the Configuration section for details on the configuration format.

DOCS_RS

The DOCS_RS environment variable is automatically set by docs.rs when building documentation. When this variable is detected, Lindera creates dummy dictionary files instead of downloading actual dictionary data, allowing documentation to be built without network access or large file downloads.

This is primarily used internally by docs.rs and typically doesn't need to be set by users.

LINDERA_WORKDIR

The LINDERA_WORKDIR environment variable is automatically set during the build process by the lindera-dictionary crate. It points to the directory containing the built dictionary data files and is used internally by dictionary crates to locate their data files.

This variable is set automatically and should not be modified by users.

Quick Start

This example covers the basic usage of Lindera.

It will:

  • Create a tokenizer in normal mode
  • Tokenize the input text
  • Output the tokens

First, download a pre-built IPADIC dictionary from GitHub Releases and extract it to a local directory (e.g., /path/to/ipadic).

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("/path/to/ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "関西国際空港限定トートバッグ";
    let mut tokens = tokenizer.tokenize(text)?;
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --example=tokenize

[!TIP] If you embed the dictionary into the binary using the embed-ipadic feature (advanced usage), you can use load_dictionary("embedded://ipadic") instead of specifying a file path. See Feature Flags for details.

You can see the result as follows:

text:   関西国際空港限定トートバッグ
token:  関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token:  限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token:  トートバッグ    名詞,一般,*,*,*,*,*,*,*

Examples

Lindera includes several example programs that demonstrate common use cases. The source code is available in the examples directory on GitHub.

Before running the examples, download a pre-built IPADIC dictionary from GitHub Releases and extract it to a local directory.

Available Examples

tokenize

Basic tokenization using an external IPADIC dictionary. Segments input text and prints each token with its part-of-speech details.

cargo run --example=tokenize

tokenize_with_user_dict

Tokenization with a user dictionary. Shows how to supplement the dictionary with custom entries for domain-specific terms.

cargo run --example=tokenize_with_user_dict

tokenize_with_filters

Tokenization with character filters and token filters. Demonstrates the text processing pipeline, including Unicode normalization, part-of-speech filtering, and other transformations.

cargo run --example=tokenize_with_filters

tokenize_with_config

Tokenization using a YAML configuration file. Shows how to configure the tokenizer declaratively instead of programmatically.

cargo run --example=tokenize_with_config

Core Concepts

This section explains the fundamental concepts behind Lindera's morphological analysis system.

Morphological Analysis

What is morphological analysis?

Morphological analysis is the process of breaking down text into its smallest meaningful units (morphemes) and identifying their grammatical properties. For languages like Japanese, Chinese, and Korean -- where words are not separated by spaces -- morphological analysis is an essential first step for natural language processing tasks such as search indexing, text classification, and machine translation.

How Lindera works

Lindera is a dictionary-based morphological analyzer. It uses a pre-compiled system dictionary containing known words along with their costs, and applies the Viterbi algorithm to find the optimal segmentation of input text.

The analysis process works as follows:

  1. Lattice construction: Lindera scans the input text and looks up all possible words in the dictionary at every position, building a directed acyclic graph (lattice) of candidate segmentations.
  2. Cost assignment: Each candidate word has an associated word cost (from the dictionary), and each pair of adjacent words has a connection cost (from the connection cost matrix).
  3. Optimal path search: The Viterbi algorithm finds the path through the lattice with the minimum total cost, producing the best segmentation.

Key terminology

TermDescription
Surface formThe actual text as it appears in the input (e.g., "食べ").
Part-of-speech (POS)The grammatical category of a word (e.g., noun, verb, particle). Lindera dictionaries provide hierarchical POS tags with up to four levels of subcategories.
ReadingThe pronunciation of a word, typically in Katakana for Japanese dictionaries.
Base formThe uninflected (dictionary) form of a word (e.g., "食べる" for the surface "食べ").
ConjugationInflection information for words that conjugate, consisting of a conjugation type and a conjugation form.

Cost-based segmentation

The Viterbi algorithm selects the segmentation path with the minimum total cost. The total cost of a path is the sum of:

  • Word costs: Each word in the dictionary has an associated cost. Lower cost means the word is more likely to appear. Common words tend to have lower costs, while rare words have higher costs.
  • Connection costs: The cost of connecting two adjacent words, determined by the right context ID of the left word and the left context ID of the right word.

The algorithm computes:

Total cost = sum of word costs + sum of connection costs

By minimizing this total cost, Lindera finds the most natural segmentation of the input text.

Connection cost matrix

The connection cost matrix stores the cost of transitioning from one word to another. It is a two-dimensional matrix indexed by:

  • The right context ID of the preceding word
  • The left context ID of the following word

These context IDs encode grammatical information about word boundaries. For example, the connection cost between a noun and a particle is typically low (natural sequence), while the connection cost between two verbs in base form might be high (unnatural sequence).

The connection cost matrix is compiled into binary format as part of the dictionary build process and is loaded at runtime for efficient lookup.

Dictionaries

Lindera supports various dictionaries for Japanese, Korean, and Chinese morphological analysis. Each dictionary is provided as a separate crate.

DictionaryLanguageCrateDescription
IPADICJapaneselindera-ipadicThe most common dictionary for Japanese
IPADIC NEologdJapaneselindera-ipadic-neologdIPADIC with neologisms (new words)
UniDicJapaneselindera-unidicUniform word unit definitions
ko-dicKoreanlindera-ko-dicKorean morphological analysis
CC-CEDICTChineselindera-cc-cedictChinese-English dictionary
JiebaChineselindera-jiebaJieba-based Chinese dictionary

Obtaining Dictionaries

Pre-built dictionaries are available for download from GitHub Releases. Download the dictionary archive for your target language and extract it to a local directory.

#![allow(unused)]
fn main() {
// Load an external dictionary from a local path
let dictionary = load_dictionary("/path/to/ipadic")?;
}

[!TIP] If you need a self-contained binary without external dictionary files, you can embed dictionaries using the embed-* feature flags and load them using the embedded:// scheme:

#![allow(unused)]
fn main() {
let dictionary = load_dictionary("embedded://ipadic")?;
}

See Feature Flags for details.

See each dictionary crate's documentation for format details, build instructions, and usage examples.

Tokenization

Lindera provides multiple tokenization modes and supports N-Best analysis for enumerating alternative segmentation candidates.

Tokenization modes

Normal mode

Normal mode performs standard tokenization based on dictionary entries. Compound words that exist as single entries in the dictionary are kept as-is.

Example -- tokenizing "関西国際空港限定トートバッグ" in Normal mode:

関西国際空港 | 限定 | トートバッグ

The compound noun "関西国際空港" (Kansai International Airport) is preserved as a single token because it exists as one entry in the dictionary.

Decompose mode

Decompose mode further breaks down compound nouns into their constituent parts, even when the compound exists as a dictionary entry.

Example -- tokenizing "関西国際空港限定トートバッグ" in Decompose mode:

関西 | 国際 | 空港 | 限定 | トートバッグ

The compound "関西国際空港" is decomposed into "関西", "国際", and "空港".

Selecting a mode

In Rust, specify the mode when creating a Segmenter:

#![allow(unused)]
fn main() {
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::dictionary::load_dictionary;

let dictionary = load_dictionary("embedded://ipadic")?;

// Normal mode
let segmenter = Segmenter::new(Mode::Normal, dictionary, None);

// Decompose mode
let segmenter = Segmenter::new(Mode::Decompose(Default::default()), dictionary, None);
}

With the CLI, use the --mode flag:

echo "関西国際空港限定トートバッグ" | lindera tokenize --dict embedded://ipadic --mode normal
echo "関西国際空港限定トートバッグ" | lindera tokenize --dict embedded://ipadic --mode decompose

N-Best tokenization

N-Best tokenization enumerates the top N tokenization candidates ordered by total path cost (lower cost = better segmentation). This is useful when the best result is ambiguous, or when you want to explore alternative interpretations of the input text.

Algorithm

N-Best tokenization is based on the Forward-DP Backward-A* algorithm, which is compatible with MeCab's N-Best implementation. The forward pass computes optimal costs using dynamic programming, and the backward pass uses A* search to enumerate paths in order of increasing total cost.

Parameters

The tokenize_nbest method accepts the following parameters:

ParameterTypeDescription
text&strThe text to tokenize.
nusizeNumber of N-best results to return.
uniqueboolWhen true, deduplicates results that produce the same word boundary positions.
cost_thresholdOption<i64>When Some(threshold), only returns paths with cost within best_cost + threshold.

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "すもももももももものうち";

    // Get top 3 tokenization results
    let results = tokenizer.tokenize_nbest(text, 3, false, None)?;

    for (rank, (tokens, cost)) in results.iter().enumerate() {
        println!("--- NBEST {} (cost={}) ---", rank + 1, cost);
        for token in tokens {
            let details = token.details().join(",");
            println!("{}\t{}", token.surface.as_ref(), details);
        }
    }

    Ok(())
}

Output:

--- NBEST 1 (cost=7546) ---
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
--- NBEST 2 (cost=7914) ---
...

CLI example

echo "すもももももももものうち" | lindera tokenize --dict embedded://ipadic -N 3

Lattice reuse

For repeated tokenization, you can reuse a Lattice to reduce memory allocations:

#![allow(unused)]
fn main() {
use lindera_dictionary::viterbi::Lattice;

let mut lattice = Lattice::default();
let results = tokenizer.tokenize_nbest_with_lattice(text, &mut lattice, 3, false, None)?;
}

User Dictionary

A user dictionary is a supplementary dictionary that allows you to register custom words alongside the system dictionary. This is useful for domain-specific terms, brand names, proper nouns, or any words that are not in the default system dictionary.

CSV format

The simplest user dictionary format is a CSV file with three columns:

<surface>,<part_of_speech>,<reading>

Example CSV content

東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

Each dictionary type (IPADIC, UniDic, ko-dic, etc.) also supports a detailed CSV format with full control over context IDs, costs, and all feature fields. See the Dictionaries section for the detailed format of each dictionary type.

Rust API example

use std::fs::File;
use std::path::PathBuf;

use lindera::dictionary::{Metadata, load_dictionary, load_user_dictionary};
use lindera::error::LinderaErrorKind;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let user_dict_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../resources")
        .join("user_dict")
        .join("ipadic_simple_userdic.csv");

    let metadata_file = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../lindera-ipadic")
        .join("metadata.json");
    let metadata: Metadata = serde_json::from_reader(
        File::open(metadata_file)
            .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))
            .unwrap(),
    )
    .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))
    .unwrap();

    let dictionary = load_dictionary("embedded://ipadic")?;
    let user_dictionary = load_user_dictionary(user_dict_path.to_str().unwrap(), &metadata)?;
    let segmenter = Segmenter::new(
        Mode::Normal,
        dictionary,
        Some(user_dictionary), // Using the loaded user dictionary
    );

    // Create a tokenizer.
    let tokenizer = Tokenizer::new(segmenter);

    // Tokenize a text.
    let text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です";
    let mut tokens = tokenizer.tokenize(text)?;

    // Print the text and tokens.
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

Output:

text:   東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です
token:  東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
token:  の      助詞,連体化,*,*,*,*,の,ノ,ノ
token:  最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
token:  は      助詞,係助詞,*,*,*,*,は,ハ,ワ
token:  とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
token:  です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス

Building a user dictionary with CLI

You can build a user dictionary from CSV to binary format using the CLI:

lindera build --src <source_dir> --dest <dest_dir> --metadata <metadata.json> --user

Binary vs CSV user dictionary

  • CSV format: Loaded and parsed at runtime. Convenient for development and small dictionaries.
  • Binary format: Pre-compiled for faster loading. Recommended for production use with large user dictionaries.

Both formats can be specified when creating a Segmenter. The binary format skips the CSV parsing step, resulting in faster startup times.

Character Filters

Character filters are pre-processing steps applied to the input text before tokenization. They normalize or transform characters to improve tokenization quality and consistency.

Available character filters

unicode_normalize

Applies Unicode normalization to the input text. This is useful for normalizing full-width characters to half-width, or for canonicalizing equivalent Unicode representations.

Supported normalization forms:

FormDescription
NFKCCompatibility decomposition followed by canonical composition. Converts full-width alphanumeric characters to half-width and normalizes Katakana variants.
NFCCanonical decomposition followed by canonical composition.
NFDCanonical decomposition.
NFKDCompatibility decomposition.

japanese_iteration_mark

Normalizes Japanese iteration marks into their expanded forms. Iteration marks are special characters that indicate the repetition of the preceding character.

MarkNameExample
Kanji iteration mark人々 (hitobito)
ゝ / ゞHiragana iteration marksいすゞ (isuzu)
ヽ / ヾKatakana iteration marksバナナヽ

The filter accepts two boolean parameters: whether to normalize Hiragana iteration marks and whether to normalize Katakana iteration marks.

mapping

Performs character-level string replacement based on a user-defined mapping table. This can be used for custom normalization rules.

For example, mapping "リンデラ" to "Lindera".

YAML configuration example

When using Lindera with a YAML configuration file, character filters can be specified in the character_filters section:

segmenter:
  mode: normal
  dictionary: "embedded://ipadic"

character_filters:
  - kind: unicode_normalize
    args:
      kind: nfkc
  - kind: japanese_iteration_mark
    args:
      normalize_kanji: true
      normalize_kana: true
  - kind: mapping
    args:
      mapping:
        リンデラ: Lindera

Rust API example

Character filters can be created and appended to a Tokenizer programmatically:

use lindera::character_filter::BoxCharacterFilter;
use lindera::character_filter::unicode_normalize::{
    UnicodeNormalizeCharacterFilter, UnicodeNormalizeKind,
};
use lindera::character_filter::japanese_iteration_mark::JapaneseIterationMarkCharacterFilter;
use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);

    // Create character filters.
    let unicode_normalize_char_filter =
        UnicodeNormalizeCharacterFilter::new(UnicodeNormalizeKind::NFKC);

    let japanese_iteration_mark_char_filter =
        JapaneseIterationMarkCharacterFilter::new(true, true);

    // Create a tokenizer and append character filters.
    let mut tokenizer = Tokenizer::new(segmenter);

    tokenizer
        .append_character_filter(BoxCharacterFilter::from(unicode_normalize_char_filter))
        .append_character_filter(BoxCharacterFilter::from(
            japanese_iteration_mark_char_filter,
        ));

    // Tokenize text -- full-width "Lindera" will be normalized to "Lindera".
    let text = "Linderaは形態素解析エンジンです。";
    let tokens = tokenizer.tokenize(text)?;

    for token in tokens {
        println!(
            "token: {:?}, details: {:?}",
            token.surface, token.details
        );
    }

    Ok(())
}

Output (with NFKC normalization applied):

token: "Lindera", details: Some(["名詞", "固有名詞", "組織", "*", "*", "*", "*", "*", "*"])
token: "は", details: Some(["助詞", "係助詞", "*", "*", "*", "*", "は", "ハ", "ワ"])
token: "形態素", details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: "解析", details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: "エンジン", details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: "です", details: Some(["助動詞", "*", "*", "*", "特殊・デス", "基本形", "です", "デス", "デス"])
token: "。", details: Some(["記号", "句点", "*", "*", "*", "*", "。", "。", "。"])

Lindera CRF

Lindera CRF is a pure Rust implementation of Conditional Random Fields (CRFs), forked from rucrf. It provides a trainer and an estimator for CRFs with support for lattice structures.

Key Features

  • Lattices with variable length edges
  • L1, L2, and Elastic Net regularization
  • Multi-threaded training
  • Zero-copy deserialization with rkyv
  • no_std support (without train feature)

Contents

Changes from rucrf

  • Serialization backend: Switched from bincode to rkyv for zero-copy deserialization
  • Elastic Net regularization: Added Regularization::ElasticNet combining L1 and L2 penalties
  • Rust 2024 edition: Updated to Rust 2024 edition
  • Dependency updates: Updated argmin, argmin-math, hashbrown, etc.

Architecture

Module Structure

lindera-crf/src/
├── lib.rs                # Public API re-exports
├── feature.rs            # FeatureSet, FeatureProvider
├── lattice.rs            # Edge, Node, Lattice
├── model.rs              # RawModel, MergedModel, Model trait
├── trainer.rs            # Trainer, Regularization enum
├── errors.rs             # Error types
├── forward_backward.rs   # Forward-backward algorithm
├── math.rs               # Mathematical utilities (logsumexp)
├── optimizers/
│   └── lbfgs.rs          # L-BFGS optimization
└── utils.rs              # Utility traits

Key Components

FeatureProvider / FeatureSet

Manage per-label feature sets. Each FeatureSet holds unigram features and left/right bigram features for a given label. FeatureProvider aggregates FeatureSet instances and maps feature IDs to weights.

Lattice / Edge / Node

Lattice structure with variable-length edges for sequence labeling. Edge represents a candidate span with a label, while Node aggregates edges at a given position. The Lattice is constructed from input data and used by the model to find the best path.

Trainer

Trains a CRF model using L-BFGS optimization with configurable regularization. The trainer accepts labeled lattice examples, computes gradients via the forward-backward algorithm, and iteratively updates model weights.

Regularization

Configurable regularization strategies:

  • L1: Sparse models via L1 penalty
  • L2: Smooth models via L2 penalty
  • ElasticNet: Combines L1 and L2 with a configurable l1_ratio

Model (trait)

Interface for searching the best path through a lattice. Two implementations are provided:

  • RawModel: Stores weights in a flat vector indexed by feature ID
  • MergedModel: Optimized for inference; merges feature weights into a compact representation serializable with rkyv

Forward-backward Algorithm

Computes alpha (forward) and beta (backward) values over the lattice. Used during training to calculate expected feature counts and gradients.

Feature Flags

FeatureDescriptionDefault
allocAlloc support for no_stdNo
stdStandard library support (implies alloc)No
trainTraining functionality (L-BFGS, multi-threading, logging)Yes

API Reference

The API reference is available. Please see following URL:

Lindera Dictionary

Lindera Dictionary is the base library for morphological analysis dictionaries. It provides dictionary loading, building, Viterbi-based segmentation, and CRF-based training functionality.

Key Features

  • Dictionary loading from filesystem or embedded data
  • Dictionary building from MeCab-format CSV source files
  • Viterbi algorithm for optimal segmentation
  • N-best path generation (Forward-DP Backward-A*)
  • Memory-mapped file support
  • CRF-based dictionary training (with train feature)

Contents

Architecture

Module Structure

lindera-dictionary/src/
├── lib.rs               # Public API
├── dictionary.rs        # Dictionary, UserDictionary
├── builder.rs           # DictionaryBuilder
├── loader.rs            # DictionaryLoader trait, FSDictionaryLoader
├── viterbi.rs           # Lattice, Edge, Viterbi segmentation
├── nbest.rs             # NBestGenerator (Forward-DP Backward-A*)
├── mode.rs              # Mode (Normal/Decompose), Penalty
├── error.rs             # LinderaError, LinderaErrorKind
├── assets.rs            # Download and file management
├── dictionary/
│   ├── character_definition.rs    # Character type definitions
│   ├── connection_cost_matrix.rs  # Connection cost matrix
│   ├── prefix_dictionary.rs       # Double-array trie dictionary
│   ├── unknown_dictionary.rs      # Unknown word handling
│   ├── metadata.rs                # Dictionary metadata
│   └── schema.rs                  # Schema definitions
└── trainer/             # (train feature)
    ├── config.rs        # TrainerConfig
    ├── corpus.rs        # Corpus, Example, Word
    ├── feature_extractor.rs  # Feature template parsing
    ├── feature_rewriter.rs   # MeCab-compatible rewrite rules
    └── model.rs         # Trained model, tocost()

Key Components

Dictionary / UserDictionary

Main data structures holding the compiled dictionary data. A Dictionary contains the character definitions, connection cost matrix, prefix dictionary (double-array trie), and unknown word dictionary. UserDictionary allows users to add custom vocabulary on top of the system dictionary.

DictionaryBuilder

Fluent API for building dictionaries from source CSV files. It compiles MeCab-format dictionary sources into the binary format used at runtime.

DictionaryLoader / FSDictionaryLoader

DictionaryLoader is a trait for loading compiled dictionaries. FSDictionaryLoader is the filesystem-based implementation that reads dictionary files from a directory, with optional memory-mapped file support.

Viterbi (Lattice, Edge)

Builds a lattice of candidate tokens from the input text and finds the optimal segmentation path using the Viterbi algorithm. Each Edge in the lattice represents a candidate token with associated costs (word cost + connection cost).

NBestGenerator

Generates N-best segmentation paths using the Forward-DP Backward-A* algorithm. This enables applications to consider alternative segmentations beyond the single best path.

Mode

Controls tokenization behavior:

  • Normal: Standard tokenization using the optimal Viterbi path
  • Decompose: Further splits compound nouns based on configurable Penalty thresholds

Trainer (train feature)

CRF-based dictionary training pipeline using lindera-crf. The training workflow includes:

  1. TrainerConfig: Parses seed dictionary, char.def, feature.def, and rewrite.def
  2. Corpus: Manages training data as labeled examples
  3. FeatureExtractor: Parses feature templates and assigns feature IDs
  4. DictionaryRewriter: Applies MeCab-compatible 3-section rewrite rules
  5. Model: Holds training results and exports dictionary files with cost conversion via tocost(weight, cost_factor)

Feature Flags

FeatureDescriptionDefault
mmapMemory-mapped file supportYes
build_rsHTTP download for dictionary sourcesNo
trainCRF-based training (depends on lindera-crf)No

API Reference

The API reference is available. Please see following URL:

Lindera Library

The lindera crate is the core morphological analysis library. This section covers configuration, segmentation, token filters, error handling, and API reference.

Configuration

Lindera is able to read YAML format configuration files. Specify the path to the following file in the environment variable LINDERA_CONFIG_PATH. You can use it easily without having to code the behavior of the tokenizer in Rust code.

segmenter:
  mode: "normal"
  dictionary: "embedded://ipadic"
  # user_dictionary: "./resources/user_dict/ipadic_simple_userdic.csv"

character_filters:
  - kind: "unicode_normalize"
    args:
      kind: "nfkc"
  - kind: "japanese_iteration_mark"
    args:
      normalize_kanji: true
      normalize_kana: true
  - kind: mapping
    args:
       mapping:
         リンデラ: Lindera

token_filters:
  - kind: "japanese_compound_word"
    args:
      tags:
        - "名詞,数"
        - "名詞,接尾,助数詞"
      new_tag: "名詞,数"
  - kind: "japanese_number"
    args:
      tags:
        - "名詞,数"
  - kind: "japanese_stop_tags"
    args:
      tags:
        - "接続詞"
        - "助詞"
        - "助詞,格助詞"
        - "助詞,格助詞,一般"
        - "助詞,格助詞,引用"
        - "助詞,格助詞,連語"
        - "助詞,係助詞"
        - "助詞,副助詞"
        - "助詞,間投助詞"
        - "助詞,並立助詞"
        - "助詞,終助詞"
        - "助詞,副助詞/並立助詞/終助詞"
        - "助詞,連体化"
        - "助詞,副詞化"
        - "助詞,特殊"
        - "助動詞"
        - "記号"
        - "記号,一般"
        - "記号,読点"
        - "記号,句点"
        - "記号,空白"
        - "記号,括弧閉"
        - "その他,間投"
        - "フィラー"
        - "非言語音"
  - kind: "japanese_katakana_stem"
    args:
      min: 3
  - kind: "remove_diacritical_mark"
    args:
      japanese: false
% export LINDERA_CONFIG_PATH=./resources/config/lindera.yml
use std::path::PathBuf;

use lindera::tokenizer::TokenizerBuilder;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    // Load tokenizer configuration from file
    let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
        .join("../resources")
        .join("config")
        .join("lindera.yml");

    let builder = TokenizerBuilder::from_file(&path)?;

    let tokenizer = builder.build()?;

    let text = "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。".to_string();
    println!("text: {text}");

    let tokens = tokenizer.tokenize(&text)?;

    for token in tokens {
        println!(
            "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
            token.surface, token.byte_start, token.byte_end, token.details
        );
    }

    Ok(())
}

Segmenter

The Segmenter is the core component that performs morphological analysis. It uses the Viterbi algorithm to find the optimal segmentation of input text based on a dictionary and cost model.

Creating a Segmenter

A Segmenter requires three components:

  • Mode - the tokenization strategy (Normal or Decompose)
  • Dictionary - a system dictionary for morphological analysis
  • UserDictionary (optional) - a supplementary dictionary for custom words
#![allow(unused)]
fn main() {
use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;

let dictionary = load_dictionary("embedded://ipadic")?;
let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
}

Tokenization Modes

Mode::Normal

Standard tokenization based on the dictionary entries. Words are segmented faithfully according to what is registered in the dictionary.

#![allow(unused)]
fn main() {
use lindera::mode::Mode;

let mode = Mode::Normal;
}

Mode::Decompose

Decomposes compound nouns into their constituent parts. This mode applies a configurable penalty to long compound words, encouraging the segmenter to split them into shorter components.

For example, with Mode::Normal, the compound word "関西国際空港" remains as a single token, while with Mode::Decompose, it is split into "関西", "国際", and "空港".

#![allow(unused)]
fn main() {
use lindera::mode::Mode;

let mode = Mode::Decompose(Default::default());
}

Dictionary Loading

Lindera provides the load_dictionary function to load dictionaries from various sources.

Embedded Dictionaries

When built with the appropriate feature flag (e.g., embed-ipadic), dictionaries can be loaded directly from the binary:

#![allow(unused)]
fn main() {
use lindera::dictionary::load_dictionary;

let dictionary = load_dictionary("embedded://ipadic")?;
}

Available embedded dictionary URIs:

  • embedded://ipadic - IPADIC (Japanese)
  • embedded://ipadic-neologd - IPADIC NEologd (Japanese)
  • embedded://unidic - UniDic (Japanese)
  • embedded://ko-dic - ko-dic (Korean)
  • embedded://cc-cedict - CC-CEDICT (Chinese)
  • embedded://jieba - Jieba (Chinese)

External Dictionaries

Pre-built dictionary directories can be loaded from the filesystem:

#![allow(unused)]
fn main() {
use lindera::dictionary::load_dictionary;

let dictionary = load_dictionary("/path/to/dictionary")?;
}

Using with Tokenizer

The Segmenter is typically used through the Tokenizer, which adds support for character filters and token filters:

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "日本語の形態素解析を行うことができます。";
    let tokens = tokenizer.tokenize(text)?;

    for token in tokens {
        let details = token.details().join(",");
        println!("{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

Token Filters

Token filters are post-processing components applied to tokens after segmentation. They can modify, remove, or transform tokens to suit specific use cases such as search indexing, text normalization, or linguistic analysis.

Available Token Filters

Japanese

FilterDescription
japanese_compound_wordCombines consecutive tokens matching specified part-of-speech tags into compound words
japanese_numberNormalizes Japanese number representations (e.g., converts Kanji numerals)
japanese_stop_tagsRemoves tokens with specified part-of-speech tags
japanese_katakana_stemStems Katakana words by removing trailing prolonged sound marks
japanese_base_formNormalizes tokens to their base (dictionary) form
japanese_keep_tagsKeeps only tokens matching specified part-of-speech tags, removing all others
japanese_reading_formConverts token text to its reading form (Katakana)
japanese_kanaConverts between Hiragana and Katakana

Korean

FilterDescription
korean_stop_tagsRemoves Korean tokens with specified part-of-speech tags
korean_keep_tagsKeeps only Korean tokens matching specified part-of-speech tags
korean_reading_formConverts Korean tokens to their reading form

General

FilterDescription
lowercaseConverts token text to lowercase
uppercaseConverts token text to uppercase
mappingMaps token text according to a user-defined mapping table
lengthFilters tokens by text length (minimum and/or maximum)
stop_wordsRemoves tokens matching a list of stop words
keep_wordsKeeps only tokens matching a list of specified words
remove_diacritical_markRemoves diacritical marks (accent marks) from token text

YAML Configuration

Token filters can be configured in the YAML configuration file under the token_filters key:

token_filters:
  - kind: "japanese_stop_tags"
    args:
      tags:
        - "助詞"
        - "助動詞"
        - "記号"
  - kind: "japanese_katakana_stem"
    args:
      min: 3
  - kind: "lowercase"
  - kind: "length"
    args:
      min: 2

Rust API

Token filters can also be created and applied programmatically:

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::token_filter::BoxTokenFilter;
use lindera::token_filter::japanese_stop_tags::JapaneseStopTagsTokenFilter;
use lindera::token_filter::japanese_katakana_stem::JapaneseKatakanaStemTokenFilter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);

    let mut tokenizer = Tokenizer::new(segmenter);

    // Add token filters
    let stop_tags_filter = JapaneseStopTagsTokenFilter::new(
        vec![
            "助詞".to_string(),
            "助動詞".to_string(),
            "記号".to_string(),
        ]
        .into_iter()
        .collect(),
    );
    tokenizer.append_token_filter(BoxTokenFilter::from(stop_tags_filter));

    let katakana_stem_filter = JapaneseKatakanaStemTokenFilter::new(3);
    tokenizer.append_token_filter(BoxTokenFilter::from(katakana_stem_filter));

    // Tokenize with filters applied
    let tokens = tokenizer.tokenize("Linderaは形態素解析エンジンです。")?;

    for token in tokens {
        println!(
            "token: {:?}, details: {:?}",
            token.surface, token.details
        );
    }

    Ok(())
}

The append_token_filter method adds filters in order. Filters are applied sequentially to the token list after segmentation.

Error Handling

Lindera uses a structured error system based on anyhow and thiserror for ergonomic error handling throughout the library.

LinderaResult

The LinderaResult<T> type alias is the standard return type for fallible operations in Lindera:

#![allow(unused)]
fn main() {
pub type LinderaResult<T> = Result<T, LinderaError>;
}

LinderaError

LinderaError is the main error type, containing an error kind and a source error with full context:

#![allow(unused)]
fn main() {
pub struct LinderaError {
    pub kind: LinderaErrorKind,
    source: anyhow::Error,
}
}

The add_context method allows attaching additional context to an error:

#![allow(unused)]
fn main() {
let error = error.add_context("failed to load dictionary from /path/to/dict");
}

LinderaErrorKind

LinderaErrorKind is an enum that categorizes errors:

KindDescription
IoI/O errors (file read/write, network)
ParseParsing errors (invalid input format)
SerializeSerialization errors
DeserializeDeserialization errors
ContentInvalid content or data errors
ArgsInvalid argument errors
DecodeDecoding errors
NotFoundResource not found (e.g., dictionary file missing)
BuildDictionary build errors
DictionaryDictionary-related errors
ModeInvalid tokenization mode errors
AlgorithmAlgorithm errors (e.g., Viterbi failure)
FeatureDisabledAttempted to use a feature that is not enabled

Creating Errors

Use LinderaErrorKind::with_error to create an error from a kind and a source:

#![allow(unused)]
fn main() {
use lindera::error::LinderaErrorKind;

let error = LinderaErrorKind::Io.with_error(anyhow::anyhow!("file not found: config.yml"));
}

Using the ? Operator

Since Lindera functions return LinderaResult, the ? operator can propagate errors naturally:

#![allow(unused)]
fn main() {
use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn analyze(text: &str) -> LinderaResult<Vec<String>> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let tokens = tokenizer.tokenize(text)?;
    Ok(tokens.iter().map(|t| t.surface.as_ref().to_string()).collect())
}
}

Error Handling Patterns

Matching on Error Kind

#![allow(unused)]
fn main() {
use lindera::dictionary::load_dictionary;
use lindera::error::LinderaErrorKind;

match load_dictionary("/path/to/dictionary") {
    Ok(dict) => { /* use dictionary */ }
    Err(e) if e.kind() == LinderaErrorKind::NotFound => {
        eprintln!("Dictionary not found: {}", e);
    }
    Err(e) if e.kind() == LinderaErrorKind::Io => {
        eprintln!("I/O error loading dictionary: {}", e);
    }
    Err(e) => {
        eprintln!("Unexpected error: {}", e);
    }
}
}

Converting from External Errors

#![allow(unused)]
fn main() {
use lindera::error::LinderaErrorKind;

let content = std::fs::read_to_string("config.yml")
    .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))?;
}

API Reference

The API reference is available. Please see following URL:

Lindera CLI

A morphological analysis command-line interface for Lindera.

  • Installation - Install or build the CLI
  • Commands - Command reference for tokenize, build, train, and export
  • Tutorial - Step-by-step guide to get started

Installation

Install via Cargo

You can install the binary via cargo:

% cargo install lindera-cli

Download from GitHub Releases

Alternatively, you can download a pre-built binary from the release page:

Obtaining Dictionaries

Lindera does not bundle dictionaries with the binary. You need to download a pre-built dictionary separately from the GitHub Releases page:

# Example: download and extract the IPADIC dictionary
% curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
% unzip lindera-ipadic-<version>.zip -d /path/to/ipadic

Then specify the dictionary path when using the CLI:

% lindera tokenize --dictionary /path/to/ipadic "関西国際空港限定トートバッグ"

Build from Source

Build without dictionaries (default)

Build a binary containing only the tokenizer and trainer without embedded dictionaries:

% cargo build --release

Build with all features

% cargo build --release --all-features

Build with Embedded Dictionaries (Advanced)

For advanced users who want to embed dictionaries directly into the binary, use the embed-* feature flags. This eliminates the need for external dictionary files at runtime but increases the binary size.

IPADIC (Japanese dictionary)

% cargo build --release --features=embed-ipadic

IPADIC NEologd (Japanese dictionary)

% cargo build --release --features=embed-ipadic-neologd

UniDic (Japanese dictionary)

% cargo build --release --features=embed-unidic

ko-dic (Korean dictionary)

% cargo build --release --features=embed-ko-dic

CC-CEDICT (Chinese dictionary)

% cargo build --release --features=embed-cc-cedict

Jieba (Chinese dictionary)

% cargo build --release --features=embed-jieba

[!TIP] After building with an embed-* feature flag, use the embedded:// scheme to load the embedded dictionary:

% lindera tokenize --dictionary embedded://ipadic "関西国際空港限定トートバッグ"

See Feature Flags for details.

Commands

The Lindera CLI provides four main commands:

  • tokenize - Perform morphological analysis on text
  • build - Build a dictionary from source CSV files
  • train - Train a CRF model from annotated corpus data
  • export - Export a trained model to dictionary format

tokenize

Perform morphological analysis (tokenization) on Japanese, Chinese, or Korean text using various dictionaries.

Parameters

  • --dict / -d: Dictionary path or URI (required)
    • File path: /path/to/dictionary
    • Embedded: embedded://ipadic, embedded://unidic, etc.
  • --output / -o: Output format (default: mecab)
    • mecab: MeCab-compatible format with part-of-speech info
    • wakati: Space-separated tokens only
    • json: Detailed JSON format with all token information
  • --user-dict / -u: User dictionary path (optional)
  • --mode / -m: Tokenization mode (default: normal)
    • normal: Standard tokenization
    • decompose: Decompose compound words
  • --char-filter / -c: Character filter configuration (JSON)
  • --token-filter / -t: Token filter configuration (JSON)
  • --nbest / -N: Number of N-best results to return (default: 1). When set to 2 or more, N-best output is enabled.
  • --nbest-unique: Deduplicate N-best results by removing paths that produce the same segmentation.
  • --nbest-cost-threshold: Maximum cost difference from the best path. Only paths with cost within best_cost + threshold are returned.
  • Input file: Optional file path (default: stdin)

Basic usage

# Tokenize text using a dictionary directory
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /path/to/dictionary

# Tokenize text using embedded dictionary
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic

# Tokenize with different output format
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic \
  --output json

# Tokenize text from file
lindera tokenize \
  --dict /path/to/dictionary \
  --output wakati \
  input.txt

Examples with external dictionaries

Tokenize with external IPADIC (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Tokenize with external IPADIC Neologd (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-neologd-0.0.7-20200820
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Tokenize with external UniDic (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-unidic-2.1.2
日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Tokenize with external ko-dic (Korean dictionary)

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict /tmp/lindera-ko-dic-2.1.1-20180720
한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수      NNB,*,F,수,*,*,*,*
있      VV,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       SF,*,*,*,*,*,*,*
EOS

Tokenize with external CC-CEDICT (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict /tmp/lindera-cc-cedict-0.1.0-20200409
可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      *,*,*,*,*,*,*,*
EOS

Tokenize with external Jieba (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict /tmp/lindera-jieba-0.1.1

Examples with embedded dictionaries

Lindera can include dictionaries directly in the binary when built with specific feature flags. This allows tokenization without external dictionary files.

Tokenize with embedded IPADIC (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include IPADIC dictionary in the binary, you must build with the --features=embed-ipadic option.

Tokenize with embedded UniDic (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://unidic
日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-unidic option.

Tokenize with embedded IPADIC NEologd (Japanese dictionary)

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic-neologd
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-ipadic-neologd option.

Tokenize with embedded ko-dic (Korean dictionary)

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict embedded://ko-dic
한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수      NNB,*,F,수,*,*,*,*
있      VV,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       SF,*,*,*,*,*,*,*
EOS

NOTE: To include ko-dic dictionary in the binary, you must build with the --features=embed-ko-dic option.

Tokenize with embedded CC-CEDICT (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict embedded://cc-cedict
可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      *,*,*,*,*,*,*,*
EOS

NOTE: To include CC-CEDICT dictionary in the binary, you must build with the --features=embed-cc-cedict option.

Tokenize with embedded Jieba (Chinese dictionary)

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict embedded://jieba

NOTE: To include Jieba dictionary in the binary, you must build with the --features=embed-jieba option.

User dictionary examples

Lindera supports user dictionaries to add custom words alongside system dictionaries. User dictionaries can be in CSV or binary format.

Use user dictionary (CSV format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict embedded://ipadic \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Use user dictionary (Binary format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920 \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.bin
東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Tokenization modes

Lindera provides two tokenization modes: normal and decompose.

Normal mode (default)

Tokenizes faithfully based on words registered in the dictionary:

% echo "関西国際空港限定トートバッグ" | lindera tokenize \
  --dict embedded://ipadic \
  --mode normal
関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    名詞,一般,*,*,*,*,*,*,*
EOS

Decompose mode

Tokenizes compound noun words additionally:

% echo "関西国際空港限定トートバッグ" | lindera tokenize \
  --dict embedded://ipadic \
  --mode decompose
関西    名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港    名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    名詞,一般,*,*,*,*,*,*,*
EOS

Output formats

Lindera provides three output formats: mecab, wakati and json.

MeCab format (default)

Outputs results in MeCab-compatible format with part-of-speech information:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output mecab
お待ち  名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチ
し  動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て  助詞,接続助詞,*,*,*,*,て,テ,テ
おり  動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます  助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。  記号,句点,*,*,*,*,。,。,。
EOS

Wakati format

Outputs only the token text separated by spaces:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output wakati
お待ち し て おり ます 。

JSON format

Outputs detailed token information in JSON format:

% echo "お待ちしております。" | lindera tokenize \
  --dict embedded://ipadic \
  --output json
[
  {
    "base_form": "お待ち",
    "byte_end": 9,
    "byte_start": 0,
    "conjugation_form": "*",
    "conjugation_type": "*",
    "part_of_speech": "名詞",
    "part_of_speech_subcategory_1": "サ変接続",
    "part_of_speech_subcategory_2": "*",
    "part_of_speech_subcategory_3": "*",
    "pronunciation": "オマチ",
    "reading": "オマチ",
    "surface": "お待ち",
    "word_id": 14698
  },
  ...
]

N-Best tokenization

Lindera supports N-Best tokenization, which returns the top N tokenization candidates ordered by cost (lower cost = better). This is based on the Forward-DP Backward-A* algorithm, compatible with MeCab's N-Best implementation.

Basic N-Best example

% echo "すもももももももものうち" | lindera tokenize \
  --dict embedded://ipadic \
  -N 3
NBEST 1 (cost=7546)
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
NBEST 2 (cost=7914)
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
NBEST 3 (cost=10060)
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
も      助詞,係助詞,*,*,*,*,も,モ,モ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

N-Best with unique results

When the same segmentation appears in multiple paths (differing only in internal Viterbi states), use --nbest-unique to deduplicate:

% echo "営業部長谷川です" | lindera tokenize \
  --dict embedded://ipadic \
  -N 5 --nbest-unique -o wakati
NBEST 1 (cost=15760)
営業 部長 谷川 です
NBEST 2 (cost=17758)
営業 部長 谷 川 です
NBEST 3 (cost=18816)
営業 部 長谷川 です
NBEST 4 (cost=19320)
営業 部長 谷川 で す
NBEST 5 (cost=20814)
営業 部 長谷 川 です

N-Best with cost threshold

Use --nbest-cost-threshold to limit results to paths within a certain cost range of the best path:

% echo "営業部長谷川です" | lindera tokenize \
  --dict embedded://ipadic \
  -N 10 --nbest-unique --nbest-cost-threshold 5000 -o wakati
NBEST 1 (cost=15760)
営業 部長 谷川 です
NBEST 2 (cost=17758)
営業 部長 谷 川 です
NBEST 3 (cost=18816)
営業 部 長谷川 です

Only 3 results are returned because the remaining candidates exceed 15760 + 5000 = 20760.

Advanced tokenization with filters

Lindera provides an analytical framework that combines character filters, tokenizers, and token filters for advanced text processing. Filters are configured using JSON.

% echo "すもももももももものうち" | lindera tokenize \
  --dict embedded://ipadic \
  --char-filter 'unicode_normalize:{"kind":"nfkc"}' \
  --token-filter 'japanese_keep_tags:{"tags":["名詞,一般"]}'
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
EOS

build

Build (compile) a morphological analysis dictionary from source CSV files for use with Lindera.

Build parameters

  • --src / -s: Source directory containing dictionary CSV files (or single CSV file for user dictionary)
  • --dest / -d: Destination directory for compiled dictionary output
  • --metadata / -m: Metadata configuration file (metadata.json) that defines dictionary structure
  • --user / -u: Build user dictionary instead of system dictionary (optional flag)

Dictionary types

System dictionary

A full morphological analysis dictionary containing:

  • Lexicon entries (word definitions)
  • Connection cost matrix
  • Unknown word handling rules
  • Character type definitions

User dictionary

A supplementary dictionary for custom words that works alongside a system dictionary.

Examples

Build IPADIC (Japanese dictionary)

# Download and extract IPADIC source files
% curl -L -o /tmp/mecab-ipadic-2.7.0-20250920.tar.gz "https://Lindera.dev/mecab-ipadic-2.7.0-20250920.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20250920.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ipadic-2.7.0-20250920 \
  --dest /tmp/lindera-ipadic-2.7.0-20250920 \
  --metadata ./lindera-ipadic/metadata.json

Build IPADIC NEologd (Japanese dictionary)

% curl -L -o /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz "https://lindera.dev/mecab-ipadic-neologd-0.0.7-20200820.tar.gz"
% tar zxvf /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp

% lindera build \
  --src /tmp/mecab-ipadic-neologd-0.0.7-20200820 \
  --dest /tmp/lindera-ipadic-neologd-0.0.7-20200820 \
  --metadata ./lindera-ipadic-neologd/metadata.json

Build UniDic (Japanese dictionary)

% curl -L -o /tmp/unidic-mecab-2.1.2.tar.gz "https://Lindera.dev/unidic-mecab-2.1.2.tar.gz"
% tar zxvf /tmp/unidic-mecab-2.1.2.tar.gz -C /tmp

% lindera build \
  --src /tmp/unidic-mecab-2.1.2 \
  --dest /tmp/lindera-unidic-2.1.2 \
  --metadata ./lindera-unidic/metadata.json

Build CC-CEDICT (Chinese dictionary)

% curl -L -o /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz "https://lindera.dev/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz"
% tar zxvf /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz -C /tmp

% lindera build \
  --src /tmp/CC-CEDICT-MeCab-0.1.0-20200409 \
  --dest /tmp/lindera-cc-cedict-0.1.0-20200409 \
  --metadata ./lindera-cc-cedict/metadata.json

Build Jieba (Chinese dictionary)

% curl -L -o /tmp/mecab-jieba-0.1.1.tar.gz "https://lindera.dev/mecab-jieba-0.1.1.tar.gz"
% tar zxvf /tmp/mecab-jieba-0.1.1.tar.gz -C /tmp

% lindera build \
  --src /tmp/mecab-jieba-0.1.1/dict-src \
  --dest /tmp/lindera-jieba-0.1.1 \
  --metadata ./lindera-jieba/metadata.json

Build ko-dic (Korean dictionary)

% curl -L -o /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz "https://Lindera.dev/mecab-ko-dic-2.1.1-20180720.tar.gz"
% tar zxvf /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz -C /tmp

% lindera build \
  --src /tmp/mecab-ko-dic-2.1.1-20180720 \
  --dest /tmp/lindera-ko-dic-2.1.1-20180720 \
  --metadata ./lindera-ko-dic/metadata.json

Build user dictionaries

Build IPADIC user dictionary (Japanese)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/ipadic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ipadic/metadata.json \
  --user

Build UniDic user dictionary (Japanese)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/unidic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-unidic/metadata.json \
  --user

Build CC-CEDICT user dictionary (Chinese)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/cc-cedict_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-cc-cedict/metadata.json \
  --user

Build Jieba user dictionary (Chinese)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/jieba_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-jieba/metadata.json \
  --user

Build ko-dic user dictionary (Korean)

For more details about user dictionary format please refer to the following URL:

% lindera build \
  --src ./resources/user_dict/ko-dic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ko-dic/metadata.json \
  --user

train

Train a new morphological analysis model from annotated corpus data. To use this feature, you must build with the train feature flag enabled. (The train feature flag is enabled by default.)

Train parameters

  • --seed / -s: Seed lexicon file (CSV format) to be weighted
  • --corpus / -c: Training corpus (annotated text)
  • --char-def / -C: Character definition file (char.def)
  • --unk-def / -u: Unknown word definition file (unk.def) to be weighted
  • --feature-def / -f: Feature definition file (feature.def)
  • --rewrite-def / -r: Rewrite rule definition file (rewrite.def)
  • --output / -o: Output model file
  • --lambda / -l: L1 regularization (0.0-1.0) (default: 0.01)
  • --max-iterations / -i: Maximum number of iterations for training (default: 100)
  • --max-threads / -t: Maximum number of threads (defaults to CPU core count, auto-adjusted based on dataset size)

Basic workflow

1. Prepare training files

Seed lexicon file (seed.csv):

The seed lexicon file contains initial dictionary entries used for training the CRF model. Each line represents a word entry with comma-separated fields:

  • Surface
  • Left context ID
  • Right context ID
  • Word cost
  • Part-of-speech tags (multiple fields)
  • Base form
  • Reading (katakana)
  • Pronunciation

Note: The exact field definitions differ between dictionary formats (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.

外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン

Training corpus (corpus.txt):

The training corpus file contains annotated text data used to train the CRF model. Each line consists of:

  • A surface form (word) followed by a tab character
  • Comma-separated morphological features (part-of-speech tags, base form, reading, pronunciation)
  • Sentences are separated by "EOS" (End Of Sentence) markers
外国	名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人	名詞,接尾,一般,*,*,*,人,ジン,ジン
参政	名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権	名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS

For detailed information about file formats and advanced features, see TRAINER_README.md.

2. Train model

lindera train \
  --seed ./resources/training/seed.csv \
  --corpus ./resources/training/corpus.txt \
  --unk-def ./resources/training/unk.def \
  --char-def ./resources/training/char.def \
  --feature-def ./resources/training/feature.def \
  --rewrite-def ./resources/training/rewrite.def \
  --output /tmp/lindera/training/model.dat \
  --lambda 0.01 \
  --max-iterations 100

3. Training results

The trained model will contain:

  • Existing words: All seed dictionary records with newly learned weights
  • New words: Words from the corpus not in the seed dictionary, added with appropriate weights

export

Export a trained model file to Lindera dictionary format files. This feature requires building with the train feature flag enabled.

Export parameters

  • --model / -m: Path to the trained model file (.dat format)
  • --output / -o: Directory to output the dictionary files
  • --metadata: Optional metadata.json file to update with trained model information
  • --cost-factor: Override cost factor for weight-to-cost conversion (default: value from trained model, typically 700)

Output files

The export command creates the following dictionary files in the output directory:

  • lex.csv: Lexicon file with learned weights (MeCab-compatible cost via tocost())
  • matrix.def: Dense connection cost matrix covering all (right_id, left_id) pairs
  • unk.def: Unknown word definitions
  • char.def: Character type definitions
  • feature.def: Feature template definitions (copied from trained model)
  • rewrite.def: Feature rewrite rules (copied from trained model)
  • left-id.def: Left context ID to feature string mapping
  • right-id.def: Right context ID to feature string mapping
  • metadata.json: Updated metadata file (if --metadata option is provided)

Complete workflow example

1. Train model

lindera train \
  --seed ./resources/training/seed.csv \
  --corpus ./resources/training/corpus.txt \
  --unk-def ./resources/training/unk.def \
  --char-def ./resources/training/char.def \
  --feature-def ./resources/training/feature.def \
  --rewrite-def ./resources/training/rewrite.def \
  --output /tmp/lindera/training/model.dat \
  --lambda 0.01 \
  --max-iterations 100

2. Export to dictionary format

lindera export \
  --model /tmp/lindera/training/model.dat \
  --metadata ./resources/training/metadata.json \
  --output /tmp/lindera/training/dictionary

3. Build dictionary

lindera build \
  --src /tmp/lindera/training/dictionary \
  --dest /tmp/lindera/training/compiled_dictionary \
  --metadata /tmp/lindera/training/dictionary/metadata.json

4. Use trained dictionary

echo "これは外国人参政権です。" | lindera tokenize \
  -d /tmp/lindera/training/compiled_dictionary

Metadata update feature

When the --metadata option is provided, the export command will:

  1. Read the base metadata.json file to preserve existing configuration
  2. Update specific fields with values from the trained model:
    • default_left_context_id: Maximum left context ID from trained model
    • default_right_context_id: Maximum right context ID from trained model
    • default_word_cost: Calculated from feature weight median
    • model_info: Training statistics including feature count, label count, matrix size, iterations, regularization, version, and timestamp
  3. Preserve existing settings such as dictionary name, character encoding, schema definitions, and other user-defined configuration

Tutorial

This tutorial walks you through the basic usage of the Lindera CLI, from installation to advanced text processing.

1. Install the CLI

Install Lindera CLI with the embedded IPADIC dictionary:

% cargo install lindera-cli --features=embed-ipadic

Verify the installation:

% lindera --help

2. Basic tokenization with embedded dictionary

Tokenize Japanese text using the embedded IPADIC dictionary:

% echo "東京は日本の首都です。" | lindera tokenize \
  --dict embedded://ipadic

Expected output:

東京    名詞,固有名詞,地域,一般,*,*,東京,トウキョウ,トーキョー
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
日本    名詞,固有名詞,地域,国,*,*,日本,ニホン,ニホン
の      助詞,連体化,*,*,*,*,の,ノ,ノ
首都    名詞,一般,*,*,*,*,首都,シュト,シュト
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。      記号,句点,*,*,*,*,。,。,。
EOS

3. Try different output formats

Wakati format (word segmentation only)

% echo "東京は日本の首都です。" | lindera tokenize \
  --dict embedded://ipadic \
  --output wakati

Expected output:

東京 は 日本 の 首都 です 。

JSON format (detailed information)

% echo "東京は日本の首都です。" | lindera tokenize \
  --dict embedded://ipadic \
  --output json

This produces a JSON array with detailed token information including byte offsets, part-of-speech tags, readings, and more.

4. Use decompose mode

Decompose mode splits compound nouns into their constituent parts:

% echo "関西国際空港限定トートバッグ" | lindera tokenize \
  --dict embedded://ipadic \
  --mode decompose

Expected output:

関西    名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際    名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港    名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    名詞,一般,*,*,*,*,*,*,*
EOS

Compare with normal mode, where "関西国際空港" remains as a single token.

5. Apply character and token filters

Use Unicode normalization and keep only common nouns:

% echo "Linderaは形態素解析エンジンです。" | lindera tokenize \
  --dict embedded://ipadic \
  --char-filter 'unicode_normalize:{"kind":"nfkc"}' \
  --token-filter 'japanese_keep_tags:{"tags":["名詞,一般","名詞,固有名詞,組織"]}'

Expected output:

Lindera 名詞,固有名詞,組織,*,*,*,*,*,*
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
エンジン        名詞,一般,*,*,*,*,エンジン,エンジン,エンジン
EOS

The Unicode normalization converts full-width characters to half-width, and the token filter keeps only tokens matching the specified part-of-speech tags.

You can also combine multiple filters:

% echo "すもももももももものうち" | lindera tokenize \
  --dict embedded://ipadic \
  --token-filter 'japanese_stop_tags:{"tags":["助詞","助詞,係助詞","助詞,連体化"]}'

6. Use user dictionary

Create a CSV file with custom word entries (e.g., my_dict.csv):

東京スカイツリー,カスタム名詞,トウキョウスカイツリー

Tokenize with the user dictionary:

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict embedded://ipadic \
  --user-dict ./my_dict.csv

Without the user dictionary, "東京スカイツリー" would be split into multiple tokens. With the user dictionary, it is recognized as a single token.

For pre-built user dictionary examples, see:

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict embedded://ipadic \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.csv

Expected output:

東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Lindera Python

Lindera Python provides Python bindings for the Lindera morphological analysis engine, built with PyO3. It brings Lindera's high-performance tokenization capabilities to the Python ecosystem with support for Python 3.10 and later.

Features

  • Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
  • Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
  • CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires train feature)
  • Multiple tokenization modes: Normal and decompose modes for different analysis granularity
  • N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
  • User dictionaries: Extend system dictionaries with custom vocabulary

Documentation

Installation

Installing from PyPI

Pre-built wheels are available on PyPI:

pip install lindera-python

[!NOTE] The PyPI package does not include dictionaries. See Obtaining Dictionaries below.

Obtaining Dictionaries

Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.

Download from GitHub Releases

Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:

# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic

Building from Source

If you need to build from source (e.g., to enable specific feature flags), the following prerequisites are required:

  • Python 3.10 or later (up to 3.14)
  • Rust toolchain -- Install via rustup
  • maturin -- Python package for building Rust-based Python extensions

Install maturin with pip:

pip install maturin

Development Build

Build and install lindera-python in development mode:

cd lindera-python
maturin develop

Or use the project Makefile:

make python-develop

Build with Training Support

The train feature enables CRF-based dictionary training functionality. It is enabled by default:

maturin develop --features train

Feature Flags

FeatureDescriptionDefault
trainCRF training functionalityEnabled
embed-ipadicEmbed Japanese dictionary (IPADIC) into the binaryDisabled
embed-unidicEmbed Japanese dictionary (UniDic) into the binaryDisabled
embed-ipadic-neologdEmbed Japanese dictionary (IPADIC NEologd) into the binaryDisabled
embed-ko-dicEmbed Korean dictionary (ko-dic) into the binaryDisabled
embed-cc-cedictEmbed Chinese dictionary (CC-CEDICT) into the binaryDisabled
embed-jiebaEmbed Chinese dictionary (Jieba) into the binaryDisabled
embed-cjkEmbed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binaryDisabled

Multiple features can be combined:

maturin develop --features "train,embed-ipadic,embed-ko-dic"

[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding embed-* feature flag and load it using the embedded:// scheme:

dictionary = load_dictionary("embedded://ipadic")

See Feature Flags for details.

Verifying the Installation

After installation, verify that lindera is available in Python:

import lindera

print(lindera.version())

Quick Start

This guide shows how to tokenize text using lindera-python.

Basic Tokenization

The recommended way to create a tokenizer is through TokenizerBuilder:

from lindera import TokenizerBuilder

builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")
tokenizer = builder.build()

tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")
for token in tokens:
    print(f"{token.surface}\t{','.join(token.details)}")

Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.

Expected output:

関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK

Method Chaining

TokenizerBuilder supports method chaining for concise configuration:

from lindera import TokenizerBuilder

tokenizer = (
    TokenizerBuilder()
    .set_mode("normal")
    .set_dictionary("/path/to/ipadic")
    .build()
)

tokens = tokenizer.tokenize("すもももももももものうち")
for token in tokens:
    print(f"{token.surface}\t{token.get_detail(0)}")

Accessing Token Properties

Each token exposes the following properties:

from lindera import TokenizerBuilder

tokenizer = TokenizerBuilder().set_dictionary("/path/to/ipadic").build()
tokens = tokenizer.tokenize("東京タワー")

for token in tokens:
    print(f"Surface: {token.surface}")
    print(f"Byte range: {token.byte_start}..{token.byte_end}")
    print(f"Position: {token.position}")
    print(f"Word ID: {token.word_id}")
    print(f"Unknown: {token.is_unknown}")
    print(f"Details: {token.details}")
    print()

N-best Tokenization

Retrieve multiple tokenization candidates ranked by cost:

from lindera import TokenizerBuilder

tokenizer = TokenizerBuilder().set_dictionary("/path/to/ipadic").build()
results = tokenizer.tokenize_nbest("すもももももももものうち", n=3)

for tokens, cost in results:
    surfaces = [t.surface for t in tokens]
    print(f"Cost {cost}: {' / '.join(surfaces)}")

Tokenizer API

TokenizerBuilder

TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

TokenizerBuilder()

Creates a new builder with default configuration.

from lindera import TokenizerBuilder

builder = TokenizerBuilder()

TokenizerBuilder().from_file(file_path)

Loads configuration from a JSON file and returns a new builder.

builder = TokenizerBuilder().from_file("config.json")

Configuration Methods

All setter methods return self for method chaining.

set_mode(mode)

Sets the tokenization mode.

  • "normal" -- Standard tokenization (default)
  • "decompose" -- Decomposes compound words into smaller units
builder.set_mode("normal")

set_dictionary(path)

Sets the system dictionary path or URI.

# Use an embedded dictionary
builder.set_dictionary("embedded://ipadic")

# Use an external dictionary
builder.set_dictionary("/path/to/dictionary")

set_user_dictionary(uri)

Sets the user dictionary URI.

builder.set_user_dictionary("/path/to/user_dictionary")

set_keep_whitespace(keep)

Controls whether whitespace tokens appear in the output.

builder.set_keep_whitespace(True)

append_character_filter(kind, args=None)

Appends a character filter to the preprocessing pipeline.

builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

append_token_filter(kind, args=None)

Appends a token filter to the postprocessing pipeline.

builder.append_token_filter("lowercase", {})

Build

build()

Builds and returns a Tokenizer with the configured settings.

tokenizer = builder.build()

Tokenizer

Tokenizer performs morphological analysis on text.

Creating a Tokenizer

Tokenizer(dictionary, mode="normal", user_dictionary=None)

Creates a tokenizer directly from a loaded dictionary.

from lindera import Tokenizer, load_dictionary

dictionary = load_dictionary("embedded://ipadic")
tokenizer = Tokenizer(dictionary, mode="normal")

Tokenizer Methods

tokenize(text)

Tokenizes the input text and returns a list of Token objects.

tokens = tokenizer.tokenize("形態素解析")

Parameters:

NameTypeDescription
textstrText to tokenize

Returns: list[Token]

tokenize_nbest(text, n, unique=False, cost_threshold=None)

Returns the N-best tokenization results, each paired with its total path cost.

results = tokenizer.tokenize_nbest("すもももももももものうち", n=3)
for tokens, cost in results:
    print(cost, [t.surface for t in tokens])

Parameters:

NameTypeDescription
textstrText to tokenize
nintNumber of results to return
uniqueboolDeduplicate results (default: False)
cost_thresholdint or NoneMaximum cost difference from the best path (default: None)

Returns: list[tuple[list[Token], int]]

Token

Token represents a single morphological token.

Properties

PropertyTypeDescription
surfacestrSurface form of the token
byte_startintStart byte position in the original text
byte_endintEnd byte position in the original text
positionintToken position index
word_idintDictionary word ID
is_unknownboolTrue if the word is not in the dictionary
detailslist[str] or NoneMorphological details (part of speech, reading, etc.)

Token Methods

get_detail(index)

Returns the detail string at the specified index, or None if the index is out of range.

token = tokenizer.tokenize("東京")[0]
pos = token.get_detail(0)        # e.g., "名詞"
subpos = token.get_detail(1)     # e.g., "固有名詞"
reading = token.get_detail(7)    # e.g., "トウキョウ"

Parameters:

NameTypeDescription
indexintZero-based index into the details list

Returns: str or None

The structure of details depends on the dictionary:

  • IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
  • UniDic: Detailed morphological features following the UniDic specification
  • ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats

Dictionary Management

Lindera Python provides functions for loading, building, and managing dictionaries used in morphological analysis.

Loading Dictionaries

System Dictionaries

Use load_dictionary(uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:

from lindera import load_dictionary

dictionary = load_dictionary("/path/to/ipadic")

Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:

dictionary = load_dictionary("embedded://ipadic")

User Dictionaries

User dictionaries add custom vocabulary on top of a system dictionary.

from lindera import load_user_dictionary, Metadata

metadata = Metadata()
user_dict = load_user_dictionary("/path/to/user_dictionary", metadata)

Pass the user dictionary when building a tokenizer:

from lindera import Tokenizer, load_dictionary, load_user_dictionary, Metadata

dictionary = load_dictionary("/path/to/ipadic")
metadata = Metadata()
user_dict = load_user_dictionary("/path/to/user_dictionary", metadata)

tokenizer = Tokenizer(dictionary, mode="normal", user_dictionary=user_dict)

Or via the builder:

from lindera import TokenizerBuilder

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("/path/to/ipadic")
    .set_user_dictionary("/path/to/user_dictionary")
    .build()
)

Building Dictionaries

System Dictionary

Build a system dictionary from source files:

from lindera import build_dictionary, Metadata

metadata = Metadata(name="custom", encoding="UTF-8")
build_dictionary("/path/to/input_dir", "/path/to/output_dir", metadata)

The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).

User Dictionary

Build a user dictionary from a CSV file:

from lindera import build_user_dictionary, Metadata

metadata = Metadata()
build_user_dictionary("ipadic", "user_words.csv", "/path/to/output_dir", metadata)

The metadata parameter is optional. When omitted, default metadata values are used:

build_user_dictionary("ipadic", "user_words.csv", "/path/to/output_dir")

Metadata

The Metadata class configures dictionary parameters.

Creating Metadata

from lindera import Metadata

# Default metadata
metadata = Metadata()

# Custom metadata
metadata = Metadata(
    name="my_dictionary",
    encoding="UTF-8",
    default_word_cost=-10000,
)

Loading from JSON

metadata = Metadata.from_json_file("metadata.json")

Properties

PropertyTypeDefaultDescription
namestr"default"Dictionary name
encodingstr"UTF-8"Character encoding
default_word_costint-10000Default cost for unknown words
default_left_context_idint1288Default left context ID
default_right_context_idint1288Default right context ID
default_field_valuestr"*"Default value for missing fields
flexible_csvboolFalseAllow flexible CSV parsing
skip_invalid_cost_or_idboolFalseSkip entries with invalid cost or ID
normalize_detailsboolFalseNormalize morphological details
dictionary_schemaSchemaIPADIC schemaSchema for the main dictionary
user_dictionary_schemaSchemaMinimal schemaSchema for user dictionaries

All properties support both getting and setting:

metadata = Metadata()
metadata.name = "custom_dict"
metadata.encoding = "EUC-JP"
print(metadata.name)  # "custom_dict"

to_dict()

Returns a dictionary representation of the metadata:

metadata = Metadata(name="test")
print(metadata.to_dict())

Text Processing Pipeline

Lindera Python supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.

Input Text
  --> Character Filters (preprocessing)
  --> Tokenization
  --> Token Filters (postprocessing)
  --> Output Tokens

Character Filters

Character filters transform the input text before tokenization.

unicode_normalize

Applies Unicode normalization to the input text.

from lindera import TokenizerBuilder

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("embedded://ipadic")
    .append_character_filter("unicode_normalize", {"kind": "nfkc"})
    .build()
)

Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".

mapping

Replaces characters or strings according to a mapping table.

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("embedded://ipadic")
    .append_character_filter("mapping", {
        "mapping": {
            "\u30fc": "-",
            "\uff5e": "~",
        }
    })
    .build()
)

japanese_iteration_mark

Resolves Japanese iteration marks (odoriji) into their full forms.

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("embedded://ipadic")
    .append_character_filter("japanese_iteration_mark", {
        "normalize_kanji": True,
        "normalize_kana": True,
    })
    .build()
)

Token Filters

Token filters transform or remove tokens after tokenization.

lowercase

Converts token surface forms to lowercase.

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("embedded://ipadic")
    .append_token_filter("lowercase", {})
    .build()
)

japanese_base_form

Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("embedded://ipadic")
    .append_token_filter("japanese_base_form", {})
    .build()
)

japanese_stop_tags

Removes tokens whose part-of-speech matches any of the specified tags.

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("embedded://ipadic")
    .append_token_filter("japanese_stop_tags", {
        "tags": ["助詞", "助動詞"],
    })
    .build()
)

japanese_keep_tags

Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.

tokenizer = (
    TokenizerBuilder()
    .set_dictionary("embedded://ipadic")
    .append_token_filter("japanese_keep_tags", {
        "tags": ["名詞"],
    })
    .build()
)

Complete Pipeline Example

The following example combines multiple character filters and token filters into a single pipeline:

from lindera import TokenizerBuilder

tokenizer = (
    TokenizerBuilder()
    .set_mode("normal")
    .set_dictionary("embedded://ipadic")
    # Preprocessing
    .append_character_filter("unicode_normalize", {"kind": "nfkc"})
    .append_character_filter("japanese_iteration_mark", {
        "normalize_kanji": True,
        "normalize_kana": True,
    })
    # Postprocessing
    .append_token_filter("japanese_base_form", {})
    .append_token_filter("japanese_stop_tags", {
        "tags": ["助詞", "助動詞", "記号"],
    })
    .append_token_filter("lowercase", {})
    .build()
)

tokens = tokenizer.tokenize("Linderaは形態素解析を行うライブラリです。")
for token in tokens:
    print(f"{token.surface}\t{','.join(token.details)}")

In this pipeline:

  1. unicode_normalize converts full-width characters to half-width (NFKC normalization)
  2. japanese_iteration_mark resolves iteration marks
  3. japanese_base_form converts inflected tokens to base form
  4. japanese_stop_tags removes particles, auxiliary verbs, and symbols
  5. lowercase normalizes alphabetic characters to lowercase

Training

Lindera Python supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.

Prerequisites

Build lindera-python with the train feature enabled (enabled by default):

maturin develop --features train

Training a Model

Use lindera.train() to train a CRF model from a seed lexicon and annotated corpus:

import lindera

lindera.train(
    seed="resources/training/seed.csv",
    corpus="resources/training/corpus.txt",
    char_def="resources/training/char.def",
    unk_def="resources/training/unk.def",
    feature_def="resources/training/feature.def",
    rewrite_def="resources/training/rewrite.def",
    output="/tmp/model.dat",
    lambda_=0.01,
    max_iter=100,
    max_threads=4,
)

Training Parameters

ParameterTypeDefaultDescription
seedstrrequiredPath to the seed lexicon file (CSV format)
corpusstrrequiredPath to the annotated training corpus
char_defstrrequiredPath to the character definition file (char.def)
unk_defstrrequiredPath to the unknown word definition file (unk.def)
feature_defstrrequiredPath to the feature definition file (feature.def)
rewrite_defstrrequiredPath to the rewrite rule definition file (rewrite.def)
outputstrrequiredOutput path for the trained model file
lambda_float0.01L1 regularization cost (0.0--1.0)
max_iterint100Maximum number of training iterations
max_threadsint or NoneNoneNumber of threads (None = auto-detect CPU cores)

Exporting a Trained Model

After training, export the model to dictionary source files using lindera.export():

import lindera

lindera.export(
    model="/tmp/model.dat",
    output="/tmp/dictionary_source",
    metadata="resources/training/metadata.json",
)

Export Parameters

ParameterTypeDefaultDescription
modelstrrequiredPath to the trained model file (.dat)
outputstrrequiredOutput directory for dictionary source files
metadatastr or NoneNonePath to a base metadata.json file

The export creates the following files in the output directory:

  • lex.csv -- Lexicon entries with trained costs
  • matrix.def -- Connection cost matrix
  • unk.def -- Unknown word definitions
  • char.def -- Character category definitions
  • metadata.json -- Updated metadata (when metadata parameter is provided)

Complete Workflow

The full workflow for training and using a custom dictionary:

import lindera

# Step 1: Train the CRF model
lindera.train(
    seed="resources/training/seed.csv",
    corpus="resources/training/corpus.txt",
    char_def="resources/training/char.def",
    unk_def="resources/training/unk.def",
    feature_def="resources/training/feature.def",
    rewrite_def="resources/training/rewrite.def",
    output="/tmp/model.dat",
    lambda_=0.01,
    max_iter=100,
)

# Step 2: Export to dictionary source files
lindera.export(
    model="/tmp/model.dat",
    output="/tmp/dictionary_source",
    metadata="resources/training/metadata.json",
)

# Step 3: Build the dictionary from exported source files
metadata = lindera.Metadata.from_json_file("/tmp/dictionary_source/metadata.json")
lindera.build_dictionary("/tmp/dictionary_source", "/tmp/dictionary", metadata)

# Step 4: Use the trained dictionary
tokenizer = (
    lindera.TokenizerBuilder()
    .set_dictionary("/tmp/dictionary")
    .set_mode("normal")
    .build()
)

tokens = tokenizer.tokenize("形態素解析のテスト")
for token in tokens:
    print(f"{token.surface}\t{','.join(token.details)}")

Lindera Node.js

Lindera Node.js provides Node.js bindings for the Lindera morphological analysis engine, built with NAPI-RS. It brings Lindera's high-performance tokenization capabilities to the Node.js ecosystem with support for Node.js 18 and later.

Features

  • Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
  • Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
  • CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires train feature)
  • Multiple tokenization modes: Normal and decompose modes for different analysis granularity
  • N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
  • User dictionaries: Extend system dictionaries with custom vocabulary
  • TypeScript support: Full type definitions included out of the box

Documentation

Installation

Installing from npm

Pre-built packages will be available on npm:

npm install lindera-nodejs

[!NOTE] The npm package does not include dictionaries. See Obtaining Dictionaries below. For browser/WASM usage, see lindera-wasm.

Building from Source

Prerequisites

  • Node.js 18 or later (LTS versions recommended)
  • Rust toolchain -- Install via rustup
  • NAPI-RS CLI -- CLI tool for building native Node.js addons in Rust

Install the NAPI-RS CLI globally:

npm install -g @napi-rs/cli

Obtaining Dictionaries

Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.

Download from GitHub Releases

Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:

# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic

Development Build

Build lindera-nodejs in development mode:

cd lindera-nodejs
npm install
npm run build

Or use the project Makefile:

make nodejs-develop

Build with Training Support

The train feature enables CRF-based dictionary training functionality. It is enabled by default:

npm run build -- --features train

Feature Flags

FeatureDescriptionDefault
trainCRF training functionalityEnabled
embed-ipadicEmbed Japanese dictionary (IPADIC) into the binaryDisabled
embed-unidicEmbed Japanese dictionary (UniDic) into the binaryDisabled
embed-ipadic-neologdEmbed Japanese dictionary (IPADIC NEologd) into the binaryDisabled
embed-ko-dicEmbed Korean dictionary (ko-dic) into the binaryDisabled
embed-cc-cedictEmbed Chinese dictionary (CC-CEDICT) into the binaryDisabled
embed-jiebaEmbed Chinese dictionary (Jieba) into the binaryDisabled
embed-cjkEmbed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binaryDisabled

Multiple features can be combined:

npm run build -- --features "train,embed-ipadic,embed-ko-dic"

[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding embed-* feature flag and load it using the embedded:// scheme:

const dictionary = loadDictionary("embedded://ipadic");

See Feature Flags for details.

Verifying the Installation

After installation, verify that lindera is available in Node.js:

const lindera = require("lindera-nodejs");

console.log(lindera.version());

Or with ES modules:

import { version } from "lindera-nodejs";

console.log(version());

Quick Start

This guide shows how to tokenize text using lindera-nodejs.

Basic Tokenization

The recommended way to create a tokenizer is through TokenizerBuilder:

const { TokenizerBuilder } = require("lindera-nodejs");

const builder = new TokenizerBuilder();
builder.setMode("normal");
builder.setDictionary("/path/to/ipadic");
const tokenizer = builder.build();

const tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");
for (const token of tokens) {
  console.log(`${token.surface}\t${token.details.join(",")}`);
}

Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.

Expected output:

関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK

Method Chaining

TokenizerBuilder supports method chaining for concise configuration:

const { TokenizerBuilder } = require("lindera-nodejs");

const tokenizer = new TokenizerBuilder()
  .setMode("normal")
  .setDictionary("/path/to/ipadic")
  .build();

const tokens = tokenizer.tokenize("すもももももももものうち");
for (const token of tokens) {
  console.log(`${token.surface}\t${token.getDetail(0)}`);
}

Accessing Token Properties

Each token exposes the following properties:

const { TokenizerBuilder } = require("lindera-nodejs");

const tokenizer = new TokenizerBuilder()
  .setDictionary("/path/to/ipadic")
  .build();

const tokens = tokenizer.tokenize("東京タワー");
for (const token of tokens) {
  console.log(`Surface: ${token.surface}`);
  console.log(`Byte range: ${token.byteStart}..${token.byteEnd}`);
  console.log(`Position: ${token.position}`);
  console.log(`Word ID: ${token.wordId}`);
  console.log(`Unknown: ${token.isUnknown}`);
  console.log(`Details: ${token.details}`);
  console.log();
}

N-best Tokenization

Retrieve multiple tokenization candidates ranked by cost:

const { TokenizerBuilder } = require("lindera-nodejs");

const tokenizer = new TokenizerBuilder()
  .setDictionary("/path/to/ipadic")
  .build();

const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
for (const { tokens, cost } of results) {
  const surfaces = tokens.map((t) => t.surface);
  console.log(`Cost ${cost}: ${surfaces.join(" / ")}`);
}

TypeScript

Lindera Node.js includes TypeScript type definitions. All classes and functions are fully typed:

import { TokenizerBuilder, Token } from "lindera-nodejs";

const tokenizer = new TokenizerBuilder()
  .setMode("normal")
  .setDictionary("/path/to/ipadic")
  .build();

const tokens: Token[] = tokenizer.tokenize("形態素解析");
for (const token of tokens) {
  console.log(`${token.surface}: ${token.details?.join(",")}`);
}

Tokenizer API

TokenizerBuilder

TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

new TokenizerBuilder()

Creates a new builder with default configuration.

const { TokenizerBuilder } = require("lindera-nodejs");

const builder = new TokenizerBuilder();

new TokenizerBuilder().fromFile(filePath)

Loads configuration from a JSON file and returns a new builder.

const builder = new TokenizerBuilder().fromFile("config.json");

Configuration Methods

All setter methods return this for method chaining.

setMode(mode)

Sets the tokenization mode.

  • "normal" -- Standard tokenization (default)
  • "decompose" -- Decomposes compound words into smaller units
builder.setMode("normal");

setDictionary(path)

Sets the system dictionary path or URI.

// Use an embedded dictionary
builder.setDictionary("embedded://ipadic");

// Use an external dictionary
builder.setDictionary("/path/to/dictionary");

setUserDictionary(uri)

Sets the user dictionary URI.

builder.setUserDictionary("/path/to/user_dictionary");

setKeepWhitespace(keep)

Controls whether whitespace tokens appear in the output.

builder.setKeepWhitespace(true);

appendCharacterFilter(kind, args?)

Appends a character filter to the preprocessing pipeline.

builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });

appendTokenFilter(kind, args?)

Appends a token filter to the postprocessing pipeline.

builder.appendTokenFilter("lowercase", {});

Build

build()

Builds and returns a Tokenizer with the configured settings.

const tokenizer = builder.build();

Tokenizer

Tokenizer performs morphological analysis on text.

Creating a Tokenizer

new Tokenizer(dictionary, mode?, userDictionary?)

Creates a tokenizer directly from a loaded dictionary.

const { Tokenizer, loadDictionary } = require("lindera-nodejs");

const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");

Tokenizer Methods

tokenize(text)

Tokenizes the input text and returns an array of Token objects.

const tokens = tokenizer.tokenize("形態素解析");

Parameters:

NameTypeDescription
textstringText to tokenize

Returns: Token[]

tokenizeNbest(text, n, unique?, costThreshold?)

Returns the N-best tokenization results, each containing tokens and total path cost.

const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
for (const { tokens, cost } of results) {
  console.log(cost, tokens.map((t) => t.surface));
}

Parameters:

NameTypeDescription
textstringText to tokenize
nnumberNumber of results to return
uniquebooleanDeduplicate results (default: false)
costThresholdnumber | undefinedMaximum cost difference from the best path (default: undefined)

Returns: Array<{ tokens: Token[], cost: number }>

Token

Token represents a single morphological token.

Properties

PropertyTypeDescription
surfacestringSurface form of the token
byteStartnumberStart byte position in the original text
byteEndnumberEnd byte position in the original text
positionnumberToken position index
wordIdnumberDictionary word ID
isUnknownbooleantrue if the word is not in the dictionary
detailsstring[] | nullMorphological details (part of speech, reading, etc.)

Token Methods

getDetail(index)

Returns the detail string at the specified index, or null if the index is out of range.

const token = tokenizer.tokenize("東京")[0];
const pos = token.getDetail(0);      // e.g., "名詞"
const subpos = token.getDetail(1);   // e.g., "固有名詞"
const reading = token.getDetail(7);  // e.g., "トウキョウ"

Parameters:

NameTypeDescription
indexnumberZero-based index into the details array

Returns: string | null

The structure of details depends on the dictionary:

  • IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
  • UniDic: Detailed morphological features following the UniDic specification
  • ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats

Dictionary Management

Lindera Node.js provides functions for loading, building, and managing dictionaries used in morphological analysis.

Loading Dictionaries

System Dictionaries

Use loadDictionary(uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:

const { loadDictionary } = require("lindera-nodejs");

const dictionary = loadDictionary("/path/to/ipadic");

Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:

const dictionary = loadDictionary("embedded://ipadic");

User Dictionaries

User dictionaries add custom vocabulary on top of a system dictionary.

const { loadUserDictionary, Metadata } = require("lindera-nodejs");

const metadata = new Metadata();
const userDict = loadUserDictionary("/path/to/user_dictionary", metadata);

Pass the user dictionary when building a tokenizer:

const { Tokenizer, loadDictionary, loadUserDictionary, Metadata } = require("lindera-nodejs");

const dictionary = loadDictionary("/path/to/ipadic");
const metadata = new Metadata();
const userDict = loadUserDictionary("/path/to/user_dictionary", metadata);

const tokenizer = new Tokenizer(dictionary, "normal", userDict);

Or via the builder:

const { TokenizerBuilder } = require("lindera-nodejs");

const tokenizer = new TokenizerBuilder()
  .setDictionary("/path/to/ipadic")
  .setUserDictionary("/path/to/user_dictionary")
  .build();

Building Dictionaries

System Dictionary

Build a system dictionary from source files:

const { buildDictionary, Metadata } = require("lindera-nodejs");

const metadata = new Metadata({ name: "custom", encoding: "UTF-8" });
buildDictionary("/path/to/input_dir", "/path/to/output_dir", metadata);

The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).

User Dictionary

Build a user dictionary from a CSV file:

const { buildUserDictionary, Metadata } = require("lindera-nodejs");

const metadata = new Metadata();
buildUserDictionary("ipadic", "user_words.csv", "/path/to/output_dir", metadata);

The metadata parameter is optional. When omitted, default metadata values are used:

buildUserDictionary("ipadic", "user_words.csv", "/path/to/output_dir");

Metadata

The Metadata class configures dictionary parameters.

Creating Metadata

const { Metadata } = require("lindera-nodejs");

// Default metadata
const metadata = new Metadata();

// Custom metadata
const metadata = new Metadata({
  name: "my_dictionary",
  encoding: "UTF-8",
  defaultWordCost: -10000,
});

Loading from JSON

const metadata = Metadata.fromJsonFile("metadata.json");

Properties

PropertyTypeDefaultDescription
namestring"default"Dictionary name
encodingstring"UTF-8"Character encoding
defaultWordCostnumber-10000Default cost for unknown words
defaultLeftContextIdnumber1288Default left context ID
defaultRightContextIdnumber1288Default right context ID
defaultFieldValuestring"*"Default value for missing fields
flexibleCsvbooleanfalseAllow flexible CSV parsing
skipInvalidCostOrIdbooleanfalseSkip entries with invalid cost or ID
normalizeDetailsbooleanfalseNormalize morphological details
dictionarySchemaSchemaIPADIC schemaSchema for the main dictionary
userDictionarySchemaSchemaMinimal schemaSchema for user dictionaries

All properties support both getting and setting:

const metadata = new Metadata();
metadata.name = "custom_dict";
metadata.encoding = "EUC-JP";
console.log(metadata.name); // "custom_dict"

toObject()

Returns a plain object representation of the metadata:

const metadata = new Metadata({ name: "test" });
console.log(metadata.toObject());

Schema

The Schema class defines the field structure of dictionary entries.

Creating a Schema

const { Schema } = require("lindera-nodejs");

// Default IPADIC-compatible schema
const schema = Schema.createDefault();

// Custom schema
const custom = new Schema(["surface", "left_id", "right_id", "cost", "pos", "reading"]);

Schema Methods

MethodReturnsDescription
getFieldIndex(name)number | nullGet field index by name
fieldCount()numberTotal number of fields
getFieldName(index)string | nullGet field name by index
getCustomFields()string[]Fields beyond index 4 (morphological features)
getAllFields()string[]All field names
getFieldByName(name)FieldDefinition | nullGet full field definition
validateRecord(record)voidValidate a CSV record against the schema
const schema = Schema.createDefault();

console.log(schema.fieldCount());           // 13 (IPADIC format)
console.log(schema.getFieldIndex("pos1"));  // e.g., 4
console.log(schema.getAllFields());          // ["surface", "left_id", ...]
console.log(schema.getCustomFields());      // Fields after index 4

FieldDefinition

PropertyTypeDescription
indexnumberField position index
namestringField name
fieldTypeFieldTypeField type enum
descriptionstring | undefinedOptional description

FieldType

ValueDescription
FieldType.SurfaceWord surface text
FieldType.LeftContextIdLeft context ID
FieldType.RightContextIdRight context ID
FieldType.CostWord cost
FieldType.CustomMorphological feature field

Text Processing Pipeline

Lindera Node.js supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.

Input Text
  --> Character Filters (preprocessing)
  --> Tokenization
  --> Token Filters (postprocessing)
  --> Output Tokens

Character Filters

Character filters transform the input text before tokenization.

unicode_normalize

Applies Unicode normalization to the input text.

const { TokenizerBuilder } = require("lindera-nodejs");

const tokenizer = new TokenizerBuilder()
  .setDictionary("embedded://ipadic")
  .appendCharacterFilter("unicode_normalize", { kind: "nfkc" })
  .build();

Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".

mapping

Replaces characters or strings according to a mapping table.

const tokenizer = new TokenizerBuilder()
  .setDictionary("embedded://ipadic")
  .appendCharacterFilter("mapping", {
    mapping: {
      "\u30fc": "-",
      "\uff5e": "~",
    },
  })
  .build();

japanese_iteration_mark

Resolves Japanese iteration marks (odoriji) into their full forms.

const tokenizer = new TokenizerBuilder()
  .setDictionary("embedded://ipadic")
  .appendCharacterFilter("japanese_iteration_mark", {
    normalize_kanji: true,
    normalize_kana: true,
  })
  .build();

Token Filters

Token filters transform or remove tokens after tokenization.

lowercase

Converts token surface forms to lowercase.

const tokenizer = new TokenizerBuilder()
  .setDictionary("embedded://ipadic")
  .appendTokenFilter("lowercase", {})
  .build();

japanese_base_form

Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.

const tokenizer = new TokenizerBuilder()
  .setDictionary("embedded://ipadic")
  .appendTokenFilter("japanese_base_form", {})
  .build();

japanese_stop_tags

Removes tokens whose part-of-speech matches any of the specified tags.

const tokenizer = new TokenizerBuilder()
  .setDictionary("embedded://ipadic")
  .appendTokenFilter("japanese_stop_tags", {
    tags: ["助詞", "助動詞"],
  })
  .build();

japanese_keep_tags

Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.

const tokenizer = new TokenizerBuilder()
  .setDictionary("embedded://ipadic")
  .appendTokenFilter("japanese_keep_tags", {
    tags: ["名詞"],
  })
  .build();

Complete Pipeline Example

The following example combines multiple character filters and token filters into a single pipeline:

const { TokenizerBuilder } = require("lindera-nodejs");

const tokenizer = new TokenizerBuilder()
  .setMode("normal")
  .setDictionary("embedded://ipadic")
  // Preprocessing
  .appendCharacterFilter("unicode_normalize", { kind: "nfkc" })
  .appendCharacterFilter("japanese_iteration_mark", {
    normalize_kanji: true,
    normalize_kana: true,
  })
  // Postprocessing
  .appendTokenFilter("japanese_base_form", {})
  .appendTokenFilter("japanese_stop_tags", {
    tags: ["助詞", "助動詞", "記号"],
  })
  .appendTokenFilter("lowercase", {})
  .build();

const tokens = tokenizer.tokenize("Linderaは形態素解析を行うライブラリです。");
for (const token of tokens) {
  console.log(`${token.surface}\t${token.details.join(",")}`);
}

In this pipeline:

  1. unicode_normalize converts full-width characters to half-width (NFKC normalization)
  2. japanese_iteration_mark resolves iteration marks
  3. japanese_base_form converts inflected tokens to base form
  4. japanese_stop_tags removes particles, auxiliary verbs, and symbols
  5. lowercase normalizes alphabetic characters to lowercase

Training

Lindera Node.js supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.

Prerequisites

Build lindera-nodejs with the train feature enabled (enabled by default):

npm run build -- --features train

Training a Model

Use train() to train a CRF model from a seed lexicon and annotated corpus:

const { train } = require("lindera-nodejs");

train({
  seed: "resources/training/seed.csv",
  corpus: "resources/training/corpus.txt",
  charDef: "resources/training/char.def",
  unkDef: "resources/training/unk.def",
  featureDef: "resources/training/feature.def",
  rewriteDef: "resources/training/rewrite.def",
  output: "/tmp/model.dat",
  lambda: 0.01,
  maxIter: 100,
  maxThreads: 4,
});

Training Parameters

ParameterTypeDefaultDescription
seedstringrequiredPath to the seed lexicon file (CSV format)
corpusstringrequiredPath to the annotated training corpus
charDefstringrequiredPath to the character definition file (char.def)
unkDefstringrequiredPath to the unknown word definition file (unk.def)
featureDefstringrequiredPath to the feature definition file (feature.def)
rewriteDefstringrequiredPath to the rewrite rule definition file (rewrite.def)
outputstringrequiredOutput path for the trained model file
lambdanumber0.01L1 regularization cost (0.0--1.0)
maxIternumber100Maximum number of training iterations
maxThreadsnumber | undefinedundefinedNumber of threads (undefined = auto-detect CPU cores)

Exporting a Trained Model

After training, export the model to dictionary source files using exportModel():

const { exportModel } = require("lindera-nodejs");

exportModel({
  model: "/tmp/model.dat",
  output: "/tmp/dictionary_source",
  metadata: "resources/training/metadata.json",
});

Export Parameters

ParameterTypeDefaultDescription
modelstringrequiredPath to the trained model file (.dat)
outputstringrequiredOutput directory for dictionary source files
metadatastring | undefinedundefinedPath to a base metadata.json file

The export creates the following files in the output directory:

  • lex.csv -- Lexicon entries with trained costs
  • matrix.def -- Connection cost matrix
  • unk.def -- Unknown word definitions
  • char.def -- Character category definitions
  • metadata.json -- Updated metadata (when metadata parameter is provided)

Complete Workflow

The full workflow for training and using a custom dictionary:

const {
  train,
  exportModel,
  buildDictionary,
  Metadata,
  TokenizerBuilder,
} = require("lindera-nodejs");

// Step 1: Train the CRF model
train({
  seed: "resources/training/seed.csv",
  corpus: "resources/training/corpus.txt",
  charDef: "resources/training/char.def",
  unkDef: "resources/training/unk.def",
  featureDef: "resources/training/feature.def",
  rewriteDef: "resources/training/rewrite.def",
  output: "/tmp/model.dat",
  lambda: 0.01,
  maxIter: 100,
});

// Step 2: Export to dictionary source files
exportModel({
  model: "/tmp/model.dat",
  output: "/tmp/dictionary_source",
  metadata: "resources/training/metadata.json",
});

// Step 3: Build the dictionary from exported source files
const metadata = Metadata.fromJsonFile("/tmp/dictionary_source/metadata.json");
buildDictionary("/tmp/dictionary_source", "/tmp/dictionary", metadata);

// Step 4: Use the trained dictionary
const tokenizer = new TokenizerBuilder()
  .setDictionary("/tmp/dictionary")
  .setMode("normal")
  .build();

const tokens = tokenizer.tokenize("形態素解析のテスト");
for (const token of tokens) {
  console.log(`${token.surface}\t${token.details.join(",")}`);
}

Lindera Ruby

Lindera Ruby provides Ruby bindings for the Lindera morphological analysis engine, built with Magnus and rb-sys. It brings Lindera's high-performance tokenization capabilities to the Ruby ecosystem with support for Ruby 3.1 and later.

Features

  • Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
  • Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
  • CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires train feature)
  • Multiple tokenization modes: Normal and decompose modes for different analysis granularity
  • N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
  • User dictionaries: Extend system dictionaries with custom vocabulary

Documentation

Installation

[!NOTE] lindera-ruby is not yet published to RubyGems. You need to build from source.

Prerequisites

  • Ruby 3.1 or later
  • Rust toolchain -- Install via rustup
  • Bundler -- Ruby dependency manager (gem install bundler)

Obtaining Dictionaries

Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.

Download from GitHub Releases

Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:

# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic

Development Build

Build and install lindera-ruby in development mode:

cd lindera-ruby
bundle install
bundle exec rake compile

Or use the project Makefile:

make ruby-develop

Build with Training Support

The train feature enables CRF-based dictionary training functionality:

LINDERA_FEATURES="train" bundle exec rake compile

Feature Flags

Features are specified through the LINDERA_FEATURES environment variable as a comma-separated list.

FeatureDescriptionDefault
trainCRF training functionalityDisabled
embed-ipadicEmbed Japanese dictionary (IPADIC) into the binaryDisabled
embed-unidicEmbed Japanese dictionary (UniDic) into the binaryDisabled
embed-ipadic-neologdEmbed Japanese dictionary (IPADIC NEologd) into the binaryDisabled
embed-ko-dicEmbed Korean dictionary (ko-dic) into the binaryDisabled
embed-cc-cedictEmbed Chinese dictionary (CC-CEDICT) into the binaryDisabled
embed-jiebaEmbed Chinese dictionary (Jieba) into the binaryDisabled
embed-cjkEmbed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binaryDisabled

Multiple features can be combined:

LINDERA_FEATURES="train,embed-ipadic,embed-ko-dic" bundle exec rake compile

[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding embed-* feature flag and load it using the embedded:// scheme:

dictionary = Lindera.load_dictionary("embedded://ipadic")

See Feature Flags for details.

Verifying the Installation

After installation, verify that lindera is available in Ruby:

require 'lindera'

puts Lindera.version

Quick Start

This guide shows how to tokenize text using lindera-ruby.

Basic Tokenization

The recommended way to create a tokenizer is through Lindera::TokenizerBuilder:

require 'lindera'

builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build

tokens = tokenizer.tokenize('関西国際空港限定トートバッグ')
tokens.each do |token|
  puts "#{token.surface}\t#{token.details.join(',')}"
end

Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.

Expected output:

関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK

Sequential Configuration

TokenizerBuilder is configured through sequential method calls:

require 'lindera'

builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build

tokens = tokenizer.tokenize('すもももももももものうち')
tokens.each do |token|
  puts "#{token.surface}\t#{token.get_detail(0)}"
end

Accessing Token Properties

Each token exposes the following properties:

require 'lindera'

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build

tokens = tokenizer.tokenize('東京タワー')

tokens.each do |token|
  puts "Surface: #{token.surface}"
  puts "Byte range: #{token.byte_start}..#{token.byte_end}"
  puts "Position: #{token.position}"
  puts "Word ID: #{token.word_id}"
  puts "Unknown: #{token.is_unknown}"
  puts "Details: #{token.details}"
  puts
end

N-best Tokenization

Retrieve multiple tokenization candidates ranked by cost:

require 'lindera'

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build

results = tokenizer.tokenize_nbest('すもももももももものうち', 3, false, nil)

results.each do |tokens, cost|
  surfaces = tokens.map(&:surface)
  puts "Cost #{cost}: #{surfaces.join(' / ')}"
end

Tokenizer API

TokenizerBuilder

Lindera::TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

Lindera::TokenizerBuilder.new

Creates a new builder with default configuration.

require 'lindera'

builder = Lindera::TokenizerBuilder.new

Lindera::TokenizerBuilder.new.from_file(file_path)

Loads configuration from a JSON file and returns a new builder.

builder = Lindera::TokenizerBuilder.new.from_file('config.json')

Configuration Methods

set_mode(mode)

Sets the tokenization mode.

  • "normal" -- Standard tokenization (default)
  • "decompose" -- Decomposes compound words into smaller units
builder.set_mode('normal')

set_dictionary(path)

Sets the system dictionary path or URI.

# Use an embedded dictionary
builder.set_dictionary('embedded://ipadic')

# Use an external dictionary
builder.set_dictionary('/path/to/dictionary')

set_user_dictionary(uri)

Sets the user dictionary URI.

builder.set_user_dictionary('/path/to/user_dictionary')

set_keep_whitespace(keep)

Controls whether whitespace tokens appear in the output.

builder.set_keep_whitespace(true)

append_character_filter(kind, args)

Appends a character filter to the preprocessing pipeline. The args parameter is a hash with string keys.

builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })

append_token_filter(kind, args)

Appends a token filter to the postprocessing pipeline. The args parameter is a hash with string keys, or nil if the filter requires no arguments.

builder.append_token_filter('lowercase', nil)

Build

build

Builds and returns a Tokenizer with the configured settings.

tokenizer = builder.build

Tokenizer

Lindera::Tokenizer performs morphological analysis on text.

Creating a Tokenizer

Lindera::Tokenizer.new(dictionary, mode, user_dictionary)

Creates a tokenizer directly from a loaded dictionary.

require 'lindera'

dictionary = Lindera.load_dictionary('embedded://ipadic')
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', nil)

With a user dictionary:

dictionary = Lindera.load_dictionary('embedded://ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', user_dict)

Tokenizer Methods

tokenize(text)

Tokenizes the input text and returns an array of Token objects.

tokens = tokenizer.tokenize('形態素解析')

Parameters:

NameTypeDescription
textStringText to tokenize

Returns: Array<Token>

tokenize_nbest(text, n, unique, cost_threshold)

Returns the N-best tokenization results, each paired with its total path cost.

results = tokenizer.tokenize_nbest('すもももももももものうち', 3, false, nil)
results.each do |tokens, cost|
  puts "#{cost}: #{tokens.map(&:surface).inspect}"
end

Parameters:

NameTypeDescription
textStringText to tokenize
nIntegerNumber of results to return
uniqueBoolean or nilDeduplicate results (default: false)
cost_thresholdInteger or nilMaximum cost difference from the best path (default: nil)

Returns: Array<Array(Array<Token>, Integer)>

Token

Token represents a single morphological token.

Properties

PropertyTypeDescription
surfaceStringSurface form of the token
byte_startIntegerStart byte position in the original text
byte_endIntegerEnd byte position in the original text
positionIntegerToken position index
word_idIntegerDictionary word ID
is_unknownBooleantrue if the word is not in the dictionary
detailsArray<String> or nilMorphological details (part of speech, reading, etc.)

Token Methods

get_detail(index)

Returns the detail string at the specified index, or nil if the index is out of range.

token = tokenizer.tokenize('東京')[0]
pos = token.get_detail(0)        # e.g., "名詞"
subpos = token.get_detail(1)     # e.g., "固有名詞"
reading = token.get_detail(7)    # e.g., "トウキョウ"

Parameters:

NameTypeDescription
indexIntegerZero-based index into the details array

Returns: String or nil

The structure of details depends on the dictionary:

  • IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
  • UniDic: Detailed morphological features following the UniDic specification
  • ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats

Dictionary Management

Lindera Ruby provides functions for loading, building, and managing dictionaries used in morphological analysis.

Loading Dictionaries

System Dictionaries

Use Lindera.load_dictionary(uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:

require 'lindera'

dictionary = Lindera.load_dictionary('/path/to/ipadic')

Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:

dictionary = Lindera.load_dictionary('embedded://ipadic')

User Dictionaries

User dictionaries add custom vocabulary on top of a system dictionary.

require 'lindera'

dictionary = Lindera.load_dictionary('/path/to/ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)

Pass the user dictionary when building a tokenizer:

require 'lindera'

dictionary = Lindera.load_dictionary('/path/to/ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)

tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', user_dict)

Or via the builder:

require 'lindera'

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
builder.set_user_dictionary('/path/to/user_dictionary')
tokenizer = builder.build

Building Dictionaries

System Dictionary

Build a system dictionary from source files:

require 'lindera'

metadata = Lindera::Metadata.from_json_file('metadata.json')
Lindera.build_dictionary('/path/to/input_dir', '/path/to/output_dir', metadata)

The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).

User Dictionary

Build a user dictionary from a CSV file:

require 'lindera'

metadata = Lindera::Metadata.from_json_file('metadata.json')
Lindera.build_user_dictionary('ipadic', 'user_words.csv', '/path/to/output_dir', metadata)

The metadata parameter is optional. When omitted, default metadata values are used:

Lindera.build_user_dictionary('ipadic', 'user_words.csv', '/path/to/output_dir', nil)

Metadata

The Lindera::Metadata class configures dictionary parameters.

Creating Metadata

require 'lindera'

# Default metadata
metadata = Lindera::Metadata.new

# Create default metadata with standard settings
metadata = Lindera::Metadata.create_default

Loading from JSON

metadata = Lindera::Metadata.from_json_file('metadata.json')

Properties

PropertyTypeDefaultDescription
nameString"default"Dictionary name
encodingString"UTF-8"Character encoding
default_word_costInteger-10000Default cost for unknown words
default_left_context_idInteger1288Default left context ID
default_right_context_idInteger1288Default right context ID
default_field_valueString"*"Default value for missing fields
flexible_csvBooleanfalseAllow flexible CSV parsing
skip_invalid_cost_or_idBooleanfalseSkip entries with invalid cost or ID
normalize_detailsBooleanfalseNormalize morphological details

Text Processing Pipeline

Lindera Ruby supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.

Input Text
  --> Character Filters (preprocessing)
  --> Tokenization
  --> Token Filters (postprocessing)
  --> Output Tokens

Character Filters

Character filters transform the input text before tokenization.

unicode_normalize

Applies Unicode normalization to the input text.

require 'lindera'

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })
tokenizer = builder.build

Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".

mapping

Replaces characters or strings according to a mapping table.

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_character_filter('mapping', {
  'mapping' => {
    "\u30fc" => '-',
    "\uff5e" => '~'
  }
})
tokenizer = builder.build

japanese_iteration_mark

Resolves Japanese iteration marks (odoriji) into their full forms.

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_character_filter('japanese_iteration_mark', {
  'normalize_kanji' => 'true',
  'normalize_kana' => 'true'
})
tokenizer = builder.build

Token Filters

Token filters transform or remove tokens after tokenization.

lowercase

Converts token surface forms to lowercase.

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('lowercase', nil)
tokenizer = builder.build

japanese_base_form

Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_base_form', nil)
tokenizer = builder.build

japanese_stop_tags

Removes tokens whose part-of-speech matches any of the specified tags.

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_stop_tags', {
  'tags' => ['助詞', '助動詞']
})
tokenizer = builder.build

japanese_keep_tags

Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_keep_tags', {
  'tags' => ['名詞']
})
tokenizer = builder.build

japanese_katakana_stem

Removes trailing prolonged sound marks from katakana tokens that exceed a minimum length.

builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_katakana_stem', { 'min' => 3 })
tokenizer = builder.build

Complete Pipeline Example

The following example combines multiple character filters and token filters into a single pipeline:

require 'lindera'

builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('embedded://ipadic')

# Preprocessing
builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })
builder.append_character_filter('japanese_iteration_mark', {
  'normalize_kanji' => 'true',
  'normalize_kana' => 'true'
})

# Postprocessing
builder.append_token_filter('japanese_base_form', nil)
builder.append_token_filter('japanese_stop_tags', {
  'tags' => ['助詞', '助動詞', '記号']
})
builder.append_token_filter('lowercase', nil)

tokenizer = builder.build

tokens = tokenizer.tokenize('Linderaは形態素解析を行うライブラリです。')
tokens.each do |token|
  puts "#{token.surface}\t#{token.details.join(',')}"
end

In this pipeline:

  1. unicode_normalize converts full-width characters to half-width (NFKC normalization)
  2. japanese_iteration_mark resolves iteration marks
  3. japanese_base_form converts inflected tokens to base form
  4. japanese_stop_tags removes particles, auxiliary verbs, and symbols
  5. lowercase normalizes alphabetic characters to lowercase

Training

Lindera Ruby supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.

Prerequisites

Build lindera-ruby with the train feature enabled:

LINDERA_FEATURES="embed-ipadic,train" bundle exec rake compile

Training a Model

Use Lindera::Trainer.train to train a CRF model from a seed lexicon and annotated corpus:

require 'lindera'

Lindera::Trainer.train(
  'resources/training/seed.csv',
  'resources/training/corpus.txt',
  'resources/training/char.def',
  'resources/training/unk.def',
  'resources/training/feature.def',
  'resources/training/rewrite.def',
  '/tmp/model.dat',
  0.01,  # lambda (L1 regularization)
  100,   # max_iter
  nil    # max_threads (nil = auto-detect CPU cores)
)

Training Parameters

Parameters are passed as positional arguments in the following order:

PositionNameTypeDescription
1seedStringPath to the seed lexicon file (CSV format)
2corpusStringPath to the annotated training corpus
3char_defStringPath to the character definition file (char.def)
4unk_defStringPath to the unknown word definition file (unk.def)
5feature_defStringPath to the feature definition file (feature.def)
6rewrite_defStringPath to the rewrite rule definition file (rewrite.def)
7outputStringOutput path for the trained model file
8lambdaFloatL1 regularization cost (0.0--1.0)
9max_iterIntegerMaximum number of training iterations
10max_threadsInteger or nilNumber of threads (nil = auto-detect CPU cores)

Exporting a Trained Model

After training, export the model to dictionary source files using Lindera::Trainer.export:

require 'lindera'

Lindera::Trainer.export(
  '/tmp/model.dat',
  '/tmp/dictionary_source',
  'resources/training/metadata.json'
)

Export Parameters

PositionNameTypeDescription
1modelStringPath to the trained model file (.dat)
2outputStringOutput directory for dictionary source files
3metadataString or nilPath to a base metadata.json file

The export creates the following files in the output directory:

  • lex.csv -- Lexicon entries with trained costs
  • matrix.def -- Connection cost matrix
  • unk.def -- Unknown word definitions
  • char.def -- Character category definitions
  • metadata.json -- Updated metadata (when metadata parameter is provided)

Complete Workflow

The full workflow for training and using a custom dictionary:

require 'lindera'

# Step 1: Train the CRF model
Lindera::Trainer.train(
  'resources/training/seed.csv',
  'resources/training/corpus.txt',
  'resources/training/char.def',
  'resources/training/unk.def',
  'resources/training/feature.def',
  'resources/training/rewrite.def',
  '/tmp/model.dat',
  0.01,  # lambda
  100,   # max_iter
  nil    # max_threads
)

# Step 2: Export to dictionary source files
Lindera::Trainer.export(
  '/tmp/model.dat',
  '/tmp/dictionary_source',
  'resources/training/metadata.json'
)

# Step 3: Build the dictionary from exported source files
metadata = Lindera::Metadata.from_json_file('/tmp/dictionary_source/metadata.json')
Lindera.build_dictionary('/tmp/dictionary_source', '/tmp/dictionary', metadata)

# Step 4: Use the trained dictionary
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/tmp/dictionary')
builder.set_mode('normal')
tokenizer = builder.build

tokens = tokenizer.tokenize('形態素解析のテスト')
tokens.each do |token|
  puts "#{token.surface}\t#{token.details.join(',')}"
end

Lindera PHP

Lindera PHP provides PHP bindings for the Lindera morphological analysis engine, built with ext-php-rs. It brings Lindera's high-performance tokenization capabilities to the PHP ecosystem with support for PHP 8.1 and later.

Features

  • Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
  • Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
  • CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires train feature)
  • Multiple tokenization modes: Normal and decompose modes for different analysis granularity
  • N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
  • User dictionaries: Extend system dictionaries with custom vocabulary

Documentation

Installation

[!NOTE] lindera-php is not yet published to Packagist. You need to build from source.

Prerequisites

  • PHP 8.1 or later
  • Rust toolchain -- Install via rustup
  • Composer -- PHP dependency manager (optional, for running tests)

Obtaining Dictionaries

Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.

Download from GitHub Releases

Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:

# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic

Development Build

Build the lindera-php extension from the project root:

cargo build -p lindera-php

Or use the project Makefile:

make php-build

Build with Training Support

The train feature enables CRF-based dictionary training functionality:

cargo build -p lindera-php --features train

Feature Flags

FeatureDescriptionDefault
trainCRF training functionalityDisabled
embed-ipadicEmbed Japanese dictionary (IPADIC) into the binaryDisabled
embed-unidicEmbed Japanese dictionary (UniDic) into the binaryDisabled
embed-ipadic-neologdEmbed Japanese dictionary (IPADIC NEologd) into the binaryDisabled
embed-ko-dicEmbed Korean dictionary (ko-dic) into the binaryDisabled
embed-cc-cedictEmbed Chinese dictionary (CC-CEDICT) into the binaryDisabled
embed-jiebaEmbed Chinese dictionary (Jieba) into the binaryDisabled
embed-cjkEmbed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binaryDisabled

Multiple features can be combined:

cargo build -p lindera-php --features "train,embed-ipadic,embed-ko-dic"

[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding embed-* feature flag and load it using the embedded:// scheme:

$dictionary = Lindera\Dictionary::load('embedded://ipadic');

See Feature Flags for details.

Loading the Extension

Load the compiled shared library when running PHP:

php -d extension=target/debug/liblindera_php.so script.php

For release builds:

cargo build -p lindera-php --release
php -d extension=target/release/liblindera_php.so script.php

Alternatively, add the extension to your php.ini:

extension=/absolute/path/to/liblindera_php.so

Verifying the Installation

After building, verify that lindera is available in PHP:

php -d extension=target/debug/liblindera_php.so -r "echo Lindera\Dictionary::version() . PHP_EOL;"

Quick Start

This guide shows how to tokenize text using lindera-php.

Basic Tokenization

The recommended way to create a tokenizer is through TokenizerBuilder:

<?php

$builder = new Lindera\TokenizerBuilder();
$builder->setMode('normal');
$builder->setDictionary('/path/to/ipadic');
$tokenizer = $builder->build();

$tokens = $tokenizer->tokenize('関西国際空港限定トートバッグ');
foreach ($tokens as $token) {
    echo $token->surface . "\t" . implode(',', $token->details) . "\n";
}

Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.

Expected output:

関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    UNK

Method Chaining

TokenizerBuilder supports method chaining for concise configuration:

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setMode('normal')
    ->setDictionary('/path/to/ipadic')
    ->build();

$tokens = $tokenizer->tokenize('すもももももももものうち');
foreach ($tokens as $token) {
    echo $token->surface . "\t" . $token->getDetail(0) . "\n";
}

Accessing Token Properties

Each token exposes the following properties:

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder->setDictionary('/path/to/ipadic')->build();
$tokens = $tokenizer->tokenize('東京タワー');

foreach ($tokens as $token) {
    echo "Surface: {$token->surface}\n";
    echo "Byte range: {$token->byte_start}..{$token->byte_end}\n";
    echo "Position: {$token->position}\n";
    echo "Word ID: {$token->word_id}\n";
    echo "Unknown: " . ($token->is_unknown ? 'true' : 'false') . "\n";
    echo "Details: " . implode(',', $token->details) . "\n";
    echo "\n";
}

N-best Tokenization

Retrieve multiple tokenization candidates ranked by cost:

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder->setDictionary('/path/to/ipadic')->build();
$results = $tokenizer->tokenizeNbest('すもももももももものうち', 3);

foreach ($results as $result) {
    $surfaces = array_map(fn($t) => $t->surface, $result->tokens);
    echo "Cost {$result->cost}: " . implode(' / ', $surfaces) . "\n";
}

Tokenizer API

TokenizerBuilder

Lindera\TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

new Lindera\TokenizerBuilder()

Creates a new builder with default configuration.

<?php

$builder = new Lindera\TokenizerBuilder();

$builder->fromFile($filePath)

Loads configuration from a JSON file.

<?php

$builder = new Lindera\TokenizerBuilder();
$builder->fromFile('config.json');

Configuration Methods

All setter methods return $this for method chaining.

setMode($mode)

Sets the tokenization mode.

  • "normal" -- Standard tokenization (default)
  • "decompose" -- Decomposes compound words into smaller units
<?php

$builder->setMode('normal');

setDictionary($path)

Sets the system dictionary path or URI.

<?php

// Use an embedded dictionary
$builder->setDictionary('embedded://ipadic');

// Use an external dictionary
$builder->setDictionary('/path/to/dictionary');

setUserDictionary($uri)

Sets the user dictionary URI.

<?php

$builder->setUserDictionary('/path/to/user_dictionary');

setKeepWhitespace($keep)

Controls whether whitespace tokens appear in the output.

<?php

$builder->setKeepWhitespace(true);

appendCharacterFilter($kind, $args)

Appends a character filter to the preprocessing pipeline.

<?php

$builder->appendCharacterFilter('unicode_normalize', ['kind' => 'nfkc']);

appendTokenFilter($kind, $args)

Appends a token filter to the postprocessing pipeline.

<?php

$builder->appendTokenFilter('lowercase');

Build

build()

Builds and returns a Tokenizer with the configured settings.

<?php

$tokenizer = $builder->build();

Tokenizer

Lindera\Tokenizer performs morphological analysis on text.

Creating a Tokenizer

new Lindera\Tokenizer($dictionary, $mode, $userDictionary)

Creates a tokenizer directly from a loaded dictionary.

<?php

$dictionary = Lindera\Dictionary::load('embedded://ipadic');
$tokenizer = new Lindera\Tokenizer($dictionary, 'normal');

With a user dictionary:

<?php

$dictionary = Lindera\Dictionary::load('embedded://ipadic');
$metadata = $dictionary->metadata();
$userDict = Lindera\Dictionary::loadUser('/path/to/user_dictionary', $metadata);
$tokenizer = new Lindera\Tokenizer($dictionary, 'normal', $userDict);

Tokenizer Methods

tokenize($text)

Tokenizes the input text and returns an array of Token objects.

<?php

$tokens = $tokenizer->tokenize('形態素解析');

Parameters:

NameTypeDescription
$textstringText to tokenize

Returns: array<Token>

tokenizeNbest($text, $n, $unique, $costThreshold)

Returns the N-best tokenization results as an array of NbestResult objects.

<?php

$results = $tokenizer->tokenizeNbest('すもももももももものうち', 3);
foreach ($results as $result) {
    echo "Cost: {$result->cost}\n";
    foreach ($result->tokens as $token) {
        echo "  {$token->surface}\n";
    }
}

Parameters:

NameTypeDescription
$textstringText to tokenize
$nintNumber of results to return
$uniquebool|nullDeduplicate results (default: false)
$costThresholdint|nullMaximum cost difference from the best path (default: null)

Returns: array<NbestResult>

NbestResult

Lindera\NbestResult represents a single N-best tokenization result.

NbestResult Properties

PropertyTypeDescription
$tokensarray<Token>The tokens in this result
$costintThe total cost of this segmentation

Token

Lindera\Token represents a single morphological token.

Token Properties

PropertyTypeDescription
$surfacestringSurface form of the token
$byte_startintStart byte position in the original text
$byte_endintEnd byte position in the original text
$positionintToken position index
$word_idintDictionary word ID
$is_unknownbooltrue if the word is not in the dictionary
$detailsarray<string>Morphological details (part of speech, reading, etc.)

Token Methods

getDetail($index)

Returns the detail string at the specified index, or null if the index is out of range.

<?php

$token = $tokenizer->tokenize('東京')[0];
$pos = $token->getDetail(0);        // e.g., "名詞"
$subpos = $token->getDetail(1);     // e.g., "固有名詞"
$reading = $token->getDetail(7);    // e.g., "トウキョウ"

Parameters:

NameTypeDescription
$indexintZero-based index into the details array

Returns: string|null

The structure of details depends on the dictionary:

  • IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
  • UniDic: Detailed morphological features following the UniDic specification
  • ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats

Dictionary Management

Lindera PHP provides static methods on the Lindera\Dictionary class for loading, building, and managing dictionaries used in morphological analysis.

Loading Dictionaries

System Dictionaries

Use Lindera\Dictionary::load($uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:

<?php

$dictionary = Lindera\Dictionary::load('/path/to/ipadic');

Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:

<?php

$dictionary = Lindera\Dictionary::load('embedded://ipadic');

User Dictionaries

User dictionaries add custom vocabulary on top of a system dictionary.

<?php

$dictionary = Lindera\Dictionary::load('/path/to/ipadic');
$metadata = $dictionary->metadata();
$userDict = Lindera\Dictionary::loadUser('/path/to/user_dictionary', $metadata);

Pass the user dictionary when creating a tokenizer directly:

<?php

$dictionary = Lindera\Dictionary::load('/path/to/ipadic');
$metadata = $dictionary->metadata();
$userDict = Lindera\Dictionary::loadUser('/path/to/user_dictionary', $metadata);

$tokenizer = new Lindera\Tokenizer($dictionary, 'normal', $userDict);

Or via the builder:

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('/path/to/ipadic')
    ->setUserDictionary('/path/to/user_dictionary')
    ->build();

Building Dictionaries

System Dictionary

Build a system dictionary from source files:

<?php

$metadata = Lindera\Metadata::fromJsonFile('/path/to/metadata.json');
Lindera\Dictionary::build('/path/to/input_dir', '/path/to/output_dir', $metadata);

The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).

User Dictionary

Build a user dictionary from a CSV file:

<?php

$metadata = new Lindera\Metadata();
Lindera\Dictionary::buildUser('ipadic', 'user_words.csv', '/path/to/output_dir', $metadata);

Metadata

The Lindera\Metadata class configures dictionary parameters.

Creating Metadata

<?php

// Default metadata
$metadata = new Lindera\Metadata();

// Custom metadata
$metadata = new Lindera\Metadata(
    name: 'my_dictionary',
    encoding: 'UTF-8',
    default_word_cost: -10000,
);

// Create with all defaults explicitly
$metadata = Lindera\Metadata::createDefault();

Loading from JSON

<?php

$metadata = Lindera\Metadata::fromJsonFile('metadata.json');

Properties

PropertyTypeDefaultDescription
namestring"default"Dictionary name
encodingstring"UTF-8"Character encoding
default_word_costint-10000Default cost for unknown words
default_left_context_idint1288Default left context ID
default_right_context_idint1288Default right context ID
default_field_valuestring"*"Default value for missing fields
flexible_csvboolfalseAllow flexible CSV parsing
skip_invalid_cost_or_idboolfalseSkip entries with invalid cost or ID
normalize_detailsboolfalseNormalize morphological details
dictionary_schema_fieldsarray<string>IPADIC schemaSchema fields for the main dictionary
user_dictionary_schema_fieldsarray<string>Minimal schemaSchema fields for user dictionaries

All properties are read-only via getter methods:

<?php

$metadata = new Lindera\Metadata(name: 'custom_dict', encoding: 'EUC-JP');
echo $metadata->name;      // "custom_dict"
echo $metadata->encoding;  // "EUC-JP"

toArray()

Returns an associative array representation of the metadata:

<?php

$metadata = new Lindera\Metadata(name: 'test');
print_r($metadata->toArray());

Dictionary Info

The Lindera\Dictionary object provides metadata accessors:

<?php

$dictionary = Lindera\Dictionary::load('/path/to/ipadic');
echo $dictionary->metadataName();      // Dictionary name
echo $dictionary->metadataEncoding();  // Dictionary encoding
$metadata = $dictionary->metadata();   // Full Metadata object

Version

Retrieve the Lindera library version:

<?php

echo Lindera\Dictionary::version();

Text Processing Pipeline

Lindera PHP supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.

Input Text
  --> Character Filters (preprocessing)
  --> Tokenization
  --> Token Filters (postprocessing)
  --> Output Tokens

Character Filters

Character filters transform the input text before tokenization.

unicode_normalize

Applies Unicode normalization to the input text.

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('embedded://ipadic')
    ->appendCharacterFilter('unicode_normalize', ['kind' => 'nfkc'])
    ->build();

Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".

mapping

Replaces characters or strings according to a mapping table.

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('embedded://ipadic')
    ->appendCharacterFilter('mapping', [
        'mapping' => [
            "\u{30FC}" => '-',
            "\u{FF5E}" => '~',
        ],
    ])
    ->build();

japanese_iteration_mark

Resolves Japanese iteration marks (odoriji) into their full forms.

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('embedded://ipadic')
    ->appendCharacterFilter('japanese_iteration_mark', [
        'normalize_kanji' => 'true',
        'normalize_kana' => 'true',
    ])
    ->build();

Token Filters

Token filters transform or remove tokens after tokenization.

lowercase

Converts token surface forms to lowercase.

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('embedded://ipadic')
    ->appendTokenFilter('lowercase')
    ->build();

japanese_base_form

Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('embedded://ipadic')
    ->appendTokenFilter('japanese_base_form', [])
    ->build();

japanese_stop_tags

Removes tokens whose part-of-speech matches any of the specified tags.

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('embedded://ipadic')
    ->appendTokenFilter('japanese_stop_tags', [
        'tags' => ['助詞', '助動詞'],
    ])
    ->build();

japanese_keep_tags

Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('embedded://ipadic')
    ->appendTokenFilter('japanese_keep_tags', [
        'tags' => ['名詞'],
    ])
    ->build();

Complete Pipeline Example

The following example combines multiple character filters and token filters into a single pipeline:

<?php

$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setMode('normal')
    ->setDictionary('embedded://ipadic')
    // Preprocessing
    ->appendCharacterFilter('unicode_normalize', ['kind' => 'nfkc'])
    ->appendCharacterFilter('japanese_iteration_mark', [
        'normalize_kanji' => 'true',
        'normalize_kana' => 'true',
    ])
    // Postprocessing
    ->appendTokenFilter('japanese_base_form', [])
    ->appendTokenFilter('japanese_stop_tags', [
        'tags' => ['助詞', '助動詞', '記号'],
    ])
    ->appendTokenFilter('lowercase')
    ->build();

$tokens = $tokenizer->tokenize('Linderaは形態素解析を行うライブラリです。');
foreach ($tokens as $token) {
    echo $token->surface . "\t" . implode(',', $token->details) . "\n";
}

In this pipeline:

  1. unicode_normalize converts full-width characters to half-width (NFKC normalization)
  2. japanese_iteration_mark resolves iteration marks
  3. japanese_base_form converts inflected tokens to base form
  4. japanese_stop_tags removes particles, auxiliary verbs, and symbols
  5. lowercase normalizes alphabetic characters to lowercase

Training

Lindera PHP supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.

Prerequisites

Build lindera-php with the train feature enabled:

cargo build -p lindera-php --features train,embed-ipadic

Training a Model

Use Lindera\Trainer::train() to train a CRF model from a seed lexicon and annotated corpus:

<?php

Lindera\Trainer::train(
    seed: 'resources/training/seed.csv',
    corpus: 'resources/training/corpus.txt',
    char_def: 'resources/training/char.def',
    unk_def: 'resources/training/unk.def',
    feature_def: 'resources/training/feature.def',
    rewrite_def: 'resources/training/rewrite.def',
    output: '/tmp/model.dat',
    lambda: 0.01,
    max_iter: 100,
    max_threads: null,
);

Training Parameters

ParameterTypeDefaultDescription
$seedstringrequiredPath to the seed lexicon file (CSV format)
$corpusstringrequiredPath to the annotated training corpus
$char_defstringrequiredPath to the character definition file (char.def)
$unk_defstringrequiredPath to the unknown word definition file (unk.def)
$feature_defstringrequiredPath to the feature definition file (feature.def)
$rewrite_defstringrequiredPath to the rewrite rule definition file (rewrite.def)
$outputstringrequiredOutput path for the trained model file
$lambdafloat0.01L1 regularization cost (0.0--1.0)
$max_iterint100Maximum number of training iterations
$max_threadsint|nullnullNumber of threads (null = auto-detect CPU cores)

Exporting a Trained Model

After training, export the model to dictionary source files using Lindera\Trainer::export():

<?php

Lindera\Trainer::export(
    model: '/tmp/model.dat',
    output: '/tmp/dictionary_source',
    metadata: 'resources/training/metadata.json',
);

Export Parameters

ParameterTypeDefaultDescription
$modelstringrequiredPath to the trained model file (.dat)
$outputstringrequiredOutput directory for dictionary source files
$metadatastring|nullnullPath to a base metadata.json file

The export creates the following files in the output directory:

  • lex.csv -- Lexicon entries with trained costs
  • matrix.def -- Connection cost matrix
  • unk.def -- Unknown word definitions
  • char.def -- Character category definitions
  • metadata.json -- Updated metadata (when $metadata parameter is provided)

Complete Workflow

The full workflow for training and using a custom dictionary:

<?php

// Step 1: Train the CRF model
Lindera\Trainer::train(
    seed: 'resources/training/seed.csv',
    corpus: 'resources/training/corpus.txt',
    char_def: 'resources/training/char.def',
    unk_def: 'resources/training/unk.def',
    feature_def: 'resources/training/feature.def',
    rewrite_def: 'resources/training/rewrite.def',
    output: '/tmp/model.dat',
    lambda: 0.01,
    max_iter: 100,
);

// Step 2: Export to dictionary source files
Lindera\Trainer::export(
    model: '/tmp/model.dat',
    output: '/tmp/dictionary_source',
    metadata: 'resources/training/metadata.json',
);

// Step 3: Build the dictionary from exported source files
$metadata = Lindera\Metadata::fromJsonFile('/tmp/dictionary_source/metadata.json');
Lindera\Dictionary::build('/tmp/dictionary_source', '/tmp/dictionary', $metadata);

// Step 4: Use the trained dictionary
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
    ->setDictionary('/tmp/dictionary')
    ->setMode('normal')
    ->build();

$tokens = $tokenizer->tokenize('形態素解析のテスト');
foreach ($tokens as $token) {
    echo $token->surface . "\t" . implode(',', $token->details) . "\n";
}

Lindera WASM

Lindera WASM provides WebAssembly bindings for Lindera's morphological analysis engine, built with wasm-bindgen. It enables Japanese, Korean, and Chinese text tokenization directly in web browsers, Node.js, and bundler environments.

Distribution Formats

Lindera WASM supports multiple distribution formats via wasm-pack:

TargetUse CaseModule System
webBrowser ESMES Modules
bundlerWebpack, Vite, RollupES Modules (bundler-resolved)

Dictionary Packages

Each package embeds a specific dictionary for offline use:

Feature FlagDictionaryLanguage
(none)No embedded dictionary--
embed-ipadicIPADICJapanese
embed-unidicUniDicJapanese
embed-ko-dicko-dicKorean
embed-cc-cedictCC-CEDICTChinese
embed-jiebaJiebaChinese
embed-cjkIPADIC + ko-dic + JiebaCJK

Sections

Installation

Prerequisites

Obtaining Dictionaries

Lindera WASM does not bundle dictionaries by default. The recommended approach for browser environments is to download dictionaries at runtime using the OPFS (Origin Private File System) API.

Download from GitHub Releases

Pre-built dictionaries are available on the GitHub Releases page. In browser environments, use the OPFS helpers to download and cache dictionaries:

import { downloadDictionary, hasDictionary } from 'lindera-wasm-web/opfs';

if (!await hasDictionary("ipadic")) {
    await downloadDictionary(
        "https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
        "ipadic",
    );
}

See OPFS Dictionary Storage for the full workflow.

Building with wasm-pack

Build the WASM package for your target environment:

Web (ES Modules for browsers)

wasm-pack build --target web

Bundler (Webpack, Vite, Rollup)

wasm-pack build --target bundler

The output is written to the pkg/ directory inside the lindera-wasm crate.

Available Feature Flags (Advanced)

For advanced users who want to embed dictionaries directly into the WASM binary, the following feature flags are available. This increases the binary size significantly but eliminates the need to download dictionaries at runtime.

FeatureDictionaryLanguage
embed-ipadicIPADICJapanese
embed-unidicUniDicJapanese
embed-ko-dicko-dicKorean
embed-cc-cedictCC-CEDICTChinese
embed-jiebaJiebaChinese
embed-cjkIPADIC + ko-dic + JiebaCJK (all)

You can combine multiple dictionaries by enabling multiple feature flags:

wasm-pack build --target web --features embed-ipadic,embed-ko-dic

NPM Package Naming Convention

When publishing to npm, the recommended naming convention is:

lindera-wasm-{target}
lindera-wasm-{target}-{dict}

Examples:

  • lindera-wasm-web
  • lindera-wasm-web-ipadic
  • lindera-wasm-bundler-unidic
  • lindera-wasm-web-cjk

To set the package name before publishing, edit the name field in the generated pkg/package.json.

Installing from npm

Pre-built packages are available on npm:

npm install lindera-wasm-web

Or with yarn:

yarn add lindera-wasm-web

[!NOTE] The npm package does not include dictionaries. Use the OPFS helpers to download dictionaries at runtime. See OPFS Dictionary Storage.

Quick Start

Web (Browser) -- OPFS Dictionary Loading

The recommended approach is to download dictionaries at runtime using the OPFS helpers:

import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { downloadDictionary, loadDictionaryFiles, hasDictionary } from 'lindera-wasm-web/opfs';

async function main() {
    await __wbg_init();

    // Download dictionary if not cached
    if (!await hasDictionary("ipadic")) {
        await downloadDictionary(
            "https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
            "ipadic",
        );
    }

    // Load dictionary from OPFS
    const files = await loadDictionaryFiles("ipadic");
    const dictionary = loadDictionaryFromBytes(
        files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
        files.dictWords, files.matrixMtx, files.charDef, files.unk,
    );

    // Build tokenizer
    const builder = new TokenizerBuilder();
    builder.setDictionaryInstance(dictionary);
    builder.setMode("normal");
    const tokenizer = builder.build();

    const tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");
    tokens.forEach(token => {
        console.log(`${token.surface}\t${token.details.join(',')}`);
    });
}

main();

Note: Download a pre-built dictionary from GitHub Releases. See OPFS Dictionary Storage for the full workflow.

Expected output:

関西国際空港    名詞,固有名詞,一般,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ    名詞,一般,*,*,*,*,*,*,*

Using Embedded Dictionaries (Advanced)

If you built with an embed-* feature flag, you can use embedded dictionaries:

import __wbg_init, { TokenizerBuilder } from 'lindera-wasm-web-ipadic';

async function main() {
    await __wbg_init();

    const builder = new TokenizerBuilder();
    builder.setDictionary("embedded://ipadic");
    builder.setMode("normal");
    const tokenizer = builder.build();

    const tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");
    tokens.forEach(token => {
        console.log(`${token.surface}\t${token.details.join(',')}`);
    });
}

main();

Using Filters

You can add character filters and token filters to the tokenization pipeline:

import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';

async function main() {
    await __wbg_init();

    // Assume dictionary is already cached in OPFS
    const files = await loadDictionaryFiles("ipadic");
    const dictionary = loadDictionaryFromBytes(
        files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
        files.dictWords, files.matrixMtx, files.charDef, files.unk,
    );

    const builder = new TokenizerBuilder();
    builder.setDictionaryInstance(dictionary);
    builder.setMode("normal");

    // Add Unicode NFKC normalization
    builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });

    // Add a stop-tags filter to remove particles and auxiliary verbs
    builder.appendTokenFilter("japanese_stop_tags", {
        tags: ["助詞", "助動詞"]
    });

    const tokenizer = builder.build();
    const tokens = tokenizer.tokenize("Linderaは形態素解析エンジンです");
    tokens.forEach(token => {
        console.log(`${token.surface}\t${token.details.join(',')}`);
    });
}

main();

N-Best Tokenization

Retrieve multiple tokenization candidates ranked by cost:

const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
results.forEach((result, rank) => {
    console.log(`--- NBEST ${rank + 1} (cost=${result.cost}) ---`);
    result.tokens.forEach(token => {
        console.log(`${token.surface}\t${token.details.join(',')}`);
    });
});

Tokenizer API

This page documents the JavaScript/TypeScript API exposed by lindera-wasm.

TokenizerBuilder

Builder class for creating a configured Tokenizer instance.

Constructor

const builder = new TokenizerBuilder();

Creates a new builder with default settings.

Methods

setMode(mode)

Sets the tokenization mode.

  • Parameters: mode (string) -- "normal" or "decompose"
  • Returns: void
builder.setMode("normal");

setDictionary(uri)

Sets the dictionary to use for tokenization.

  • Parameters: uri (string) -- Dictionary URI (e.g., "embedded://ipadic")
  • Returns: void
builder.setDictionary("embedded://ipadic");

setDictionaryInstance(dictionary)

Sets a pre-loaded dictionary instance for tokenization. Use this when the dictionary has been loaded from bytes (e.g., via loadDictionaryFromBytes()) instead of from a URI.

  • Parameters: dictionary (Dictionary) -- A loaded dictionary object
  • Returns: void
import { loadDictionaryFromBytes } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';

const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
    files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
    files.dictWords, files.matrixMtx, files.charDef, files.unk,
);

builder.setDictionaryInstance(dictionary);

setUserDictionary(uri)

Sets a user-defined dictionary by URI.

  • Parameters: uri (string) -- Path or URI to the user dictionary
  • Returns: void
builder.setUserDictionary("file:///path/to/user_dict.csv");

setUserDictionaryInstance(userDictionary)

Sets a pre-loaded user dictionary instance. Use this when the user dictionary has been loaded from bytes instead of from a URI.

  • Parameters: userDictionary (UserDictionary) -- A loaded user dictionary object
  • Returns: void

setKeepWhitespace(keep)

Sets whether whitespace tokens are preserved in the output.

  • Parameters: keep (boolean) -- true to keep whitespace tokens
  • Returns: void
builder.setKeepWhitespace(true);

appendCharacterFilter(name, args)

Appends a character filter to the preprocessing pipeline.

  • Parameters:
    • name (string) -- Filter name (e.g., "unicode_normalize", "japanese_iteration_mark")
    • args (object, optional) -- Filter configuration
  • Returns: void
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });

appendTokenFilter(name, args)

Appends a token filter to the postprocessing pipeline.

  • Parameters:
    • name (string) -- Filter name (e.g., "japanese_stop_tags", "lowercase")
    • args (object, optional) -- Filter configuration
  • Returns: void
builder.appendTokenFilter("japanese_stop_tags", {
    tags: ["助詞", "助動詞", "記号"]
});

build()

Builds and returns a configured Tokenizer instance. Consumes the builder.

  • Returns: Tokenizer
const tokenizer = builder.build();

Tokenizer

The main tokenizer class. Can be created via TokenizerBuilder.build() or directly via the constructor.

Tokenizer Constructor

const tokenizer = new Tokenizer(dictionary, mode, userDictionary);
  • Parameters:
    • dictionary (Dictionary) -- A loaded dictionary object
    • mode (string, optional) -- Tokenization mode ("normal" or "decompose", defaults to "normal")
    • userDictionary (UserDictionary, optional) -- A loaded user dictionary

Tokenizer Methods

tokenize(text)

Tokenizes the input text.

  • Parameters: text (string) -- Text to tokenize
  • Returns: Token[] -- Array of token objects
const tokens = tokenizer.tokenize("関西国際空港");

tokenizeNbest(text, n, unique?, costThreshold?)

Returns N-best tokenization results ordered by total path cost.

  • Parameters:
    • text (string) -- Text to tokenize
    • n (number) -- Number of results to return
    • unique (boolean, optional) -- Deduplicate results with identical segmentation (default: false)
    • costThreshold (number, optional) -- Only return paths within bestCost + threshold
  • Returns: Array of { tokens: object[], cost: number }
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);

Token

Represents a single token produced by the tokenizer.

Properties

PropertyTypeDescription
surfacestringSurface form of the token
byteStartnumberStart byte offset in the original text
byteEndnumberEnd byte offset in the original text
positionnumberPosition index of the token
wordIdnumberWord ID in the dictionary
isUnknownbooleanWhether the token is an unknown word
detailsstring[]Morphological detail fields

Token Methods

getDetail(index)

Returns the detail string at the specified index.

  • Parameters: index (number) -- Zero-based index into the details array
  • Returns: string | undefined
const pos = token.getDetail(0);   // e.g., "名詞"
const reading = token.getDetail(7); // e.g., "トウキョウ"

toJSON()

Returns a plain JavaScript object representation of the token.

  • Returns: object with keys: surface, byteStart, byteEnd, position, wordId, isUnknown, details
console.log(JSON.stringify(token.toJSON(), null, 2));

Helper Functions

loadDictionary(uri)

Loads a dictionary from the specified URI.

  • Parameters: uri (string) -- Dictionary URI (e.g., "embedded://ipadic")
  • Returns: Dictionary
import { loadDictionary } from 'lindera-wasm-web-ipadic';

const dict = loadDictionary("embedded://ipadic");

loadUserDictionary(uri, metadata)

Loads a user dictionary from the specified URI.

  • Parameters:
    • uri (string) -- Path or URI to the user dictionary file
    • metadata (Metadata) -- Dictionary metadata object
  • Returns: UserDictionary

buildDictionary(inputDir, outputDir, metadata)

Builds a compiled dictionary from source files.

  • Parameters:
    • inputDir (string) -- Path to the directory containing source dictionary files
    • outputDir (string) -- Path to the output directory
    • metadata (Metadata) -- Dictionary metadata object
  • Returns: void

buildUserDictionary(inputFile, outputDir, metadata?)

Builds a compiled user dictionary from a CSV file.

  • Parameters:
    • inputFile (string) -- Path to the user dictionary CSV file
    • outputDir (string) -- Path to the output directory
    • metadata (Metadata, optional) -- Dictionary metadata object
  • Returns: void

version() / getVersion()

Returns the version string of the lindera-wasm package.

  • Returns: string
import { version } from 'lindera-wasm-web-ipadic';

console.log(version()); // e.g., "2.1.1"

Enums and Utility Classes

Mode

Tokenization mode enum.

ValueDescription
Mode.NormalStandard tokenization based on dictionary cost
Mode.DecomposeDecompose compound words using penalty-based segmentation

Penalty

Configuration for decompose mode. Controls how aggressively compound words are decomposed.

const penalty = new Penalty(
    kanjiThreshold?,     // Kanji length threshold (default: 2)
    kanjiPenalty?,       // Kanji length penalty (default: 3000)
    otherThreshold?,     // Other character length threshold (default: 7)
    otherPenalty?,       // Other character length penalty (default: 1700)
);
PropertyTypeDefaultDescription
kanji_penalty_length_thresholdnumber2Length threshold for kanji compound splitting
kanji_penalty_length_penaltynumber3000Penalty cost for kanji compounds exceeding threshold
other_penalty_length_thresholdnumber7Length threshold for non-kanji compound splitting
other_penalty_length_penaltynumber1700Penalty cost for non-kanji compounds exceeding threshold

LinderaError

Error type for Lindera operations.

const error = new LinderaError("message");
console.log(error.message);    // "message"
console.log(error.toString()); // "message"
Property / MethodTypeDescription
messagestringError message
toString()stringReturns the error message

Snake-Case Aliases

For consistency with the Python API, all methods are also available in snake_case form:

camelCasesnake_case
setMode()set_mode()
setDictionary()set_dictionary()
setDictionaryInstance()set_dictionary_instance()
setUserDictionary()set_user_dictionary()
setUserDictionaryInstance()set_user_dictionary_instance()
setKeepWhitespace()set_keep_whitespace()
appendCharacterFilter()append_character_filter()
appendTokenFilter()append_token_filter()
tokenizeNbest()tokenize_nbest()
loadDictionary()load_dictionary()
loadDictionaryFromBytes()load_dictionary_from_bytes()
loadUserDictionary()load_user_dictionary()
buildDictionary()build_dictionary()
buildUserDictionary()build_user_dictionary()

Dictionary Management

Loading Dictionaries from OPFS

The recommended way to use dictionaries in WASM is to download them from GitHub Releases and load them via OPFS. This avoids embedding large dictionaries in the WASM binary.

Loading from Bytes

Use loadDictionaryFromBytes() to construct a Dictionary from raw byte arrays stored in OPFS or other browser storage.

loadDictionaryFromBytes(metadata, dictDa, dictVals, dictWordsIdx, dictWords, matrixMtx, charDef, unk)

  • Parameters:
    • metadata (Uint8Array) -- Contents of metadata.json
    • dictDa (Uint8Array) -- Contents of dict.da (Double-Array Trie)
    • dictVals (Uint8Array) -- Contents of dict.vals (word value data)
    • dictWordsIdx (Uint8Array) -- Contents of dict.wordsidx (word details index)
    • dictWords (Uint8Array) -- Contents of dict.words (word details)
    • matrixMtx (Uint8Array) -- Contents of matrix.mtx (connection cost matrix)
    • charDef (Uint8Array) -- Contents of char_def.bin (character definitions)
    • unk (Uint8Array) -- Contents of unk.bin (unknown word dictionary)
  • Returns: Dictionary
import { loadDictionaryFromBytes, TokenizerBuilder } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';

// Load dictionary files from OPFS
const files = await loadDictionaryFiles("ipadic");

// Create a Dictionary from bytes
const dictionary = loadDictionaryFromBytes(
    files.metadata,
    files.dictDa,
    files.dictVals,
    files.dictWordsIdx,
    files.dictWords,
    files.matrixMtx,
    files.charDef,
    files.unk,
);

// Use with TokenizerBuilder
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();

See OPFS Dictionary Storage for the full OPFS workflow including downloading and caching.

Embedded Dictionaries (Advanced)

If you built with an embed-* feature flag, you can load embedded dictionaries via the embedded:// URI scheme. This increases the WASM binary size significantly.

Loading an Embedded Dictionary

import { loadDictionary } from 'lindera-wasm-web-ipadic';

const dictionary = loadDictionary("embedded://ipadic");

Available embedded dictionary URIs (depending on which features were enabled at build time):

URIFeature Flag
embedded://ipadicembed-ipadic
embedded://unidicembed-unidic
embedded://ko-dicembed-ko-dic
embedded://cc-cedictembed-cc-cedict
embedded://jiebaembed-jieba

Using with TokenizerBuilder

const builder = new TokenizerBuilder();
builder.setDictionary("embedded://ipadic");
builder.setMode("normal");
const tokenizer = builder.build();

Using with Tokenizer Constructor

import { loadDictionary, Tokenizer } from 'lindera-wasm-web-ipadic';

const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");

Dictionary Class

The Dictionary class represents a loaded morphological analysis dictionary.

Properties

PropertyTypeDescription
namestringDictionary name (e.g., "ipadic")
encodingstringCharacter encoding of the dictionary
metadataMetadataFull metadata object
console.log(dictionary.name);     // "ipadic"
console.log(dictionary.encoding); // "utf-8"

User Dictionaries

User dictionaries allow you to add custom words that are not in the system dictionary.

Loading a User Dictionary

import { loadUserDictionary } from 'lindera-wasm-web';

const metadata = dictionary.metadata;
const userDict = loadUserDictionary("/path/to/user_dict.csv", metadata);

Using a User Dictionary with Tokenizer

import { loadDictionaryFromBytes, loadUserDictionary, Tokenizer } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';

const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
    files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
    files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const userDict = loadUserDictionary("/path/to/user_dict.csv", dictionary.metadata);
const tokenizer = new Tokenizer(dictionary, "normal", userDict);

User Dictionary CSV Format

The user dictionary CSV follows the same format as the Lindera user dictionary:

東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン

Each line contains: surface,part_of_speech,reading

Building Dictionaries

You can build compiled dictionaries from source files using the JavaScript API.

Building a System Dictionary

import { buildDictionary } from 'lindera-wasm-web';

const metadata = {
    name: "custom-dict",
    encoding: "utf-8",
    // ... other metadata fields
};

buildDictionary("/path/to/source/dir", "/path/to/output/dir", metadata);

Building a User Dictionary

import { buildUserDictionary } from 'lindera-wasm-web';

buildUserDictionary("/path/to/user_dict.csv", "/path/to/output/dir");

The metadata parameter is optional for buildUserDictionary. If omitted, default metadata is used.

Metadata

The Metadata class configures dictionary parameters.

Constructor

const metadata = new Metadata(name?, encoding?);
  • Parameters:
    • name (string, optional) -- Dictionary name (default: "default")
    • encoding (string, optional) -- Character encoding (default: "UTF-8")

Static Methods

Metadata.createDefault()

Creates a Metadata instance with default values.

const metadata = Metadata.createDefault();

Metadata Properties

PropertyTypeDefaultDescription
namestring"default"Dictionary name
encodingstring"UTF-8"Character encoding
dictionary_schemaSchemaIPADIC schemaSchema for the main dictionary
user_dictionary_schemaSchemaMinimal schemaSchema for user dictionaries

All properties support both getting and setting:

const metadata = Metadata.createDefault();
metadata.name = "custom_dict";
metadata.encoding = "EUC-JP";
console.log(metadata.name); // "custom_dict"

You can also access the metadata from a loaded dictionary via dictionary.metadata.

Schema

The Schema class defines the field structure of dictionary entries.

Schema Constructor

const schema = new Schema(["surface", "left_id", "right_id", "cost", "pos", "reading"]);

Schema Static Methods

  • Schema.create_default() -- Creates a default IPADIC-like schema

Schema Methods

MethodReturnsDescription
get_field_index(name)number | undefinedGet field index by name
field_count()numberTotal number of fields
get_field_name(index)string | undefinedGet field name by index
get_custom_fields()string[]Fields beyond index 3 (morphological features)
get_all_fields()string[]All field names
get_field_by_name(name)FieldDefinition | undefinedGet full field definition

FieldDefinition

PropertyTypeDescription
indexnumberField position index
namestringField name
field_typeFieldTypeField type enum
descriptionstring | undefinedOptional description

FieldType

ValueDescription
FieldType.SurfaceWord surface text
FieldType.LeftContextIdLeft context ID
FieldType.RightContextIdRight context ID
FieldType.CostWord cost
FieldType.CustomMorphological feature field

Browser Usage

ES Module Import

In browser environments, you must initialize the WASM module before using any Lindera functions. The default export __wbg_init handles this initialization.

The recommended approach is to load dictionaries from OPFS rather than embedding them in the WASM binary:

import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { downloadDictionary, loadDictionaryFiles, hasDictionary } from 'lindera-wasm-web/opfs';

async function main() {
    // Initialize the WASM module (must be called once before using any API)
    await __wbg_init();

    // Download dictionary if not cached
    if (!await hasDictionary("ipadic")) {
        await downloadDictionary(
            "https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
            "ipadic",
        );
    }

    // Load dictionary from OPFS
    const files = await loadDictionaryFiles("ipadic");
    const dictionary = loadDictionaryFromBytes(
        files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
        files.dictWords, files.matrixMtx, files.charDef, files.unk,
    );

    const builder = new TokenizerBuilder();
    builder.setDictionaryInstance(dictionary);
    builder.setMode("normal");
    const tokenizer = builder.build();

    const tokens = tokenizer.tokenize("形態素解析を行います");
    tokens.forEach(token => {
        console.log(`${token.surface}: ${token.details.join(',')}`);
    });
}

main();

Using Embedded Dictionaries (Advanced)

If you built with an embed-* feature flag, you can use embedded dictionaries instead of OPFS:

import __wbg_init, { TokenizerBuilder } from 'lindera-wasm-web-ipadic';

async function main() {
    await __wbg_init();

    const builder = new TokenizerBuilder();
    builder.setDictionary("embedded://ipadic");
    builder.setMode("normal");
    const tokenizer = builder.build();

    const tokens = tokenizer.tokenize("形態素解析を行います");
    tokens.forEach(token => {
        console.log(`${token.surface}: ${token.details.join(',')}`);
    });
}

main();

HTML Example

A minimal HTML page using lindera-wasm with OPFS dictionary loading:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Lindera WASM Demo</title>
</head>
<body>
    <textarea id="input" rows="4" cols="50">関西国際空港限定トートバッグ</textarea>
    <br>
    <button id="tokenize" disabled>Tokenize</button>
    <pre id="output">Loading dictionary...</pre>

    <script type="module">
        import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from './pkg/lindera_wasm.js';
        import { downloadDictionary, loadDictionaryFiles, hasDictionary } from './pkg/opfs.js';

        let tokenizer;

        async function init() {
            await __wbg_init();

            // Download dictionary if not cached
            if (!await hasDictionary("ipadic")) {
                document.getElementById('output').textContent = 'Downloading dictionary...';
                await downloadDictionary(
                    "https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
                    "ipadic",
                );
            }

            // Load dictionary from OPFS
            const files = await loadDictionaryFiles("ipadic");
            const dictionary = loadDictionaryFromBytes(
                files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
                files.dictWords, files.matrixMtx, files.charDef, files.unk,
            );

            const builder = new TokenizerBuilder();
            builder.setDictionaryInstance(dictionary);
            builder.setMode("normal");
            tokenizer = builder.build();

            document.getElementById('tokenize').disabled = false;
            document.getElementById('output').textContent = 'Ready!';
        }

        document.getElementById('tokenize').addEventListener('click', () => {
            const text = document.getElementById('input').value;
            const tokens = tokenizer.tokenize(text);
            const output = tokens.map(t =>
                `${t.surface}\t${t.details.join(',')}`
            ).join('\n');
            document.getElementById('output').textContent = output;
        });

        init();
    </script>
</body>
</html>

Webpack Configuration

When using Webpack 5, enable the asyncWebAssembly experiment:

// webpack.config.js
module.exports = {
    experiments: {
        asyncWebAssembly: true,
    },
    module: {
        rules: [
            {
                test: /\.wasm$/,
                type: "webassembly/async",
            },
        ],
    },
};

Then import using the bundler target build:

import { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-bundler';
import { loadDictionaryFiles } from 'lindera-wasm-bundler/opfs';

// Load dictionary from OPFS (see OPFS Dictionary Storage for setup)
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
    files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
    files.dictWords, files.matrixMtx, files.charDef, files.unk,
);

const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();

With the bundler target, __wbg_init() is called automatically by the bundler.

Vite / Rollup Setup

Vite supports WASM out of the box with the web target. Place the built pkg/ directory in your project and import directly:

import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from './pkg/lindera_wasm.js';
import { loadDictionaryFiles } from './pkg/opfs.js';

await __wbg_init();
// Load dictionary from OPFS and use TokenizerBuilder as shown above

For the bundler target with Vite, you may need the vite-plugin-wasm plugin:

// vite.config.js
import wasm from 'vite-plugin-wasm';

export default {
    plugins: [wasm()],
};

Chrome Extension Considerations

Chrome extensions using Manifest V3 restrict WebAssembly.compile and WebAssembly.instantiate by default. To use lindera-wasm in an extension, you need to add wasm-unsafe-eval to your Content Security Policy:

{
    "content_security_policy": {
        "extension_pages": "script-src 'self' 'wasm-unsafe-eval'; object-src 'self'"
    }
}

Note that wasm-unsafe-eval only allows WebAssembly execution and does not permit arbitrary JavaScript eval().

Performance Tips

  • Initialize once: Call __wbg_init() once at application startup, not on every tokenization request.
  • Reuse the tokenizer: Create the Tokenizer instance once and reuse it for multiple calls to tokenize().
  • Web Workers: For heavy tokenization workloads, consider running Lindera in a Web Worker to avoid blocking the main thread.

OPFS Dictionary Storage

Lindera WASM provides OPFS (Origin Private File System) helper utilities for persistent dictionary caching in web browsers. This allows you to download dictionaries once and reuse them across sessions without embedding them in the WASM binary.

Overview

The OPFS helpers are distributed as a separate JavaScript module (opfs.js) alongside the WASM package. They provide functions to download, store, load, and manage dictionaries using the browser's Origin Private File System.

Dictionaries are stored under the OPFS path lindera/dictionaries/<name>/.

Import

import { downloadDictionary, loadDictionaryFiles, removeDictionary,
         listDictionaries, hasDictionary } from 'lindera-wasm-web/opfs';

Functions

downloadDictionary(url, name, options?)

Downloads a dictionary zip archive, extracts it, and stores the files in OPFS.

The archive should be a zip file containing the 8 required dictionary files, optionally nested in a subdirectory.

  • Parameters:
    • url (string) -- URL of the dictionary zip archive
    • name (string) -- Name to store the dictionary under (e.g., "ipadic")
    • options (object, optional):
      • onProgress (function) -- Progress callback
  • Returns: Promise<void>
await downloadDictionary(
    "https://example.com/ipadic.zip",
    "ipadic",
    {
        onProgress: (progress) => {
            switch (progress.phase) {
                case "downloading":
                    console.log(`Downloading: ${progress.loaded}/${progress.total} bytes`);
                    break;
                case "extracting":
                    console.log("Extracting archive...");
                    break;
                case "storing":
                    console.log("Storing in OPFS...");
                    break;
                case "complete":
                    console.log("Done!");
                    break;
            }
        },
    },
);

Progress Callback

The onProgress callback receives an object with the following shape:

PropertyTypeDescription
phasestring"downloading", "extracting", "storing", or "complete"
loadednumber | undefinedBytes downloaded (only during "downloading" phase)
totalnumber | undefinedTotal bytes if known (only during "downloading" phase)

loadDictionaryFiles(name)

Loads dictionary files from OPFS as an object of Uint8Array values.

The returned object can be passed directly to loadDictionaryFromBytes().

  • Parameters: name (string) -- The dictionary name (e.g., "ipadic")
  • Returns: Promise<DictionaryFiles>
const files = await loadDictionaryFiles("ipadic");

DictionaryFiles

PropertyTypeSource File
metadataUint8Arraymetadata.json
dictDaUint8Arraydict.da (Double-Array Trie)
dictValsUint8Arraydict.vals (word value data)
dictWordsIdxUint8Arraydict.wordsidx (word details index)
dictWordsUint8Arraydict.words (word details)
matrixMtxUint8Arraymatrix.mtx (connection cost matrix)
charDefUint8Arraychar_def.bin (character definitions)
unkUint8Arrayunk.bin (unknown word dictionary)

removeDictionary(name)

Removes a dictionary from OPFS.

  • Parameters: name (string) -- The dictionary name to remove
  • Returns: Promise<void>
await removeDictionary("ipadic");

listDictionaries()

Lists all dictionaries stored in OPFS.

  • Returns: Promise<string[]> -- Array of dictionary names
const names = await listDictionaries();
console.log(names); // e.g., ["ipadic", "unidic"]

hasDictionary(name)

Checks if a dictionary exists in OPFS.

  • Parameters: name (string) -- The dictionary name to check
  • Returns: Promise<boolean>
if (await hasDictionary("ipadic")) {
    console.log("Dictionary is cached");
}

Complete Workflow

A typical workflow for using OPFS-based dictionaries:

import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { downloadDictionary, loadDictionaryFiles, hasDictionary } from 'lindera-wasm-web/opfs';

async function main() {
    await __wbg_init();

    const DICT_NAME = "ipadic";
    const DICT_URL = "https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip";

    // Download dictionary if not already cached
    if (!await hasDictionary(DICT_NAME)) {
        await downloadDictionary(DICT_URL, DICT_NAME, {
            onProgress: ({ phase, loaded, total }) => {
                if (phase === "downloading" && total) {
                    console.log(`${(loaded / total * 100).toFixed(1)}%`);
                }
            },
        });
    }

    // Load dictionary from OPFS
    const files = await loadDictionaryFiles(DICT_NAME);
    const dictionary = loadDictionaryFromBytes(
        files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
        files.dictWords, files.matrixMtx, files.charDef, files.unk,
    );

    // Build tokenizer
    const builder = new TokenizerBuilder();
    builder.setDictionaryInstance(dictionary);
    builder.setMode("normal");
    const tokenizer = builder.build();

    // Tokenize
    const tokens = tokenizer.tokenize("形態素解析を行います");
    tokens.forEach(token => {
        console.log(`${token.surface}\t${token.details.join(',')}`);
    });
}

main();

Required Dictionary Files

A valid dictionary archive must contain these 8 files:

FileDescription
metadata.jsonDictionary metadata (name, encoding, schema, etc.)
dict.daDouble-Array Trie structure
dict.valsWord value data
dict.wordsidxWord details index
dict.wordsWord details (morphological features)
matrix.mtxConnection cost matrix
char_def.binCharacter category definitions
unk.binUnknown word dictionary

Browser Compatibility

OPFS requires a secure context (HTTPS or localhost) and is supported in:

  • Chrome 86+
  • Edge 86+
  • Firefox 111+
  • Safari 15.2+

The zip extraction uses the DecompressionStream API, which requires:

  • Chrome 80+
  • Edge 80+
  • Firefox 113+
  • Safari 16.4+

Lindera IPADIC

Lindera IPADIC is a Japanese dictionary crate based on IPADIC. IPADIC is the most common dictionary for Japanese morphological analysis.

Contents

  • Dictionary Format -- Field definitions for system and user dictionaries
  • Build -- How to build the dictionary from source
  • Examples -- Tokenization examples

API Reference

Lindera IPADIC

Dictionary version

This repository contains mecab-ipadic.

Dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1品詞Part-of-speech
2読みReading

Detailed version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation
13--After 13, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Build

This page describes how to build the IPADIC dictionary from source files.

Build system dictionary

Download the IPADIC source files and build the dictionary:

# Download and extract IPADIC source files
% curl -L -o /tmp/mecab-ipadic-2.7.0-20250920.tar.gz "https://Lindera.dev/mecab-ipadic-2.7.0-20250920.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20250920.tar.gz -C /tmp

# Build the dictionary
% lindera build \
  --src /tmp/mecab-ipadic-2.7.0-20250920 \
  --dest /tmp/lindera-ipadic-2.7.0-20250920 \
  --metadata ./lindera-ipadic/metadata.json

Build user dictionary

Build a user dictionary from a CSV file:

% lindera build \
  --src ./resources/user_dict/ipadic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ipadic/metadata.json \
  --user

For more details about user dictionary format, see Dictionary Format.

Embedding in binary

To embed the IPADIC dictionary directly into the binary:

cargo build --features=embed-ipadic

This allows using embedded://ipadic as the dictionary path without external dictionary files.

Examples

This page shows tokenization examples using the IPADIC dictionary.

Tokenize with external IPADIC

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Tokenize with embedded IPADIC

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素  名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析    名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include IPADIC dictionary in the binary, you must build with the --features=embed-ipadic option.

Tokenize with user dictionary (CSV format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict embedded://ipadic \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Tokenize with user dictionary (binary format)

% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
  --dict /tmp/lindera-ipadic-2.7.0-20250920 \
  --user-dict ./resources/user_dict/ipadic_simple_userdic.bin
東京スカイツリー        カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の      助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅        名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅        カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です    助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "日本語の形態素解析を行うことができます。";
    let mut tokens = tokenizer.tokenize(text)?;
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("{}\t{}", token.surface.as_ref(), details);
    }
    Ok(())
}

Lindera IPADIC NEologd

Lindera IPADIC NEologd is a Japanese dictionary crate based on IPADIC NEologd, which includes neologisms (new words). It extends the standard IPADIC dictionary with additional vocabulary covering recent terms and proper nouns.

Contents

  • Dictionary Format -- Field definitions for system and user dictionaries
  • Build -- How to build the dictionary from source
  • Examples -- Tokenization examples

API Reference

Lindera IPADIC NEologd

Dictionary version

This repository contains mecab-ipadic-neologd.

Dictionary format

Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1品詞Part-of-speech
2読みReading

Detailed version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞Part-of-speech
5品詞細分類1Part-of-speech subcategory 1
6品詞細分類2Part-of-speech subcategory 2
7品詞細分類3Part-of-speech subcategory 3
8活用形Conjugation form
9活用型Conjugation type
10原形Base form
11読みReading
12発音Pronunciation
13--After 13, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Build

This page describes how to build the IPADIC NEologd dictionary from source files.

Build system dictionary

Download the IPADIC NEologd source files and build the dictionary:

% curl -L -o /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz "https://lindera.dev/mecab-ipadic-neologd-0.0.7-20200820.tar.gz"
% tar zxvf /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp

% lindera build \
  --src /tmp/mecab-ipadic-neologd-0.0.7-20200820 \
  --dest /tmp/lindera-ipadic-neologd-0.0.7-20200820 \
  --metadata ./lindera-ipadic-neologd/metadata.json

Embedding in binary

To embed the IPADIC NEologd dictionary directly into the binary:

cargo build --features=embed-ipadic-neologd

This allows using embedded://ipadic-neologd as the dictionary path without external dictionary files.

Examples

This page shows tokenization examples using the IPADIC NEologd dictionary.

Tokenize with external IPADIC NEologd

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-ipadic-neologd-0.0.7-20200820
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

Notice that NEologd treats "形態素解析" (morphological analysis) as a single compound noun, whereas standard IPADIC splits it into "形態素" and "解析".

Tokenize with embedded IPADIC NEologd

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://ipadic-neologd
日本語  名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析      名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う    動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき    動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。      記号,句点,*,*,*,*,。,。,。
EOS

NOTE: To include IPADIC NEologd dictionary in the binary, you must build with the --features=embed-ipadic-neologd option.

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic-neologd")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "日本語の形態素解析を行うことができます。";
    let mut tokens = tokenizer.tokenize(text)?;
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("{}\t{}", token.surface.as_ref(), details);
    }
    Ok(())
}

Lindera UniDic

Lindera UniDic is a Japanese dictionary crate based on UniDic, which uses uniform word unit definitions. UniDic provides more detailed morphological information than IPADIC, with 21 fields per entry.

Contents

  • Dictionary Format -- Field definitions for system and user dictionaries
  • Build -- How to build the dictionary from source
  • Examples -- Tokenization examples

API Reference

Lindera UniDic

Dictionary version

This repository contains unidic-mecab.

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞大分類Part-of-speech
5品詞中分類Part-of-speech subcategory 1
6品詞小分類Part-of-speech subcategory 2
7品詞細分類Part-of-speech subcategory 3
8活用型Conjugation type
9活用形Conjugation form
10語彙素読みReading
11語彙素(語彙素表記 + 語彙素細分類)Lexeme
12書字形出現形Orthographic surface form
13発音形出現形Phonological surface form
14書字形基本形Orthographic base form
15発音形基本形Phonological base form
16語種Word type
17語頭変化型Initial mutation type
18語頭変化形Initial mutation form
19語末変化型Final mutation type
20語末変化形Final mutation form

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1品詞大分類Part-of-speech
2語彙素読みReading

Detailed version

IndexName (Japanese)Name (English)Notes
0表層形Surface
1左文脈IDLeft context ID
2右文脈IDRight context ID
3コストCost
4品詞大分類Part-of-speech
5品詞中分類Part-of-speech subcategory 1
6品詞小分類Part-of-speech subcategory 2
7品詞細分類Part-of-speech subcategory 3
8活用型Conjugation type
9活用形Conjugation form
10語彙素読みReading
11語彙素(語彙素表記 + 語彙素細分類)Lexeme
12書字形出現形Orthographic surface form
13発音形出現形Phonological surface form
14書字形基本形Orthographic base form
15発音形基本形Phonological base form
16語種Word type
17語頭変化型Initial mutation type
18語頭変化形Initial mutation form
19語末変化型Final mutation type
20語末変化形Final mutation form
21--After 21, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Build

This page describes how to build the UniDic dictionary from source files.

Build system dictionary

Download the UniDic source files and build the dictionary:

% curl -L -o /tmp/unidic-mecab-2.1.2.tar.gz "https://Lindera.dev/unidic-mecab-2.1.2.tar.gz"
% tar zxvf /tmp/unidic-mecab-2.1.2.tar.gz -C /tmp

% lindera build \
  --src /tmp/unidic-mecab-2.1.2 \
  --dest /tmp/lindera-unidic-2.1.2 \
  --metadata ./lindera-unidic/metadata.json

Build user dictionary

Build a user dictionary from a CSV file:

% lindera build \
  --src ./resources/user_dict/unidic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-unidic/metadata.json \
  --user

For more details about user dictionary format, see Dictionary Format.

Embedding in binary

To embed the UniDic dictionary directly into the binary:

cargo build --features=embed-unidic

This allows using embedded://unidic as the dictionary path without external dictionary files.

Examples

This page shows tokenization examples using the UniDic dictionary.

Tokenize with external UniDic

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict /tmp/lindera-unidic-2.1.2
日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Notice that UniDic splits "日本語" into "日本" and "語", and "形態素" into "形態" and "素", reflecting its uniform word unit definitions.

Tokenize with embedded UniDic

% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
  --dict embedded://unidic
日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態    名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素      接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析    名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を      助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う    動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと    名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が      助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき    動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます    助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。      補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-unidic option.

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://unidic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "日本語の形態素解析を行うことができます。";
    let mut tokens = tokenizer.tokenize(text)?;
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("{}\t{}", token.surface.as_ref(), details);
    }
    Ok(())
}

Lindera ko-dic

Lindera ko-dic is a Korean dictionary crate based on mecab-ko-dic.

Contents

  • Dictionary Format -- Field definitions for system and user dictionaries
  • Build -- How to build the dictionary from source
  • Examples -- Tokenization examples

API Reference

Lindera ko-dic

Dictionary version

This repository contains mecab-ko-dic.

Dictionary format

Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.

Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).

The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0 on the above-linked spreadsheet.

The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0 of the spreadsheet. Any blank values default to *.

IndexName (Korean)Name (English)Notes
0표면Surface
1왼쪽 문맥 IDLeft context ID
2오른쪽 문맥 IDRight context ID
3비용Cost
4품사 태그Part-of-speech tagSee 태그 v2.0 tab on spreadsheet
5의미 부류Meaning(too few examples for me to be sure)
6종성 유무Presence or absenceT for true; F for false; else *
7읽기Readingusually matches surface, but may differ for foreign words e.g. Chinese character words
8타입TypeOne of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9첫번째 품사First part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10마지막 품사Last part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11표현Expression활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 -- Fields that tell how usage, compound nouns, and key analysis are organized

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0표면Surface
1품사 태그part-of-speech tagSee 태그 v2.0 tab on spreadsheet
2읽기readingusually matches surface, but may differ for foreign words e.g. Chinese character words

Detailed version

IndexName (Korean)Name (English)Notes
0표면Surface
1왼쪽 문맥 IDLeft context ID
2오른쪽 문맥 IDRight context ID
3비용Cost
4품사 태그part-of-speech tagSee 태그 v2.0 tab on spreadsheet
5의미 부류meaning(too few examples for me to be sure)
6종성 유무presence or absenceT for true; F for false; else *
7읽기readingusually matches surface, but may differ for foreign words e.g. Chinese character words
8타입typeOne of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석)
9첫번째 품사first part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV
10마지막 품사last part-of-speeche.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP
11표현expression활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 -- Fields that tell how usage, compound nouns, and key analysis are organized
12--After 12, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Build

Build system dictionary

Download and extract the mecab-ko-dic source files, then build the dictionary:

% curl -L -o /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz "https://Lindera.dev/mecab-ko-dic-2.1.1-20180720.tar.gz"
% tar zxvf /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz -C /tmp
% lindera build \
  --src /tmp/mecab-ko-dic-2.1.1-20180720 \
  --dest /tmp/lindera-ko-dic-2.1.1-20180720 \
  --metadata ./lindera-ko-dic/metadata.json

Build user dictionary

% lindera build \
  --src ./resources/user_dict/ko-dic_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-ko-dic/metadata.json \
  --user

Embedding the dictionary

To embed the ko-dic dictionary directly into the binary, build with the following feature flag:

% cargo build --features=embed-ko-dic

Examples

Tokenize with external ko-dic

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict /tmp/lindera-ko-dic-2.1.1-20180720
한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수      NNB,*,F,수,*,*,*,*
있      VV,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       SF,*,*,*,*,*,*,*
EOS

Tokenize with embedded ko-dic

% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
  --dict embedded://ko-dic
한국어  NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의      JKG,*,F,의,*,*,*,*
형태    NNG,*,F,형태,*,*,*,*
해석    NNG,행위,T,해석,*,*,*,*
을      JKO,*,T,을,*,*,*,*
실시    NNG,행위,F,실시,*,*,*,*
할      XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수      NNB,*,F,수,*,*,*,*
있      VV,*,T,있,*,*,*,*
습니다  EF,*,F,습니다,*,*,*,*
.       SF,*,*,*,*,*,*,*
EOS

NOTE: To include ko-dic dictionary in the binary, you must build with the --features=embed-ko-dic option.

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ko-dic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "한국어의형태해석을실시할수있습니다.";
    let mut tokens = tokenizer.tokenize(text)?;
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("{}\t{}", token.surface.as_ref(), details);
    }
    Ok(())
}

Lindera CC-CEDICT

Lindera CC-CEDICT is a Chinese dictionary crate based on CC-CEDICT-MeCab.

Contents

  • Dictionary Format -- Field definitions for system and user dictionaries
  • Build -- How to build the dictionary from source
  • Examples -- Tokenization examples

API Reference

Lindera CC-CE-DICT

Dictionary version

This repository contains CC-CEDICT-MeCab.

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

IndexName (Chinese)Name (English)Notes
0表面形式Surface
1左语境IDLeft context ID
2右语境IDRight context ID
3成本Cost
4词类Part-of-speech
5词类1Part-of-speech subcategory 1
6词类2Part-of-speech subcategory 2
7词类3Part-of-speech subcategory 3
8併音Pinyin
9繁体字Traditional
10簡体字Simplified
11定义Definition

User dictionary format (CSV)

Simple version

IndexName (Japanese)Name (English)Notes
0表面形式Surface
1词类Part-of-speech
2併音Pinyin

Detailed version

IndexName (Japanese)Name (English)Notes
0表面形式Surface
1左语境IDLeft context ID
2右语境IDRight context ID
3成本Cost
4词类Part-of-speech
5词类1Part-of-speech subcategory 1
6词类2Part-of-speech subcategory 2
7词类3Part-of-speech subcategory 3
8併音Pinyin
9繁体字Traditional
10簡体字Simplified
11定义Definition
12--After 12, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Build

Build system dictionary

Download and extract the CC-CEDICT-MeCab source files, then build the dictionary:

% curl -L -o /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz "https://lindera.dev/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz"
% tar zxvf /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz -C /tmp
% lindera build \
  --src /tmp/CC-CEDICT-MeCab-0.1.0-20200409 \
  --dest /tmp/lindera-cc-cedict-0.1.0-20200409 \
  --metadata ./lindera-cc-cedict/metadata.json

Build user dictionary

% lindera build \
  --src ./resources/user_dict/cc-cedict_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-cc-cedict/metadata.json \
  --user

Embedding the dictionary

To embed the CC-CEDICT dictionary directly into the binary, build with the following feature flag:

% cargo build --features=embed-cc-cedict

Examples

Tokenize with external CC-CEDICT

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict /tmp/lindera-cc-cedict-0.1.0-20200409
可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      *,*,*,*,*,*,*,*
EOS

Tokenize with embedded CC-CEDICT

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict embedded://cc-cedict
可以    *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行    *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文    *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学  *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析    *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。      *,*,*,*,*,*,*,*
EOS

NOTE: To include CC-CEDICT dictionary in the binary, you must build with the --features=embed-cc-cedict option.

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://cc-cedict")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "可以进行中文形态学分析。";
    let mut tokens = tokenizer.tokenize(text)?;
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("{}\t{}", token.surface.as_ref(), details);
    }
    Ok(())
}

Lindera Jieba

Lindera Jieba is a Chinese dictionary crate based on mecab-jieba.

Contents

  • Dictionary Format -- Field definitions for system and user dictionaries
  • Build -- How to build the dictionary from source
  • Examples -- Tokenization examples

API Reference

Lindera Jieba

Dictionary version

This repository contains mecab-jieba.

Dictionary format

IndexName (Chinese)Name (English)Notes
0表面形式Surface
1左语境IDLeft context ID
2右语境IDRight context ID
3成本Cost
4词类Part-of-speech
5併音Pinyin
6繁体字Traditional
7簡体字Simplified
8定义Definition

User dictionary format (CSV)

Simple version

IndexName (Chinese)Name (English)Notes
0表面形式Surface
1词类Part-of-speech
2併音Pinyin

Detailed version

IndexName (Chinese)Name (English)Notes
0表面形式Surface
1左语境IDLeft context ID
2右语境IDRight context ID
3成本Cost
4词类Part-of-speech
5併音Pinyin
6繁体字Traditional
7簡体字Simplified
8定义Definition
9--After 9, it can be freely expanded.

API reference

The API reference is available. Please see following URL:

Build

Build system dictionary

Download and extract the mecab-jieba source files, then build the dictionary:

% curl -L -o /tmp/mecab-jieba-0.1.1.tar.gz "https://lindera.dev/mecab-jieba-0.1.1.tar.gz"
% tar zxvf /tmp/mecab-jieba-0.1.1.tar.gz -C /tmp
% lindera build \
  --src /tmp/mecab-jieba-0.1.1/dict-src \
  --dest /tmp/lindera-jieba-0.1.1 \
  --metadata ./lindera-jieba/metadata.json

Build user dictionary

% lindera build \
  --src ./resources/user_dict/jieba_simple_userdic.csv \
  --dest ./resources/user_dict \
  --metadata ./lindera-jieba/metadata.json \
  --user

Embedding the dictionary

To embed the Jieba dictionary directly into the binary, build with the following feature flag:

% cargo build --features=embed-jieba

Examples

Tokenize with external Jieba

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict /tmp/lindera-jieba-0.1.1
可以    c,CHINESE,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good,2,可,以,high
进行    v,CHINESE,jin4 xing2,進行,进行,(of a process etc) to proceed; to be in progress; to be underway/(of people) to carry out; to conduct (an investigation or discussion etc)/(of an army etc) to be on the march; to advance,2,进,行,high
中文    nz,CHINESE,Zhong1 wen2,中文,中文,Chinese language,2,中,文,high
形态    n,CHINESE,xing2 tai4,形態,形态,shape/form/pattern/morphology,2,形,态,high
学      n,CHINESE,xue2,學,学,to learn/to study/to imitate/science/-ology,1,学,学,high
分析    vn,CHINESE,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4],2,分,析,high
。      w,*,*,*,*,*,*,*,*,*
EOS

Tokenize with embedded Jieba

% echo "可以进行中文形态学分析。" | lindera tokenize \
  --dict embedded://jieba
可以    c,CHINESE,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good,2,可,以,high
进行    v,CHINESE,jin4 xing2,進行,进行,(of a process etc) to proceed; to be in progress; to be underway/(of people) to carry out; to conduct (an investigation or discussion etc)/(of an army etc) to be on the march; to advance,2,进,行,high
中文    nz,CHINESE,Zhong1 wen2,中文,中文,Chinese language,2,中,文,high
形态    n,CHINESE,xing2 tai4,形態,形态,shape/form/pattern/morphology,2,形,态,high
学      n,CHINESE,xue2,學,学,to learn/to study/to imitate/science/-ology,1,学,学,high
分析    vn,CHINESE,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4],2,分,析,high
。      w,*,*,*,*,*,*,*,*,*
EOS

NOTE: To include Jieba dictionary in the binary, you must build with the --features=embed-jieba option.

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://jieba")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "可以进行中文形态学分析。";
    let mut tokens = tokenizer.tokenize(text)?;
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("{}\t{}", token.surface.as_ref(), details);
    }
    Ok(())
}

Development Guide

This section provides information for developers who want to build, test, or contribute to Lindera.

Build & Test

Build

Default Build

Build the workspace with default features (mmap):

cargo build

Build with Training Support

Include CRF-based dictionary training functionality:

cargo build --features train

Build CLI Only

cargo build -p lindera-cli

The CLI has the train feature enabled by default.

Test

Single Test

Run a specific test within a crate (recommended for development):

cargo test -p <crate> <test_name>

Training Feature Tests

cargo test -p lindera-dictionary --features train

All Features for a Crate

Run the full test suite for a single crate (matches CI):

cargo test -p <crate> --all-features

Workspace-Wide Tests

cargo test

Quality Checks

Format Check

Verify code formatting matches the project style:

cargo fmt --all -- --check

To auto-fix formatting:

cargo fmt --all

Lint

Run Clippy with warnings treated as errors:

cargo clippy -- -D warnings

Documentation

API Documentation

Generate and open Rust API documentation:

cargo doc --no-deps --open

mdBook Documentation

Build the user-facing documentation:

mdbook build docs

Preview locally at http://localhost:3000:

mdbook serve docs

Markdown Lint

Check documentation for Markdown style issues:

markdownlint-cli2 "docs/src/**/*.md"

Rules are configured in .markdownlint.json at the repository root.

Feature Flags

Lindera uses Cargo feature flags to control optional functionality and dictionary embedding.

Core Features

FeatureDescriptionDefault
mmapMemory-mapped file supportYes
trainCRF-based dictionary training (depends on lindera-crf)CLI only
  • mmap is enabled by default in the main lindera crate.
  • train is enabled by default only in lindera-cli. For library usage, enable it explicitly with --features train.

The recommended approach is to use pre-built dictionaries as external files. Download a dictionary from GitHub Releases and specify its path at runtime:

#![allow(unused)]
fn main() {
let dictionary = load_dictionary("/path/to/ipadic")?;
}

No additional feature flags are required for this usage.

Dictionary Embedding Features (Advanced)

These features embed pre-built dictionaries directly into the binary, eliminating the need for external dictionary files at runtime. This is intended for advanced users who need self-contained binaries.

FeatureDictionaryLanguage
embed-ipadicIPADICJapanese
embed-ipadic-neologdIPADIC NEologdJapanese
embed-unidicUniDicJapanese
embed-ko-dicko-dicKorean
embed-cc-cedictCC-CEDICTChinese
embed-jiebaJiebaChinese

None of these are enabled by default. Enable them as needed:

[dependencies]
lindera = { version = "2.3.2", features = ["embed-ipadic"] }

When embedding is enabled, you can load the dictionary with:

#![allow(unused)]
fn main() {
let dictionary = load_dictionary("embedded://ipadic")?;
}

Combination Features

These meta-features enable multiple dictionaries at once for multilingual applications.

FeatureIncluded Dictionaries
embed-cjkIPADIC + ko-dic + Jieba
embed-cjk2UniDic + ko-dic + Jieba
embed-cjk3IPADIC NEologd + ko-dic + Jieba

Combining Feature Flags

Multiple feature flags can be combined. For example, to embed both Japanese and Korean dictionaries:

[dependencies]
lindera = { version = "2.3.2", features = ["embed-ipadic", "embed-ko-dic"] }

Or from the command line:

cargo build --features embed-ipadic,embed-ko-dic

Notes

  • Embedding dictionaries increases binary size significantly. Only embed dictionaries you actually need.
  • The train feature adds a dependency on lindera-crf and increases compile time. It is not needed for tokenization-only use cases.
  • The mmap feature enables memory-mapped dictionary loading, which reduces memory usage for large dictionaries loaded from disk. It has no effect on embedded dictionaries.

Project Structure

Lindera is organized as a Cargo workspace with multiple crates.

Directory Layout

lindera/
├── lindera-crf/            # CRF engine (pure Rust, no_std)
├── lindera-dictionary/     # Dictionary base library
├── lindera/                # Core morphological analysis library
├── lindera-cli/            # CLI tool
├── lindera-ipadic/         # IPADIC dictionary (Japanese)
├── lindera-ipadic-neologd/ # IPADIC NEologd dictionary (Japanese)
├── lindera-unidic/         # UniDic dictionary (Japanese)
├── lindera-ko-dic/         # ko-dic dictionary (Korean)
├── lindera-cc-cedict/      # CC-CEDICT dictionary (Chinese)
├── lindera-jieba/          # Jieba dictionary (Chinese)
├── lindera-python/         # Python bindings (PyO3)
├── lindera-wasm/           # WebAssembly bindings (wasm-bindgen)
├── resources/              # Test resources and sample data
├── docs/                   # Documentation (mdBook)
└── examples/               # Example code

Crate Descriptions

Core Crates

lindera-crf

Pure Rust implementation of Conditional Random Fields (CRF). Supports no_std environments. Uses rkyv for fast zero-copy serialization. This crate provides the statistical learning engine used in dictionary training.

lindera-dictionary

Base library for dictionary handling: loading, building, and querying dictionaries. With the train feature enabled, it also provides the CRF training pipeline for creating custom dictionaries.

Key modules under src/trainer/:

ModuleRole
config.rsConfiguration management (seed dict, char.def, feature.def, rewrite.def)
corpus.rsTraining corpus processing
feature_extractor.rsFeature template parsing and feature ID management
feature_rewriter.rsMeCab-compatible feature rewriting (3-section format)
model.rsTrained model storage, serialization, and dictionary output

lindera

The main morphological analysis library. Integrates dictionary crates and provides the Tokenizer, Segmenter, character filters, and token filters.

lindera-cli

Command-line interface for tokenization, dictionary training, export, and building. The train feature is enabled by default.

Dictionary Crates

Each dictionary crate contains pre-built dictionary data for a specific language and dictionary source.

CrateLanguageDictionary Source
lindera-ipadicJapaneseIPADIC
lindera-ipadic-neologdJapaneseIPADIC NEologd (extended vocabulary)
lindera-unidicJapaneseUniDic
lindera-ko-dicKoreanko-dic
lindera-cc-cedictChineseCC-CEDICT
lindera-jiebaChineseJieba

Bindings

lindera-python

Python bindings built with PyO3. Exposes the Lindera tokenizer API to Python applications.

lindera-wasm

WebAssembly bindings built with wasm-bindgen. Enables tokenization in browsers and Node.js.

Other Directories

resources/

Test resources including sample dictionaries, user dictionaries, and test corpora used by the test suite.

docs/

User-facing documentation built with mdBook. The table of contents is defined in docs/src/SUMMARY.md. A Japanese translation is available under docs/ja/.

examples/

Runnable example programs demonstrating common usage patterns. Run with:

cargo run --features=embed-ipadic --example=<example_name>

Training Pipeline

Lindera provides CRF-based dictionary training functionality for creating custom morphological analysis models. This feature requires the train feature flag.

Overview

The training pipeline follows three stages:

lindera train --> model.dat --> lindera export --> dictionary files --> lindera build --> compiled dictionary
  1. Train: Learn CRF weights from an annotated corpus and seed dictionary, producing a binary model file.
  2. Export: Convert the trained model into Lindera dictionary source files.
  3. Build: Compile the source files into a binary dictionary that Lindera can load at runtime.

Required Input Files

1. Seed Lexicon (seed.csv)

Base vocabulary dictionary in MeCab CSV format.

外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
参政,0,0,0,名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ

Each line contains: surface,left_id,right_id,cost,pos,pos_detail1,pos_detail2,pos_detail3,inflection_type,inflection_form,base_form,reading,pronunciation

The left_id, right_id, and cost fields are set to 0 in the seed dictionary -- the trainer will compute appropriate values from the CRF model.

2. Training Corpus (corpus.txt)

Annotated text data in tab-separated format. Each line is surface<TAB>pos_info, and sentences are separated by EOS.

外国	名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人	名詞,接尾,一般,*,*,*,人,ジン,ジン
参政	名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権	名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS

これ	連体詞,*,*,*,*,*,これ,コレ,コレ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト	名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
EOS

Training quality depends heavily on the quantity and quality of this corpus.

3. Character Definition (char.def)

Defines character type categories and Unicode code point ranges.

# Category definition: category_name compatibility_flag continuity_flag length
DEFAULT 0 1 0
HIRAGANA 1 1 0
KATAKANA 1 1 0
KANJI 0 0 2
ALPHA 1 1 0
NUMERIC 1 1 0

# Character range mapping
0x3041..0x3096 HIRAGANA  # Hiragana
0x30A1..0x30F6 KATAKANA  # Katakana
0x4E00..0x9FAF KANJI     # Kanji
0x0030..0x0039 NUMERIC   # Numbers
0x0041..0x005A ALPHA     # Uppercase letters
0x0061..0x007A ALPHA     # Lowercase letters

Parameters control how unknown words of each character type are segmented: compatibility with adjacent characters, whether runs of the same type continue as a single token, and default token length.

4. Unknown Word Definition (unk.def)

Defines how out-of-vocabulary words are handled by character type.

DEFAULT,0,0,0,名詞,一般,*,*,*,*,*,*,*
HIRAGANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KATAKANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KANJI,0,0,0,名詞,一般,*,*,*,*,*,*,*
ALPHA,0,0,0,名詞,固有名詞,一般,*,*,*,*,*,*
NUMERIC,0,0,0,名詞,数,*,*,*,*,*,*,*

5. Feature Template (feature.def)

MeCab-compatible feature extraction patterns that define what information the CRF model uses for learning.

# Unigram features (word-level)
UNIGRAM U00:%F[0]           # POS
UNIGRAM U01:%F[0],%F?[1]    # POS + POS detail (%F?[n] = optional, skipped if *)
UNIGRAM U02:%F[6]           # Base form
UNIGRAM U03:%w              # Surface form

# Bigram features (context combination)
BIGRAM B00:%L[0]/%R[0]      # Left POS / Right POS
BIGRAM B01:%L[0],%L[1]/%R[0],%R[1]  # Left POS detail / Right POS detail

Template variables:

VariableDescription
%F[n] / %F?[n]Feature field at index n (? = optional, skipped if value is *)
%L[n]Left context feature field (from rewrite.def left section)
%R[n]Right context feature field (from rewrite.def right section)
%wSurface form of the word
%uUnigram rewritten feature string
%lLeft rewritten feature string
%rRight rewritten feature string

6. Feature Rewrite Rules (rewrite.def)

Feature normalization rules in MeCab-compatible 3-section format. Sections are separated by blank lines.

# Section 1: Unigram rewrite rules
名詞,固有名詞,*  名詞,固有名詞
助動詞,*,*,*,特殊・デス  助動詞
*  *

# Section 2: Left context rewrite rules
名詞,固有名詞,*  名詞,固有名詞
助詞,*  助詞
*  *

# Section 3: Right context rewrite rules
名詞,固有名詞,*  名詞,固有名詞
助詞,*  助詞
*  *

Each line is pattern<TAB>replacement. Patterns use * as a wildcard and are matched by prefix. The first matching rule in each section is applied. Different rules can be applied to unigram, left context, and right context independently, enabling fine-grained feature normalization to reduce sparsity.

Training Parameters

ParameterDescriptionDefault
lambdaL1 regularization coefficient (controls overfitting)0.01
max-iterationsMaximum number of training iterations100
max-threadsNumber of parallel processing threads1

CLI Usage

Train

lindera train \
    --seed seed.csv \
    --corpus corpus.txt \
    --char-def char.def \
    --unk-def unk.def \
    --feature-def feature.def \
    --rewrite-def rewrite.def \
    --lambda 0.01 \
    --max-iter 100 \
    --max-threads 4 \
    --output model.dat

Export

Convert the trained model into dictionary source files:

lindera export --model model.dat --output-dir ./dict-source

This produces the following files:

FileDescription
lex.csvLexicon with trained costs
matrix.defConnection cost matrix
unk.defUnknown word definition
char.defCharacter definition
feature.defFeature template
rewrite.defFeature rewrite rules
left-id.defLeft context ID mapping
right-id.defRight context ID mapping
metadata.jsonDictionary metadata

Build

Compile the exported source files into a binary dictionary:

lindera build --input-dir ./dict-source --output-dir ./dict-compiled

Output Model Format

The trained model is serialized in rkyv binary format for fast loading. It contains:

  • Feature weights learned by the CRF
  • Label set (vocabulary entries)
  • Part-of-speech information
  • Feature templates
  • Training metadata (regularization, iterations, feature/label counts)

API Usage

#![allow(unused)]
fn main() {
use std::fs::File;
use lindera_dictionary::trainer::{Corpus, Trainer, TrainerConfig};

// Load configuration from files
let seed_file = File::open("resources/training/seed.csv")?;
let char_file = File::open("resources/training/char.def")?;
let unk_file = File::open("resources/training/unk.def")?;
let feature_file = File::open("resources/training/feature.def")?;
let rewrite_file = File::open("resources/training/rewrite.def")?;

let config = TrainerConfig::from_readers(
    seed_file,
    char_file,
    unk_file,
    feature_file,
    rewrite_file
)?;

// Initialize and configure trainer
let trainer = Trainer::new(config)?
    .regularization_cost(0.01)
    .max_iter(100)
    .num_threads(4);

// Load corpus
let corpus_file = File::open("resources/training/corpus.txt")?;
let corpus = Corpus::from_reader(corpus_file)?;

// Execute training
let model = trainer.train(corpus)?;

// Save model (binary format)
let mut output = File::create("trained_model.dat")?;
model.write_model(&mut output)?;

// Output in Lindera dictionary format
let mut lex_out = File::create("output_lex.csv")?;
let mut conn_out = File::create("output_conn.dat")?;
let mut unk_out = File::create("output_unk.def")?;
let mut user_out = File::create("output_user.csv")?;
model.write_dictionary(&mut lex_out, &mut conn_out, &mut unk_out, &mut user_out)?;

Ok::<(), Box<dyn std::error::Error>>(())
}

For generating effective dictionaries for real applications:

Corpus Size

LevelSentencesUse Case
Minimum100+Basic operation verification
Recommended1,000+Practical applications
Ideal10,000+Commercial quality

Quality Guidelines

  • Vocabulary diversity: Balanced distribution of different parts of speech, coverage of inflections and suffixes, appropriate inclusion of technical terms and proper nouns.
  • Consistency: Apply analysis criteria consistently across the corpus.
  • Verification: Manually verify morphological analysis results. Maintain an error rate below 5%.

Contributing

Thank you for your interest in contributing to Lindera! This page provides guidelines to help you get started.

Getting Started

  1. Fork the repository on GitHub.

  2. Clone your fork locally:

    git clone https://github.com/<your-username>/lindera.git
    cd lindera
    
  3. Create a feature branch:

    git checkout -b feature/my-feature
    
  4. Make your changes, then verify they pass all checks:

    cargo fmt --all -- --check
    cargo clippy -- -D warnings
    cargo test
    
  5. Commit and push your changes, then open a pull request.

Code Style

  • Follow the existing code style in the repository.
  • Run cargo fmt before committing.
  • All public and private items (types, functions, modules, fields, constants, type aliases) must have documentation comments (///).
  • Trait implementation methods should also have documentation comments describing implementation-specific behavior.
  • Function and method documentation should include # Arguments and # Returns sections where applicable.
  • Code comments, documentation comments, commit messages, log messages, and error messages should be written in English.
  • Avoid unwrap() and expect() in production code (test code is fine).
  • Use unsafe blocks only when necessary, and always include a // SAFETY: ... comment.
  • Use file-based module style (src/tokenizer.rs) instead of mod.rs style.

Testing

  • Write unit tests for all new functionality.

  • Run the relevant test(s) during development for fast feedback:

    cargo test -p <crate> <test_name>
    
  • When working with the train feature, include the feature flag:

    cargo test -p lindera-dictionary --features train
    

Commit Messages

Follow the Conventional Commits specification. Write commit messages in English.

Examples:

  • feat: add Korean dictionary support
  • fix: correct character category ID in trainer
  • docs: update installation instructions
  • refactor: split large training method into smaller functions

Documentation

  • If your change affects user-facing documentation, update the relevant files in docs/src/.

  • After editing Markdown files, verify there are no lint errors:

    markdownlint-cli2 "docs/src/**/*.md"
    
  • Rules are configured in .markdownlint.json at the repository root.

Dependencies

When adding new dependencies, verify license compatibility. Lindera uses the MIT / Apache-2.0 dual license.

Feature Flags

Use #[cfg(feature = "train")] for conditional compilation of training-related code. See Feature Flags for a full list.

Reporting Issues

When reporting a bug, please include:

  • Lindera version (lindera --version or check Cargo.toml)
  • Rust version (rustc --version)
  • Operating system
  • Steps to reproduce the issue
  • Expected and actual behavior