Architecture

Lindera is organized as a Cargo workspace comprising multiple crates. Each crate has a focused responsibility, from low-level CRF computation to high-level CLI and language bindings.

Crate Dependency Graph

graph TB
    CRF["lindera-crf\n(CRF Engine)"]
    DICT["lindera-dictionary\n(Dictionary Base)"]
    IPADIC["lindera-ipadic"]
    UNIDIC["lindera-unidic"]
    KODIC["lindera-ko-dic"]
    CCCEDICT["lindera-cc-cedict"]
    JIEBA["lindera-jieba"]
    NEOLOGD["lindera-ipadic-neologd"]
    LIB["lindera\n(Core Library)"]
    CLI["lindera-cli\n(CLI)"]
    PY["lindera-python\n(Python)"]
    WASM["lindera-wasm\n(WebAssembly)"]

    CRF --> DICT
    DICT --> IPADIC
    DICT --> UNIDIC
    DICT --> KODIC
    DICT --> CCCEDICT
    DICT --> JIEBA
    DICT --> NEOLOGD
    DICT --> LIB
    IPADIC --> LIB
    UNIDIC --> LIB
    KODIC --> LIB
    CCCEDICT --> LIB
    JIEBA --> LIB
    NEOLOGD --> LIB
    LIB --> CLI
    LIB --> PY
    LIB --> WASM

Crate Overview

Crate	Type	Description
`lindera-crf`	Core	Pure Rust CRF (Conditional Random Field) implementation. Supports `no_std`. Uses `rkyv` for serialization.
`lindera-dictionary`	Core	Dictionary base library. Provides dictionary loading, building, and training (with the `train` feature).
`lindera`	Core	Main morphological analysis library. Integrates dictionaries, segmenter, character filters, and token filters.
`lindera-cli`	Application	Command-line interface for tokenization, dictionary building, and CRF training.
`lindera-ipadic`	Dictionary	Japanese dictionary based on IPADIC.
`lindera-ipadic-neologd`	Dictionary	Japanese dictionary based on IPADIC NEologd (includes neologisms).
`lindera-unidic`	Dictionary	Japanese dictionary based on UniDic.
`lindera-ko-dic`	Dictionary	Korean dictionary based on ko-dic.
`lindera-cc-cedict`	Dictionary	Chinese dictionary based on CC-CEDICT.
`lindera-jieba`	Dictionary	Chinese dictionary based on Jieba.
`lindera-python`	Binding	Python bindings via PyO3.
`lindera-wasm`	Binding	WebAssembly bindings via wasm-bindgen.

Tokenization Pipeline

Lindera processes text through a multi-stage pipeline:

Input Text
  |
  v
Character Filters    -- Normalize characters (e.g., Unicode normalization, mapping)
  |
  v
Segmenter            -- Segment text into tokens using a dictionary and the Viterbi algorithm
  |
  v
Token Filters        -- Transform tokens (e.g., POS filtering, stop words, stemming)
  |
  v
Output Tokens

The Segmenter is the core component. It builds a lattice of candidate tokens from the dictionary, then applies the Viterbi algorithm to find the lowest-cost path, producing the most likely segmentation.

Feature Flags

Feature	Description	Default
`mmap`	Memory-mapped file support for dictionary loading	Enabled
`train`	CRF-based dictionary training functionality (depends on `lindera-crf`)	CLI only
`embed-ipadic`	Embed the IPADIC dictionary into the binary	Disabled
`embed-cjk`	Embed IPADIC + ko-dic + Jieba dictionaries	Disabled
`embed-cjk2`	Embed UniDic + ko-dic + Jieba dictionaries	Disabled
`embed-cjk3`	Embed IPADIC NEologd + ko-dic + Jieba dictionaries	Disabled

Learn More

Getting Started -- Installation and first steps
Core Concepts -- Dictionaries, tokenization, and filters
Lindera Library -- Configuration, segmenter, and API
Lindera CLI -- Command-line interface
Development Guide -- Build, test, and contribute