Architecture

Lindera is organized as a Cargo workspace comprising multiple crates. Each crate has a focused responsibility, from low-level CRF computation to high-level CLI and language bindings.

Crate Dependency Graph

graph TB
    CRF["lindera-crf\n(CRF Engine)"]
    DICT["lindera-dictionary\n(Dictionary Base)"]
    IPADIC["lindera-ipadic"]
    UNIDIC["lindera-unidic"]
    KODIC["lindera-ko-dic"]
    CCCEDICT["lindera-cc-cedict"]
    JIEBA["lindera-jieba"]
    NEOLOGD["lindera-ipadic-neologd"]
    LIB["lindera\n(Core Library)"]
    CLI["lindera-cli\n(CLI)"]
    PY["lindera-python\n(Python)"]
    WASM["lindera-wasm\n(WebAssembly)"]

    CRF --> DICT
    DICT --> IPADIC
    DICT --> UNIDIC
    DICT --> KODIC
    DICT --> CCCEDICT
    DICT --> JIEBA
    DICT --> NEOLOGD
    DICT --> LIB
    IPADIC --> LIB
    UNIDIC --> LIB
    KODIC --> LIB
    CCCEDICT --> LIB
    JIEBA --> LIB
    NEOLOGD --> LIB
    LIB --> CLI
    LIB --> PY
    LIB --> WASM

Crate Overview

CrateTypeDescription
lindera-crfCorePure Rust CRF (Conditional Random Field) implementation. Supports no_std. Uses rkyv for serialization.
lindera-dictionaryCoreDictionary base library. Provides dictionary loading, building, and training (with the train feature).
linderaCoreMain morphological analysis library. Integrates dictionaries, segmenter, character filters, and token filters.
lindera-cliApplicationCommand-line interface for tokenization, dictionary building, and CRF training.
lindera-ipadicDictionaryJapanese dictionary based on IPADIC.
lindera-ipadic-neologdDictionaryJapanese dictionary based on IPADIC NEologd (includes neologisms).
lindera-unidicDictionaryJapanese dictionary based on UniDic.
lindera-ko-dicDictionaryKorean dictionary based on ko-dic.
lindera-cc-cedictDictionaryChinese dictionary based on CC-CEDICT.
lindera-jiebaDictionaryChinese dictionary based on Jieba.
lindera-pythonBindingPython bindings via PyO3.
lindera-wasmBindingWebAssembly bindings via wasm-bindgen.

Tokenization Pipeline

Lindera processes text through a multi-stage pipeline:

Input Text
  |
  v
Character Filters    -- Normalize characters (e.g., Unicode normalization, mapping)
  |
  v
Segmenter            -- Segment text into tokens using a dictionary and the Viterbi algorithm
  |
  v
Token Filters        -- Transform tokens (e.g., POS filtering, stop words, stemming)
  |
  v
Output Tokens

The Segmenter is the core component. It builds a lattice of candidate tokens from the dictionary, then applies the Viterbi algorithm to find the lowest-cost path, producing the most likely segmentation.

Feature Flags

FeatureDescriptionDefault
mmapMemory-mapped file support for dictionary loadingEnabled
trainCRF-based dictionary training functionality (depends on lindera-crf)CLI only
embed-ipadicEmbed the IPADIC dictionary into the binaryDisabled
embed-cjkEmbed IPADIC + ko-dic + Jieba dictionariesDisabled
embed-cjk2Embed UniDic + ko-dic + Jieba dictionariesDisabled
embed-cjk3Embed IPADIC NEologd + ko-dic + Jieba dictionariesDisabled

Learn More