Project Structure

Lindera is organized as a Cargo workspace with multiple crates.

Directory Layout

lindera/
├── lindera-crf/            # CRF engine (pure Rust, no_std)
├── lindera-dictionary/     # Dictionary base library
├── lindera/                # Core morphological analysis library
├── lindera-cli/            # CLI tool
├── lindera-ipadic/         # IPADIC dictionary (Japanese)
├── lindera-ipadic-neologd/ # IPADIC NEologd dictionary (Japanese)
├── lindera-unidic/         # UniDic dictionary (Japanese)
├── lindera-ko-dic/         # ko-dic dictionary (Korean)
├── lindera-cc-cedict/      # CC-CEDICT dictionary (Chinese)
├── lindera-jieba/          # Jieba dictionary (Chinese)
├── lindera-python/         # Python bindings (PyO3)
├── lindera-wasm/           # WebAssembly bindings (wasm-bindgen)
├── resources/              # Test resources and sample data
├── docs/                   # Documentation (mdBook)
└── examples/               # Example code

Crate Descriptions

Core Crates

lindera-crf

Pure Rust implementation of Conditional Random Fields (CRF). Supports no_std environments. Uses rkyv for fast zero-copy serialization. This crate provides the statistical learning engine used in dictionary training.

lindera-dictionary

Base library for dictionary handling: loading, building, and querying dictionaries. With the train feature enabled, it also provides the CRF training pipeline for creating custom dictionaries.

Key modules under src/trainer/:

ModuleRole
config.rsConfiguration management (seed dict, char.def, feature.def, rewrite.def)
corpus.rsTraining corpus processing
feature_extractor.rsFeature template parsing and feature ID management
feature_rewriter.rsMeCab-compatible feature rewriting (3-section format)
model.rsTrained model storage, serialization, and dictionary output

lindera

The main morphological analysis library. Integrates dictionary crates and provides the Tokenizer, Segmenter, character filters, and token filters.

lindera-cli

Command-line interface for tokenization, dictionary training, export, and building. The train feature is enabled by default.

Dictionary Crates

Each dictionary crate contains pre-built dictionary data for a specific language and dictionary source.

CrateLanguageDictionary Source
lindera-ipadicJapaneseIPADIC
lindera-ipadic-neologdJapaneseIPADIC NEologd (extended vocabulary)
lindera-unidicJapaneseUniDic
lindera-ko-dicKoreanko-dic
lindera-cc-cedictChineseCC-CEDICT
lindera-jiebaChineseJieba

Bindings

lindera-python

Python bindings built with PyO3. Exposes the Lindera tokenizer API to Python applications.

lindera-wasm

WebAssembly bindings built with wasm-bindgen. Enables tokenization in browsers and Node.js.

Other Directories

resources/

Test resources including sample dictionaries, user dictionaries, and test corpora used by the test suite.

docs/

User-facing documentation built with mdBook. The table of contents is defined in docs/src/SUMMARY.md. A Japanese translation is available under docs/ja/.

examples/

Runnable example programs demonstrating common usage patterns. Run with:

cargo run --features=embed-ipadic --example=<example_name>