Project Structure

Lindera is organized as a Cargo workspace with multiple crates.

Directory Layout

lindera/
├── lindera-crf/            # CRF engine (pure Rust, no_std)
├── lindera-dictionary/     # Dictionary base library
├── lindera-trainer/        # CRF-based dictionary training
├── lindera/                # Core morphological segmentation library
├── lindera-analysis/       # Analysis chain (character/token filters, tokenizer)
├── lindera-cli/            # CLI tool
├── lindera-binding-core/   # FFI-independent helpers shared by the language bindings
├── lindera-ipadic/         # IPADIC dictionary (Japanese)
├── lindera-ipadic-neologd/ # IPADIC NEologd dictionary (Japanese)
├── lindera-unidic/         # UniDic dictionary (Japanese)
├── lindera-ko-dic/         # ko-dic dictionary (Korean)
├── lindera-cc-cedict/      # CC-CEDICT dictionary (Chinese)
├── lindera-jieba/          # Jieba dictionary (Chinese)
├── lindera-python/         # Python bindings (PyO3)
├── lindera-nodejs/         # Node.js bindings (NAPI-RS)
├── lindera-ruby/           # Ruby bindings (Magnus + rb-sys)
├── lindera-php/            # PHP bindings (ext-php-rs)
├── lindera-wasm/           # WebAssembly bindings (wasm-bindgen)
├── resources/              # Test resources and sample data
├── docs/                   # Documentation (mdBook)
└── examples/               # Example code

Pure Rust implementation of Conditional Random Fields (CRF). Supports no_std environments. Uses rkyv for fast zero-copy serialization. This crate provides the statistical learning engine used in dictionary training.

`lindera-dictionary`

Base library for dictionary handling: loading, building, and querying dictionaries.

`lindera-trainer`

CRF training pipeline for creating custom dictionaries. Builds on lindera-dictionary runtime types and the lindera-crf engine. Consumed through the lindera facade's train feature (re-exported as lindera::dictionary::trainer).

Module	Role
`config.rs`	Configuration management (seed dict, char.def, feature.def, rewrite.def)
`corpus.rs`	Training corpus processing
`feature_extractor.rs`	Feature template parsing and feature ID management
`feature_rewriter.rs`	MeCab-compatible feature rewriting (3-section format)
`model.rs`	Trained model storage, serialization, and dictionary output

`lindera`

The main morphological segmentation library. Integrates dictionary crates and provides the Segmenter API.

`lindera-analysis`

Lucene-style analysis chain on top of lindera: character filters, token filters, and the Tokenizer that composes them around a Segmenter.

`lindera-cli`

Command-line interface for tokenization, dictionary training, export, and building. The train feature is enabled by default.

`lindera-binding-core`

FFI-independent helpers shared by all five language bindings (lindera-python, lindera-nodejs, lindera-ruby, lindera-php, lindera-wasm): a core tokenizer/schema/metadata layer that each binding wraps in its own language-native API.

Dictionary Crates

Each dictionary crate contains pre-built dictionary data for a specific language and dictionary source.

Crate	Language	Dictionary Source
`lindera-ipadic`	Japanese	IPADIC
`lindera-ipadic-neologd`	Japanese	IPADIC NEologd (extended vocabulary)
`lindera-unidic`	Japanese	UniDic
`lindera-ko-dic`	Korean	ko-dic
`lindera-cc-cedict`	Chinese	CC-CEDICT
`lindera-jieba`	Chinese	Jieba

Bindings

`lindera-python`

Python bindings built with PyO3. Exposes the Lindera tokenizer API to Python applications.

`lindera-nodejs`

Node.js bindings built with NAPI-RS. Exposes the Lindera tokenizer API to Node.js applications.

`lindera-ruby`

Ruby bindings built with Magnus and rb-sys. Exposes the Lindera tokenizer API as a Ruby gem.

`lindera-php`

PHP bindings built with ext-php-rs. Exposes the Lindera tokenizer API as a PHP extension.

`lindera-wasm`

WebAssembly bindings built with wasm-bindgen. Enables tokenization in browsers and Node.js.

Other Directories

`resources/`

Test resources including sample dictionaries, user dictionaries, and test corpora used by the test suite.

`docs/`

User-facing documentation built with mdBook. The table of contents is defined in docs/src/SUMMARY.md. A Japanese translation is available under docs/ja/.

`examples/`

Runnable example programs demonstrating common usage patterns. Run with:

cargo run --features=embed-ipadic --example=<example_name>

Lindera Documentation