Lindera Python
Lindera Python provides Python bindings for the Lindera morphological analysis engine, built with PyO3. It brings Lindera's high-performance tokenization capabilities to the Python ecosystem with support for Python 3.10 and later.
Features
- Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
- Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
- CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires
trainfeature) - Multiple tokenization modes: Normal and decompose modes for different analysis granularity
- N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
- User dictionaries: Extend system dictionaries with custom vocabulary
Documentation
- Installation -- Prerequisites, build instructions, and feature flags
- Quick Start -- A minimal example to get started
- Tokenizer API --
TokenizerBuilder,Tokenizer, andTokenclass reference - Dictionary Management -- Loading, building, and managing dictionaries
- Text Processing Pipeline -- Character filters and token filters
- Training -- Training custom CRF models and exporting dictionaries