Segmenter
The Segmenter is the core component that performs morphological analysis. It uses the Viterbi algorithm to find the optimal segmentation of input text based on a dictionary and cost model.
Creating a Segmenter
A Segmenter requires three components:
- Mode - the tokenization strategy (
NormalorDecompose) - Dictionary - a system dictionary for morphological analysis
- UserDictionary (optional) - a supplementary dictionary for custom words
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); }
Tokenization Modes
Mode::Normal
Standard tokenization based on the dictionary entries. Words are segmented faithfully according to what is registered in the dictionary.
#![allow(unused)] fn main() { use lindera::mode::Mode; let mode = Mode::Normal; }
Mode::Decompose
Decomposes compound nouns into their constituent parts. This mode applies a configurable penalty to long compound words, encouraging the segmenter to split them into shorter components.
For example, with Mode::Normal, the compound word "関西国際空港" remains as a single token, while with Mode::Decompose, it is split into "関西", "国際", and "空港".
#![allow(unused)] fn main() { use lindera::mode::Mode; let mode = Mode::Decompose(Default::default()); }
Dictionary Loading
Lindera provides the load_dictionary function to load dictionaries from various sources.
Embedded Dictionaries
When built with the appropriate feature flag (e.g., embed-ipadic), dictionaries can be loaded directly from the binary:
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; let dictionary = load_dictionary("embedded://ipadic")?; }
Available embedded dictionary URIs:
embedded://ipadic- IPADIC (Japanese)embedded://ipadic-neologd- IPADIC NEologd (Japanese)embedded://unidic- UniDic (Japanese)embedded://ko-dic- ko-dic (Korean)embedded://cc-cedict- CC-CEDICT (Chinese)embedded://jieba- Jieba (Chinese)
External Dictionaries
Pre-built dictionary directories can be loaded from the filesystem:
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; let dictionary = load_dictionary("/path/to/dictionary")?; }
Using with Tokenizer
The Segmenter is typically used through the Tokenizer, which adds support for character filters and token filters:
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "日本語の形態素解析を行うことができます。"; let tokens = tokenizer.tokenize(text)?; for token in tokens { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }