Tokenization

Lindera provides multiple tokenization modes and supports N-Best analysis for enumerating alternative segmentation candidates.

Tokenization modes

Normal mode

Normal mode performs standard tokenization based on dictionary entries. Compound words that exist as single entries in the dictionary are kept as-is.

Example -- tokenizing "関西国際空港限定トートバッグ" in Normal mode:

関西国際空港 | 限定 | トートバッグ

The compound noun "関西国際空港" (Kansai International Airport) is preserved as a single token because it exists as one entry in the dictionary.

Decompose mode

Decompose mode further breaks down compound nouns into their constituent parts, even when the compound exists as a dictionary entry.

Example -- tokenizing "関西国際空港限定トートバッグ" in Decompose mode:

関西 | 国際 | 空港 | 限定 | トートバッグ

The compound "関西国際空港" is decomposed into "関西", "国際", and "空港".

Selecting a mode

In Rust, specify the mode when creating a Segmenter:

#![allow(unused)]
fn main() {
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::dictionary::load_dictionary;

let dictionary = load_dictionary("embedded://ipadic")?;

// Normal mode
let segmenter = Segmenter::new(Mode::Normal, dictionary, None);

// Decompose mode
let segmenter = Segmenter::new(Mode::Decompose(Default::default()), dictionary, None);
}

With the CLI, use the --mode flag:

echo "関西国際空港限定トートバッグ" | lindera tokenize --dict embedded://ipadic --mode normal
echo "関西国際空港限定トートバッグ" | lindera tokenize --dict embedded://ipadic --mode decompose

N-Best tokenization enumerates the top N tokenization candidates ordered by total path cost (lower cost = better segmentation). This is useful when the best result is ambiguous, or when you want to explore alternative interpretations of the input text.

Algorithm

N-Best tokenization is based on the Forward-DP Backward-A* algorithm, which is compatible with MeCab's N-Best implementation. The forward pass computes optimal costs using dynamic programming, and the backward pass uses A* search to enumerate paths in order of increasing total cost.

Parameters

The tokenize_nbest method accepts the following parameters:

Parameter	Type	Description
`text`	`&str`	The text to tokenize.
`n`	`usize`	Number of N-best results to return.
`unique`	`bool`	When `true`, deduplicates results that produce the same word boundary positions.
`cost_threshold`	`Option<i64>`	When `Some(threshold)`, only returns paths with cost within `best_cost + threshold`.

Rust API example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "すもももももももものうち";

    // Get top 3 tokenization results
    let results = tokenizer.tokenize_nbest(text, 3, false, None)?;

    for (rank, (tokens, cost)) in results.iter().enumerate() {
        println!("--- NBEST {} (cost={}) ---", rank + 1, cost);
        for token in tokens {
            let details = token.details().join(",");
            println!("{}\t{}", token.surface.as_ref(), details);
        }
    }

    Ok(())
}

Output:

--- NBEST 1 (cost=7546) ---
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
--- NBEST 2 (cost=7914) ---
...

CLI example

echo "すもももももももものうち" | lindera tokenize --dict embedded://ipadic -N 3

Lattice reuse

For repeated tokenization, you can reuse a Lattice to reduce memory allocations:

#![allow(unused)]
fn main() {
use lindera_dictionary::viterbi::Lattice;

let mut lattice = Lattice::default();
let results = tokenizer.tokenize_nbest_with_lattice(text, &mut lattice, 3, false, None)?;
}

Lindera Documentation