Character Filters

Character filters are pre-processing steps applied to the input text before tokenization. They normalize or transform characters to improve tokenization quality and consistency.

Available character filters

unicode_normalize

Applies Unicode normalization to the input text. This is useful for normalizing full-width characters to half-width, or for canonicalizing equivalent Unicode representations.

Supported normalization forms:

FormDescription
NFKCCompatibility decomposition followed by canonical composition. Converts full-width alphanumeric characters to half-width and normalizes Katakana variants.
NFCCanonical decomposition followed by canonical composition.
NFDCanonical decomposition.
NFKDCompatibility decomposition.

japanese_iteration_mark

Normalizes Japanese iteration marks into their expanded forms. Iteration marks are special characters that indicate the repetition of the preceding character.

MarkNameExample
Kanji iteration mark人々 (hitobito)
ゝ / ゞHiragana iteration marksいすゞ (isuzu)
ヽ / ヾKatakana iteration marksバナナヽ

The filter accepts two boolean parameters: whether to normalize Hiragana iteration marks and whether to normalize Katakana iteration marks.

mapping

Performs character-level string replacement based on a user-defined mapping table. This can be used for custom normalization rules.

For example, mapping "リンデラ" to "Lindera".

YAML configuration example

When using Lindera with a YAML configuration file, character filters can be specified in the character_filters section:

segmenter:
  mode: normal
  dictionary: "embedded://ipadic"

character_filters:
  - kind: unicode_normalize
    args:
      kind: nfkc
  - kind: japanese_iteration_mark
    args:
      normalize_kanji: true
      normalize_kana: true
  - kind: mapping
    args:
      mapping:
        リンデラ: Lindera

Rust API example

Character filters can be created and appended to a Tokenizer programmatically:

use lindera::character_filter::BoxCharacterFilter;
use lindera::character_filter::unicode_normalize::{
    UnicodeNormalizeCharacterFilter, UnicodeNormalizeKind,
};
use lindera::character_filter::japanese_iteration_mark::JapaneseIterationMarkCharacterFilter;
use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);

    // Create character filters.
    let unicode_normalize_char_filter =
        UnicodeNormalizeCharacterFilter::new(UnicodeNormalizeKind::NFKC);

    let japanese_iteration_mark_char_filter =
        JapaneseIterationMarkCharacterFilter::new(true, true);

    // Create a tokenizer and append character filters.
    let mut tokenizer = Tokenizer::new(segmenter);

    tokenizer
        .append_character_filter(BoxCharacterFilter::from(unicode_normalize_char_filter))
        .append_character_filter(BoxCharacterFilter::from(
            japanese_iteration_mark_char_filter,
        ));

    // Tokenize text -- full-width "Lindera" will be normalized to "Lindera".
    let text = "Linderaは形態素解析エンジンです。";
    let tokens = tokenizer.tokenize(text)?;

    for token in tokens {
        println!(
            "token: {:?}, details: {:?}",
            token.surface, token.details
        );
    }

    Ok(())
}

Output (with NFKC normalization applied):

token: "Lindera", details: Some(["名詞", "固有名詞", "組織", "*", "*", "*", "*", "*", "*"])
token: "は", details: Some(["助詞", "係助詞", "*", "*", "*", "*", "は", "ハ", "ワ"])
token: "形態素", details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: "解析", details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: "エンジン", details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: "です", details: Some(["助動詞", "*", "*", "*", "特殊・デス", "基本形", "です", "デス", "デス"])
token: "。", details: Some(["記号", "句点", "*", "*", "*", "*", "。", "。", "。"])