Tokenizer API

TokenizerBuilder

TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

TokenizerBuilder()

Creates a new builder with default configuration.

from lindera import TokenizerBuilder

builder = TokenizerBuilder()

TokenizerBuilder().from_file(file_path)

Loads configuration from a JSON file and returns a new builder.

builder = TokenizerBuilder().from_file("config.json")

Configuration Methods

All setter methods return self for method chaining.

set_mode(mode)

Sets the tokenization mode.

  • "normal" -- Standard tokenization (default)
  • "decompose" -- Decomposes compound words into smaller units
builder.set_mode("normal")

set_dictionary(path)

Sets the system dictionary path or URI.

# Use an embedded dictionary
builder.set_dictionary("embedded://ipadic")

# Use an external dictionary
builder.set_dictionary("/path/to/dictionary")

set_user_dictionary(uri)

Sets the user dictionary URI.

builder.set_user_dictionary("/path/to/user_dictionary")

set_keep_whitespace(keep)

Controls whether whitespace tokens appear in the output.

builder.set_keep_whitespace(True)

append_character_filter(kind, args=None)

Appends a character filter to the preprocessing pipeline.

builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

append_token_filter(kind, args=None)

Appends a token filter to the postprocessing pipeline.

builder.append_token_filter("lowercase", {})

Build

build()

Builds and returns a Tokenizer with the configured settings.

tokenizer = builder.build()

Tokenizer

Tokenizer performs morphological analysis on text.

Creating a Tokenizer

Tokenizer(dictionary, mode="normal", user_dictionary=None)

Creates a tokenizer directly from a loaded dictionary.

from lindera import Tokenizer, load_dictionary

dictionary = load_dictionary("embedded://ipadic")
tokenizer = Tokenizer(dictionary, mode="normal")

Tokenizer Methods

tokenize(text)

Tokenizes the input text and returns a list of Token objects.

tokens = tokenizer.tokenize("形態素解析")

Parameters:

NameTypeDescription
textstrText to tokenize

Returns: list[Token]

tokenize_nbest(text, n, unique=False, cost_threshold=None)

Returns the N-best tokenization results, each paired with its total path cost.

results = tokenizer.tokenize_nbest("すもももももももものうち", n=3)
for tokens, cost in results:
    print(cost, [t.surface for t in tokens])

Parameters:

NameTypeDescription
textstrText to tokenize
nintNumber of results to return
uniqueboolDeduplicate results (default: False)
cost_thresholdint or NoneMaximum cost difference from the best path (default: None)

Returns: list[tuple[list[Token], int]]

Token

Token represents a single morphological token.

Properties

PropertyTypeDescription
surfacestrSurface form of the token
byte_startintStart byte position in the original text
byte_endintEnd byte position in the original text
positionintToken position index
word_idintDictionary word ID
is_unknownboolTrue if the word is not in the dictionary
detailslist[str] or NoneMorphological details (part of speech, reading, etc.)

Token Methods

get_detail(index)

Returns the detail string at the specified index, or None if the index is out of range.

token = tokenizer.tokenize("東京")[0]
pos = token.get_detail(0)        # e.g., "名詞"
subpos = token.get_detail(1)     # e.g., "固有名詞"
reading = token.get_detail(7)    # e.g., "トウキョウ"

Parameters:

NameTypeDescription
indexintZero-based index into the details list

Returns: str or None

The structure of details depends on the dictionary:

  • IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
  • UniDic: Detailed morphological features following the UniDic specification
  • ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats