Tokenizer API

TokenizerBuilder

TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

`TokenizerBuilder()`

Creates a new builder with default configuration.

from lindera import TokenizerBuilder

builder = TokenizerBuilder()

`TokenizerBuilder().from_file(file_path)`

Loads configuration from a JSON file and returns a new builder.

builder = TokenizerBuilder().from_file("config.json")

Configuration Methods

All setter methods return self for method chaining.

`set_mode(mode)`

Sets the tokenization mode.

"normal" -- Standard tokenization (default)
"decompose" -- Decomposes compound words into smaller units

builder.set_mode("normal")

`set_dictionary(path)`

Sets the system dictionary path or URI.

# Use an embedded dictionary
builder.set_dictionary("embedded://ipadic")

# Use an external dictionary
builder.set_dictionary("/path/to/dictionary")

`set_user_dictionary(uri)`

Sets the user dictionary URI.

builder.set_user_dictionary("/path/to/user_dictionary")

`set_keep_whitespace(keep)`

Controls whether whitespace tokens appear in the output.

builder.set_keep_whitespace(True)

`append_character_filter(kind, args=None)`

Appends a character filter to the preprocessing pipeline.

builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})

`append_token_filter(kind, args=None)`

Appends a token filter to the postprocessing pipeline.

builder.append_token_filter("lowercase", {})

Build

`build()`

Builds and returns a Tokenizer with the configured settings.

tokenizer = builder.build()

Tokenizer

Tokenizer performs morphological analysis on text.

Creating a Tokenizer

`Tokenizer(dictionary, mode="normal", user_dictionary=None)`

Creates a tokenizer directly from a loaded dictionary.

from lindera import Tokenizer, load_dictionary

dictionary = load_dictionary("embedded://ipadic")
tokenizer = Tokenizer(dictionary, mode="normal")

Tokenizer Methods

`tokenize(text)`

Tokenizes the input text and returns a list of Token objects.

tokens = tokenizer.tokenize("形態素解析")

Parameters:

Name	Type	Description
`text`	`str`	Text to tokenize

Returns: list[Token]

`tokenize_nbest(text, n, unique=False, cost_threshold=None)`

Returns the N-best tokenization results, each paired with its total path cost.

results = tokenizer.tokenize_nbest("すもももももももものうち", n=3)
for tokens, cost in results:
    print(cost, [t.surface for t in tokens])

Parameters:

Name	Type	Description
`text`	`str`	Text to tokenize
`n`	`int`	Number of results to return
`unique`	`bool`	Deduplicate results (default: `False`)
`cost_threshold`	`int` or `None`	Maximum cost difference from the best path (default: `None`)

Returns: list[tuple[list[Token], int]]

Token

Token represents a single morphological token.

Properties

Property	Type	Description
`surface`	`str`	Surface form of the token
`byte_start`	`int`	Start byte position in the original text
`byte_end`	`int`	End byte position in the original text
`position`	`int`	Token position index
`word_id`	`int`	Dictionary word ID
`is_unknown`	`bool`	`True` if the word is not in the dictionary
`details`	`list[str]` or `None`	Morphological details (part of speech, reading, etc.)

Token Methods

`get_detail(index)`

Returns the detail string at the specified index, or None if the index is out of range.

token = tokenizer.tokenize("東京")[0]
pos = token.get_detail(0)        # e.g., "名詞"
subpos = token.get_detail(1)     # e.g., "固有名詞"
reading = token.get_detail(7)    # e.g., "トウキョウ"

Parameters:

Name	Type	Description
`index`	`int`	Zero-based index into the details list

Returns: str or None

The structure of details depends on the dictionary:

IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
UniDic: Detailed morphological features following the UniDic specification
ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats

Lindera Documentation