Tokenizer API

TokenizerBuilder

Lindera::TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

`Lindera::TokenizerBuilder.new`

Creates a new builder with default configuration.

require 'lindera'

builder = Lindera::TokenizerBuilder.new

`Lindera::TokenizerBuilder.new.from_file(file_path)`

Loads configuration from a JSON file and returns a new builder.

builder = Lindera::TokenizerBuilder.new.from_file('config.json')

Configuration Methods

`set_mode(mode)`

Sets the tokenization mode.

"normal" -- Standard tokenization (default)
"decompose" -- Decomposes compound words into smaller units

builder.set_mode('normal')

`set_dictionary(path)`

Sets the system dictionary path or URI.

# Use an embedded dictionary
builder.set_dictionary('embedded://ipadic')

# Use an external dictionary
builder.set_dictionary('/path/to/dictionary')

`set_user_dictionary(uri)`

Sets the user dictionary URI.

builder.set_user_dictionary('/path/to/user_dictionary')

`set_keep_whitespace(keep)`

Controls whether whitespace tokens appear in the output.

builder.set_keep_whitespace(true)

`append_character_filter(kind, args)`

Appends a character filter to the preprocessing pipeline. The args parameter is a hash with string keys.

builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })

`append_token_filter(kind, args)`

Appends a token filter to the postprocessing pipeline. The args parameter is a hash with string keys, or nil if the filter requires no arguments.

builder.append_token_filter('lowercase', nil)

Build

`build`

Builds and returns a Tokenizer with the configured settings.

tokenizer = builder.build

Tokenizer

Lindera::Tokenizer performs morphological analysis on text.

Creating a Tokenizer

`Lindera::Tokenizer.new(dictionary, mode, user_dictionary)`

Creates a tokenizer directly from a loaded dictionary.

require 'lindera'

dictionary = Lindera.load_dictionary('embedded://ipadic')
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', nil)

With a user dictionary:

dictionary = Lindera.load_dictionary('embedded://ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', user_dict)

Tokenizer Methods

`tokenize(text)`

Tokenizes the input text and returns an array of Token objects.

tokens = tokenizer.tokenize('形態素解析')

Parameters:

Name	Type	Description
`text`	`String`	Text to tokenize

Returns: Array<Token>

`tokenize_nbest(text, n, unique, cost_threshold)`

Returns the N-best tokenization results, each paired with its total path cost.

results = tokenizer.tokenize_nbest('すもももももももものうち', 3, false, nil)
results.each do |tokens, cost|
  puts "#{cost}: #{tokens.map(&:surface).inspect}"
end

Parameters:

Name	Type	Description
`text`	`String`	Text to tokenize
`n`	`Integer`	Number of results to return
`unique`	`Boolean` or `nil`	Deduplicate results (default: `false`)
`cost_threshold`	`Integer` or `nil`	Maximum cost difference from the best path (default: `nil`)

Returns: Array<Array(Array<Token>, Integer)>

Token

Token represents a single morphological token.

Properties

Property	Type	Description
`surface`	`String`	Surface form of the token
`byte_start`	`Integer`	Start byte position in the original text
`byte_end`	`Integer`	End byte position in the original text
`position`	`Integer`	Token position index
`word_id`	`Integer`	Dictionary word ID
`is_unknown`	`Boolean`	`true` if the word is not in the dictionary
`details`	`Array<String>` or `nil`	Morphological details (part of speech, reading, etc.)

Token Methods

`get_detail(index)`

Returns the detail string at the specified index, or nil if the index is out of range.

token = tokenizer.tokenize('東京')[0]
pos = token.get_detail(0)        # e.g., "名詞"
subpos = token.get_detail(1)     # e.g., "固有名詞"
reading = token.get_detail(7)    # e.g., "トウキョウ"

Parameters:

Name	Type	Description
`index`	`Integer`	Zero-based index into the details array

Returns: String or nil

The structure of details depends on the dictionary:

IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
UniDic: Detailed morphological features following the UniDic specification
ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats

Lindera Documentation