Tokenizer API

TokenizerBuilder

Lindera::TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

Lindera::TokenizerBuilder.new

Creates a new builder with default configuration.

require 'lindera'

builder = Lindera::TokenizerBuilder.new

Lindera::TokenizerBuilder.new.from_file(file_path)

Loads configuration from a JSON file and returns a new builder.

builder = Lindera::TokenizerBuilder.new.from_file('config.json')

Configuration Methods

set_mode(mode)

Sets the tokenization mode.

  • "normal" -- Standard tokenization (default)
  • "decompose" -- Decomposes compound words into smaller units
builder.set_mode('normal')

set_dictionary(path)

Sets the system dictionary path or URI.

# Use an embedded dictionary
builder.set_dictionary('embedded://ipadic')

# Use an external dictionary
builder.set_dictionary('/path/to/dictionary')

set_user_dictionary(uri)

Sets the user dictionary URI.

builder.set_user_dictionary('/path/to/user_dictionary')

set_keep_whitespace(keep)

Controls whether whitespace tokens appear in the output.

builder.set_keep_whitespace(true)

append_character_filter(kind, args)

Appends a character filter to the preprocessing pipeline. The args parameter is a hash with string keys.

builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })

append_token_filter(kind, args)

Appends a token filter to the postprocessing pipeline. The args parameter is a hash with string keys, or nil if the filter requires no arguments.

builder.append_token_filter('lowercase', nil)

Build

build

Builds and returns a Tokenizer with the configured settings.

tokenizer = builder.build

Tokenizer

Lindera::Tokenizer performs morphological analysis on text.

Creating a Tokenizer

Lindera::Tokenizer.new(dictionary, mode, user_dictionary)

Creates a tokenizer directly from a loaded dictionary.

require 'lindera'

dictionary = Lindera.load_dictionary('embedded://ipadic')
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', nil)

With a user dictionary:

dictionary = Lindera.load_dictionary('embedded://ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', user_dict)

Tokenizer Methods

tokenize(text)

Tokenizes the input text and returns an array of Token objects.

tokens = tokenizer.tokenize('形態素解析')

Parameters:

NameTypeDescription
textStringText to tokenize

Returns: Array<Token>

tokenize_nbest(text, n, unique, cost_threshold)

Returns the N-best tokenization results, each paired with its total path cost.

results = tokenizer.tokenize_nbest('すもももももももものうち', 3, false, nil)
results.each do |tokens, cost|
  puts "#{cost}: #{tokens.map(&:surface).inspect}"
end

Parameters:

NameTypeDescription
textStringText to tokenize
nIntegerNumber of results to return
uniqueBoolean or nilDeduplicate results (default: false)
cost_thresholdInteger or nilMaximum cost difference from the best path (default: nil)

Returns: Array<Array(Array<Token>, Integer)>

Token

Token represents a single morphological token.

Properties

PropertyTypeDescription
surfaceStringSurface form of the token
byte_startIntegerStart byte position in the original text
byte_endIntegerEnd byte position in the original text
positionIntegerToken position index
word_idIntegerDictionary word ID
is_unknownBooleantrue if the word is not in the dictionary
detailsArray<String> or nilMorphological details (part of speech, reading, etc.)

Token Methods

get_detail(index)

Returns the detail string at the specified index, or nil if the index is out of range.

token = tokenizer.tokenize('東京')[0]
pos = token.get_detail(0)        # e.g., "名詞"
subpos = token.get_detail(1)     # e.g., "固有名詞"
reading = token.get_detail(7)    # e.g., "トウキョウ"

Parameters:

NameTypeDescription
indexIntegerZero-based index into the details array

Returns: String or nil

The structure of details depends on the dictionary:

  • IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
  • UniDic: Detailed morphological features following the UniDic specification
  • ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats