Quick Start
This guide shows how to tokenize text using lindera-ruby.
Basic Tokenization
The recommended way to create a tokenizer is through Lindera::TokenizerBuilder:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
tokens = tokenizer.tokenize('関西国際空港限定トートバッグ')
tokens.each do |token|
puts "#{token.surface}\t#{token.details.join(',')}"
end
Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.
Expected output:
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK
Sequential Configuration
TokenizerBuilder is configured through sequential method calls:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
tokens = tokenizer.tokenize('すもももももももものうち')
tokens.each do |token|
puts "#{token.surface}\t#{token.get_detail(0)}"
end
Accessing Token Properties
Each token exposes the following properties:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
tokens = tokenizer.tokenize('東京タワー')
tokens.each do |token|
puts "Surface: #{token.surface}"
puts "Byte range: #{token.byte_start}..#{token.byte_end}"
puts "Position: #{token.position}"
puts "Word ID: #{token.word_id}"
puts "Unknown: #{token.is_unknown}"
puts "Details: #{token.details}"
puts
end
N-best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
results = tokenizer.tokenize_nbest('すもももももももものうち', 3, false, nil)
results.each do |tokens, cost|
surfaces = tokens.map(&:surface)
puts "Cost #{cost}: #{surfaces.join(' / ')}"
end