Quick Start
This guide shows how to tokenize text using lindera-python.
Basic Tokenization
The recommended way to create a tokenizer is through TokenizerBuilder:
from lindera import TokenizerBuilder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")
tokenizer = builder.build()
tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")
for token in tokens:
print(f"{token.surface}\t{','.join(token.details)}")
Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.
Expected output:
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK
Method Chaining
TokenizerBuilder supports method chaining for concise configuration:
from lindera import TokenizerBuilder
tokenizer = (
TokenizerBuilder()
.set_mode("normal")
.set_dictionary("/path/to/ipadic")
.build()
)
tokens = tokenizer.tokenize("すもももももももものうち")
for token in tokens:
print(f"{token.surface}\t{token.get_detail(0)}")
Accessing Token Properties
Each token exposes the following properties:
from lindera import TokenizerBuilder
tokenizer = TokenizerBuilder().set_dictionary("/path/to/ipadic").build()
tokens = tokenizer.tokenize("東京タワー")
for token in tokens:
print(f"Surface: {token.surface}")
print(f"Byte range: {token.byte_start}..{token.byte_end}")
print(f"Position: {token.position}")
print(f"Word ID: {token.word_id}")
print(f"Unknown: {token.is_unknown}")
print(f"Details: {token.details}")
print()
N-best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
from lindera import TokenizerBuilder
tokenizer = TokenizerBuilder().set_dictionary("/path/to/ipadic").build()
results = tokenizer.tokenize_nbest("すもももももももものうち", n=3)
for tokens, cost in results:
surfaces = [t.surface for t in tokens]
print(f"Cost {cost}: {' / '.join(surfaces)}")