Tokenizer API

TokenizerBuilder

Lindera\TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

`new Lindera\TokenizerBuilder()`

Creates a new builder with default configuration.

<?php

$builder = new Lindera\TokenizerBuilder();

`$builder->fromFile($filePath)`

Loads configuration from a JSON file.

<?php

$builder = new Lindera\TokenizerBuilder();
$builder->fromFile('config.json');

Configuration Methods

All setter methods return $this for method chaining.

`setMode($mode)`

Sets the tokenization mode.

"normal" -- Standard tokenization (default)
"decompose" -- Decomposes compound words into smaller units

<?php

$builder->setMode('normal');

`setDictionary($path)`

Sets the system dictionary path or URI.

<?php

// Use an embedded dictionary
$builder->setDictionary('embedded://ipadic');

// Use an external dictionary
$builder->setDictionary('/path/to/dictionary');

`setUserDictionary($uri)`

Sets the user dictionary URI.

<?php

$builder->setUserDictionary('/path/to/user_dictionary');

`setKeepWhitespace($keep)`

Controls whether whitespace tokens appear in the output.

<?php

$builder->setKeepWhitespace(true);

`appendCharacterFilter($kind, $args)`

Appends a character filter to the preprocessing pipeline.

<?php

$builder->appendCharacterFilter('unicode_normalize', ['kind' => 'nfkc']);

`appendTokenFilter($kind, $args)`

Appends a token filter to the postprocessing pipeline.

<?php

$builder->appendTokenFilter('lowercase');

Build

`build()`

Builds and returns a Tokenizer with the configured settings.

<?php

$tokenizer = $builder->build();

Tokenizer

Lindera\Tokenizer performs morphological analysis on text.

Creating a Tokenizer

`new Lindera\Tokenizer($dictionary, $mode, $userDictionary)`

Creates a tokenizer directly from a loaded dictionary.

<?php

$dictionary = Lindera\Dictionary::load('embedded://ipadic');
$tokenizer = new Lindera\Tokenizer($dictionary, 'normal');

With a user dictionary:

<?php

$dictionary = Lindera\Dictionary::load('embedded://ipadic');
$metadata = $dictionary->metadata();
$userDict = Lindera\Dictionary::loadUser('/path/to/user_dictionary', $metadata);
$tokenizer = new Lindera\Tokenizer($dictionary, 'normal', $userDict);

Tokenizer Methods

`tokenize($text)`

Tokenizes the input text and returns an array of Token objects.

<?php

$tokens = $tokenizer->tokenize('形態素解析');

Parameters:

Name	Type	Description
`$text`	`string`	Text to tokenize

Returns: array<Token>

`tokenizeNbest($text, $n, $unique, $costThreshold)`

Returns the N-best tokenization results as an array of NbestResult objects.

<?php

$results = $tokenizer->tokenizeNbest('すもももももももものうち', 3);
foreach ($results as $result) {
    echo "Cost: {$result->cost}\n";
    foreach ($result->tokens as $token) {
        echo "  {$token->surface}\n";
    }
}

Parameters:

Name	Type	Description
`$text`	`string`	Text to tokenize
`$n`	`int`	Number of results to return
`$unique`	`bool\|null`	Deduplicate results (default: `false`)
`$costThreshold`	`int\|null`	Maximum cost difference from the best path (default: `null`)

Returns: array<NbestResult>

NbestResult

Lindera\NbestResult represents a single N-best tokenization result.

NbestResult Properties

Property	Type	Description
`$tokens`	`array<Token>`	The tokens in this result
`$cost`	`int`	The total cost of this segmentation

Token

Lindera\Token represents a single morphological token.

Token Properties

Property	Type	Description
`$surface`	`string`	Surface form of the token
`$byte_start`	`int`	Start byte position in the original text
`$byte_end`	`int`	End byte position in the original text
`$position`	`int`	Token position index
`$word_id`	`int`	Dictionary word ID
`$is_unknown`	`bool`	`true` if the word is not in the dictionary
`$details`	`array<string>`	Morphological details (part of speech, reading, etc.)

Token Methods

`getDetail($index)`

Returns the detail string at the specified index, or null if the index is out of range.

<?php

$token = $tokenizer->tokenize('東京')[0];
$pos = $token->getDetail(0);        // e.g., "名詞"
$subpos = $token->getDetail(1);     // e.g., "固有名詞"
$reading = $token->getDetail(7);    // e.g., "トウキョウ"

Parameters:

Name	Type	Description
`$index`	`int`	Zero-based index into the details array

Returns: string|null

The structure of details depends on the dictionary:

IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
UniDic: Detailed morphological features following the UniDic specification
ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats

Lindera Documentation