Tokenizer API

TokenizerBuilder

TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.

Constructors

new TokenizerBuilder()

Creates a new builder with default configuration.

const { TokenizerBuilder } = require("lindera-nodejs");

const builder = new TokenizerBuilder();

new TokenizerBuilder().fromFile(filePath)

Loads configuration from a JSON file and returns a new builder.

const builder = new TokenizerBuilder().fromFile("config.json");

Configuration Methods

All setter methods return this for method chaining.

setMode(mode)

Sets the tokenization mode.

  • "normal" -- Standard tokenization (default)
  • "decompose" -- Decomposes compound words into smaller units
builder.setMode("normal");

setDictionary(path)

Sets the system dictionary path or URI.

// Use an embedded dictionary
builder.setDictionary("embedded://ipadic");

// Use an external dictionary
builder.setDictionary("/path/to/dictionary");

setUserDictionary(uri)

Sets the user dictionary URI.

builder.setUserDictionary("/path/to/user_dictionary");

setKeepWhitespace(keep)

Controls whether whitespace tokens appear in the output.

builder.setKeepWhitespace(true);

appendCharacterFilter(kind, args?)

Appends a character filter to the preprocessing pipeline.

builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });

appendTokenFilter(kind, args?)

Appends a token filter to the postprocessing pipeline.

builder.appendTokenFilter("lowercase", {});

Build

build()

Builds and returns a Tokenizer with the configured settings.

const tokenizer = builder.build();

Tokenizer

Tokenizer performs morphological analysis on text.

Creating a Tokenizer

new Tokenizer(dictionary, mode?, userDictionary?)

Creates a tokenizer directly from a loaded dictionary.

const { Tokenizer, loadDictionary } = require("lindera-nodejs");

const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");

Tokenizer Methods

tokenize(text)

Tokenizes the input text and returns an array of Token objects.

const tokens = tokenizer.tokenize("形態素解析");

Parameters:

NameTypeDescription
textstringText to tokenize

Returns: Token[]

tokenizeNbest(text, n, unique?, costThreshold?)

Returns the N-best tokenization results, each containing tokens and total path cost.

const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
for (const { tokens, cost } of results) {
  console.log(cost, tokens.map((t) => t.surface));
}

Parameters:

NameTypeDescription
textstringText to tokenize
nnumberNumber of results to return
uniquebooleanDeduplicate results (default: false)
costThresholdnumber | undefinedMaximum cost difference from the best path (default: undefined)

Returns: Array<{ tokens: Token[], cost: number }>

Token

Token represents a single morphological token.

Properties

PropertyTypeDescription
surfacestringSurface form of the token
byteStartnumberStart byte position in the original text
byteEndnumberEnd byte position in the original text
positionnumberToken position index
wordIdnumberDictionary word ID
isUnknownbooleantrue if the word is not in the dictionary
detailsstring[] | nullMorphological details (part of speech, reading, etc.)

Token Methods

getDetail(index)

Returns the detail string at the specified index, or null if the index is out of range.

const token = tokenizer.tokenize("東京")[0];
const pos = token.getDetail(0);      // e.g., "名詞"
const subpos = token.getDetail(1);   // e.g., "固有名詞"
const reading = token.getDetail(7);  // e.g., "トウキョウ"

Parameters:

NameTypeDescription
indexnumberZero-based index into the details array

Returns: string | null

The structure of details depends on the dictionary:

  • IPADIC: [品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音]
  • UniDic: Detailed morphological features following the UniDic specification
  • ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats