Tokenizer API

This page documents the JavaScript/TypeScript API exposed by lindera-wasm.

TokenizerBuilder

Builder class for creating a configured Tokenizer instance.

Constructor

const builder = new TokenizerBuilder();

Creates a new builder with default settings.

Methods

setMode(mode)

Sets the tokenization mode.

  • Parameters: mode (string) -- "normal" or "decompose"
  • Returns: void
builder.setMode("normal");

setDictionary(uri)

Sets the dictionary to use for tokenization.

  • Parameters: uri (string) -- Dictionary URI (e.g., "embedded://ipadic")
  • Returns: void
builder.setDictionary("embedded://ipadic");

setDictionaryInstance(dictionary)

Sets a pre-loaded dictionary instance for tokenization. Use this when the dictionary has been loaded from bytes (e.g., via loadDictionaryFromBytes()) instead of from a URI.

  • Parameters: dictionary (Dictionary) -- A loaded dictionary object
  • Returns: void
import { loadDictionaryFromBytes } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';

const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
    files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
    files.dictWords, files.matrixMtx, files.charDef, files.unk,
);

builder.setDictionaryInstance(dictionary);

setUserDictionary(uri)

Sets a user-defined dictionary by URI.

  • Parameters: uri (string) -- Path or URI to the user dictionary
  • Returns: void
builder.setUserDictionary("file:///path/to/user_dict.csv");

setUserDictionaryInstance(userDictionary)

Sets a pre-loaded user dictionary instance. Use this when the user dictionary has been loaded from bytes instead of from a URI.

  • Parameters: userDictionary (UserDictionary) -- A loaded user dictionary object
  • Returns: void

setKeepWhitespace(keep)

Sets whether whitespace tokens are preserved in the output.

  • Parameters: keep (boolean) -- true to keep whitespace tokens
  • Returns: void
builder.setKeepWhitespace(true);

appendCharacterFilter(name, args)

Appends a character filter to the preprocessing pipeline.

  • Parameters:
    • name (string) -- Filter name (e.g., "unicode_normalize", "japanese_iteration_mark")
    • args (object, optional) -- Filter configuration
  • Returns: void
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });

appendTokenFilter(name, args)

Appends a token filter to the postprocessing pipeline.

  • Parameters:
    • name (string) -- Filter name (e.g., "japanese_stop_tags", "lowercase")
    • args (object, optional) -- Filter configuration
  • Returns: void
builder.appendTokenFilter("japanese_stop_tags", {
    tags: ["助詞", "助動詞", "記号"]
});

build()

Builds and returns a configured Tokenizer instance. Consumes the builder.

  • Returns: Tokenizer
const tokenizer = builder.build();

Tokenizer

The main tokenizer class. Can be created via TokenizerBuilder.build() or directly via the constructor.

Tokenizer Constructor

const tokenizer = new Tokenizer(dictionary, mode, userDictionary);
  • Parameters:
    • dictionary (Dictionary) -- A loaded dictionary object
    • mode (string, optional) -- Tokenization mode ("normal" or "decompose", defaults to "normal")
    • userDictionary (UserDictionary, optional) -- A loaded user dictionary

Tokenizer Methods

tokenize(text)

Tokenizes the input text.

  • Parameters: text (string) -- Text to tokenize
  • Returns: Token[] -- Array of token objects
const tokens = tokenizer.tokenize("関西国際空港");

tokenizeNbest(text, n, unique?, costThreshold?)

Returns N-best tokenization results ordered by total path cost.

  • Parameters:
    • text (string) -- Text to tokenize
    • n (number) -- Number of results to return
    • unique (boolean, optional) -- Deduplicate results with identical segmentation (default: false)
    • costThreshold (number, optional) -- Only return paths within bestCost + threshold
  • Returns: Array of { tokens: object[], cost: number }
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);

Token

Represents a single token produced by the tokenizer.

Properties

PropertyTypeDescription
surfacestringSurface form of the token
byteStartnumberStart byte offset in the original text
byteEndnumberEnd byte offset in the original text
positionnumberPosition index of the token
wordIdnumberWord ID in the dictionary
isUnknownbooleanWhether the token is an unknown word
detailsstring[]Morphological detail fields

Token Methods

getDetail(index)

Returns the detail string at the specified index.

  • Parameters: index (number) -- Zero-based index into the details array
  • Returns: string | undefined
const pos = token.getDetail(0);   // e.g., "名詞"
const reading = token.getDetail(7); // e.g., "トウキョウ"

toJSON()

Returns a plain JavaScript object representation of the token.

  • Returns: object with keys: surface, byteStart, byteEnd, position, wordId, isUnknown, details
console.log(JSON.stringify(token.toJSON(), null, 2));

Helper Functions

loadDictionary(uri)

Loads a dictionary from the specified URI.

  • Parameters: uri (string) -- Dictionary URI (e.g., "embedded://ipadic")
  • Returns: Dictionary
import { loadDictionary } from 'lindera-wasm-web-ipadic';

const dict = loadDictionary("embedded://ipadic");

loadUserDictionary(uri, metadata)

Loads a user dictionary from the specified URI.

  • Parameters:
    • uri (string) -- Path or URI to the user dictionary file
    • metadata (Metadata) -- Dictionary metadata object
  • Returns: UserDictionary

buildDictionary(inputDir, outputDir, metadata)

Builds a compiled dictionary from source files.

  • Parameters:
    • inputDir (string) -- Path to the directory containing source dictionary files
    • outputDir (string) -- Path to the output directory
    • metadata (Metadata) -- Dictionary metadata object
  • Returns: void

buildUserDictionary(inputFile, outputDir, metadata?)

Builds a compiled user dictionary from a CSV file.

  • Parameters:
    • inputFile (string) -- Path to the user dictionary CSV file
    • outputDir (string) -- Path to the output directory
    • metadata (Metadata, optional) -- Dictionary metadata object
  • Returns: void

version() / getVersion()

Returns the version string of the lindera-wasm package.

  • Returns: string
import { version } from 'lindera-wasm-web-ipadic';

console.log(version()); // e.g., "2.1.1"

Enums and Utility Classes

Mode

Tokenization mode enum.

ValueDescription
Mode.NormalStandard tokenization based on dictionary cost
Mode.DecomposeDecompose compound words using penalty-based segmentation

Penalty

Configuration for decompose mode. Controls how aggressively compound words are decomposed.

const penalty = new Penalty(
    kanjiThreshold?,     // Kanji length threshold (default: 2)
    kanjiPenalty?,       // Kanji length penalty (default: 3000)
    otherThreshold?,     // Other character length threshold (default: 7)
    otherPenalty?,       // Other character length penalty (default: 1700)
);
PropertyTypeDefaultDescription
kanji_penalty_length_thresholdnumber2Length threshold for kanji compound splitting
kanji_penalty_length_penaltynumber3000Penalty cost for kanji compounds exceeding threshold
other_penalty_length_thresholdnumber7Length threshold for non-kanji compound splitting
other_penalty_length_penaltynumber1700Penalty cost for non-kanji compounds exceeding threshold

LinderaError

Error type for Lindera operations.

const error = new LinderaError("message");
console.log(error.message);    // "message"
console.log(error.toString()); // "message"
Property / MethodTypeDescription
messagestringError message
toString()stringReturns the error message

Snake-Case Aliases

For consistency with the Python API, all methods are also available in snake_case form:

camelCasesnake_case
setMode()set_mode()
setDictionary()set_dictionary()
setDictionaryInstance()set_dictionary_instance()
setUserDictionary()set_user_dictionary()
setUserDictionaryInstance()set_user_dictionary_instance()
setKeepWhitespace()set_keep_whitespace()
appendCharacterFilter()append_character_filter()
appendTokenFilter()append_token_filter()
tokenizeNbest()tokenize_nbest()
loadDictionary()load_dictionary()
loadDictionaryFromBytes()load_dictionary_from_bytes()
loadUserDictionary()load_user_dictionary()
buildDictionary()build_dictionary()
buildUserDictionary()build_user_dictionary()