Tokenizer API
This page documents the JavaScript/TypeScript API exposed by lindera-wasm.
TokenizerBuilder
Builder class for creating a configured Tokenizer instance.
Constructor
const builder = new TokenizerBuilder();
Creates a new builder with default settings.
Methods
setMode(mode)
Sets the tokenization mode.
- Parameters:
mode(string) --"normal"or"decompose" - Returns: void
builder.setMode("normal");
setDictionary(uri)
Sets the dictionary to use for tokenization.
- Parameters:
uri(string) -- Dictionary URI (e.g.,"embedded://ipadic") - Returns: void
builder.setDictionary("embedded://ipadic");
setDictionaryInstance(dictionary)
Sets a pre-loaded dictionary instance for tokenization.
Use this when the dictionary has been loaded from bytes (e.g., via loadDictionaryFromBytes()) instead of from a URI.
- Parameters:
dictionary(Dictionary) -- A loaded dictionary object - Returns: void
import { loadDictionaryFromBytes } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
builder.setDictionaryInstance(dictionary);
setUserDictionary(uri)
Sets a user-defined dictionary by URI.
- Parameters:
uri(string) -- Path or URI to the user dictionary - Returns: void
builder.setUserDictionary("file:///path/to/user_dict.csv");
setUserDictionaryInstance(userDictionary)
Sets a pre-loaded user dictionary instance. Use this when the user dictionary has been loaded from bytes instead of from a URI.
- Parameters:
userDictionary(UserDictionary) -- A loaded user dictionary object - Returns: void
setKeepWhitespace(keep)
Sets whether whitespace tokens are preserved in the output.
- Parameters:
keep(boolean) --trueto keep whitespace tokens - Returns: void
builder.setKeepWhitespace(true);
appendCharacterFilter(name, args)
Appends a character filter to the preprocessing pipeline.
- Parameters:
name(string) -- Filter name (e.g.,"unicode_normalize","japanese_iteration_mark")args(object, optional) -- Filter configuration
- Returns: void
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });
appendTokenFilter(name, args)
Appends a token filter to the postprocessing pipeline.
- Parameters:
name(string) -- Filter name (e.g.,"japanese_stop_tags","lowercase")args(object, optional) -- Filter configuration
- Returns: void
builder.appendTokenFilter("japanese_stop_tags", {
tags: ["助詞", "助動詞", "記号"]
});
build()
Builds and returns a configured Tokenizer instance. Consumes the builder.
- Returns:
Tokenizer
const tokenizer = builder.build();
Tokenizer
The main tokenizer class. Can be created via TokenizerBuilder.build() or directly via the constructor.
Tokenizer Constructor
const tokenizer = new Tokenizer(dictionary, mode, userDictionary);
- Parameters:
dictionary(Dictionary) -- A loaded dictionary objectmode(string, optional) -- Tokenization mode ("normal"or"decompose", defaults to"normal")userDictionary(UserDictionary, optional) -- A loaded user dictionary
Tokenizer Methods
tokenize(text)
Tokenizes the input text.
- Parameters:
text(string) -- Text to tokenize - Returns:
Token[]-- Array of token objects
const tokens = tokenizer.tokenize("関西国際空港");
tokenizeNbest(text, n, unique?, costThreshold?)
Returns N-best tokenization results ordered by total path cost.
- Parameters:
text(string) -- Text to tokenizen(number) -- Number of results to returnunique(boolean, optional) -- Deduplicate results with identical segmentation (default:false)costThreshold(number, optional) -- Only return paths withinbestCost + threshold
- Returns: Array of
{ tokens: object[], cost: number }
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
Token
Represents a single token produced by the tokenizer.
Properties
| Property | Type | Description |
|---|---|---|
surface | string | Surface form of the token |
byteStart | number | Start byte offset in the original text |
byteEnd | number | End byte offset in the original text |
position | number | Position index of the token |
wordId | number | Word ID in the dictionary |
isUnknown | boolean | Whether the token is an unknown word |
details | string[] | Morphological detail fields |
Token Methods
getDetail(index)
Returns the detail string at the specified index.
- Parameters:
index(number) -- Zero-based index into the details array - Returns:
string | undefined
const pos = token.getDetail(0); // e.g., "名詞"
const reading = token.getDetail(7); // e.g., "トウキョウ"
toJSON()
Returns a plain JavaScript object representation of the token.
- Returns:
objectwith keys:surface,byteStart,byteEnd,position,wordId,isUnknown,details
console.log(JSON.stringify(token.toJSON(), null, 2));
Helper Functions
loadDictionary(uri)
Loads a dictionary from the specified URI.
- Parameters:
uri(string) -- Dictionary URI (e.g.,"embedded://ipadic") - Returns:
Dictionary
import { loadDictionary } from 'lindera-wasm-web-ipadic';
const dict = loadDictionary("embedded://ipadic");
loadUserDictionary(uri, metadata)
Loads a user dictionary from the specified URI.
- Parameters:
uri(string) -- Path or URI to the user dictionary filemetadata(Metadata) -- Dictionary metadata object
- Returns:
UserDictionary
buildDictionary(inputDir, outputDir, metadata)
Builds a compiled dictionary from source files.
- Parameters:
inputDir(string) -- Path to the directory containing source dictionary filesoutputDir(string) -- Path to the output directorymetadata(Metadata) -- Dictionary metadata object
- Returns: void
buildUserDictionary(inputFile, outputDir, metadata?)
Builds a compiled user dictionary from a CSV file.
- Parameters:
inputFile(string) -- Path to the user dictionary CSV fileoutputDir(string) -- Path to the output directorymetadata(Metadata, optional) -- Dictionary metadata object
- Returns: void
version() / getVersion()
Returns the version string of the lindera-wasm package.
- Returns:
string
import { version } from 'lindera-wasm-web-ipadic';
console.log(version()); // e.g., "2.1.1"
Enums and Utility Classes
Mode
Tokenization mode enum.
| Value | Description |
|---|---|
Mode.Normal | Standard tokenization based on dictionary cost |
Mode.Decompose | Decompose compound words using penalty-based segmentation |
Penalty
Configuration for decompose mode. Controls how aggressively compound words are decomposed.
const penalty = new Penalty(
kanjiThreshold?, // Kanji length threshold (default: 2)
kanjiPenalty?, // Kanji length penalty (default: 3000)
otherThreshold?, // Other character length threshold (default: 7)
otherPenalty?, // Other character length penalty (default: 1700)
);
| Property | Type | Default | Description |
|---|---|---|---|
kanji_penalty_length_threshold | number | 2 | Length threshold for kanji compound splitting |
kanji_penalty_length_penalty | number | 3000 | Penalty cost for kanji compounds exceeding threshold |
other_penalty_length_threshold | number | 7 | Length threshold for non-kanji compound splitting |
other_penalty_length_penalty | number | 1700 | Penalty cost for non-kanji compounds exceeding threshold |
LinderaError
Error type for Lindera operations.
const error = new LinderaError("message");
console.log(error.message); // "message"
console.log(error.toString()); // "message"
| Property / Method | Type | Description |
|---|---|---|
message | string | Error message |
toString() | string | Returns the error message |
Snake-Case Aliases
For consistency with the Python API, all methods are also available in snake_case form:
| camelCase | snake_case |
|---|---|
setMode() | set_mode() |
setDictionary() | set_dictionary() |
setDictionaryInstance() | set_dictionary_instance() |
setUserDictionary() | set_user_dictionary() |
setUserDictionaryInstance() | set_user_dictionary_instance() |
setKeepWhitespace() | set_keep_whitespace() |
appendCharacterFilter() | append_character_filter() |
appendTokenFilter() | append_token_filter() |
tokenizeNbest() | tokenize_nbest() |
loadDictionary() | load_dictionary() |
loadDictionaryFromBytes() | load_dictionary_from_bytes() |
loadUserDictionary() | load_user_dictionary() |
buildDictionary() | build_dictionary() |
buildUserDictionary() | build_user_dictionary() |