Tokenizer API
TokenizerBuilder
TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.
Constructors
new TokenizerBuilder()
Creates a new builder with default configuration.
const { TokenizerBuilder } = require("lindera-nodejs");
const builder = new TokenizerBuilder();
new TokenizerBuilder().fromFile(filePath)
Loads configuration from a JSON file and returns a new builder.
const builder = new TokenizerBuilder().fromFile("config.json");
Configuration Methods
All setter methods return this for method chaining.
setMode(mode)
Sets the tokenization mode.
"normal"-- Standard tokenization (default)"decompose"-- Decomposes compound words into smaller units
builder.setMode("normal");
setDictionary(path)
Sets the system dictionary path or URI.
// Use an embedded dictionary
builder.setDictionary("embedded://ipadic");
// Use an external dictionary
builder.setDictionary("/path/to/dictionary");
setUserDictionary(uri)
Sets the user dictionary URI.
builder.setUserDictionary("/path/to/user_dictionary");
setKeepWhitespace(keep)
Controls whether whitespace tokens appear in the output.
builder.setKeepWhitespace(true);
appendCharacterFilter(kind, args?)
Appends a character filter to the preprocessing pipeline.
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });
appendTokenFilter(kind, args?)
Appends a token filter to the postprocessing pipeline.
builder.appendTokenFilter("lowercase", {});
Build
build()
Builds and returns a Tokenizer with the configured settings.
const tokenizer = builder.build();
Tokenizer
Tokenizer performs morphological analysis on text.
Creating a Tokenizer
new Tokenizer(dictionary, mode?, userDictionary?)
Creates a tokenizer directly from a loaded dictionary.
const { Tokenizer, loadDictionary } = require("lindera-nodejs");
const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");
Tokenizer Methods
tokenize(text)
Tokenizes the input text and returns an array of Token objects.
const tokens = tokenizer.tokenize("形態素解析");
Parameters:
| Name | Type | Description |
|---|---|---|
text | string | Text to tokenize |
Returns: Token[]
tokenizeNbest(text, n, unique?, costThreshold?)
Returns the N-best tokenization results, each containing tokens and total path cost.
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
for (const { tokens, cost } of results) {
console.log(cost, tokens.map((t) => t.surface));
}
Parameters:
| Name | Type | Description |
|---|---|---|
text | string | Text to tokenize |
n | number | Number of results to return |
unique | boolean | Deduplicate results (default: false) |
costThreshold | number | undefined | Maximum cost difference from the best path (default: undefined) |
Returns: Array<{ tokens: Token[], cost: number }>
Token
Token represents a single morphological token.
Properties
| Property | Type | Description |
|---|---|---|
surface | string | Surface form of the token |
byteStart | number | Start byte position in the original text |
byteEnd | number | End byte position in the original text |
position | number | Token position index |
wordId | number | Dictionary word ID |
isUnknown | boolean | true if the word is not in the dictionary |
details | string[] | null | Morphological details (part of speech, reading, etc.) |
Token Methods
getDetail(index)
Returns the detail string at the specified index, or null if the index is out of range.
const token = tokenizer.tokenize("東京")[0];
const pos = token.getDetail(0); // e.g., "名詞"
const subpos = token.getDetail(1); // e.g., "固有名詞"
const reading = token.getDetail(7); // e.g., "トウキョウ"
Parameters:
| Name | Type | Description |
|---|---|---|
index | number | Zero-based index into the details array |
Returns: string | null
The structure of details depends on the dictionary:
- IPADIC:
[品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音] - UniDic: Detailed morphological features following the UniDic specification
- ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats