Quick Start
This guide shows how to tokenize text using lindera-nodejs.
Basic Tokenization
The recommended way to create a tokenizer is through TokenizerBuilder:
const { TokenizerBuilder } = require("lindera-nodejs");
const builder = new TokenizerBuilder();
builder.setMode("normal");
builder.setDictionary("/path/to/ipadic");
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");
for (const token of tokens) {
console.log(`${token.surface}\t${token.details.join(",")}`);
}
Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.
Expected output:
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK
Method Chaining
TokenizerBuilder supports method chaining for concise configuration:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setMode("normal")
.setDictionary("/path/to/ipadic")
.build();
const tokens = tokenizer.tokenize("すもももももももものうち");
for (const token of tokens) {
console.log(`${token.surface}\t${token.getDetail(0)}`);
}
Accessing Token Properties
Each token exposes the following properties:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setDictionary("/path/to/ipadic")
.build();
const tokens = tokenizer.tokenize("東京タワー");
for (const token of tokens) {
console.log(`Surface: ${token.surface}`);
console.log(`Byte range: ${token.byteStart}..${token.byteEnd}`);
console.log(`Position: ${token.position}`);
console.log(`Word ID: ${token.wordId}`);
console.log(`Unknown: ${token.isUnknown}`);
console.log(`Details: ${token.details}`);
console.log();
}
N-best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setDictionary("/path/to/ipadic")
.build();
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
for (const { tokens, cost } of results) {
const surfaces = tokens.map((t) => t.surface);
console.log(`Cost ${cost}: ${surfaces.join(" / ")}`);
}
TypeScript
Lindera Node.js includes TypeScript type definitions. All classes and functions are fully typed:
import { TokenizerBuilder, Token } from "lindera-nodejs";
const tokenizer = new TokenizerBuilder()
.setMode("normal")
.setDictionary("/path/to/ipadic")
.build();
const tokens: Token[] = tokenizer.tokenize("形態素解析");
for (const token of tokens) {
console.log(`${token.surface}: ${token.details?.join(",")}`);
}