Dictionary Management
Loading Dictionaries from OPFS
The recommended way to use dictionaries in WASM is to download them from GitHub Releases and load them via OPFS. This avoids embedding large dictionaries in the WASM binary.
Loading from Bytes
Use loadDictionaryFromBytes() to construct a Dictionary from raw byte arrays stored in OPFS or other browser storage.
loadDictionaryFromBytes(metadata, dictDa, dictVals, dictWordsIdx, dictWords, matrixMtx, charDef, unk)
- Parameters:
metadata(Uint8Array) -- Contents ofmetadata.jsondictDa(Uint8Array) -- Contents ofdict.da(Double-Array Trie)dictVals(Uint8Array) -- Contents ofdict.vals(word value data)dictWordsIdx(Uint8Array) -- Contents ofdict.wordsidx(word details index)dictWords(Uint8Array) -- Contents ofdict.words(word details)matrixMtx(Uint8Array) -- Contents ofmatrix.mtx(connection cost matrix)charDef(Uint8Array) -- Contents ofchar_def.bin(character definitions)unk(Uint8Array) -- Contents ofunk.bin(unknown word dictionary)
- Returns:
Dictionary
import { loadDictionaryFromBytes, TokenizerBuilder } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';
// Load dictionary files from OPFS
const files = await loadDictionaryFiles("ipadic");
// Create a Dictionary from bytes
const dictionary = loadDictionaryFromBytes(
files.metadata,
files.dictDa,
files.dictVals,
files.dictWordsIdx,
files.dictWords,
files.matrixMtx,
files.charDef,
files.unk,
);
// Use with TokenizerBuilder
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();
See OPFS Dictionary Storage for the full OPFS workflow including downloading and caching.
Embedded Dictionaries (Advanced)
If you built with an embed-* feature flag, you can load embedded dictionaries via the embedded:// URI scheme. This increases the WASM binary size significantly.
Loading an Embedded Dictionary
import { loadDictionary } from 'lindera-wasm-web-ipadic';
const dictionary = loadDictionary("embedded://ipadic");
Available embedded dictionary URIs (depending on which features were enabled at build time):
| URI | Feature Flag |
|---|---|
embedded://ipadic | embed-ipadic |
embedded://unidic | embed-unidic |
embedded://ko-dic | embed-ko-dic |
embedded://cc-cedict | embed-cc-cedict |
embedded://jieba | embed-jieba |
Using with TokenizerBuilder
const builder = new TokenizerBuilder();
builder.setDictionary("embedded://ipadic");
builder.setMode("normal");
const tokenizer = builder.build();
Using with Tokenizer Constructor
import { loadDictionary, Tokenizer } from 'lindera-wasm-web-ipadic';
const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");
Dictionary Class
The Dictionary class represents a loaded morphological analysis dictionary.
Properties
| Property | Type | Description |
|---|---|---|
name | string | Dictionary name (e.g., "ipadic") |
encoding | string | Character encoding of the dictionary |
metadata | Metadata | Full metadata object |
console.log(dictionary.name); // "ipadic"
console.log(dictionary.encoding); // "utf-8"
User Dictionaries
User dictionaries allow you to add custom words that are not in the system dictionary.
Loading a User Dictionary
import { loadUserDictionary } from 'lindera-wasm-web';
const metadata = dictionary.metadata;
const userDict = loadUserDictionary("/path/to/user_dict.csv", metadata);
Using a User Dictionary with Tokenizer
import { loadDictionaryFromBytes, loadUserDictionary, Tokenizer } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const userDict = loadUserDictionary("/path/to/user_dict.csv", dictionary.metadata);
const tokenizer = new Tokenizer(dictionary, "normal", userDict);
User Dictionary CSV Format
The user dictionary CSV follows the same format as the Lindera user dictionary:
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
Each line contains: surface,part_of_speech,reading
Building Dictionaries
You can build compiled dictionaries from source files using the JavaScript API.
Building a System Dictionary
import { buildDictionary } from 'lindera-wasm-web';
const metadata = {
name: "custom-dict",
encoding: "utf-8",
// ... other metadata fields
};
buildDictionary("/path/to/source/dir", "/path/to/output/dir", metadata);
Building a User Dictionary
import { buildUserDictionary } from 'lindera-wasm-web';
buildUserDictionary("/path/to/user_dict.csv", "/path/to/output/dir");
The metadata parameter is optional for buildUserDictionary. If omitted, default metadata is used.
Metadata
The Metadata class configures dictionary parameters.
Constructor
const metadata = new Metadata(name?, encoding?);
- Parameters:
name(string, optional) -- Dictionary name (default:"default")encoding(string, optional) -- Character encoding (default:"UTF-8")
Static Methods
Metadata.createDefault()
Creates a Metadata instance with default values.
const metadata = Metadata.createDefault();
Metadata Properties
| Property | Type | Default | Description |
|---|---|---|---|
name | string | "default" | Dictionary name |
encoding | string | "UTF-8" | Character encoding |
dictionary_schema | Schema | IPADIC schema | Schema for the main dictionary |
user_dictionary_schema | Schema | Minimal schema | Schema for user dictionaries |
All properties support both getting and setting:
const metadata = Metadata.createDefault();
metadata.name = "custom_dict";
metadata.encoding = "EUC-JP";
console.log(metadata.name); // "custom_dict"
You can also access the metadata from a loaded dictionary via dictionary.metadata.
Schema
The Schema class defines the field structure of dictionary entries.
Schema Constructor
const schema = new Schema(["surface", "left_id", "right_id", "cost", "pos", "reading"]);
Schema Static Methods
Schema.create_default()-- Creates a default IPADIC-like schema
Schema Methods
| Method | Returns | Description |
|---|---|---|
get_field_index(name) | number | undefined | Get field index by name |
field_count() | number | Total number of fields |
get_field_name(index) | string | undefined | Get field name by index |
get_custom_fields() | string[] | Fields beyond index 3 (morphological features) |
get_all_fields() | string[] | All field names |
get_field_by_name(name) | FieldDefinition | undefined | Get full field definition |
FieldDefinition
| Property | Type | Description |
|---|---|---|
index | number | Field position index |
name | string | Field name |
field_type | FieldType | Field type enum |
description | string | undefined | Optional description |
FieldType
| Value | Description |
|---|---|
FieldType.Surface | Word surface text |
FieldType.LeftContextId | Left context ID |
FieldType.RightContextId | Right context ID |
FieldType.Cost | Word cost |
FieldType.Custom | Morphological feature field |