Dictionary Management

Loading Dictionaries from OPFS

The recommended way to use dictionaries in WASM is to download them from GitHub Releases and load them via OPFS. This avoids embedding large dictionaries in the WASM binary.

Loading from Bytes

Use loadDictionaryFromBytes() to construct a Dictionary from raw byte arrays stored in OPFS or other browser storage.

loadDictionaryFromBytes(metadata, dictDa, dictVals, dictWordsIdx, dictWords, matrixMtx, charDef, unk)

  • Parameters:
    • metadata (Uint8Array) -- Contents of metadata.json
    • dictDa (Uint8Array) -- Contents of dict.da (Double-Array Trie)
    • dictVals (Uint8Array) -- Contents of dict.vals (word value data)
    • dictWordsIdx (Uint8Array) -- Contents of dict.wordsidx (word details index)
    • dictWords (Uint8Array) -- Contents of dict.words (word details)
    • matrixMtx (Uint8Array) -- Contents of matrix.mtx (connection cost matrix)
    • charDef (Uint8Array) -- Contents of char_def.bin (character definitions)
    • unk (Uint8Array) -- Contents of unk.bin (unknown word dictionary)
  • Returns: Dictionary
import { loadDictionaryFromBytes, TokenizerBuilder } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';

// Load dictionary files from OPFS
const files = await loadDictionaryFiles("ipadic");

// Create a Dictionary from bytes
const dictionary = loadDictionaryFromBytes(
    files.metadata,
    files.dictDa,
    files.dictVals,
    files.dictWordsIdx,
    files.dictWords,
    files.matrixMtx,
    files.charDef,
    files.unk,
);

// Use with TokenizerBuilder
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();

See OPFS Dictionary Storage for the full OPFS workflow including downloading and caching.

Embedded Dictionaries (Advanced)

If you built with an embed-* feature flag, you can load embedded dictionaries via the embedded:// URI scheme. This increases the WASM binary size significantly.

Loading an Embedded Dictionary

import { loadDictionary } from 'lindera-wasm-web-ipadic';

const dictionary = loadDictionary("embedded://ipadic");

Available embedded dictionary URIs (depending on which features were enabled at build time):

URIFeature Flag
embedded://ipadicembed-ipadic
embedded://unidicembed-unidic
embedded://ko-dicembed-ko-dic
embedded://cc-cedictembed-cc-cedict
embedded://jiebaembed-jieba

Using with TokenizerBuilder

const builder = new TokenizerBuilder();
builder.setDictionary("embedded://ipadic");
builder.setMode("normal");
const tokenizer = builder.build();

Using with Tokenizer Constructor

import { loadDictionary, Tokenizer } from 'lindera-wasm-web-ipadic';

const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");

Dictionary Class

The Dictionary class represents a loaded morphological analysis dictionary.

Properties

PropertyTypeDescription
namestringDictionary name (e.g., "ipadic")
encodingstringCharacter encoding of the dictionary
metadataMetadataFull metadata object
console.log(dictionary.name);     // "ipadic"
console.log(dictionary.encoding); // "utf-8"

User Dictionaries

User dictionaries allow you to add custom words that are not in the system dictionary.

Loading a User Dictionary

import { loadUserDictionary } from 'lindera-wasm-web';

const metadata = dictionary.metadata;
const userDict = loadUserDictionary("/path/to/user_dict.csv", metadata);

Using a User Dictionary with Tokenizer

import { loadDictionaryFromBytes, loadUserDictionary, Tokenizer } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';

const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
    files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
    files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const userDict = loadUserDictionary("/path/to/user_dict.csv", dictionary.metadata);
const tokenizer = new Tokenizer(dictionary, "normal", userDict);

User Dictionary CSV Format

The user dictionary CSV follows the same format as the Lindera user dictionary:

東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン

Each line contains: surface,part_of_speech,reading

Building Dictionaries

You can build compiled dictionaries from source files using the JavaScript API.

Building a System Dictionary

import { buildDictionary } from 'lindera-wasm-web';

const metadata = {
    name: "custom-dict",
    encoding: "utf-8",
    // ... other metadata fields
};

buildDictionary("/path/to/source/dir", "/path/to/output/dir", metadata);

Building a User Dictionary

import { buildUserDictionary } from 'lindera-wasm-web';

buildUserDictionary("/path/to/user_dict.csv", "/path/to/output/dir");

The metadata parameter is optional for buildUserDictionary. If omitted, default metadata is used.

Metadata

The Metadata class configures dictionary parameters.

Constructor

const metadata = new Metadata(name?, encoding?);
  • Parameters:
    • name (string, optional) -- Dictionary name (default: "default")
    • encoding (string, optional) -- Character encoding (default: "UTF-8")

Static Methods

Metadata.createDefault()

Creates a Metadata instance with default values.

const metadata = Metadata.createDefault();

Metadata Properties

PropertyTypeDefaultDescription
namestring"default"Dictionary name
encodingstring"UTF-8"Character encoding
dictionary_schemaSchemaIPADIC schemaSchema for the main dictionary
user_dictionary_schemaSchemaMinimal schemaSchema for user dictionaries

All properties support both getting and setting:

const metadata = Metadata.createDefault();
metadata.name = "custom_dict";
metadata.encoding = "EUC-JP";
console.log(metadata.name); // "custom_dict"

You can also access the metadata from a loaded dictionary via dictionary.metadata.

Schema

The Schema class defines the field structure of dictionary entries.

Schema Constructor

const schema = new Schema(["surface", "left_id", "right_id", "cost", "pos", "reading"]);

Schema Static Methods

  • Schema.create_default() -- Creates a default IPADIC-like schema

Schema Methods

MethodReturnsDescription
get_field_index(name)number | undefinedGet field index by name
field_count()numberTotal number of fields
get_field_name(index)string | undefinedGet field name by index
get_custom_fields()string[]Fields beyond index 3 (morphological features)
get_all_fields()string[]All field names
get_field_by_name(name)FieldDefinition | undefinedGet full field definition

FieldDefinition

PropertyTypeDescription
indexnumberField position index
namestringField name
field_typeFieldTypeField type enum
descriptionstring | undefinedOptional description

FieldType

ValueDescription
FieldType.SurfaceWord surface text
FieldType.LeftContextIdLeft context ID
FieldType.RightContextIdRight context ID
FieldType.CostWord cost
FieldType.CustomMorphological feature field