Lindera
A morphological analysis library in Rust. Lindera is forked from kuromoji-rs and aims to provide easy installation and concise APIs for tokenizing text in multiple languages.
Key Features
| Feature | Description |
|---|---|
| Morphological Analysis | Viterbi-based segmentation and part-of-speech tagging |
| Multi-language Support | Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba) |
| Dictionary System | Pre-built dictionaries, user dictionaries, and custom dictionary training |
| Text Processing Pipeline | Composable character filters and token filters for flexible text normalization |
| CRF Training | Train custom CRF models for dictionary cost estimation |
| Python Bindings | Use Lindera from Python via PyO3 |
| WebAssembly | Run Lindera in the browser via wasm-bindgen |
| Pure Rust | No C/C++ dependencies; works on any platform Rust supports |
Tokenization Flow
graph LR
subgraph Your Application
T["Text"]
end
subgraph Lindera
CF["Character Filters"]
SEG["Segmenter\n(Dictionary + Viterbi)"]
TF["Token Filters"]
end
T --> CF --> SEG --> TF --> R["Tokens"]
Document Map
| Section | Description |
|---|---|
| Getting Started | Installation, quick start, and examples |
| Dictionaries | Available dictionaries and how to use them |
| Configuration | YAML-based tokenizer configuration |
| Advanced Usage | User dictionaries, filters, and CRF training |
| CLI | Command-line interface reference |
| Architecture | Crate structure and design overview |
| API Reference | Rust API documentation |
| Contributing | How to contribute to Lindera |
Quick Example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "関西国際空港限定トートバッグ"; let mut tokens = tokenizer.tokenize(text)?; println!("text:\t{}", text); for token in tokens.iter_mut() { let details = token.details().join(","); println!("token:\t{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Run the example:
cargo run --features=embed-ipadic --example=tokenize
Output:
text: 関西国際空港限定トートバッグ
token: 関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token: 限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token: トートバッグ 名詞,一般,*,*,*,*,*,*,*
License
Lindera is released under the MIT License.
Architecture
Lindera is organized as a Cargo workspace comprising multiple crates. Each crate has a focused responsibility, from low-level CRF computation to high-level CLI and language bindings.
Crate Dependency Graph
graph TB
CRF["lindera-crf\n(CRF Engine)"]
DICT["lindera-dictionary\n(Dictionary Base)"]
IPADIC["lindera-ipadic"]
UNIDIC["lindera-unidic"]
KODIC["lindera-ko-dic"]
CCCEDICT["lindera-cc-cedict"]
JIEBA["lindera-jieba"]
NEOLOGD["lindera-ipadic-neologd"]
LIB["lindera\n(Core Library)"]
CLI["lindera-cli\n(CLI)"]
PY["lindera-python\n(Python)"]
WASM["lindera-wasm\n(WebAssembly)"]
CRF --> DICT
DICT --> IPADIC
DICT --> UNIDIC
DICT --> KODIC
DICT --> CCCEDICT
DICT --> JIEBA
DICT --> NEOLOGD
DICT --> LIB
IPADIC --> LIB
UNIDIC --> LIB
KODIC --> LIB
CCCEDICT --> LIB
JIEBA --> LIB
NEOLOGD --> LIB
LIB --> CLI
LIB --> PY
LIB --> WASM
Crate Overview
| Crate | Type | Description |
|---|---|---|
lindera-crf | Core | Pure Rust CRF (Conditional Random Field) implementation. Supports no_std. Uses rkyv for serialization. |
lindera-dictionary | Core | Dictionary base library. Provides dictionary loading, building, and training (with the train feature). |
lindera | Core | Main morphological analysis library. Integrates dictionaries, segmenter, character filters, and token filters. |
lindera-cli | Application | Command-line interface for tokenization, dictionary building, and CRF training. |
lindera-ipadic | Dictionary | Japanese dictionary based on IPADIC. |
lindera-ipadic-neologd | Dictionary | Japanese dictionary based on IPADIC NEologd (includes neologisms). |
lindera-unidic | Dictionary | Japanese dictionary based on UniDic. |
lindera-ko-dic | Dictionary | Korean dictionary based on ko-dic. |
lindera-cc-cedict | Dictionary | Chinese dictionary based on CC-CEDICT. |
lindera-jieba | Dictionary | Chinese dictionary based on Jieba. |
lindera-python | Binding | Python bindings via PyO3. |
lindera-wasm | Binding | WebAssembly bindings via wasm-bindgen. |
Tokenization Pipeline
Lindera processes text through a multi-stage pipeline:
Input Text
|
v
Character Filters -- Normalize characters (e.g., Unicode normalization, mapping)
|
v
Segmenter -- Segment text into tokens using a dictionary and the Viterbi algorithm
|
v
Token Filters -- Transform tokens (e.g., POS filtering, stop words, stemming)
|
v
Output Tokens
The Segmenter is the core component. It builds a lattice of candidate tokens from the dictionary, then applies the Viterbi algorithm to find the lowest-cost path, producing the most likely segmentation.
Feature Flags
| Feature | Description | Default |
|---|---|---|
mmap | Memory-mapped file support for dictionary loading | Enabled |
train | CRF-based dictionary training functionality (depends on lindera-crf) | CLI only |
embed-ipadic | Embed the IPADIC dictionary into the binary | Disabled |
embed-cjk | Embed IPADIC + ko-dic + Jieba dictionaries | Disabled |
embed-cjk2 | Embed UniDic + ko-dic + Jieba dictionaries | Disabled |
embed-cjk3 | Embed IPADIC NEologd + ko-dic + Jieba dictionaries | Disabled |
Learn More
- Getting Started -- Installation and first steps
- Core Concepts -- Dictionaries, tokenization, and filters
- Lindera Library -- Configuration, segmenter, and API
- Lindera CLI -- Command-line interface
- Development Guide -- Build, test, and contribute
Getting Started
This section will guide you through installing Lindera and running your first morphological analysis.
- Installation -- Add Lindera to your project and configure environment variables
- Quick Start -- Tokenize your first text in just a few lines of code
- Examples -- Explore example programs for common use cases
Installation
Put the following in Cargo.toml:
[dependencies]
lindera = "3.0.0"
Dictionary Setup
Lindera requires a pre-built dictionary at runtime. Download a dictionary from GitHub Releases and specify its path when loading:
#![allow(unused)] fn main() { let dictionary = load_dictionary("/path/to/ipadic")?; }
[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding
embed-*feature flag and load it using theembedded://scheme:#![allow(unused)] fn main() { // Cargo.toml: lindera = { version = "3.0.0", features = ["embed-ipadic"] } let dictionary = load_dictionary("embedded://ipadic")?; }See Feature Flags for details.
Environment Variables
LINDERA_DICTIONARIES_PATH
The LINDERA_DICTIONARIES_PATH environment variable specifies a directory for caching dictionary source files. This enables:
- Offline builds: Once downloaded, dictionary source files are preserved for future builds
- Faster builds: Subsequent builds skip downloading if valid cached files exist
- Reproducible builds: Ensures consistent dictionary versions across builds
Usage:
export LINDERA_DICTIONARIES_PATH=/path/to/dicts
cargo build --features=ipadic
When set, dictionary source files are stored in $LINDERA_DICTIONARIES_PATH/<version>/ where <version> is the lindera-dictionary crate version. The cache validates files using MD5 checksums - invalid files are automatically re-downloaded.
[!NOTE]
LINDERA_CACHEis deprecated but still supported for backward compatibility. It will be used ifLINDERA_DICTIONARIES_PATHis not set.
LINDERA_CONFIG_PATH
The LINDERA_CONFIG_PATH environment variable specifies the path to a YAML configuration file for the tokenizer. This allows you to configure tokenizer behavior without modifying Rust code.
export LINDERA_CONFIG_PATH=./resources/config/lindera.yml
See the Configuration section for details on the configuration format.
DOCS_RS
The DOCS_RS environment variable is automatically set by docs.rs when building documentation. When this variable is detected, Lindera creates dummy dictionary files instead of downloading actual dictionary data, allowing documentation to be built without network access or large file downloads.
This is primarily used internally by docs.rs and typically doesn't need to be set by users.
LINDERA_WORKDIR
The LINDERA_WORKDIR environment variable is automatically set during the build process by the lindera-dictionary crate. It points to the directory containing the built dictionary data files and is used internally by dictionary crates to locate their data files.
This variable is set automatically and should not be modified by users.
Quick Start
This example covers the basic usage of Lindera.
It will:
- Create a tokenizer in normal mode
- Tokenize the input text
- Output the tokens
First, download a pre-built IPADIC dictionary from GitHub Releases and extract it to a local directory (e.g., /path/to/ipadic).
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("/path/to/ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "関西国際空港限定トートバッグ"; let mut tokens = tokenizer.tokenize(text)?; println!("text:\t{}", text); for token in tokens.iter_mut() { let details = token.details().join(","); println!("token:\t{}\t{}", token.surface.as_ref(), details); } Ok(()) }
The above example can be run as follows:
% cargo run --example=tokenize
[!TIP] If you embed the dictionary into the binary using the
embed-ipadicfeature (advanced usage), you can useload_dictionary("embedded://ipadic")instead of specifying a file path. See Feature Flags for details.
You can see the result as follows:
text: 関西国際空港限定トートバッグ
token: 関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token: 限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token: トートバッグ 名詞,一般,*,*,*,*,*,*,*
Examples
Lindera includes several example programs that demonstrate common use cases. The source code is available in the examples directory on GitHub.
Before running the examples, download a pre-built IPADIC dictionary from GitHub Releases and extract it to a local directory.
Available Examples
tokenize
Basic tokenization using an external IPADIC dictionary. Segments input text and prints each token with its part-of-speech details.
cargo run --example=tokenize
tokenize_with_user_dict
Tokenization with a user dictionary. Shows how to supplement the dictionary with custom entries for domain-specific terms.
cargo run --example=tokenize_with_user_dict
tokenize_with_filters
Tokenization with character filters and token filters. Demonstrates the text processing pipeline, including Unicode normalization, part-of-speech filtering, and other transformations.
cargo run --example=tokenize_with_filters
tokenize_with_config
Tokenization using a YAML configuration file. Shows how to configure the tokenizer declaratively instead of programmatically.
cargo run --example=tokenize_with_config
Core Concepts
This section explains the fundamental concepts behind Lindera's morphological analysis system.
- Morphological Analysis - How Lindera segments and analyzes text.
- Dictionaries - Dictionary formats supported by Lindera.
- Tokenization - Tokenization modes and N-Best analysis.
- User Dictionary - Adding custom words with user dictionaries.
- Character Filters - Pre-processing text before tokenization.
Morphological Analysis
What is morphological analysis?
Morphological analysis is the process of breaking down text into its smallest meaningful units (morphemes) and identifying their grammatical properties. For languages like Japanese, Chinese, and Korean -- where words are not separated by spaces -- morphological analysis is an essential first step for natural language processing tasks such as search indexing, text classification, and machine translation.
How Lindera works
Lindera is a dictionary-based morphological analyzer. It uses a pre-compiled system dictionary containing known words along with their costs, and applies the Viterbi algorithm to find the optimal segmentation of input text.
The analysis process works as follows:
- Lattice construction: Lindera scans the input text and looks up all possible words in the dictionary at every position, building a directed acyclic graph (lattice) of candidate segmentations.
- Cost assignment: Each candidate word has an associated word cost (from the dictionary), and each pair of adjacent words has a connection cost (from the connection cost matrix).
- Optimal path search: The Viterbi algorithm finds the path through the lattice with the minimum total cost, producing the best segmentation.
Key terminology
| Term | Description |
|---|---|
| Surface form | The actual text as it appears in the input (e.g., "食べ"). |
| Part-of-speech (POS) | The grammatical category of a word (e.g., noun, verb, particle). Lindera dictionaries provide hierarchical POS tags with up to four levels of subcategories. |
| Reading | The pronunciation of a word, typically in Katakana for Japanese dictionaries. |
| Base form | The uninflected (dictionary) form of a word (e.g., "食べる" for the surface "食べ"). |
| Conjugation | Inflection information for words that conjugate, consisting of a conjugation type and a conjugation form. |
Cost-based segmentation
The Viterbi algorithm selects the segmentation path with the minimum total cost. The total cost of a path is the sum of:
- Word costs: Each word in the dictionary has an associated cost. Lower cost means the word is more likely to appear. Common words tend to have lower costs, while rare words have higher costs.
- Connection costs: The cost of connecting two adjacent words, determined by the right context ID of the left word and the left context ID of the right word.
The algorithm computes:
Total cost = sum of word costs + sum of connection costs
By minimizing this total cost, Lindera finds the most natural segmentation of the input text.
Connection cost matrix
The connection cost matrix stores the cost of transitioning from one word to another. It is a two-dimensional matrix indexed by:
- The right context ID of the preceding word
- The left context ID of the following word
These context IDs encode grammatical information about word boundaries. For example, the connection cost between a noun and a particle is typically low (natural sequence), while the connection cost between two verbs in base form might be high (unnatural sequence).
The connection cost matrix is compiled into binary format as part of the dictionary build process and is loaded at runtime for efficient lookup.
Dictionaries
Lindera supports various dictionaries for Japanese, Korean, and Chinese morphological analysis. Each dictionary is provided as a separate crate.
| Dictionary | Language | Crate | Description |
|---|---|---|---|
| IPADIC | Japanese | lindera-ipadic | The most common dictionary for Japanese |
| IPADIC NEologd | Japanese | lindera-ipadic-neologd | IPADIC with neologisms (new words) |
| UniDic | Japanese | lindera-unidic | Uniform word unit definitions |
| ko-dic | Korean | lindera-ko-dic | Korean morphological analysis |
| CC-CEDICT | Chinese | lindera-cc-cedict | Chinese-English dictionary |
| Jieba | Chinese | lindera-jieba | Jieba-based Chinese dictionary |
Obtaining Dictionaries
Pre-built dictionaries are available for download from GitHub Releases. Download the dictionary archive for your target language and extract it to a local directory.
#![allow(unused)] fn main() { // Load an external dictionary from a local path let dictionary = load_dictionary("/path/to/ipadic")?; }
[!TIP] If you need a self-contained binary without external dictionary files, you can embed dictionaries using the
embed-*feature flags and load them using theembedded://scheme:#![allow(unused)] fn main() { let dictionary = load_dictionary("embedded://ipadic")?; }See Feature Flags for details.
See each dictionary crate's documentation for format details, build instructions, and usage examples.
Tokenization
Lindera provides multiple tokenization modes and supports N-Best analysis for enumerating alternative segmentation candidates.
Tokenization modes
Normal mode
Normal mode performs standard tokenization based on dictionary entries. Compound words that exist as single entries in the dictionary are kept as-is.
Example -- tokenizing "関西国際空港限定トートバッグ" in Normal mode:
関西国際空港 | 限定 | トートバッグ
The compound noun "関西国際空港" (Kansai International Airport) is preserved as a single token because it exists as one entry in the dictionary.
Decompose mode
Decompose mode further breaks down compound nouns into their constituent parts, even when the compound exists as a dictionary entry.
Example -- tokenizing "関西国際空港限定トートバッグ" in Decompose mode:
関西 | 国際 | 空港 | 限定 | トートバッグ
The compound "関西国際空港" is decomposed into "関西", "国際", and "空港".
Selecting a mode
In Rust, specify the mode when creating a Segmenter:
#![allow(unused)] fn main() { use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::dictionary::load_dictionary; let dictionary = load_dictionary("embedded://ipadic")?; // Normal mode let segmenter = Segmenter::new(Mode::Normal, dictionary, None); // Decompose mode let segmenter = Segmenter::new(Mode::Decompose(Default::default()), dictionary, None); }
With the CLI, use the --mode flag:
echo "関西国際空港限定トートバッグ" | lindera tokenize --dict embedded://ipadic --mode normal
echo "関西国際空港限定トートバッグ" | lindera tokenize --dict embedded://ipadic --mode decompose
N-Best tokenization
N-Best tokenization enumerates the top N tokenization candidates ordered by total path cost (lower cost = better segmentation). This is useful when the best result is ambiguous, or when you want to explore alternative interpretations of the input text.
Algorithm
N-Best tokenization is based on the Forward-DP Backward-A* algorithm, which is compatible with MeCab's N-Best implementation. The forward pass computes optimal costs using dynamic programming, and the backward pass uses A* search to enumerate paths in order of increasing total cost.
Parameters
The tokenize_nbest method accepts the following parameters:
| Parameter | Type | Description |
|---|---|---|
text | &str | The text to tokenize. |
n | usize | Number of N-best results to return. |
unique | bool | When true, deduplicates results that produce the same word boundary positions. |
cost_threshold | Option<i64> | When Some(threshold), only returns paths with cost within best_cost + threshold. |
Rust API example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "すもももももももものうち"; // Get top 3 tokenization results let results = tokenizer.tokenize_nbest(text, 3, false, None)?; for (rank, (tokens, cost)) in results.iter().enumerate() { println!("--- NBEST {} (cost={}) ---", rank + 1, cost); for token in tokens { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } } Ok(()) }
Output:
--- NBEST 1 (cost=7546) ---
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
--- NBEST 2 (cost=7914) ---
...
CLI example
echo "すもももももももものうち" | lindera tokenize --dict embedded://ipadic -N 3
Lattice reuse
For repeated tokenization, you can reuse a Lattice to reduce memory allocations:
#![allow(unused)] fn main() { use lindera_dictionary::viterbi::Lattice; let mut lattice = Lattice::default(); let results = tokenizer.tokenize_nbest_with_lattice(text, &mut lattice, 3, false, None)?; }
User Dictionary
A user dictionary is a supplementary dictionary that allows you to register custom words alongside the system dictionary. This is useful for domain-specific terms, brand names, proper nouns, or any words that are not in the default system dictionary.
CSV format
The simplest user dictionary format is a CSV file with three columns:
<surface>,<part_of_speech>,<reading>
Example CSV content
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
Each dictionary type (IPADIC, UniDic, ko-dic, etc.) also supports a detailed CSV format with full control over context IDs, costs, and all feature fields. See the Dictionaries section for the detailed format of each dictionary type.
Rust API example
use std::fs::File; use std::path::PathBuf; use lindera::dictionary::{Metadata, load_dictionary, load_user_dictionary}; use lindera::error::LinderaErrorKind; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let user_dict_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) .join("../resources") .join("user_dict") .join("ipadic_simple_userdic.csv"); let metadata_file = PathBuf::from(env!("CARGO_MANIFEST_DIR")) .join("../lindera-ipadic") .join("metadata.json"); let metadata: Metadata = serde_json::from_reader( File::open(metadata_file) .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err))) .unwrap(), ) .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err))) .unwrap(); let dictionary = load_dictionary("embedded://ipadic")?; let user_dictionary = load_user_dictionary(user_dict_path.to_str().unwrap(), &metadata)?; let segmenter = Segmenter::new( Mode::Normal, dictionary, Some(user_dictionary), // Using the loaded user dictionary ); // Create a tokenizer. let tokenizer = Tokenizer::new(segmenter); // Tokenize a text. let text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"; let mut tokens = tokenizer.tokenize(text)?; // Print the text and tokens. println!("text:\t{}", text); for token in tokens.iter_mut() { let details = token.details().join(","); println!("token:\t{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Output:
text: 東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です
token: 東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
token: の 助詞,連体化,*,*,*,*,の,ノ,ノ
token: 最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
token: は 助詞,係助詞,*,*,*,*,は,ハ,ワ
token: とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
token: です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
Building a user dictionary with CLI
You can build a user dictionary from CSV to binary format using the CLI:
lindera build --src <source_dir> --dest <dest_dir> --metadata <metadata.json> --user
Binary vs CSV user dictionary
- CSV format: Loaded and parsed at runtime. Convenient for development and small dictionaries.
- Binary format: Pre-compiled for faster loading. Recommended for production use with large user dictionaries.
Both formats can be specified when creating a Segmenter. The binary format skips the CSV parsing step, resulting in faster startup times.
Character Filters
Character filters are pre-processing steps applied to the input text before tokenization. They normalize or transform characters to improve tokenization quality and consistency.
Available character filters
unicode_normalize
Applies Unicode normalization to the input text. This is useful for normalizing full-width characters to half-width, or for canonicalizing equivalent Unicode representations.
Supported normalization forms:
| Form | Description |
|---|---|
| NFKC | Compatibility decomposition followed by canonical composition. Converts full-width alphanumeric characters to half-width and normalizes Katakana variants. |
| NFC | Canonical decomposition followed by canonical composition. |
| NFD | Canonical decomposition. |
| NFKD | Compatibility decomposition. |
japanese_iteration_mark
Normalizes Japanese iteration marks into their expanded forms. Iteration marks are special characters that indicate the repetition of the preceding character.
| Mark | Name | Example |
|---|---|---|
| 々 | Kanji iteration mark | 人々 (hitobito) |
| ゝ / ゞ | Hiragana iteration marks | いすゞ (isuzu) |
| ヽ / ヾ | Katakana iteration marks | バナナヽ |
The filter accepts two boolean parameters: whether to normalize Hiragana iteration marks and whether to normalize Katakana iteration marks.
mapping
Performs character-level string replacement based on a user-defined mapping table. This can be used for custom normalization rules.
For example, mapping "リンデラ" to "Lindera".
YAML configuration example
When using Lindera with a YAML configuration file, character filters can be specified in the character_filters section:
segmenter:
mode: normal
dictionary: "embedded://ipadic"
character_filters:
- kind: unicode_normalize
args:
kind: nfkc
- kind: japanese_iteration_mark
args:
normalize_kanji: true
normalize_kana: true
- kind: mapping
args:
mapping:
リンデラ: Lindera
Rust API example
Character filters can be created and appended to a Tokenizer programmatically:
use lindera::character_filter::BoxCharacterFilter; use lindera::character_filter::unicode_normalize::{ UnicodeNormalizeCharacterFilter, UnicodeNormalizeKind, }; use lindera::character_filter::japanese_iteration_mark::JapaneseIterationMarkCharacterFilter; use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); // Create character filters. let unicode_normalize_char_filter = UnicodeNormalizeCharacterFilter::new(UnicodeNormalizeKind::NFKC); let japanese_iteration_mark_char_filter = JapaneseIterationMarkCharacterFilter::new(true, true); // Create a tokenizer and append character filters. let mut tokenizer = Tokenizer::new(segmenter); tokenizer .append_character_filter(BoxCharacterFilter::from(unicode_normalize_char_filter)) .append_character_filter(BoxCharacterFilter::from( japanese_iteration_mark_char_filter, )); // Tokenize text -- full-width "Lindera" will be normalized to "Lindera". let text = "Linderaは形態素解析エンジンです。"; let tokens = tokenizer.tokenize(text)?; for token in tokens { println!( "token: {:?}, details: {:?}", token.surface, token.details ); } Ok(()) }
Output (with NFKC normalization applied):
token: "Lindera", details: Some(["名詞", "固有名詞", "組織", "*", "*", "*", "*", "*", "*"])
token: "は", details: Some(["助詞", "係助詞", "*", "*", "*", "*", "は", "ハ", "ワ"])
token: "形態素", details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: "解析", details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: "エンジン", details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: "です", details: Some(["助動詞", "*", "*", "*", "特殊・デス", "基本形", "です", "デス", "デス"])
token: "。", details: Some(["記号", "句点", "*", "*", "*", "*", "。", "。", "。"])
Lindera CRF
Lindera CRF is a pure Rust implementation of Conditional Random Fields (CRFs), forked from rucrf. It provides a trainer and an estimator for CRFs with support for lattice structures.
Key Features
- Lattices with variable length edges
- L1, L2, and Elastic Net regularization
- Multi-threaded training
- Zero-copy deserialization with rkyv
no_stdsupport (withouttrainfeature)
Contents
- Architecture -- Internal structure and key components
- API Reference -- API documentation
Changes from rucrf
- Serialization backend: Switched from
bincodetorkyvfor zero-copy deserialization - Elastic Net regularization: Added
Regularization::ElasticNetcombining L1 and L2 penalties - Rust 2024 edition: Updated to Rust 2024 edition
- Dependency updates: Updated
argmin,argmin-math,hashbrown, etc.
Architecture
Module Structure
lindera-crf/src/
├── lib.rs # Public API re-exports
├── feature.rs # FeatureSet, FeatureProvider
├── lattice.rs # Edge, Node, Lattice
├── model.rs # RawModel, MergedModel, Model trait
├── trainer.rs # Trainer, Regularization enum
├── errors.rs # Error types
├── forward_backward.rs # Forward-backward algorithm
├── math.rs # Mathematical utilities (logsumexp)
├── optimizers/
│ └── lbfgs.rs # L-BFGS optimization
└── utils.rs # Utility traits
Key Components
FeatureProvider / FeatureSet
Manage per-label feature sets. Each FeatureSet holds unigram features and left/right bigram features for a given label. FeatureProvider aggregates FeatureSet instances and maps feature IDs to weights.
Lattice / Edge / Node
Lattice structure with variable-length edges for sequence labeling. Edge represents a candidate span with a label, while Node aggregates edges at a given position. The Lattice is constructed from input data and used by the model to find the best path.
Trainer
Trains a CRF model using L-BFGS optimization with configurable regularization. The trainer accepts labeled lattice examples, computes gradients via the forward-backward algorithm, and iteratively updates model weights.
Regularization
Configurable regularization strategies:
- L1: Sparse models via L1 penalty
- L2: Smooth models via L2 penalty
- ElasticNet: Combines L1 and L2 with a configurable
l1_ratio
Model (trait)
Interface for searching the best path through a lattice. Two implementations are provided:
- RawModel: Stores weights in a flat vector indexed by feature ID
- MergedModel: Optimized for inference; merges feature weights into a compact representation serializable with rkyv
Forward-backward Algorithm
Computes alpha (forward) and beta (backward) values over the lattice. Used during training to calculate expected feature counts and gradients.
Feature Flags
| Feature | Description | Default |
|---|---|---|
alloc | Alloc support for no_std | No |
std | Standard library support (implies alloc) | No |
train | Training functionality (L-BFGS, multi-threading, logging) | Yes |
API Reference
The API reference is available. Please see following URL:
Lindera Dictionary
Lindera Dictionary is the base library for morphological analysis dictionaries. It provides dictionary loading, building, Viterbi-based segmentation, and CRF-based training functionality.
Key Features
- Dictionary loading from filesystem or embedded data
- Dictionary building from MeCab-format CSV source files
- Viterbi algorithm for optimal segmentation
- N-best path generation (Forward-DP Backward-A*)
- Memory-mapped file support
- CRF-based dictionary training (with
trainfeature)
Contents
- Architecture -- Internal structure and key components
- API Reference -- API documentation
Architecture
Module Structure
lindera-dictionary/src/
├── lib.rs # Public API
├── dictionary.rs # Dictionary, UserDictionary
├── builder.rs # DictionaryBuilder
├── loader.rs # DictionaryLoader trait, FSDictionaryLoader
├── viterbi.rs # Lattice, Edge, Viterbi segmentation
├── nbest.rs # NBestGenerator (Forward-DP Backward-A*)
├── mode.rs # Mode (Normal/Decompose), Penalty
├── error.rs # LinderaError, LinderaErrorKind
├── assets.rs # Download and file management
├── dictionary/
│ ├── character_definition.rs # Character type definitions
│ ├── connection_cost_matrix.rs # Connection cost matrix
│ ├── prefix_dictionary.rs # Double-array trie dictionary
│ ├── unknown_dictionary.rs # Unknown word handling
│ ├── metadata.rs # Dictionary metadata
│ └── schema.rs # Schema definitions
└── trainer/ # (train feature)
├── config.rs # TrainerConfig
├── corpus.rs # Corpus, Example, Word
├── feature_extractor.rs # Feature template parsing
├── feature_rewriter.rs # MeCab-compatible rewrite rules
└── model.rs # Trained model, tocost()
Key Components
Dictionary / UserDictionary
Main data structures holding the compiled dictionary data. A Dictionary contains the character definitions, connection cost matrix, prefix dictionary (double-array trie), and unknown word dictionary. UserDictionary allows users to add custom vocabulary on top of the system dictionary.
DictionaryBuilder
Fluent API for building dictionaries from source CSV files. It compiles MeCab-format dictionary sources into the binary format used at runtime.
DictionaryLoader / FSDictionaryLoader
DictionaryLoader is a trait for loading compiled dictionaries. FSDictionaryLoader is the filesystem-based implementation that reads dictionary files from a directory, with optional memory-mapped file support.
Viterbi (Lattice, Edge)
Builds a lattice of candidate tokens from the input text and finds the optimal segmentation path using the Viterbi algorithm. Each Edge in the lattice represents a candidate token with associated costs (word cost + connection cost).
NBestGenerator
Generates N-best segmentation paths using the Forward-DP Backward-A* algorithm. This enables applications to consider alternative segmentations beyond the single best path.
Mode
Controls tokenization behavior:
- Normal: Standard tokenization using the optimal Viterbi path
- Decompose: Further splits compound nouns based on configurable
Penaltythresholds
Trainer (train feature)
CRF-based dictionary training pipeline using lindera-crf. The training workflow includes:
- TrainerConfig: Parses seed dictionary,
char.def,feature.def, andrewrite.def - Corpus: Manages training data as labeled examples
- FeatureExtractor: Parses feature templates and assigns feature IDs
- DictionaryRewriter: Applies MeCab-compatible 3-section rewrite rules
- Model: Holds training results and exports dictionary files with cost conversion via
tocost(weight, cost_factor)
Feature Flags
| Feature | Description | Default |
|---|---|---|
mmap | Memory-mapped file support | Yes |
build_rs | HTTP download for dictionary sources | No |
train | CRF-based training (depends on lindera-crf) | No |
API Reference
The API reference is available. Please see following URL:
Lindera Library
The lindera crate is the core morphological analysis library. This section covers configuration, segmentation, token filters, error handling, and API reference.
- Configuration - YAML-based tokenizer configuration
- Segmenter - Core segmentation component using the Viterbi algorithm
- Token Filters - Post-processing filters for tokens
- Error Handling - Error types and handling patterns
- API Reference - Links to generated API documentation
Configuration
Lindera is able to read YAML format configuration files.
Specify the path to the following file in the environment variable LINDERA_CONFIG_PATH. You can use it easily without having to code the behavior of the tokenizer in Rust code.
segmenter:
mode: "normal"
dictionary: "embedded://ipadic"
# user_dictionary: "./resources/user_dict/ipadic_simple_userdic.csv"
character_filters:
- kind: "unicode_normalize"
args:
kind: "nfkc"
- kind: "japanese_iteration_mark"
args:
normalize_kanji: true
normalize_kana: true
- kind: mapping
args:
mapping:
リンデラ: Lindera
token_filters:
- kind: "japanese_compound_word"
args:
tags:
- "名詞,数"
- "名詞,接尾,助数詞"
new_tag: "名詞,数"
- kind: "japanese_number"
args:
tags:
- "名詞,数"
- kind: "japanese_stop_tags"
args:
tags:
- "接続詞"
- "助詞"
- "助詞,格助詞"
- "助詞,格助詞,一般"
- "助詞,格助詞,引用"
- "助詞,格助詞,連語"
- "助詞,係助詞"
- "助詞,副助詞"
- "助詞,間投助詞"
- "助詞,並立助詞"
- "助詞,終助詞"
- "助詞,副助詞/並立助詞/終助詞"
- "助詞,連体化"
- "助詞,副詞化"
- "助詞,特殊"
- "助動詞"
- "記号"
- "記号,一般"
- "記号,読点"
- "記号,句点"
- "記号,空白"
- "記号,括弧閉"
- "その他,間投"
- "フィラー"
- "非言語音"
- kind: "japanese_katakana_stem"
args:
min: 3
- kind: "remove_diacritical_mark"
args:
japanese: false
% export LINDERA_CONFIG_PATH=./resources/config/lindera.yml
use std::path::PathBuf; use lindera::tokenizer::TokenizerBuilder; use lindera::LinderaResult; fn main() -> LinderaResult<()> { // Load tokenizer configuration from file let path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) .join("../resources") .join("config") .join("lindera.yml"); let builder = TokenizerBuilder::from_file(&path)?; let tokenizer = builder.build()?; let text = "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。".to_string(); println!("text: {text}"); let tokens = tokenizer.tokenize(&text)?; for token in tokens { println!( "token: {:?}, start: {:?}, end: {:?}, details: {:?}", token.surface, token.byte_start, token.byte_end, token.details ); } Ok(()) }
Segmenter
The Segmenter is the core component that performs morphological analysis. It uses the Viterbi algorithm to find the optimal segmentation of input text based on a dictionary and cost model.
Creating a Segmenter
A Segmenter requires three components:
- Mode - the tokenization strategy (
NormalorDecompose) - Dictionary - a system dictionary for morphological analysis
- UserDictionary (optional) - a supplementary dictionary for custom words
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); }
Tokenization Modes
Mode::Normal
Standard tokenization based on the dictionary entries. Words are segmented faithfully according to what is registered in the dictionary.
#![allow(unused)] fn main() { use lindera::mode::Mode; let mode = Mode::Normal; }
Mode::Decompose
Decomposes compound nouns into their constituent parts. This mode applies a configurable penalty to long compound words, encouraging the segmenter to split them into shorter components.
For example, with Mode::Normal, the compound word "関西国際空港" remains as a single token, while with Mode::Decompose, it is split into "関西", "国際", and "空港".
#![allow(unused)] fn main() { use lindera::mode::Mode; let mode = Mode::Decompose(Default::default()); }
Dictionary Loading
Lindera provides the load_dictionary function to load dictionaries from various sources.
Embedded Dictionaries
When built with the appropriate feature flag (e.g., embed-ipadic), dictionaries can be loaded directly from the binary:
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; let dictionary = load_dictionary("embedded://ipadic")?; }
Available embedded dictionary URIs:
embedded://ipadic- IPADIC (Japanese)embedded://ipadic-neologd- IPADIC NEologd (Japanese)embedded://unidic- UniDic (Japanese)embedded://ko-dic- ko-dic (Korean)embedded://cc-cedict- CC-CEDICT (Chinese)embedded://jieba- Jieba (Chinese)
External Dictionaries
Pre-built dictionary directories can be loaded from the filesystem:
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; let dictionary = load_dictionary("/path/to/dictionary")?; }
Using with Tokenizer
The Segmenter is typically used through the Tokenizer, which adds support for character filters and token filters:
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "日本語の形態素解析を行うことができます。"; let tokens = tokenizer.tokenize(text)?; for token in tokens { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Token Filters
Token filters are post-processing components applied to tokens after segmentation. They can modify, remove, or transform tokens to suit specific use cases such as search indexing, text normalization, or linguistic analysis.
Available Token Filters
Japanese
| Filter | Description |
|---|---|
japanese_compound_word | Combines consecutive tokens matching specified part-of-speech tags into compound words |
japanese_number | Normalizes Japanese number representations (e.g., converts Kanji numerals) |
japanese_stop_tags | Removes tokens with specified part-of-speech tags |
japanese_katakana_stem | Stems Katakana words by removing trailing prolonged sound marks |
japanese_base_form | Normalizes tokens to their base (dictionary) form |
japanese_keep_tags | Keeps only tokens matching specified part-of-speech tags, removing all others |
japanese_reading_form | Converts token text to its reading form (Katakana) |
japanese_kana | Converts between Hiragana and Katakana |
Korean
| Filter | Description |
|---|---|
korean_stop_tags | Removes Korean tokens with specified part-of-speech tags |
korean_keep_tags | Keeps only Korean tokens matching specified part-of-speech tags |
korean_reading_form | Converts Korean tokens to their reading form |
General
| Filter | Description |
|---|---|
lowercase | Converts token text to lowercase |
uppercase | Converts token text to uppercase |
mapping | Maps token text according to a user-defined mapping table |
length | Filters tokens by text length (minimum and/or maximum) |
stop_words | Removes tokens matching a list of stop words |
keep_words | Keeps only tokens matching a list of specified words |
remove_diacritical_mark | Removes diacritical marks (accent marks) from token text |
YAML Configuration
Token filters can be configured in the YAML configuration file under the token_filters key:
token_filters:
- kind: "japanese_stop_tags"
args:
tags:
- "助詞"
- "助動詞"
- "記号"
- kind: "japanese_katakana_stem"
args:
min: 3
- kind: "lowercase"
- kind: "length"
args:
min: 2
Rust API
Token filters can also be created and applied programmatically:
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::token_filter::BoxTokenFilter; use lindera::token_filter::japanese_stop_tags::JapaneseStopTagsTokenFilter; use lindera::token_filter::japanese_katakana_stem::JapaneseKatakanaStemTokenFilter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let mut tokenizer = Tokenizer::new(segmenter); // Add token filters let stop_tags_filter = JapaneseStopTagsTokenFilter::new( vec![ "助詞".to_string(), "助動詞".to_string(), "記号".to_string(), ] .into_iter() .collect(), ); tokenizer.append_token_filter(BoxTokenFilter::from(stop_tags_filter)); let katakana_stem_filter = JapaneseKatakanaStemTokenFilter::new(3); tokenizer.append_token_filter(BoxTokenFilter::from(katakana_stem_filter)); // Tokenize with filters applied let tokens = tokenizer.tokenize("Linderaは形態素解析エンジンです。")?; for token in tokens { println!( "token: {:?}, details: {:?}", token.surface, token.details ); } Ok(()) }
The append_token_filter method adds filters in order. Filters are applied sequentially to the token list after segmentation.
Error Handling
Lindera uses a structured error system based on anyhow and thiserror for ergonomic error handling throughout the library.
LinderaResult
The LinderaResult<T> type alias is the standard return type for fallible operations in Lindera:
#![allow(unused)] fn main() { pub type LinderaResult<T> = Result<T, LinderaError>; }
LinderaError
LinderaError is the main error type, containing an error kind and a source error with full context:
#![allow(unused)] fn main() { pub struct LinderaError { pub kind: LinderaErrorKind, source: anyhow::Error, } }
The add_context method allows attaching additional context to an error:
#![allow(unused)] fn main() { let error = error.add_context("failed to load dictionary from /path/to/dict"); }
LinderaErrorKind
LinderaErrorKind is an enum that categorizes errors:
| Kind | Description |
|---|---|
Io | I/O errors (file read/write, network) |
Parse | Parsing errors (invalid input format) |
Serialize | Serialization errors |
Deserialize | Deserialization errors |
Content | Invalid content or data errors |
Args | Invalid argument errors |
Decode | Decoding errors |
NotFound | Resource not found (e.g., dictionary file missing) |
Build | Dictionary build errors |
Dictionary | Dictionary-related errors |
Mode | Invalid tokenization mode errors |
Algorithm | Algorithm errors (e.g., Viterbi failure) |
FeatureDisabled | Attempted to use a feature that is not enabled |
Creating Errors
Use LinderaErrorKind::with_error to create an error from a kind and a source:
#![allow(unused)] fn main() { use lindera::error::LinderaErrorKind; let error = LinderaErrorKind::Io.with_error(anyhow::anyhow!("file not found: config.yml")); }
Using the ? Operator
Since Lindera functions return LinderaResult, the ? operator can propagate errors naturally:
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn analyze(text: &str) -> LinderaResult<Vec<String>> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let tokens = tokenizer.tokenize(text)?; Ok(tokens.iter().map(|t| t.surface.as_ref().to_string()).collect()) } }
Error Handling Patterns
Matching on Error Kind
#![allow(unused)] fn main() { use lindera::dictionary::load_dictionary; use lindera::error::LinderaErrorKind; match load_dictionary("/path/to/dictionary") { Ok(dict) => { /* use dictionary */ } Err(e) if e.kind() == LinderaErrorKind::NotFound => { eprintln!("Dictionary not found: {}", e); } Err(e) if e.kind() == LinderaErrorKind::Io => { eprintln!("I/O error loading dictionary: {}", e); } Err(e) => { eprintln!("Unexpected error: {}", e); } } }
Converting from External Errors
#![allow(unused)] fn main() { use lindera::error::LinderaErrorKind; let content = std::fs::read_to_string("config.yml") .map_err(|err| LinderaErrorKind::Io.with_error(anyhow::anyhow!(err)))?; }
API Reference
The API reference is available. Please see following URL:
Lindera CLI
A morphological analysis command-line interface for Lindera.
- Installation - Install or build the CLI
- Commands - Command reference for tokenize, build, train, and export
- Tutorial - Step-by-step guide to get started
Installation
Install via Cargo
You can install the binary via cargo:
% cargo install lindera-cli
Download from GitHub Releases
Alternatively, you can download a pre-built binary from the release page:
Obtaining Dictionaries
Lindera does not bundle dictionaries with the binary. You need to download a pre-built dictionary separately from the GitHub Releases page:
# Example: download and extract the IPADIC dictionary
% curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
% unzip lindera-ipadic-<version>.zip -d /path/to/ipadic
Then specify the dictionary path when using the CLI:
% lindera tokenize --dictionary /path/to/ipadic "関西国際空港限定トートバッグ"
Build from Source
Build without dictionaries (default)
Build a binary containing only the tokenizer and trainer without embedded dictionaries:
% cargo build --release
Build with all features
% cargo build --release --all-features
Build with Embedded Dictionaries (Advanced)
For advanced users who want to embed dictionaries directly into the binary, use the embed-* feature flags. This eliminates the need for external dictionary files at runtime but increases the binary size.
IPADIC (Japanese dictionary)
% cargo build --release --features=embed-ipadic
IPADIC NEologd (Japanese dictionary)
% cargo build --release --features=embed-ipadic-neologd
UniDic (Japanese dictionary)
% cargo build --release --features=embed-unidic
ko-dic (Korean dictionary)
% cargo build --release --features=embed-ko-dic
CC-CEDICT (Chinese dictionary)
% cargo build --release --features=embed-cc-cedict
Jieba (Chinese dictionary)
% cargo build --release --features=embed-jieba
[!TIP] After building with an
embed-*feature flag, use theembedded://scheme to load the embedded dictionary:% lindera tokenize --dictionary embedded://ipadic "関西国際空港限定トートバッグ"See Feature Flags for details.
Commands
The Lindera CLI provides four main commands:
- tokenize - Perform morphological analysis on text
- build - Build a dictionary from source CSV files
- train - Train a CRF model from annotated corpus data
- export - Export a trained model to dictionary format
tokenize
Perform morphological analysis (tokenization) on Japanese, Chinese, or Korean text using various dictionaries.
Parameters
--dict/-d: Dictionary path or URI (required)- File path:
/path/to/dictionary - Embedded:
embedded://ipadic,embedded://unidic, etc.
- File path:
--output/-o: Output format (default: mecab)mecab: MeCab-compatible format with part-of-speech infowakati: Space-separated tokens onlyjson: Detailed JSON format with all token information
--user-dict/-u: User dictionary path (optional)--mode/-m: Tokenization mode (default: normal)normal: Standard tokenizationdecompose: Decompose compound words
--char-filter/-c: Character filter configuration (JSON)--token-filter/-t: Token filter configuration (JSON)--nbest/-N: Number of N-best results to return (default: 1). When set to 2 or more, N-best output is enabled.--nbest-unique: Deduplicate N-best results by removing paths that produce the same segmentation.--nbest-cost-threshold: Maximum cost difference from the best path. Only paths with cost withinbest_cost + thresholdare returned.- Input file: Optional file path (default: stdin)
Basic usage
# Tokenize text using a dictionary directory
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /path/to/dictionary
# Tokenize text using embedded dictionary
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic
# Tokenize with different output format
echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic \
--output json
# Tokenize text from file
lindera tokenize \
--dict /path/to/dictionary \
--output wakati \
input.txt
Examples with external dictionaries
Tokenize with external IPADIC (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-ipadic-2.7.0-20250920
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Tokenize with external IPADIC Neologd (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-ipadic-neologd-0.0.7-20200820
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析 名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Tokenize with external UniDic (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-unidic-2.1.2
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語 名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態 名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素 接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析 名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う 動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと 名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が 助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき 動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
Tokenize with external ko-dic (Korean dictionary)
% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
--dict /tmp/lindera-ko-dic-2.1.1-20180720
한국어 NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의 JKG,*,F,의,*,*,*,*
형태 NNG,*,F,형태,*,*,*,*
해석 NNG,행위,T,해석,*,*,*,*
을 JKO,*,T,을,*,*,*,*
실시 NNG,행위,F,실시,*,*,*,*
할 XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수 NNB,*,F,수,*,*,*,*
있 VV,*,T,있,*,*,*,*
습니다 EF,*,F,습니다,*,*,*,*
. SF,*,*,*,*,*,*,*
EOS
Tokenize with external CC-CEDICT (Chinese dictionary)
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict /tmp/lindera-cc-cedict-0.1.0-20200409
可以 *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行 *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文 *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学 *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析 *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。 *,*,*,*,*,*,*,*
EOS
Tokenize with external Jieba (Chinese dictionary)
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict /tmp/lindera-jieba-0.1.1
Examples with embedded dictionaries
Lindera can include dictionaries directly in the binary when built with specific feature flags. This allows tokenization without external dictionary files.
Tokenize with embedded IPADIC (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
NOTE: To include IPADIC dictionary in the binary, you must build with the --features=embed-ipadic option.
Tokenize with embedded UniDic (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://unidic
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語 名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態 名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素 接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析 名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う 動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと 名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が 助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき 動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-unidic option.
Tokenize with embedded IPADIC NEologd (Japanese dictionary)
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic-neologd
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析 名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-ipadic-neologd option.
Tokenize with embedded ko-dic (Korean dictionary)
% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
--dict embedded://ko-dic
한국어 NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의 JKG,*,F,의,*,*,*,*
형태 NNG,*,F,형태,*,*,*,*
해석 NNG,행위,T,해석,*,*,*,*
을 JKO,*,T,을,*,*,*,*
실시 NNG,행위,F,실시,*,*,*,*
할 XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수 NNB,*,F,수,*,*,*,*
있 VV,*,T,있,*,*,*,*
습니다 EF,*,F,습니다,*,*,*,*
. SF,*,*,*,*,*,*,*
EOS
NOTE: To include ko-dic dictionary in the binary, you must build with the --features=embed-ko-dic option.
Tokenize with embedded CC-CEDICT (Chinese dictionary)
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict embedded://cc-cedict
可以 *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行 *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文 *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学 *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析 *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。 *,*,*,*,*,*,*,*
EOS
NOTE: To include CC-CEDICT dictionary in the binary, you must build with the --features=embed-cc-cedict option.
Tokenize with embedded Jieba (Chinese dictionary)
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict embedded://jieba
NOTE: To include Jieba dictionary in the binary, you must build with the --features=embed-jieba option.
User dictionary examples
Lindera supports user dictionaries to add custom words alongside system dictionaries. User dictionaries can be in CSV or binary format.
Use user dictionary (CSV format)
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict embedded://ipadic \
--user-dict ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
Use user dictionary (Binary format)
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict /tmp/lindera-ipadic-2.7.0-20250920 \
--user-dict ./resources/user_dict/ipadic_simple_userdic.bin
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
Tokenization modes
Lindera provides two tokenization modes: normal and decompose.
Normal mode (default)
Tokenizes faithfully based on words registered in the dictionary:
% echo "関西国際空港限定トートバッグ" | lindera tokenize \
--dict embedded://ipadic \
--mode normal
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ 名詞,一般,*,*,*,*,*,*,*
EOS
Decompose mode
Tokenizes compound noun words additionally:
% echo "関西国際空港限定トートバッグ" | lindera tokenize \
--dict embedded://ipadic \
--mode decompose
関西 名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際 名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港 名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ 名詞,一般,*,*,*,*,*,*,*
EOS
Output formats
Lindera provides three output formats: mecab, wakati and json.
MeCab format (default)
Outputs results in MeCab-compatible format with part-of-speech information:
% echo "お待ちしております。" | lindera tokenize \
--dict embedded://ipadic \
--output mecab
お待ち 名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
おり 動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Wakati format
Outputs only the token text separated by spaces:
% echo "お待ちしております。" | lindera tokenize \
--dict embedded://ipadic \
--output wakati
お待ち し て おり ます 。
JSON format
Outputs detailed token information in JSON format:
% echo "お待ちしております。" | lindera tokenize \
--dict embedded://ipadic \
--output json
[
{
"base_form": "お待ち",
"byte_end": 9,
"byte_start": 0,
"conjugation_form": "*",
"conjugation_type": "*",
"part_of_speech": "名詞",
"part_of_speech_subcategory_1": "サ変接続",
"part_of_speech_subcategory_2": "*",
"part_of_speech_subcategory_3": "*",
"pronunciation": "オマチ",
"reading": "オマチ",
"surface": "お待ち",
"word_id": 14698
},
...
]
N-Best tokenization
Lindera supports N-Best tokenization, which returns the top N tokenization candidates ordered by cost (lower cost = better). This is based on the Forward-DP Backward-A* algorithm, compatible with MeCab's N-Best implementation.
Basic N-Best example
% echo "すもももももももものうち" | lindera tokenize \
--dict embedded://ipadic \
-N 3
NBEST 1 (cost=7546)
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
NBEST 2 (cost=7914)
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
NBEST 3 (cost=10060)
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
も 助詞,係助詞,*,*,*,*,も,モ,モ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
N-Best with unique results
When the same segmentation appears in multiple paths (differing only in internal Viterbi states), use --nbest-unique to deduplicate:
% echo "営業部長谷川です" | lindera tokenize \
--dict embedded://ipadic \
-N 5 --nbest-unique -o wakati
NBEST 1 (cost=15760)
営業 部長 谷川 です
NBEST 2 (cost=17758)
営業 部長 谷 川 です
NBEST 3 (cost=18816)
営業 部 長谷川 です
NBEST 4 (cost=19320)
営業 部長 谷川 で す
NBEST 5 (cost=20814)
営業 部 長谷 川 です
N-Best with cost threshold
Use --nbest-cost-threshold to limit results to paths within a certain cost range of the best path:
% echo "営業部長谷川です" | lindera tokenize \
--dict embedded://ipadic \
-N 10 --nbest-unique --nbest-cost-threshold 5000 -o wakati
NBEST 1 (cost=15760)
営業 部長 谷川 です
NBEST 2 (cost=17758)
営業 部長 谷 川 です
NBEST 3 (cost=18816)
営業 部 長谷川 です
Only 3 results are returned because the remaining candidates exceed 15760 + 5000 = 20760.
Advanced tokenization with filters
Lindera provides an analytical framework that combines character filters, tokenizers, and token filters for advanced text processing. Filters are configured using JSON.
% echo "すもももももももものうち" | lindera tokenize \
--dict embedded://ipadic \
--char-filter 'unicode_normalize:{"kind":"nfkc"}' \
--token-filter 'japanese_keep_tags:{"tags":["名詞,一般"]}'
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
EOS
build
Build (compile) a morphological analysis dictionary from source CSV files for use with Lindera.
Build parameters
--src/-s: Source directory containing dictionary CSV files (or single CSV file for user dictionary)--dest/-d: Destination directory for compiled dictionary output--metadata/-m: Metadata configuration file (metadata.json) that defines dictionary structure--user/-u: Build user dictionary instead of system dictionary (optional flag)
Dictionary types
System dictionary
A full morphological analysis dictionary containing:
- Lexicon entries (word definitions)
- Connection cost matrix
- Unknown word handling rules
- Character type definitions
User dictionary
A supplementary dictionary for custom words that works alongside a system dictionary.
Examples
Build IPADIC (Japanese dictionary)
# Download and extract IPADIC source files
% curl -L -o /tmp/mecab-ipadic-2.7.0-20250920.tar.gz "https://Lindera.dev/mecab-ipadic-2.7.0-20250920.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20250920.tar.gz -C /tmp
# Build the dictionary
% lindera build \
--src /tmp/mecab-ipadic-2.7.0-20250920 \
--dest /tmp/lindera-ipadic-2.7.0-20250920 \
--metadata ./lindera-ipadic/metadata.json
Build IPADIC NEologd (Japanese dictionary)
% curl -L -o /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz "https://lindera.dev/mecab-ipadic-neologd-0.0.7-20200820.tar.gz"
% tar zxvf /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp
% lindera build \
--src /tmp/mecab-ipadic-neologd-0.0.7-20200820 \
--dest /tmp/lindera-ipadic-neologd-0.0.7-20200820 \
--metadata ./lindera-ipadic-neologd/metadata.json
Build UniDic (Japanese dictionary)
% curl -L -o /tmp/unidic-mecab-2.1.2.tar.gz "https://Lindera.dev/unidic-mecab-2.1.2.tar.gz"
% tar zxvf /tmp/unidic-mecab-2.1.2.tar.gz -C /tmp
% lindera build \
--src /tmp/unidic-mecab-2.1.2 \
--dest /tmp/lindera-unidic-2.1.2 \
--metadata ./lindera-unidic/metadata.json
Build CC-CEDICT (Chinese dictionary)
% curl -L -o /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz "https://lindera.dev/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz"
% tar zxvf /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz -C /tmp
% lindera build \
--src /tmp/CC-CEDICT-MeCab-0.1.0-20200409 \
--dest /tmp/lindera-cc-cedict-0.1.0-20200409 \
--metadata ./lindera-cc-cedict/metadata.json
Build Jieba (Chinese dictionary)
% curl -L -o /tmp/mecab-jieba-0.1.1.tar.gz "https://lindera.dev/mecab-jieba-0.1.1.tar.gz"
% tar zxvf /tmp/mecab-jieba-0.1.1.tar.gz -C /tmp
% lindera build \
--src /tmp/mecab-jieba-0.1.1/dict-src \
--dest /tmp/lindera-jieba-0.1.1 \
--metadata ./lindera-jieba/metadata.json
Build ko-dic (Korean dictionary)
% curl -L -o /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz "https://Lindera.dev/mecab-ko-dic-2.1.1-20180720.tar.gz"
% tar zxvf /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz -C /tmp
% lindera build \
--src /tmp/mecab-ko-dic-2.1.1-20180720 \
--dest /tmp/lindera-ko-dic-2.1.1-20180720 \
--metadata ./lindera-ko-dic/metadata.json
Build user dictionaries
Build IPADIC user dictionary (Japanese)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/ipadic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-ipadic/metadata.json \
--user
Build UniDic user dictionary (Japanese)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/unidic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-unidic/metadata.json \
--user
Build CC-CEDICT user dictionary (Chinese)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/cc-cedict_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-cc-cedict/metadata.json \
--user
Build Jieba user dictionary (Chinese)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/jieba_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-jieba/metadata.json \
--user
Build ko-dic user dictionary (Korean)
For more details about user dictionary format please refer to the following URL:
% lindera build \
--src ./resources/user_dict/ko-dic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-ko-dic/metadata.json \
--user
train
Train a new morphological analysis model from annotated corpus data. To use this feature, you must build with the train feature flag enabled. (The train feature flag is enabled by default.)
Train parameters
--seed/-s: Seed lexicon file (CSV format) to be weighted--corpus/-c: Training corpus (annotated text)--char-def/-C: Character definition file (char.def)--unk-def/-u: Unknown word definition file (unk.def) to be weighted--feature-def/-f: Feature definition file (feature.def)--rewrite-def/-r: Rewrite rule definition file (rewrite.def)--output/-o: Output model file--lambda/-l: L1 regularization (0.0-1.0) (default: 0.01)--max-iterations/-i: Maximum number of iterations for training (default: 100)--max-threads/-t: Maximum number of threads (defaults to CPU core count, auto-adjusted based on dataset size)
Basic workflow
1. Prepare training files
Seed lexicon file (seed.csv):
The seed lexicon file contains initial dictionary entries used for training the CRF model. Each line represents a word entry with comma-separated fields:
- Surface
- Left context ID
- Right context ID
- Word cost
- Part-of-speech tags (multiple fields)
- Base form
- Reading (katakana)
- Pronunciation
Note: The exact field definitions differ between dictionary formats (IPADIC, UniDic, ko-dic, CC-CEDICT). Please refer to each dictionary's format specification for details.
外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
Training corpus (corpus.txt):
The training corpus file contains annotated text data used to train the CRF model. Each line consists of:
- A surface form (word) followed by a tab character
- Comma-separated morphological features (part-of-speech tags, base form, reading, pronunciation)
- Sentences are separated by "EOS" (End Of Sentence) markers
外国 名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人 名詞,接尾,一般,*,*,*,人,ジン,ジン
参政 名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権 名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS
For detailed information about file formats and advanced features, see TRAINER_README.md.
2. Train model
lindera train \
--seed ./resources/training/seed.csv \
--corpus ./resources/training/corpus.txt \
--unk-def ./resources/training/unk.def \
--char-def ./resources/training/char.def \
--feature-def ./resources/training/feature.def \
--rewrite-def ./resources/training/rewrite.def \
--output /tmp/lindera/training/model.dat \
--lambda 0.01 \
--max-iterations 100
3. Training results
The trained model will contain:
- Existing words: All seed dictionary records with newly learned weights
- New words: Words from the corpus not in the seed dictionary, added with appropriate weights
export
Export a trained model file to Lindera dictionary format files. This feature requires building with the train feature flag enabled.
Export parameters
--model/-m: Path to the trained model file (.dat format)--output/-o: Directory to output the dictionary files--metadata: Optional metadata.json file to update with trained model information--cost-factor: Override cost factor for weight-to-cost conversion (default: value from trained model, typically 700)
Output files
The export command creates the following dictionary files in the output directory:
lex.csv: Lexicon file with learned weights (MeCab-compatible cost viatocost())matrix.def: Dense connection cost matrix covering all (right_id, left_id) pairsunk.def: Unknown word definitionschar.def: Character type definitionsfeature.def: Feature template definitions (copied from trained model)rewrite.def: Feature rewrite rules (copied from trained model)left-id.def: Left context ID to feature string mappingright-id.def: Right context ID to feature string mappingmetadata.json: Updated metadata file (if--metadataoption is provided)
Complete workflow example
1. Train model
lindera train \
--seed ./resources/training/seed.csv \
--corpus ./resources/training/corpus.txt \
--unk-def ./resources/training/unk.def \
--char-def ./resources/training/char.def \
--feature-def ./resources/training/feature.def \
--rewrite-def ./resources/training/rewrite.def \
--output /tmp/lindera/training/model.dat \
--lambda 0.01 \
--max-iterations 100
2. Export to dictionary format
lindera export \
--model /tmp/lindera/training/model.dat \
--metadata ./resources/training/metadata.json \
--output /tmp/lindera/training/dictionary
3. Build dictionary
lindera build \
--src /tmp/lindera/training/dictionary \
--dest /tmp/lindera/training/compiled_dictionary \
--metadata /tmp/lindera/training/dictionary/metadata.json
4. Use trained dictionary
echo "これは外国人参政権です。" | lindera tokenize \
-d /tmp/lindera/training/compiled_dictionary
Metadata update feature
When the --metadata option is provided, the export command will:
- Read the base metadata.json file to preserve existing configuration
- Update specific fields with values from the trained model:
default_left_context_id: Maximum left context ID from trained modeldefault_right_context_id: Maximum right context ID from trained modeldefault_word_cost: Calculated from feature weight medianmodel_info: Training statistics including feature count, label count, matrix size, iterations, regularization, version, and timestamp
- Preserve existing settings such as dictionary name, character encoding, schema definitions, and other user-defined configuration
Tutorial
This tutorial walks you through the basic usage of the Lindera CLI, from installation to advanced text processing.
1. Install the CLI
Install Lindera CLI with the embedded IPADIC dictionary:
% cargo install lindera-cli --features=embed-ipadic
Verify the installation:
% lindera --help
2. Basic tokenization with embedded dictionary
Tokenize Japanese text using the embedded IPADIC dictionary:
% echo "東京は日本の首都です。" | lindera tokenize \
--dict embedded://ipadic
Expected output:
東京 名詞,固有名詞,地域,一般,*,*,東京,トウキョウ,トーキョー
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
日本 名詞,固有名詞,地域,国,*,*,日本,ニホン,ニホン
の 助詞,連体化,*,*,*,*,の,ノ,ノ
首都 名詞,一般,*,*,*,*,首都,シュト,シュト
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。 記号,句点,*,*,*,*,。,。,。
EOS
3. Try different output formats
Wakati format (word segmentation only)
% echo "東京は日本の首都です。" | lindera tokenize \
--dict embedded://ipadic \
--output wakati
Expected output:
東京 は 日本 の 首都 です 。
JSON format (detailed information)
% echo "東京は日本の首都です。" | lindera tokenize \
--dict embedded://ipadic \
--output json
This produces a JSON array with detailed token information including byte offsets, part-of-speech tags, readings, and more.
4. Use decompose mode
Decompose mode splits compound nouns into their constituent parts:
% echo "関西国際空港限定トートバッグ" | lindera tokenize \
--dict embedded://ipadic \
--mode decompose
Expected output:
関西 名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ
国際 名詞,一般,*,*,*,*,国際,コクサイ,コクサイ
空港 名詞,一般,*,*,*,*,空港,クウコウ,クーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ 名詞,一般,*,*,*,*,*,*,*
EOS
Compare with normal mode, where "関西国際空港" remains as a single token.
5. Apply character and token filters
Use Unicode normalization and keep only common nouns:
% echo "Linderaは形態素解析エンジンです。" | lindera tokenize \
--dict embedded://ipadic \
--char-filter 'unicode_normalize:{"kind":"nfkc"}' \
--token-filter 'japanese_keep_tags:{"tags":["名詞,一般","名詞,固有名詞,組織"]}'
Expected output:
Lindera 名詞,固有名詞,組織,*,*,*,*,*,*
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
エンジン 名詞,一般,*,*,*,*,エンジン,エンジン,エンジン
EOS
The Unicode normalization converts full-width characters to half-width, and the token filter keeps only tokens matching the specified part-of-speech tags.
You can also combine multiple filters:
% echo "すもももももももものうち" | lindera tokenize \
--dict embedded://ipadic \
--token-filter 'japanese_stop_tags:{"tags":["助詞","助詞,係助詞","助詞,連体化"]}'
6. Use user dictionary
Create a CSV file with custom word entries (e.g., my_dict.csv):
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
Tokenize with the user dictionary:
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict embedded://ipadic \
--user-dict ./my_dict.csv
Without the user dictionary, "東京スカイツリー" would be split into multiple tokens. With the user dictionary, it is recognized as a single token.
For pre-built user dictionary examples, see:
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict embedded://ipadic \
--user-dict ./resources/user_dict/ipadic_simple_userdic.csv
Expected output:
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
Lindera Python
Lindera Python provides Python bindings for the Lindera morphological analysis engine, built with PyO3. It brings Lindera's high-performance tokenization capabilities to the Python ecosystem with support for Python 3.10 and later.
Features
- Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
- Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
- CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires
trainfeature) - Multiple tokenization modes: Normal and decompose modes for different analysis granularity
- N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
- User dictionaries: Extend system dictionaries with custom vocabulary
Documentation
- Installation -- Prerequisites, build instructions, and feature flags
- Quick Start -- A minimal example to get started
- Tokenizer API --
TokenizerBuilder,Tokenizer, andTokenclass reference - Dictionary Management -- Loading, building, and managing dictionaries
- Text Processing Pipeline -- Character filters and token filters
- Training -- Training custom CRF models and exporting dictionaries
Installation
Installing from PyPI
Pre-built wheels are available on PyPI:
pip install lindera-python
[!NOTE] The PyPI package does not include dictionaries. See Obtaining Dictionaries below.
Obtaining Dictionaries
Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.
Download from GitHub Releases
Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:
# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic
Building from Source
If you need to build from source (e.g., to enable specific feature flags), the following prerequisites are required:
- Python 3.10 or later (up to 3.14)
- Rust toolchain -- Install via rustup
- maturin -- Python package for building Rust-based Python extensions
Install maturin with pip:
pip install maturin
Development Build
Build and install lindera-python in development mode:
cd lindera-python
maturin develop
Or use the project Makefile:
make python-develop
Build with Training Support
The train feature enables CRF-based dictionary training functionality. It is enabled by default:
maturin develop --features train
Feature Flags
| Feature | Description | Default |
|---|---|---|
train | CRF training functionality | Enabled |
embed-ipadic | Embed Japanese dictionary (IPADIC) into the binary | Disabled |
embed-unidic | Embed Japanese dictionary (UniDic) into the binary | Disabled |
embed-ipadic-neologd | Embed Japanese dictionary (IPADIC NEologd) into the binary | Disabled |
embed-ko-dic | Embed Korean dictionary (ko-dic) into the binary | Disabled |
embed-cc-cedict | Embed Chinese dictionary (CC-CEDICT) into the binary | Disabled |
embed-jieba | Embed Chinese dictionary (Jieba) into the binary | Disabled |
embed-cjk | Embed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binary | Disabled |
Multiple features can be combined:
maturin develop --features "train,embed-ipadic,embed-ko-dic"
[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding
embed-*feature flag and load it using theembedded://scheme:dictionary = load_dictionary("embedded://ipadic")See Feature Flags for details.
Verifying the Installation
After installation, verify that lindera is available in Python:
import lindera
print(lindera.version())
Quick Start
This guide shows how to tokenize text using lindera-python.
Basic Tokenization
The recommended way to create a tokenizer is through TokenizerBuilder:
from lindera import TokenizerBuilder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("/path/to/ipadic")
tokenizer = builder.build()
tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")
for token in tokens:
print(f"{token.surface}\t{','.join(token.details)}")
Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.
Expected output:
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK
Method Chaining
TokenizerBuilder supports method chaining for concise configuration:
from lindera import TokenizerBuilder
tokenizer = (
TokenizerBuilder()
.set_mode("normal")
.set_dictionary("/path/to/ipadic")
.build()
)
tokens = tokenizer.tokenize("すもももももももものうち")
for token in tokens:
print(f"{token.surface}\t{token.get_detail(0)}")
Accessing Token Properties
Each token exposes the following properties:
from lindera import TokenizerBuilder
tokenizer = TokenizerBuilder().set_dictionary("/path/to/ipadic").build()
tokens = tokenizer.tokenize("東京タワー")
for token in tokens:
print(f"Surface: {token.surface}")
print(f"Byte range: {token.byte_start}..{token.byte_end}")
print(f"Position: {token.position}")
print(f"Word ID: {token.word_id}")
print(f"Unknown: {token.is_unknown}")
print(f"Details: {token.details}")
print()
N-best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
from lindera import TokenizerBuilder
tokenizer = TokenizerBuilder().set_dictionary("/path/to/ipadic").build()
results = tokenizer.tokenize_nbest("すもももももももものうち", n=3)
for tokens, cost in results:
surfaces = [t.surface for t in tokens]
print(f"Cost {cost}: {' / '.join(surfaces)}")
Tokenizer API
TokenizerBuilder
TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.
Constructors
TokenizerBuilder()
Creates a new builder with default configuration.
from lindera import TokenizerBuilder
builder = TokenizerBuilder()
TokenizerBuilder().from_file(file_path)
Loads configuration from a JSON file and returns a new builder.
builder = TokenizerBuilder().from_file("config.json")
Configuration Methods
All setter methods return self for method chaining.
set_mode(mode)
Sets the tokenization mode.
"normal"-- Standard tokenization (default)"decompose"-- Decomposes compound words into smaller units
builder.set_mode("normal")
set_dictionary(path)
Sets the system dictionary path or URI.
# Use an embedded dictionary
builder.set_dictionary("embedded://ipadic")
# Use an external dictionary
builder.set_dictionary("/path/to/dictionary")
set_user_dictionary(uri)
Sets the user dictionary URI.
builder.set_user_dictionary("/path/to/user_dictionary")
set_keep_whitespace(keep)
Controls whether whitespace tokens appear in the output.
builder.set_keep_whitespace(True)
append_character_filter(kind, args=None)
Appends a character filter to the preprocessing pipeline.
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
append_token_filter(kind, args=None)
Appends a token filter to the postprocessing pipeline.
builder.append_token_filter("lowercase", {})
Build
build()
Builds and returns a Tokenizer with the configured settings.
tokenizer = builder.build()
Tokenizer
Tokenizer performs morphological analysis on text.
Creating a Tokenizer
Tokenizer(dictionary, mode="normal", user_dictionary=None)
Creates a tokenizer directly from a loaded dictionary.
from lindera import Tokenizer, load_dictionary
dictionary = load_dictionary("embedded://ipadic")
tokenizer = Tokenizer(dictionary, mode="normal")
Tokenizer Methods
tokenize(text)
Tokenizes the input text and returns a list of Token objects.
tokens = tokenizer.tokenize("形態素解析")
Parameters:
| Name | Type | Description |
|---|---|---|
text | str | Text to tokenize |
Returns: list[Token]
tokenize_nbest(text, n, unique=False, cost_threshold=None)
Returns the N-best tokenization results, each paired with its total path cost.
results = tokenizer.tokenize_nbest("すもももももももものうち", n=3)
for tokens, cost in results:
print(cost, [t.surface for t in tokens])
Parameters:
| Name | Type | Description |
|---|---|---|
text | str | Text to tokenize |
n | int | Number of results to return |
unique | bool | Deduplicate results (default: False) |
cost_threshold | int or None | Maximum cost difference from the best path (default: None) |
Returns: list[tuple[list[Token], int]]
Token
Token represents a single morphological token.
Properties
| Property | Type | Description |
|---|---|---|
surface | str | Surface form of the token |
byte_start | int | Start byte position in the original text |
byte_end | int | End byte position in the original text |
position | int | Token position index |
word_id | int | Dictionary word ID |
is_unknown | bool | True if the word is not in the dictionary |
details | list[str] or None | Morphological details (part of speech, reading, etc.) |
Token Methods
get_detail(index)
Returns the detail string at the specified index, or None if the index is out of range.
token = tokenizer.tokenize("東京")[0]
pos = token.get_detail(0) # e.g., "名詞"
subpos = token.get_detail(1) # e.g., "固有名詞"
reading = token.get_detail(7) # e.g., "トウキョウ"
Parameters:
| Name | Type | Description |
|---|---|---|
index | int | Zero-based index into the details list |
Returns: str or None
The structure of details depends on the dictionary:
- IPADIC:
[品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音] - UniDic: Detailed morphological features following the UniDic specification
- ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats
Dictionary Management
Lindera Python provides functions for loading, building, and managing dictionaries used in morphological analysis.
Loading Dictionaries
System Dictionaries
Use load_dictionary(uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:
from lindera import load_dictionary
dictionary = load_dictionary("/path/to/ipadic")
Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:
dictionary = load_dictionary("embedded://ipadic")
User Dictionaries
User dictionaries add custom vocabulary on top of a system dictionary.
from lindera import load_user_dictionary, Metadata
metadata = Metadata()
user_dict = load_user_dictionary("/path/to/user_dictionary", metadata)
Pass the user dictionary when building a tokenizer:
from lindera import Tokenizer, load_dictionary, load_user_dictionary, Metadata
dictionary = load_dictionary("/path/to/ipadic")
metadata = Metadata()
user_dict = load_user_dictionary("/path/to/user_dictionary", metadata)
tokenizer = Tokenizer(dictionary, mode="normal", user_dictionary=user_dict)
Or via the builder:
from lindera import TokenizerBuilder
tokenizer = (
TokenizerBuilder()
.set_dictionary("/path/to/ipadic")
.set_user_dictionary("/path/to/user_dictionary")
.build()
)
Building Dictionaries
System Dictionary
Build a system dictionary from source files:
from lindera import build_dictionary, Metadata
metadata = Metadata(name="custom", encoding="UTF-8")
build_dictionary("/path/to/input_dir", "/path/to/output_dir", metadata)
The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).
User Dictionary
Build a user dictionary from a CSV file:
from lindera import build_user_dictionary, Metadata
metadata = Metadata()
build_user_dictionary("ipadic", "user_words.csv", "/path/to/output_dir", metadata)
The metadata parameter is optional. When omitted, default metadata values are used:
build_user_dictionary("ipadic", "user_words.csv", "/path/to/output_dir")
Metadata
The Metadata class configures dictionary parameters.
Creating Metadata
from lindera import Metadata
# Default metadata
metadata = Metadata()
# Custom metadata
metadata = Metadata(
name="my_dictionary",
encoding="UTF-8",
default_word_cost=-10000,
)
Loading from JSON
metadata = Metadata.from_json_file("metadata.json")
Properties
| Property | Type | Default | Description |
|---|---|---|---|
name | str | "default" | Dictionary name |
encoding | str | "UTF-8" | Character encoding |
default_word_cost | int | -10000 | Default cost for unknown words |
default_left_context_id | int | 1288 | Default left context ID |
default_right_context_id | int | 1288 | Default right context ID |
default_field_value | str | "*" | Default value for missing fields |
flexible_csv | bool | False | Allow flexible CSV parsing |
skip_invalid_cost_or_id | bool | False | Skip entries with invalid cost or ID |
normalize_details | bool | False | Normalize morphological details |
dictionary_schema | Schema | IPADIC schema | Schema for the main dictionary |
user_dictionary_schema | Schema | Minimal schema | Schema for user dictionaries |
All properties support both getting and setting:
metadata = Metadata()
metadata.name = "custom_dict"
metadata.encoding = "EUC-JP"
print(metadata.name) # "custom_dict"
to_dict()
Returns a dictionary representation of the metadata:
metadata = Metadata(name="test")
print(metadata.to_dict())
Text Processing Pipeline
Lindera Python supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.
Input Text
--> Character Filters (preprocessing)
--> Tokenization
--> Token Filters (postprocessing)
--> Output Tokens
Character Filters
Character filters transform the input text before tokenization.
unicode_normalize
Applies Unicode normalization to the input text.
from lindera import TokenizerBuilder
tokenizer = (
TokenizerBuilder()
.set_dictionary("embedded://ipadic")
.append_character_filter("unicode_normalize", {"kind": "nfkc"})
.build()
)
Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".
mapping
Replaces characters or strings according to a mapping table.
tokenizer = (
TokenizerBuilder()
.set_dictionary("embedded://ipadic")
.append_character_filter("mapping", {
"mapping": {
"\u30fc": "-",
"\uff5e": "~",
}
})
.build()
)
japanese_iteration_mark
Resolves Japanese iteration marks (odoriji) into their full forms.
tokenizer = (
TokenizerBuilder()
.set_dictionary("embedded://ipadic")
.append_character_filter("japanese_iteration_mark", {
"normalize_kanji": True,
"normalize_kana": True,
})
.build()
)
Token Filters
Token filters transform or remove tokens after tokenization.
lowercase
Converts token surface forms to lowercase.
tokenizer = (
TokenizerBuilder()
.set_dictionary("embedded://ipadic")
.append_token_filter("lowercase", {})
.build()
)
japanese_base_form
Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.
tokenizer = (
TokenizerBuilder()
.set_dictionary("embedded://ipadic")
.append_token_filter("japanese_base_form", {})
.build()
)
japanese_stop_tags
Removes tokens whose part-of-speech matches any of the specified tags.
tokenizer = (
TokenizerBuilder()
.set_dictionary("embedded://ipadic")
.append_token_filter("japanese_stop_tags", {
"tags": ["助詞", "助動詞"],
})
.build()
)
japanese_keep_tags
Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.
tokenizer = (
TokenizerBuilder()
.set_dictionary("embedded://ipadic")
.append_token_filter("japanese_keep_tags", {
"tags": ["名詞"],
})
.build()
)
Complete Pipeline Example
The following example combines multiple character filters and token filters into a single pipeline:
from lindera import TokenizerBuilder
tokenizer = (
TokenizerBuilder()
.set_mode("normal")
.set_dictionary("embedded://ipadic")
# Preprocessing
.append_character_filter("unicode_normalize", {"kind": "nfkc"})
.append_character_filter("japanese_iteration_mark", {
"normalize_kanji": True,
"normalize_kana": True,
})
# Postprocessing
.append_token_filter("japanese_base_form", {})
.append_token_filter("japanese_stop_tags", {
"tags": ["助詞", "助動詞", "記号"],
})
.append_token_filter("lowercase", {})
.build()
)
tokens = tokenizer.tokenize("Linderaは形態素解析を行うライブラリです。")
for token in tokens:
print(f"{token.surface}\t{','.join(token.details)}")
In this pipeline:
unicode_normalizeconverts full-width characters to half-width (NFKC normalization)japanese_iteration_markresolves iteration marksjapanese_base_formconverts inflected tokens to base formjapanese_stop_tagsremoves particles, auxiliary verbs, and symbolslowercasenormalizes alphabetic characters to lowercase
Training
Lindera Python supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.
Prerequisites
Build lindera-python with the train feature enabled (enabled by default):
maturin develop --features train
Training a Model
Use lindera.train() to train a CRF model from a seed lexicon and annotated corpus:
import lindera
lindera.train(
seed="resources/training/seed.csv",
corpus="resources/training/corpus.txt",
char_def="resources/training/char.def",
unk_def="resources/training/unk.def",
feature_def="resources/training/feature.def",
rewrite_def="resources/training/rewrite.def",
output="/tmp/model.dat",
lambda_=0.01,
max_iter=100,
max_threads=4,
)
Training Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
seed | str | required | Path to the seed lexicon file (CSV format) |
corpus | str | required | Path to the annotated training corpus |
char_def | str | required | Path to the character definition file (char.def) |
unk_def | str | required | Path to the unknown word definition file (unk.def) |
feature_def | str | required | Path to the feature definition file (feature.def) |
rewrite_def | str | required | Path to the rewrite rule definition file (rewrite.def) |
output | str | required | Output path for the trained model file |
lambda_ | float | 0.01 | L1 regularization cost (0.0--1.0) |
max_iter | int | 100 | Maximum number of training iterations |
max_threads | int or None | None | Number of threads (None = auto-detect CPU cores) |
Exporting a Trained Model
After training, export the model to dictionary source files using lindera.export():
import lindera
lindera.export(
model="/tmp/model.dat",
output="/tmp/dictionary_source",
metadata="resources/training/metadata.json",
)
Export Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | Path to the trained model file (.dat) |
output | str | required | Output directory for dictionary source files |
metadata | str or None | None | Path to a base metadata.json file |
The export creates the following files in the output directory:
lex.csv-- Lexicon entries with trained costsmatrix.def-- Connection cost matrixunk.def-- Unknown word definitionschar.def-- Character category definitionsmetadata.json-- Updated metadata (whenmetadataparameter is provided)
Complete Workflow
The full workflow for training and using a custom dictionary:
import lindera
# Step 1: Train the CRF model
lindera.train(
seed="resources/training/seed.csv",
corpus="resources/training/corpus.txt",
char_def="resources/training/char.def",
unk_def="resources/training/unk.def",
feature_def="resources/training/feature.def",
rewrite_def="resources/training/rewrite.def",
output="/tmp/model.dat",
lambda_=0.01,
max_iter=100,
)
# Step 2: Export to dictionary source files
lindera.export(
model="/tmp/model.dat",
output="/tmp/dictionary_source",
metadata="resources/training/metadata.json",
)
# Step 3: Build the dictionary from exported source files
metadata = lindera.Metadata.from_json_file("/tmp/dictionary_source/metadata.json")
lindera.build_dictionary("/tmp/dictionary_source", "/tmp/dictionary", metadata)
# Step 4: Use the trained dictionary
tokenizer = (
lindera.TokenizerBuilder()
.set_dictionary("/tmp/dictionary")
.set_mode("normal")
.build()
)
tokens = tokenizer.tokenize("形態素解析のテスト")
for token in tokens:
print(f"{token.surface}\t{','.join(token.details)}")
Lindera Node.js
Lindera Node.js provides Node.js bindings for the Lindera morphological analysis engine, built with NAPI-RS. It brings Lindera's high-performance tokenization capabilities to the Node.js ecosystem with support for Node.js 18 and later.
Features
- Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
- Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
- CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires
trainfeature) - Multiple tokenization modes: Normal and decompose modes for different analysis granularity
- N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
- User dictionaries: Extend system dictionaries with custom vocabulary
- TypeScript support: Full type definitions included out of the box
Documentation
- Installation -- Prerequisites, build instructions, and feature flags
- Quick Start -- A minimal example to get started
- Tokenizer API --
TokenizerBuilder,Tokenizer, andTokenclass reference - Dictionary Management -- Loading, building, and managing dictionaries
- Text Processing Pipeline -- Character filters and token filters
- Training -- Training custom CRF models and exporting dictionaries
Installation
Installing from npm
Pre-built packages will be available on npm:
npm install lindera-nodejs
[!NOTE] The npm package does not include dictionaries. See Obtaining Dictionaries below. For browser/WASM usage, see lindera-wasm.
Building from Source
Prerequisites
- Node.js 18 or later (LTS versions recommended)
- Rust toolchain -- Install via rustup
- NAPI-RS CLI -- CLI tool for building native Node.js addons in Rust
Install the NAPI-RS CLI globally:
npm install -g @napi-rs/cli
Obtaining Dictionaries
Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.
Download from GitHub Releases
Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:
# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic
Development Build
Build lindera-nodejs in development mode:
cd lindera-nodejs
npm install
npm run build
Or use the project Makefile:
make nodejs-develop
Build with Training Support
The train feature enables CRF-based dictionary training functionality. It is enabled by default:
npm run build -- --features train
Feature Flags
| Feature | Description | Default |
|---|---|---|
train | CRF training functionality | Enabled |
embed-ipadic | Embed Japanese dictionary (IPADIC) into the binary | Disabled |
embed-unidic | Embed Japanese dictionary (UniDic) into the binary | Disabled |
embed-ipadic-neologd | Embed Japanese dictionary (IPADIC NEologd) into the binary | Disabled |
embed-ko-dic | Embed Korean dictionary (ko-dic) into the binary | Disabled |
embed-cc-cedict | Embed Chinese dictionary (CC-CEDICT) into the binary | Disabled |
embed-jieba | Embed Chinese dictionary (Jieba) into the binary | Disabled |
embed-cjk | Embed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binary | Disabled |
Multiple features can be combined:
npm run build -- --features "train,embed-ipadic,embed-ko-dic"
[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding
embed-*feature flag and load it using theembedded://scheme:const dictionary = loadDictionary("embedded://ipadic");See Feature Flags for details.
Verifying the Installation
After installation, verify that lindera is available in Node.js:
const lindera = require("lindera-nodejs");
console.log(lindera.version());
Or with ES modules:
import { version } from "lindera-nodejs";
console.log(version());
Quick Start
This guide shows how to tokenize text using lindera-nodejs.
Basic Tokenization
The recommended way to create a tokenizer is through TokenizerBuilder:
const { TokenizerBuilder } = require("lindera-nodejs");
const builder = new TokenizerBuilder();
builder.setMode("normal");
builder.setDictionary("/path/to/ipadic");
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");
for (const token of tokens) {
console.log(`${token.surface}\t${token.details.join(",")}`);
}
Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.
Expected output:
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK
Method Chaining
TokenizerBuilder supports method chaining for concise configuration:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setMode("normal")
.setDictionary("/path/to/ipadic")
.build();
const tokens = tokenizer.tokenize("すもももももももものうち");
for (const token of tokens) {
console.log(`${token.surface}\t${token.getDetail(0)}`);
}
Accessing Token Properties
Each token exposes the following properties:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setDictionary("/path/to/ipadic")
.build();
const tokens = tokenizer.tokenize("東京タワー");
for (const token of tokens) {
console.log(`Surface: ${token.surface}`);
console.log(`Byte range: ${token.byteStart}..${token.byteEnd}`);
console.log(`Position: ${token.position}`);
console.log(`Word ID: ${token.wordId}`);
console.log(`Unknown: ${token.isUnknown}`);
console.log(`Details: ${token.details}`);
console.log();
}
N-best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setDictionary("/path/to/ipadic")
.build();
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
for (const { tokens, cost } of results) {
const surfaces = tokens.map((t) => t.surface);
console.log(`Cost ${cost}: ${surfaces.join(" / ")}`);
}
TypeScript
Lindera Node.js includes TypeScript type definitions. All classes and functions are fully typed:
import { TokenizerBuilder, Token } from "lindera-nodejs";
const tokenizer = new TokenizerBuilder()
.setMode("normal")
.setDictionary("/path/to/ipadic")
.build();
const tokens: Token[] = tokenizer.tokenize("形態素解析");
for (const token of tokens) {
console.log(`${token.surface}: ${token.details?.join(",")}`);
}
Tokenizer API
TokenizerBuilder
TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.
Constructors
new TokenizerBuilder()
Creates a new builder with default configuration.
const { TokenizerBuilder } = require("lindera-nodejs");
const builder = new TokenizerBuilder();
new TokenizerBuilder().fromFile(filePath)
Loads configuration from a JSON file and returns a new builder.
const builder = new TokenizerBuilder().fromFile("config.json");
Configuration Methods
All setter methods return this for method chaining.
setMode(mode)
Sets the tokenization mode.
"normal"-- Standard tokenization (default)"decompose"-- Decomposes compound words into smaller units
builder.setMode("normal");
setDictionary(path)
Sets the system dictionary path or URI.
// Use an embedded dictionary
builder.setDictionary("embedded://ipadic");
// Use an external dictionary
builder.setDictionary("/path/to/dictionary");
setUserDictionary(uri)
Sets the user dictionary URI.
builder.setUserDictionary("/path/to/user_dictionary");
setKeepWhitespace(keep)
Controls whether whitespace tokens appear in the output.
builder.setKeepWhitespace(true);
appendCharacterFilter(kind, args?)
Appends a character filter to the preprocessing pipeline.
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });
appendTokenFilter(kind, args?)
Appends a token filter to the postprocessing pipeline.
builder.appendTokenFilter("lowercase", {});
Build
build()
Builds and returns a Tokenizer with the configured settings.
const tokenizer = builder.build();
Tokenizer
Tokenizer performs morphological analysis on text.
Creating a Tokenizer
new Tokenizer(dictionary, mode?, userDictionary?)
Creates a tokenizer directly from a loaded dictionary.
const { Tokenizer, loadDictionary } = require("lindera-nodejs");
const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");
Tokenizer Methods
tokenize(text)
Tokenizes the input text and returns an array of Token objects.
const tokens = tokenizer.tokenize("形態素解析");
Parameters:
| Name | Type | Description |
|---|---|---|
text | string | Text to tokenize |
Returns: Token[]
tokenizeNbest(text, n, unique?, costThreshold?)
Returns the N-best tokenization results, each containing tokens and total path cost.
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
for (const { tokens, cost } of results) {
console.log(cost, tokens.map((t) => t.surface));
}
Parameters:
| Name | Type | Description |
|---|---|---|
text | string | Text to tokenize |
n | number | Number of results to return |
unique | boolean | Deduplicate results (default: false) |
costThreshold | number | undefined | Maximum cost difference from the best path (default: undefined) |
Returns: Array<{ tokens: Token[], cost: number }>
Token
Token represents a single morphological token.
Properties
| Property | Type | Description |
|---|---|---|
surface | string | Surface form of the token |
byteStart | number | Start byte position in the original text |
byteEnd | number | End byte position in the original text |
position | number | Token position index |
wordId | number | Dictionary word ID |
isUnknown | boolean | true if the word is not in the dictionary |
details | string[] | null | Morphological details (part of speech, reading, etc.) |
Token Methods
getDetail(index)
Returns the detail string at the specified index, or null if the index is out of range.
const token = tokenizer.tokenize("東京")[0];
const pos = token.getDetail(0); // e.g., "名詞"
const subpos = token.getDetail(1); // e.g., "固有名詞"
const reading = token.getDetail(7); // e.g., "トウキョウ"
Parameters:
| Name | Type | Description |
|---|---|---|
index | number | Zero-based index into the details array |
Returns: string | null
The structure of details depends on the dictionary:
- IPADIC:
[品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音] - UniDic: Detailed morphological features following the UniDic specification
- ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats
Dictionary Management
Lindera Node.js provides functions for loading, building, and managing dictionaries used in morphological analysis.
Loading Dictionaries
System Dictionaries
Use loadDictionary(uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:
const { loadDictionary } = require("lindera-nodejs");
const dictionary = loadDictionary("/path/to/ipadic");
Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:
const dictionary = loadDictionary("embedded://ipadic");
User Dictionaries
User dictionaries add custom vocabulary on top of a system dictionary.
const { loadUserDictionary, Metadata } = require("lindera-nodejs");
const metadata = new Metadata();
const userDict = loadUserDictionary("/path/to/user_dictionary", metadata);
Pass the user dictionary when building a tokenizer:
const { Tokenizer, loadDictionary, loadUserDictionary, Metadata } = require("lindera-nodejs");
const dictionary = loadDictionary("/path/to/ipadic");
const metadata = new Metadata();
const userDict = loadUserDictionary("/path/to/user_dictionary", metadata);
const tokenizer = new Tokenizer(dictionary, "normal", userDict);
Or via the builder:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setDictionary("/path/to/ipadic")
.setUserDictionary("/path/to/user_dictionary")
.build();
Building Dictionaries
System Dictionary
Build a system dictionary from source files:
const { buildDictionary, Metadata } = require("lindera-nodejs");
const metadata = new Metadata({ name: "custom", encoding: "UTF-8" });
buildDictionary("/path/to/input_dir", "/path/to/output_dir", metadata);
The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).
User Dictionary
Build a user dictionary from a CSV file:
const { buildUserDictionary, Metadata } = require("lindera-nodejs");
const metadata = new Metadata();
buildUserDictionary("ipadic", "user_words.csv", "/path/to/output_dir", metadata);
The metadata parameter is optional. When omitted, default metadata values are used:
buildUserDictionary("ipadic", "user_words.csv", "/path/to/output_dir");
Metadata
The Metadata class configures dictionary parameters.
Creating Metadata
const { Metadata } = require("lindera-nodejs");
// Default metadata
const metadata = new Metadata();
// Custom metadata
const metadata = new Metadata({
name: "my_dictionary",
encoding: "UTF-8",
defaultWordCost: -10000,
});
Loading from JSON
const metadata = Metadata.fromJsonFile("metadata.json");
Properties
| Property | Type | Default | Description |
|---|---|---|---|
name | string | "default" | Dictionary name |
encoding | string | "UTF-8" | Character encoding |
defaultWordCost | number | -10000 | Default cost for unknown words |
defaultLeftContextId | number | 1288 | Default left context ID |
defaultRightContextId | number | 1288 | Default right context ID |
defaultFieldValue | string | "*" | Default value for missing fields |
flexibleCsv | boolean | false | Allow flexible CSV parsing |
skipInvalidCostOrId | boolean | false | Skip entries with invalid cost or ID |
normalizeDetails | boolean | false | Normalize morphological details |
dictionarySchema | Schema | IPADIC schema | Schema for the main dictionary |
userDictionarySchema | Schema | Minimal schema | Schema for user dictionaries |
All properties support both getting and setting:
const metadata = new Metadata();
metadata.name = "custom_dict";
metadata.encoding = "EUC-JP";
console.log(metadata.name); // "custom_dict"
toObject()
Returns a plain object representation of the metadata:
const metadata = new Metadata({ name: "test" });
console.log(metadata.toObject());
Schema
The Schema class defines the field structure of dictionary entries.
Creating a Schema
const { Schema } = require("lindera-nodejs");
// Default IPADIC-compatible schema
const schema = Schema.createDefault();
// Custom schema
const custom = new Schema(["surface", "left_id", "right_id", "cost", "pos", "reading"]);
Schema Methods
| Method | Returns | Description |
|---|---|---|
getFieldIndex(name) | number | null | Get field index by name |
fieldCount() | number | Total number of fields |
getFieldName(index) | string | null | Get field name by index |
getCustomFields() | string[] | Fields beyond index 4 (morphological features) |
getAllFields() | string[] | All field names |
getFieldByName(name) | FieldDefinition | null | Get full field definition |
validateRecord(record) | void | Validate a CSV record against the schema |
const schema = Schema.createDefault();
console.log(schema.fieldCount()); // 13 (IPADIC format)
console.log(schema.getFieldIndex("pos1")); // e.g., 4
console.log(schema.getAllFields()); // ["surface", "left_id", ...]
console.log(schema.getCustomFields()); // Fields after index 4
FieldDefinition
| Property | Type | Description |
|---|---|---|
index | number | Field position index |
name | string | Field name |
fieldType | FieldType | Field type enum |
description | string | undefined | Optional description |
FieldType
| Value | Description |
|---|---|
FieldType.Surface | Word surface text |
FieldType.LeftContextId | Left context ID |
FieldType.RightContextId | Right context ID |
FieldType.Cost | Word cost |
FieldType.Custom | Morphological feature field |
Text Processing Pipeline
Lindera Node.js supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.
Input Text
--> Character Filters (preprocessing)
--> Tokenization
--> Token Filters (postprocessing)
--> Output Tokens
Character Filters
Character filters transform the input text before tokenization.
unicode_normalize
Applies Unicode normalization to the input text.
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setDictionary("embedded://ipadic")
.appendCharacterFilter("unicode_normalize", { kind: "nfkc" })
.build();
Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".
mapping
Replaces characters or strings according to a mapping table.
const tokenizer = new TokenizerBuilder()
.setDictionary("embedded://ipadic")
.appendCharacterFilter("mapping", {
mapping: {
"\u30fc": "-",
"\uff5e": "~",
},
})
.build();
japanese_iteration_mark
Resolves Japanese iteration marks (odoriji) into their full forms.
const tokenizer = new TokenizerBuilder()
.setDictionary("embedded://ipadic")
.appendCharacterFilter("japanese_iteration_mark", {
normalize_kanji: true,
normalize_kana: true,
})
.build();
Token Filters
Token filters transform or remove tokens after tokenization.
lowercase
Converts token surface forms to lowercase.
const tokenizer = new TokenizerBuilder()
.setDictionary("embedded://ipadic")
.appendTokenFilter("lowercase", {})
.build();
japanese_base_form
Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.
const tokenizer = new TokenizerBuilder()
.setDictionary("embedded://ipadic")
.appendTokenFilter("japanese_base_form", {})
.build();
japanese_stop_tags
Removes tokens whose part-of-speech matches any of the specified tags.
const tokenizer = new TokenizerBuilder()
.setDictionary("embedded://ipadic")
.appendTokenFilter("japanese_stop_tags", {
tags: ["助詞", "助動詞"],
})
.build();
japanese_keep_tags
Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.
const tokenizer = new TokenizerBuilder()
.setDictionary("embedded://ipadic")
.appendTokenFilter("japanese_keep_tags", {
tags: ["名詞"],
})
.build();
Complete Pipeline Example
The following example combines multiple character filters and token filters into a single pipeline:
const { TokenizerBuilder } = require("lindera-nodejs");
const tokenizer = new TokenizerBuilder()
.setMode("normal")
.setDictionary("embedded://ipadic")
// Preprocessing
.appendCharacterFilter("unicode_normalize", { kind: "nfkc" })
.appendCharacterFilter("japanese_iteration_mark", {
normalize_kanji: true,
normalize_kana: true,
})
// Postprocessing
.appendTokenFilter("japanese_base_form", {})
.appendTokenFilter("japanese_stop_tags", {
tags: ["助詞", "助動詞", "記号"],
})
.appendTokenFilter("lowercase", {})
.build();
const tokens = tokenizer.tokenize("Linderaは形態素解析を行うライブラリです。");
for (const token of tokens) {
console.log(`${token.surface}\t${token.details.join(",")}`);
}
In this pipeline:
unicode_normalizeconverts full-width characters to half-width (NFKC normalization)japanese_iteration_markresolves iteration marksjapanese_base_formconverts inflected tokens to base formjapanese_stop_tagsremoves particles, auxiliary verbs, and symbolslowercasenormalizes alphabetic characters to lowercase
Training
Lindera Node.js supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.
Prerequisites
Build lindera-nodejs with the train feature enabled (enabled by default):
npm run build -- --features train
Training a Model
Use train() to train a CRF model from a seed lexicon and annotated corpus:
const { train } = require("lindera-nodejs");
train({
seed: "resources/training/seed.csv",
corpus: "resources/training/corpus.txt",
charDef: "resources/training/char.def",
unkDef: "resources/training/unk.def",
featureDef: "resources/training/feature.def",
rewriteDef: "resources/training/rewrite.def",
output: "/tmp/model.dat",
lambda: 0.01,
maxIter: 100,
maxThreads: 4,
});
Training Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
seed | string | required | Path to the seed lexicon file (CSV format) |
corpus | string | required | Path to the annotated training corpus |
charDef | string | required | Path to the character definition file (char.def) |
unkDef | string | required | Path to the unknown word definition file (unk.def) |
featureDef | string | required | Path to the feature definition file (feature.def) |
rewriteDef | string | required | Path to the rewrite rule definition file (rewrite.def) |
output | string | required | Output path for the trained model file |
lambda | number | 0.01 | L1 regularization cost (0.0--1.0) |
maxIter | number | 100 | Maximum number of training iterations |
maxThreads | number | undefined | undefined | Number of threads (undefined = auto-detect CPU cores) |
Exporting a Trained Model
After training, export the model to dictionary source files using exportModel():
const { exportModel } = require("lindera-nodejs");
exportModel({
model: "/tmp/model.dat",
output: "/tmp/dictionary_source",
metadata: "resources/training/metadata.json",
});
Export Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | Path to the trained model file (.dat) |
output | string | required | Output directory for dictionary source files |
metadata | string | undefined | undefined | Path to a base metadata.json file |
The export creates the following files in the output directory:
lex.csv-- Lexicon entries with trained costsmatrix.def-- Connection cost matrixunk.def-- Unknown word definitionschar.def-- Character category definitionsmetadata.json-- Updated metadata (whenmetadataparameter is provided)
Complete Workflow
The full workflow for training and using a custom dictionary:
const {
train,
exportModel,
buildDictionary,
Metadata,
TokenizerBuilder,
} = require("lindera-nodejs");
// Step 1: Train the CRF model
train({
seed: "resources/training/seed.csv",
corpus: "resources/training/corpus.txt",
charDef: "resources/training/char.def",
unkDef: "resources/training/unk.def",
featureDef: "resources/training/feature.def",
rewriteDef: "resources/training/rewrite.def",
output: "/tmp/model.dat",
lambda: 0.01,
maxIter: 100,
});
// Step 2: Export to dictionary source files
exportModel({
model: "/tmp/model.dat",
output: "/tmp/dictionary_source",
metadata: "resources/training/metadata.json",
});
// Step 3: Build the dictionary from exported source files
const metadata = Metadata.fromJsonFile("/tmp/dictionary_source/metadata.json");
buildDictionary("/tmp/dictionary_source", "/tmp/dictionary", metadata);
// Step 4: Use the trained dictionary
const tokenizer = new TokenizerBuilder()
.setDictionary("/tmp/dictionary")
.setMode("normal")
.build();
const tokens = tokenizer.tokenize("形態素解析のテスト");
for (const token of tokens) {
console.log(`${token.surface}\t${token.details.join(",")}`);
}
Lindera Ruby
Lindera Ruby provides Ruby bindings for the Lindera morphological analysis engine, built with Magnus and rb-sys. It brings Lindera's high-performance tokenization capabilities to the Ruby ecosystem with support for Ruby 3.1 and later.
Features
- Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
- Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
- CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires
trainfeature) - Multiple tokenization modes: Normal and decompose modes for different analysis granularity
- N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
- User dictionaries: Extend system dictionaries with custom vocabulary
Documentation
- Installation -- Prerequisites, build instructions, and feature flags
- Quick Start -- A minimal example to get started
- Tokenizer API --
TokenizerBuilder,Tokenizer, andTokenclass reference - Dictionary Management -- Loading, building, and managing dictionaries
- Text Processing Pipeline -- Character filters and token filters
- Training -- Training custom CRF models and exporting dictionaries
Installation
[!NOTE] lindera-ruby is not yet published to RubyGems. You need to build from source.
Prerequisites
- Ruby 3.1 or later
- Rust toolchain -- Install via rustup
- Bundler -- Ruby dependency manager (
gem install bundler)
Obtaining Dictionaries
Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.
Download from GitHub Releases
Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:
# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic
Development Build
Build and install lindera-ruby in development mode:
cd lindera-ruby
bundle install
bundle exec rake compile
Or use the project Makefile:
make ruby-develop
Build with Training Support
The train feature enables CRF-based dictionary training functionality:
LINDERA_FEATURES="train" bundle exec rake compile
Feature Flags
Features are specified through the LINDERA_FEATURES environment variable as a comma-separated list.
| Feature | Description | Default |
|---|---|---|
train | CRF training functionality | Disabled |
embed-ipadic | Embed Japanese dictionary (IPADIC) into the binary | Disabled |
embed-unidic | Embed Japanese dictionary (UniDic) into the binary | Disabled |
embed-ipadic-neologd | Embed Japanese dictionary (IPADIC NEologd) into the binary | Disabled |
embed-ko-dic | Embed Korean dictionary (ko-dic) into the binary | Disabled |
embed-cc-cedict | Embed Chinese dictionary (CC-CEDICT) into the binary | Disabled |
embed-jieba | Embed Chinese dictionary (Jieba) into the binary | Disabled |
embed-cjk | Embed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binary | Disabled |
Multiple features can be combined:
LINDERA_FEATURES="train,embed-ipadic,embed-ko-dic" bundle exec rake compile
[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding
embed-*feature flag and load it using theembedded://scheme:dictionary = Lindera.load_dictionary("embedded://ipadic")See Feature Flags for details.
Verifying the Installation
After installation, verify that lindera is available in Ruby:
require 'lindera'
puts Lindera.version
Quick Start
This guide shows how to tokenize text using lindera-ruby.
Basic Tokenization
The recommended way to create a tokenizer is through Lindera::TokenizerBuilder:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
tokens = tokenizer.tokenize('関西国際空港限定トートバッグ')
tokens.each do |token|
puts "#{token.surface}\t#{token.details.join(',')}"
end
Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.
Expected output:
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK
Sequential Configuration
TokenizerBuilder is configured through sequential method calls:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
tokens = tokenizer.tokenize('すもももももももものうち')
tokens.each do |token|
puts "#{token.surface}\t#{token.get_detail(0)}"
end
Accessing Token Properties
Each token exposes the following properties:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
tokens = tokenizer.tokenize('東京タワー')
tokens.each do |token|
puts "Surface: #{token.surface}"
puts "Byte range: #{token.byte_start}..#{token.byte_end}"
puts "Position: #{token.position}"
puts "Word ID: #{token.word_id}"
puts "Unknown: #{token.is_unknown}"
puts "Details: #{token.details}"
puts
end
N-best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
tokenizer = builder.build
results = tokenizer.tokenize_nbest('すもももももももものうち', 3, false, nil)
results.each do |tokens, cost|
surfaces = tokens.map(&:surface)
puts "Cost #{cost}: #{surfaces.join(' / ')}"
end
Tokenizer API
TokenizerBuilder
Lindera::TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.
Constructors
Lindera::TokenizerBuilder.new
Creates a new builder with default configuration.
require 'lindera'
builder = Lindera::TokenizerBuilder.new
Lindera::TokenizerBuilder.new.from_file(file_path)
Loads configuration from a JSON file and returns a new builder.
builder = Lindera::TokenizerBuilder.new.from_file('config.json')
Configuration Methods
set_mode(mode)
Sets the tokenization mode.
"normal"-- Standard tokenization (default)"decompose"-- Decomposes compound words into smaller units
builder.set_mode('normal')
set_dictionary(path)
Sets the system dictionary path or URI.
# Use an embedded dictionary
builder.set_dictionary('embedded://ipadic')
# Use an external dictionary
builder.set_dictionary('/path/to/dictionary')
set_user_dictionary(uri)
Sets the user dictionary URI.
builder.set_user_dictionary('/path/to/user_dictionary')
set_keep_whitespace(keep)
Controls whether whitespace tokens appear in the output.
builder.set_keep_whitespace(true)
append_character_filter(kind, args)
Appends a character filter to the preprocessing pipeline. The args parameter is a hash with string keys.
builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })
append_token_filter(kind, args)
Appends a token filter to the postprocessing pipeline. The args parameter is a hash with string keys, or nil if the filter requires no arguments.
builder.append_token_filter('lowercase', nil)
Build
build
Builds and returns a Tokenizer with the configured settings.
tokenizer = builder.build
Tokenizer
Lindera::Tokenizer performs morphological analysis on text.
Creating a Tokenizer
Lindera::Tokenizer.new(dictionary, mode, user_dictionary)
Creates a tokenizer directly from a loaded dictionary.
require 'lindera'
dictionary = Lindera.load_dictionary('embedded://ipadic')
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', nil)
With a user dictionary:
dictionary = Lindera.load_dictionary('embedded://ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', user_dict)
Tokenizer Methods
tokenize(text)
Tokenizes the input text and returns an array of Token objects.
tokens = tokenizer.tokenize('形態素解析')
Parameters:
| Name | Type | Description |
|---|---|---|
text | String | Text to tokenize |
Returns: Array<Token>
tokenize_nbest(text, n, unique, cost_threshold)
Returns the N-best tokenization results, each paired with its total path cost.
results = tokenizer.tokenize_nbest('すもももももももものうち', 3, false, nil)
results.each do |tokens, cost|
puts "#{cost}: #{tokens.map(&:surface).inspect}"
end
Parameters:
| Name | Type | Description |
|---|---|---|
text | String | Text to tokenize |
n | Integer | Number of results to return |
unique | Boolean or nil | Deduplicate results (default: false) |
cost_threshold | Integer or nil | Maximum cost difference from the best path (default: nil) |
Returns: Array<Array(Array<Token>, Integer)>
Token
Token represents a single morphological token.
Properties
| Property | Type | Description |
|---|---|---|
surface | String | Surface form of the token |
byte_start | Integer | Start byte position in the original text |
byte_end | Integer | End byte position in the original text |
position | Integer | Token position index |
word_id | Integer | Dictionary word ID |
is_unknown | Boolean | true if the word is not in the dictionary |
details | Array<String> or nil | Morphological details (part of speech, reading, etc.) |
Token Methods
get_detail(index)
Returns the detail string at the specified index, or nil if the index is out of range.
token = tokenizer.tokenize('東京')[0]
pos = token.get_detail(0) # e.g., "名詞"
subpos = token.get_detail(1) # e.g., "固有名詞"
reading = token.get_detail(7) # e.g., "トウキョウ"
Parameters:
| Name | Type | Description |
|---|---|---|
index | Integer | Zero-based index into the details array |
Returns: String or nil
The structure of details depends on the dictionary:
- IPADIC:
[品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音] - UniDic: Detailed morphological features following the UniDic specification
- ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats
Dictionary Management
Lindera Ruby provides functions for loading, building, and managing dictionaries used in morphological analysis.
Loading Dictionaries
System Dictionaries
Use Lindera.load_dictionary(uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:
require 'lindera'
dictionary = Lindera.load_dictionary('/path/to/ipadic')
Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:
dictionary = Lindera.load_dictionary('embedded://ipadic')
User Dictionaries
User dictionaries add custom vocabulary on top of a system dictionary.
require 'lindera'
dictionary = Lindera.load_dictionary('/path/to/ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)
Pass the user dictionary when building a tokenizer:
require 'lindera'
dictionary = Lindera.load_dictionary('/path/to/ipadic')
metadata = dictionary.metadata
user_dict = Lindera.load_user_dictionary('/path/to/user_dictionary', metadata)
tokenizer = Lindera::Tokenizer.new(dictionary, 'normal', user_dict)
Or via the builder:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/path/to/ipadic')
builder.set_user_dictionary('/path/to/user_dictionary')
tokenizer = builder.build
Building Dictionaries
System Dictionary
Build a system dictionary from source files:
require 'lindera'
metadata = Lindera::Metadata.from_json_file('metadata.json')
Lindera.build_dictionary('/path/to/input_dir', '/path/to/output_dir', metadata)
The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).
User Dictionary
Build a user dictionary from a CSV file:
require 'lindera'
metadata = Lindera::Metadata.from_json_file('metadata.json')
Lindera.build_user_dictionary('ipadic', 'user_words.csv', '/path/to/output_dir', metadata)
The metadata parameter is optional. When omitted, default metadata values are used:
Lindera.build_user_dictionary('ipadic', 'user_words.csv', '/path/to/output_dir', nil)
Metadata
The Lindera::Metadata class configures dictionary parameters.
Creating Metadata
require 'lindera'
# Default metadata
metadata = Lindera::Metadata.new
# Create default metadata with standard settings
metadata = Lindera::Metadata.create_default
Loading from JSON
metadata = Lindera::Metadata.from_json_file('metadata.json')
Properties
| Property | Type | Default | Description |
|---|---|---|---|
name | String | "default" | Dictionary name |
encoding | String | "UTF-8" | Character encoding |
default_word_cost | Integer | -10000 | Default cost for unknown words |
default_left_context_id | Integer | 1288 | Default left context ID |
default_right_context_id | Integer | 1288 | Default right context ID |
default_field_value | String | "*" | Default value for missing fields |
flexible_csv | Boolean | false | Allow flexible CSV parsing |
skip_invalid_cost_or_id | Boolean | false | Skip entries with invalid cost or ID |
normalize_details | Boolean | false | Normalize morphological details |
Text Processing Pipeline
Lindera Ruby supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.
Input Text
--> Character Filters (preprocessing)
--> Tokenization
--> Token Filters (postprocessing)
--> Output Tokens
Character Filters
Character filters transform the input text before tokenization.
unicode_normalize
Applies Unicode normalization to the input text.
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })
tokenizer = builder.build
Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".
mapping
Replaces characters or strings according to a mapping table.
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_character_filter('mapping', {
'mapping' => {
"\u30fc" => '-',
"\uff5e" => '~'
}
})
tokenizer = builder.build
japanese_iteration_mark
Resolves Japanese iteration marks (odoriji) into their full forms.
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_character_filter('japanese_iteration_mark', {
'normalize_kanji' => 'true',
'normalize_kana' => 'true'
})
tokenizer = builder.build
Token Filters
Token filters transform or remove tokens after tokenization.
lowercase
Converts token surface forms to lowercase.
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('lowercase', nil)
tokenizer = builder.build
japanese_base_form
Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_base_form', nil)
tokenizer = builder.build
japanese_stop_tags
Removes tokens whose part-of-speech matches any of the specified tags.
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_stop_tags', {
'tags' => ['助詞', '助動詞']
})
tokenizer = builder.build
japanese_keep_tags
Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_keep_tags', {
'tags' => ['名詞']
})
tokenizer = builder.build
japanese_katakana_stem
Removes trailing prolonged sound marks from katakana tokens that exceed a minimum length.
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('embedded://ipadic')
builder.append_token_filter('japanese_katakana_stem', { 'min' => 3 })
tokenizer = builder.build
Complete Pipeline Example
The following example combines multiple character filters and token filters into a single pipeline:
require 'lindera'
builder = Lindera::TokenizerBuilder.new
builder.set_mode('normal')
builder.set_dictionary('embedded://ipadic')
# Preprocessing
builder.append_character_filter('unicode_normalize', { 'kind' => 'nfkc' })
builder.append_character_filter('japanese_iteration_mark', {
'normalize_kanji' => 'true',
'normalize_kana' => 'true'
})
# Postprocessing
builder.append_token_filter('japanese_base_form', nil)
builder.append_token_filter('japanese_stop_tags', {
'tags' => ['助詞', '助動詞', '記号']
})
builder.append_token_filter('lowercase', nil)
tokenizer = builder.build
tokens = tokenizer.tokenize('Linderaは形態素解析を行うライブラリです。')
tokens.each do |token|
puts "#{token.surface}\t#{token.details.join(',')}"
end
In this pipeline:
unicode_normalizeconverts full-width characters to half-width (NFKC normalization)japanese_iteration_markresolves iteration marksjapanese_base_formconverts inflected tokens to base formjapanese_stop_tagsremoves particles, auxiliary verbs, and symbolslowercasenormalizes alphabetic characters to lowercase
Training
Lindera Ruby supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.
Prerequisites
Build lindera-ruby with the train feature enabled:
LINDERA_FEATURES="embed-ipadic,train" bundle exec rake compile
Training a Model
Use Lindera::Trainer.train to train a CRF model from a seed lexicon and annotated corpus:
require 'lindera'
Lindera::Trainer.train(
'resources/training/seed.csv',
'resources/training/corpus.txt',
'resources/training/char.def',
'resources/training/unk.def',
'resources/training/feature.def',
'resources/training/rewrite.def',
'/tmp/model.dat',
0.01, # lambda (L1 regularization)
100, # max_iter
nil # max_threads (nil = auto-detect CPU cores)
)
Training Parameters
Parameters are passed as positional arguments in the following order:
| Position | Name | Type | Description |
|---|---|---|---|
| 1 | seed | String | Path to the seed lexicon file (CSV format) |
| 2 | corpus | String | Path to the annotated training corpus |
| 3 | char_def | String | Path to the character definition file (char.def) |
| 4 | unk_def | String | Path to the unknown word definition file (unk.def) |
| 5 | feature_def | String | Path to the feature definition file (feature.def) |
| 6 | rewrite_def | String | Path to the rewrite rule definition file (rewrite.def) |
| 7 | output | String | Output path for the trained model file |
| 8 | lambda | Float | L1 regularization cost (0.0--1.0) |
| 9 | max_iter | Integer | Maximum number of training iterations |
| 10 | max_threads | Integer or nil | Number of threads (nil = auto-detect CPU cores) |
Exporting a Trained Model
After training, export the model to dictionary source files using Lindera::Trainer.export:
require 'lindera'
Lindera::Trainer.export(
'/tmp/model.dat',
'/tmp/dictionary_source',
'resources/training/metadata.json'
)
Export Parameters
| Position | Name | Type | Description |
|---|---|---|---|
| 1 | model | String | Path to the trained model file (.dat) |
| 2 | output | String | Output directory for dictionary source files |
| 3 | metadata | String or nil | Path to a base metadata.json file |
The export creates the following files in the output directory:
lex.csv-- Lexicon entries with trained costsmatrix.def-- Connection cost matrixunk.def-- Unknown word definitionschar.def-- Character category definitionsmetadata.json-- Updated metadata (whenmetadataparameter is provided)
Complete Workflow
The full workflow for training and using a custom dictionary:
require 'lindera'
# Step 1: Train the CRF model
Lindera::Trainer.train(
'resources/training/seed.csv',
'resources/training/corpus.txt',
'resources/training/char.def',
'resources/training/unk.def',
'resources/training/feature.def',
'resources/training/rewrite.def',
'/tmp/model.dat',
0.01, # lambda
100, # max_iter
nil # max_threads
)
# Step 2: Export to dictionary source files
Lindera::Trainer.export(
'/tmp/model.dat',
'/tmp/dictionary_source',
'resources/training/metadata.json'
)
# Step 3: Build the dictionary from exported source files
metadata = Lindera::Metadata.from_json_file('/tmp/dictionary_source/metadata.json')
Lindera.build_dictionary('/tmp/dictionary_source', '/tmp/dictionary', metadata)
# Step 4: Use the trained dictionary
builder = Lindera::TokenizerBuilder.new
builder.set_dictionary('/tmp/dictionary')
builder.set_mode('normal')
tokenizer = builder.build
tokens = tokenizer.tokenize('形態素解析のテスト')
tokens.each do |token|
puts "#{token.surface}\t#{token.details.join(',')}"
end
Lindera PHP
Lindera PHP provides PHP bindings for the Lindera morphological analysis engine, built with ext-php-rs. It brings Lindera's high-performance tokenization capabilities to the PHP ecosystem with support for PHP 8.1 and later.
Features
- Multi-language support: Tokenize Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), and Chinese (CC-CEDICT, Jieba) text
- Text processing pipeline: Compose character filters and token filters for flexible preprocessing and postprocessing
- CRF-based dictionary training: Train custom morphological analysis models from annotated corpora (requires
trainfeature) - Multiple tokenization modes: Normal and decompose modes for different analysis granularity
- N-best tokenization: Retrieve multiple tokenization candidates ranked by cost
- User dictionaries: Extend system dictionaries with custom vocabulary
Documentation
- Installation -- Prerequisites, build instructions, and feature flags
- Quick Start -- A minimal example to get started
- Tokenizer API --
TokenizerBuilder,Tokenizer, andTokenclass reference - Dictionary Management -- Loading, building, and managing dictionaries
- Text Processing Pipeline -- Character filters and token filters
- Training -- Training custom CRF models and exporting dictionaries
Installation
[!NOTE] lindera-php is not yet published to Packagist. You need to build from source.
Prerequisites
- PHP 8.1 or later
- Rust toolchain -- Install via rustup
- Composer -- PHP dependency manager (optional, for running tests)
Obtaining Dictionaries
Lindera does not bundle dictionaries with the package. You need to obtain a pre-built dictionary separately.
Download from GitHub Releases
Pre-built dictionaries are available on the GitHub Releases page. Download and extract the dictionary archive to a local directory:
# Example: download and extract the IPADIC dictionary
curl -LO https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip
unzip lindera-ipadic-<version>.zip -d /path/to/ipadic
Development Build
Build the lindera-php extension from the project root:
cargo build -p lindera-php
Or use the project Makefile:
make php-build
Build with Training Support
The train feature enables CRF-based dictionary training functionality:
cargo build -p lindera-php --features train
Feature Flags
| Feature | Description | Default |
|---|---|---|
train | CRF training functionality | Disabled |
embed-ipadic | Embed Japanese dictionary (IPADIC) into the binary | Disabled |
embed-unidic | Embed Japanese dictionary (UniDic) into the binary | Disabled |
embed-ipadic-neologd | Embed Japanese dictionary (IPADIC NEologd) into the binary | Disabled |
embed-ko-dic | Embed Korean dictionary (ko-dic) into the binary | Disabled |
embed-cc-cedict | Embed Chinese dictionary (CC-CEDICT) into the binary | Disabled |
embed-jieba | Embed Chinese dictionary (Jieba) into the binary | Disabled |
embed-cjk | Embed all CJK dictionaries (IPADIC, ko-dic, Jieba) into the binary | Disabled |
Multiple features can be combined:
cargo build -p lindera-php --features "train,embed-ipadic,embed-ko-dic"
[!TIP] If you want to embed a dictionary directly into the binary (advanced usage), enable the corresponding
embed-*feature flag and load it using theembedded://scheme:$dictionary = Lindera\Dictionary::load('embedded://ipadic');See Feature Flags for details.
Loading the Extension
Load the compiled shared library when running PHP:
php -d extension=target/debug/liblindera_php.so script.php
For release builds:
cargo build -p lindera-php --release
php -d extension=target/release/liblindera_php.so script.php
Alternatively, add the extension to your php.ini:
extension=/absolute/path/to/liblindera_php.so
Verifying the Installation
After building, verify that lindera is available in PHP:
php -d extension=target/debug/liblindera_php.so -r "echo Lindera\Dictionary::version() . PHP_EOL;"
Quick Start
This guide shows how to tokenize text using lindera-php.
Basic Tokenization
The recommended way to create a tokenizer is through TokenizerBuilder:
<?php
$builder = new Lindera\TokenizerBuilder();
$builder->setMode('normal');
$builder->setDictionary('/path/to/ipadic');
$tokenizer = $builder->build();
$tokens = $tokenizer->tokenize('関西国際空港限定トートバッグ');
foreach ($tokens as $token) {
echo $token->surface . "\t" . implode(',', $token->details) . "\n";
}
Note: Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory.
Expected output:
関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ UNK
Method Chaining
TokenizerBuilder supports method chaining for concise configuration:
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setMode('normal')
->setDictionary('/path/to/ipadic')
->build();
$tokens = $tokenizer->tokenize('すもももももももものうち');
foreach ($tokens as $token) {
echo $token->surface . "\t" . $token->getDetail(0) . "\n";
}
Accessing Token Properties
Each token exposes the following properties:
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder->setDictionary('/path/to/ipadic')->build();
$tokens = $tokenizer->tokenize('東京タワー');
foreach ($tokens as $token) {
echo "Surface: {$token->surface}\n";
echo "Byte range: {$token->byte_start}..{$token->byte_end}\n";
echo "Position: {$token->position}\n";
echo "Word ID: {$token->word_id}\n";
echo "Unknown: " . ($token->is_unknown ? 'true' : 'false') . "\n";
echo "Details: " . implode(',', $token->details) . "\n";
echo "\n";
}
N-best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder->setDictionary('/path/to/ipadic')->build();
$results = $tokenizer->tokenizeNbest('すもももももももものうち', 3);
foreach ($results as $result) {
$surfaces = array_map(fn($t) => $t->surface, $result->tokens);
echo "Cost {$result->cost}: " . implode(' / ', $surfaces) . "\n";
}
Tokenizer API
TokenizerBuilder
Lindera\TokenizerBuilder configures and constructs a Tokenizer instance using the builder pattern.
Constructors
new Lindera\TokenizerBuilder()
Creates a new builder with default configuration.
<?php
$builder = new Lindera\TokenizerBuilder();
$builder->fromFile($filePath)
Loads configuration from a JSON file.
<?php
$builder = new Lindera\TokenizerBuilder();
$builder->fromFile('config.json');
Configuration Methods
All setter methods return $this for method chaining.
setMode($mode)
Sets the tokenization mode.
"normal"-- Standard tokenization (default)"decompose"-- Decomposes compound words into smaller units
<?php
$builder->setMode('normal');
setDictionary($path)
Sets the system dictionary path or URI.
<?php
// Use an embedded dictionary
$builder->setDictionary('embedded://ipadic');
// Use an external dictionary
$builder->setDictionary('/path/to/dictionary');
setUserDictionary($uri)
Sets the user dictionary URI.
<?php
$builder->setUserDictionary('/path/to/user_dictionary');
setKeepWhitespace($keep)
Controls whether whitespace tokens appear in the output.
<?php
$builder->setKeepWhitespace(true);
appendCharacterFilter($kind, $args)
Appends a character filter to the preprocessing pipeline.
<?php
$builder->appendCharacterFilter('unicode_normalize', ['kind' => 'nfkc']);
appendTokenFilter($kind, $args)
Appends a token filter to the postprocessing pipeline.
<?php
$builder->appendTokenFilter('lowercase');
Build
build()
Builds and returns a Tokenizer with the configured settings.
<?php
$tokenizer = $builder->build();
Tokenizer
Lindera\Tokenizer performs morphological analysis on text.
Creating a Tokenizer
new Lindera\Tokenizer($dictionary, $mode, $userDictionary)
Creates a tokenizer directly from a loaded dictionary.
<?php
$dictionary = Lindera\Dictionary::load('embedded://ipadic');
$tokenizer = new Lindera\Tokenizer($dictionary, 'normal');
With a user dictionary:
<?php
$dictionary = Lindera\Dictionary::load('embedded://ipadic');
$metadata = $dictionary->metadata();
$userDict = Lindera\Dictionary::loadUser('/path/to/user_dictionary', $metadata);
$tokenizer = new Lindera\Tokenizer($dictionary, 'normal', $userDict);
Tokenizer Methods
tokenize($text)
Tokenizes the input text and returns an array of Token objects.
<?php
$tokens = $tokenizer->tokenize('形態素解析');
Parameters:
| Name | Type | Description |
|---|---|---|
$text | string | Text to tokenize |
Returns: array<Token>
tokenizeNbest($text, $n, $unique, $costThreshold)
Returns the N-best tokenization results as an array of NbestResult objects.
<?php
$results = $tokenizer->tokenizeNbest('すもももももももものうち', 3);
foreach ($results as $result) {
echo "Cost: {$result->cost}\n";
foreach ($result->tokens as $token) {
echo " {$token->surface}\n";
}
}
Parameters:
| Name | Type | Description |
|---|---|---|
$text | string | Text to tokenize |
$n | int | Number of results to return |
$unique | bool|null | Deduplicate results (default: false) |
$costThreshold | int|null | Maximum cost difference from the best path (default: null) |
Returns: array<NbestResult>
NbestResult
Lindera\NbestResult represents a single N-best tokenization result.
NbestResult Properties
| Property | Type | Description |
|---|---|---|
$tokens | array<Token> | The tokens in this result |
$cost | int | The total cost of this segmentation |
Token
Lindera\Token represents a single morphological token.
Token Properties
| Property | Type | Description |
|---|---|---|
$surface | string | Surface form of the token |
$byte_start | int | Start byte position in the original text |
$byte_end | int | End byte position in the original text |
$position | int | Token position index |
$word_id | int | Dictionary word ID |
$is_unknown | bool | true if the word is not in the dictionary |
$details | array<string> | Morphological details (part of speech, reading, etc.) |
Token Methods
getDetail($index)
Returns the detail string at the specified index, or null if the index is out of range.
<?php
$token = $tokenizer->tokenize('東京')[0];
$pos = $token->getDetail(0); // e.g., "名詞"
$subpos = $token->getDetail(1); // e.g., "固有名詞"
$reading = $token->getDetail(7); // e.g., "トウキョウ"
Parameters:
| Name | Type | Description |
|---|---|---|
$index | int | Zero-based index into the details array |
Returns: string|null
The structure of details depends on the dictionary:
- IPADIC:
[品詞, 品詞細分類1, 品詞細分類2, 品詞細分類3, 活用型, 活用形, 原形, 読み, 発音] - UniDic: Detailed morphological features following the UniDic specification
- ko-dic / CC-CEDICT / Jieba: Dictionary-specific detail formats
Dictionary Management
Lindera PHP provides static methods on the Lindera\Dictionary class for loading, building, and managing dictionaries used in morphological analysis.
Loading Dictionaries
System Dictionaries
Use Lindera\Dictionary::load($uri) to load a system dictionary. Download a pre-built dictionary from GitHub Releases and specify the path to the extracted directory:
<?php
$dictionary = Lindera\Dictionary::load('/path/to/ipadic');
Embedded dictionaries (advanced) -- if you built with an embed-* feature flag, you can load an embedded dictionary:
<?php
$dictionary = Lindera\Dictionary::load('embedded://ipadic');
User Dictionaries
User dictionaries add custom vocabulary on top of a system dictionary.
<?php
$dictionary = Lindera\Dictionary::load('/path/to/ipadic');
$metadata = $dictionary->metadata();
$userDict = Lindera\Dictionary::loadUser('/path/to/user_dictionary', $metadata);
Pass the user dictionary when creating a tokenizer directly:
<?php
$dictionary = Lindera\Dictionary::load('/path/to/ipadic');
$metadata = $dictionary->metadata();
$userDict = Lindera\Dictionary::loadUser('/path/to/user_dictionary', $metadata);
$tokenizer = new Lindera\Tokenizer($dictionary, 'normal', $userDict);
Or via the builder:
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('/path/to/ipadic')
->setUserDictionary('/path/to/user_dictionary')
->build();
Building Dictionaries
System Dictionary
Build a system dictionary from source files:
<?php
$metadata = Lindera\Metadata::fromJsonFile('/path/to/metadata.json');
Lindera\Dictionary::build('/path/to/input_dir', '/path/to/output_dir', $metadata);
The input directory should contain the dictionary source files (CSV lexicon, matrix.def, etc.).
User Dictionary
Build a user dictionary from a CSV file:
<?php
$metadata = new Lindera\Metadata();
Lindera\Dictionary::buildUser('ipadic', 'user_words.csv', '/path/to/output_dir', $metadata);
Metadata
The Lindera\Metadata class configures dictionary parameters.
Creating Metadata
<?php
// Default metadata
$metadata = new Lindera\Metadata();
// Custom metadata
$metadata = new Lindera\Metadata(
name: 'my_dictionary',
encoding: 'UTF-8',
default_word_cost: -10000,
);
// Create with all defaults explicitly
$metadata = Lindera\Metadata::createDefault();
Loading from JSON
<?php
$metadata = Lindera\Metadata::fromJsonFile('metadata.json');
Properties
| Property | Type | Default | Description |
|---|---|---|---|
name | string | "default" | Dictionary name |
encoding | string | "UTF-8" | Character encoding |
default_word_cost | int | -10000 | Default cost for unknown words |
default_left_context_id | int | 1288 | Default left context ID |
default_right_context_id | int | 1288 | Default right context ID |
default_field_value | string | "*" | Default value for missing fields |
flexible_csv | bool | false | Allow flexible CSV parsing |
skip_invalid_cost_or_id | bool | false | Skip entries with invalid cost or ID |
normalize_details | bool | false | Normalize morphological details |
dictionary_schema_fields | array<string> | IPADIC schema | Schema fields for the main dictionary |
user_dictionary_schema_fields | array<string> | Minimal schema | Schema fields for user dictionaries |
All properties are read-only via getter methods:
<?php
$metadata = new Lindera\Metadata(name: 'custom_dict', encoding: 'EUC-JP');
echo $metadata->name; // "custom_dict"
echo $metadata->encoding; // "EUC-JP"
toArray()
Returns an associative array representation of the metadata:
<?php
$metadata = new Lindera\Metadata(name: 'test');
print_r($metadata->toArray());
Dictionary Info
The Lindera\Dictionary object provides metadata accessors:
<?php
$dictionary = Lindera\Dictionary::load('/path/to/ipadic');
echo $dictionary->metadataName(); // Dictionary name
echo $dictionary->metadataEncoding(); // Dictionary encoding
$metadata = $dictionary->metadata(); // Full Metadata object
Version
Retrieve the Lindera library version:
<?php
echo Lindera\Dictionary::version();
Text Processing Pipeline
Lindera PHP supports a composable text processing pipeline that applies character filters before tokenization and token filters after tokenization. Filters are added to the TokenizerBuilder and executed in the order they are appended.
Input Text
--> Character Filters (preprocessing)
--> Tokenization
--> Token Filters (postprocessing)
--> Output Tokens
Character Filters
Character filters transform the input text before tokenization.
unicode_normalize
Applies Unicode normalization to the input text.
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('embedded://ipadic')
->appendCharacterFilter('unicode_normalize', ['kind' => 'nfkc'])
->build();
Supported normalization forms: "nfc", "nfkc", "nfd", "nfkd".
mapping
Replaces characters or strings according to a mapping table.
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('embedded://ipadic')
->appendCharacterFilter('mapping', [
'mapping' => [
"\u{30FC}" => '-',
"\u{FF5E}" => '~',
],
])
->build();
japanese_iteration_mark
Resolves Japanese iteration marks (odoriji) into their full forms.
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('embedded://ipadic')
->appendCharacterFilter('japanese_iteration_mark', [
'normalize_kanji' => 'true',
'normalize_kana' => 'true',
])
->build();
Token Filters
Token filters transform or remove tokens after tokenization.
lowercase
Converts token surface forms to lowercase.
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('embedded://ipadic')
->appendTokenFilter('lowercase')
->build();
japanese_base_form
Replaces inflected forms with their base (dictionary) form using the morphological details from the dictionary.
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('embedded://ipadic')
->appendTokenFilter('japanese_base_form', [])
->build();
japanese_stop_tags
Removes tokens whose part-of-speech matches any of the specified tags.
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('embedded://ipadic')
->appendTokenFilter('japanese_stop_tags', [
'tags' => ['助詞', '助動詞'],
])
->build();
japanese_keep_tags
Keeps only tokens whose part-of-speech matches one of the specified tags. All other tokens are removed.
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('embedded://ipadic')
->appendTokenFilter('japanese_keep_tags', [
'tags' => ['名詞'],
])
->build();
Complete Pipeline Example
The following example combines multiple character filters and token filters into a single pipeline:
<?php
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setMode('normal')
->setDictionary('embedded://ipadic')
// Preprocessing
->appendCharacterFilter('unicode_normalize', ['kind' => 'nfkc'])
->appendCharacterFilter('japanese_iteration_mark', [
'normalize_kanji' => 'true',
'normalize_kana' => 'true',
])
// Postprocessing
->appendTokenFilter('japanese_base_form', [])
->appendTokenFilter('japanese_stop_tags', [
'tags' => ['助詞', '助動詞', '記号'],
])
->appendTokenFilter('lowercase')
->build();
$tokens = $tokenizer->tokenize('Linderaは形態素解析を行うライブラリです。');
foreach ($tokens as $token) {
echo $token->surface . "\t" . implode(',', $token->details) . "\n";
}
In this pipeline:
unicode_normalizeconverts full-width characters to half-width (NFKC normalization)japanese_iteration_markresolves iteration marksjapanese_base_formconverts inflected tokens to base formjapanese_stop_tagsremoves particles, auxiliary verbs, and symbolslowercasenormalizes alphabetic characters to lowercase
Training
Lindera PHP supports training custom CRF-based morphological analysis models from annotated corpora. This functionality requires the train feature.
Prerequisites
Build lindera-php with the train feature enabled:
cargo build -p lindera-php --features train,embed-ipadic
Training a Model
Use Lindera\Trainer::train() to train a CRF model from a seed lexicon and annotated corpus:
<?php
Lindera\Trainer::train(
seed: 'resources/training/seed.csv',
corpus: 'resources/training/corpus.txt',
char_def: 'resources/training/char.def',
unk_def: 'resources/training/unk.def',
feature_def: 'resources/training/feature.def',
rewrite_def: 'resources/training/rewrite.def',
output: '/tmp/model.dat',
lambda: 0.01,
max_iter: 100,
max_threads: null,
);
Training Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
$seed | string | required | Path to the seed lexicon file (CSV format) |
$corpus | string | required | Path to the annotated training corpus |
$char_def | string | required | Path to the character definition file (char.def) |
$unk_def | string | required | Path to the unknown word definition file (unk.def) |
$feature_def | string | required | Path to the feature definition file (feature.def) |
$rewrite_def | string | required | Path to the rewrite rule definition file (rewrite.def) |
$output | string | required | Output path for the trained model file |
$lambda | float | 0.01 | L1 regularization cost (0.0--1.0) |
$max_iter | int | 100 | Maximum number of training iterations |
$max_threads | int|null | null | Number of threads (null = auto-detect CPU cores) |
Exporting a Trained Model
After training, export the model to dictionary source files using Lindera\Trainer::export():
<?php
Lindera\Trainer::export(
model: '/tmp/model.dat',
output: '/tmp/dictionary_source',
metadata: 'resources/training/metadata.json',
);
Export Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
$model | string | required | Path to the trained model file (.dat) |
$output | string | required | Output directory for dictionary source files |
$metadata | string|null | null | Path to a base metadata.json file |
The export creates the following files in the output directory:
lex.csv-- Lexicon entries with trained costsmatrix.def-- Connection cost matrixunk.def-- Unknown word definitionschar.def-- Character category definitionsmetadata.json-- Updated metadata (when$metadataparameter is provided)
Complete Workflow
The full workflow for training and using a custom dictionary:
<?php
// Step 1: Train the CRF model
Lindera\Trainer::train(
seed: 'resources/training/seed.csv',
corpus: 'resources/training/corpus.txt',
char_def: 'resources/training/char.def',
unk_def: 'resources/training/unk.def',
feature_def: 'resources/training/feature.def',
rewrite_def: 'resources/training/rewrite.def',
output: '/tmp/model.dat',
lambda: 0.01,
max_iter: 100,
);
// Step 2: Export to dictionary source files
Lindera\Trainer::export(
model: '/tmp/model.dat',
output: '/tmp/dictionary_source',
metadata: 'resources/training/metadata.json',
);
// Step 3: Build the dictionary from exported source files
$metadata = Lindera\Metadata::fromJsonFile('/tmp/dictionary_source/metadata.json');
Lindera\Dictionary::build('/tmp/dictionary_source', '/tmp/dictionary', $metadata);
// Step 4: Use the trained dictionary
$builder = new Lindera\TokenizerBuilder();
$tokenizer = $builder
->setDictionary('/tmp/dictionary')
->setMode('normal')
->build();
$tokens = $tokenizer->tokenize('形態素解析のテスト');
foreach ($tokens as $token) {
echo $token->surface . "\t" . implode(',', $token->details) . "\n";
}
Lindera WASM
Lindera WASM provides WebAssembly bindings for Lindera's morphological analysis engine, built with wasm-bindgen. It enables Japanese, Korean, and Chinese text tokenization directly in web browsers, Node.js, and bundler environments.
Distribution Formats
Lindera WASM supports multiple distribution formats via wasm-pack:
| Target | Use Case | Module System |
|---|---|---|
web | Browser ESM | ES Modules |
bundler | Webpack, Vite, Rollup | ES Modules (bundler-resolved) |
Dictionary Packages
Each package embeds a specific dictionary for offline use:
| Feature Flag | Dictionary | Language |
|---|---|---|
| (none) | No embedded dictionary | -- |
embed-ipadic | IPADIC | Japanese |
embed-unidic | UniDic | Japanese |
embed-ko-dic | ko-dic | Korean |
embed-cc-cedict | CC-CEDICT | Chinese |
embed-jieba | Jieba | Chinese |
embed-cjk | IPADIC + ko-dic + Jieba | CJK |
Sections
- Installation -- Building and installing lindera-wasm packages
- Quick Start -- Minimal working example
- Tokenizer API -- Full API reference for JavaScript/TypeScript
- Dictionary Management -- Loading and building dictionaries
- Browser Usage -- Integration with web applications
- OPFS Dictionary Storage -- Persistent dictionary caching with OPFS
Installation
Prerequisites
Obtaining Dictionaries
Lindera WASM does not bundle dictionaries by default. The recommended approach for browser environments is to download dictionaries at runtime using the OPFS (Origin Private File System) API.
Download from GitHub Releases
Pre-built dictionaries are available on the GitHub Releases page. In browser environments, use the OPFS helpers to download and cache dictionaries:
import { downloadDictionary, hasDictionary } from 'lindera-wasm-web/opfs';
if (!await hasDictionary("ipadic")) {
await downloadDictionary(
"https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
"ipadic",
);
}
See OPFS Dictionary Storage for the full workflow.
Building with wasm-pack
Build the WASM package for your target environment:
Web (ES Modules for browsers)
wasm-pack build --target web
Bundler (Webpack, Vite, Rollup)
wasm-pack build --target bundler
The output is written to the pkg/ directory inside the lindera-wasm crate.
Available Feature Flags (Advanced)
For advanced users who want to embed dictionaries directly into the WASM binary, the following feature flags are available. This increases the binary size significantly but eliminates the need to download dictionaries at runtime.
| Feature | Dictionary | Language |
|---|---|---|
embed-ipadic | IPADIC | Japanese |
embed-unidic | UniDic | Japanese |
embed-ko-dic | ko-dic | Korean |
embed-cc-cedict | CC-CEDICT | Chinese |
embed-jieba | Jieba | Chinese |
embed-cjk | IPADIC + ko-dic + Jieba | CJK (all) |
You can combine multiple dictionaries by enabling multiple feature flags:
wasm-pack build --target web --features embed-ipadic,embed-ko-dic
NPM Package Naming Convention
When publishing to npm, the recommended naming convention is:
lindera-wasm-{target}
lindera-wasm-{target}-{dict}
Examples:
lindera-wasm-weblindera-wasm-web-ipadiclindera-wasm-bundler-unidiclindera-wasm-web-cjk
To set the package name before publishing, edit the name field in the generated pkg/package.json.
Installing from npm
Pre-built packages are available on npm:
npm install lindera-wasm-web
Or with yarn:
yarn add lindera-wasm-web
[!NOTE] The npm package does not include dictionaries. Use the OPFS helpers to download dictionaries at runtime. See OPFS Dictionary Storage.
Quick Start
Web (Browser) -- OPFS Dictionary Loading
The recommended approach is to download dictionaries at runtime using the OPFS helpers:
import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { downloadDictionary, loadDictionaryFiles, hasDictionary } from 'lindera-wasm-web/opfs';
async function main() {
await __wbg_init();
// Download dictionary if not cached
if (!await hasDictionary("ipadic")) {
await downloadDictionary(
"https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
"ipadic",
);
}
// Load dictionary from OPFS
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
// Build tokenizer
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");
tokens.forEach(token => {
console.log(`${token.surface}\t${token.details.join(',')}`);
});
}
main();
Note: Download a pre-built dictionary from GitHub Releases. See OPFS Dictionary Storage for the full workflow.
Expected output:
関西国際空港 名詞,固有名詞,一般,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
トートバッグ 名詞,一般,*,*,*,*,*,*,*
Using Embedded Dictionaries (Advanced)
If you built with an embed-* feature flag, you can use embedded dictionaries:
import __wbg_init, { TokenizerBuilder } from 'lindera-wasm-web-ipadic';
async function main() {
await __wbg_init();
const builder = new TokenizerBuilder();
builder.setDictionary("embedded://ipadic");
builder.setMode("normal");
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");
tokens.forEach(token => {
console.log(`${token.surface}\t${token.details.join(',')}`);
});
}
main();
Using Filters
You can add character filters and token filters to the tokenization pipeline:
import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';
async function main() {
await __wbg_init();
// Assume dictionary is already cached in OPFS
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
// Add Unicode NFKC normalization
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });
// Add a stop-tags filter to remove particles and auxiliary verbs
builder.appendTokenFilter("japanese_stop_tags", {
tags: ["助詞", "助動詞"]
});
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("Linderaは形態素解析エンジンです");
tokens.forEach(token => {
console.log(`${token.surface}\t${token.details.join(',')}`);
});
}
main();
N-Best Tokenization
Retrieve multiple tokenization candidates ranked by cost:
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
results.forEach((result, rank) => {
console.log(`--- NBEST ${rank + 1} (cost=${result.cost}) ---`);
result.tokens.forEach(token => {
console.log(`${token.surface}\t${token.details.join(',')}`);
});
});
Tokenizer API
This page documents the JavaScript/TypeScript API exposed by lindera-wasm.
TokenizerBuilder
Builder class for creating a configured Tokenizer instance.
Constructor
const builder = new TokenizerBuilder();
Creates a new builder with default settings.
Methods
setMode(mode)
Sets the tokenization mode.
- Parameters:
mode(string) --"normal"or"decompose" - Returns: void
builder.setMode("normal");
setDictionary(uri)
Sets the dictionary to use for tokenization.
- Parameters:
uri(string) -- Dictionary URI (e.g.,"embedded://ipadic") - Returns: void
builder.setDictionary("embedded://ipadic");
setDictionaryInstance(dictionary)
Sets a pre-loaded dictionary instance for tokenization.
Use this when the dictionary has been loaded from bytes (e.g., via loadDictionaryFromBytes()) instead of from a URI.
- Parameters:
dictionary(Dictionary) -- A loaded dictionary object - Returns: void
import { loadDictionaryFromBytes } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
builder.setDictionaryInstance(dictionary);
setUserDictionary(uri)
Sets a user-defined dictionary by URI.
- Parameters:
uri(string) -- Path or URI to the user dictionary - Returns: void
builder.setUserDictionary("file:///path/to/user_dict.csv");
setUserDictionaryInstance(userDictionary)
Sets a pre-loaded user dictionary instance. Use this when the user dictionary has been loaded from bytes instead of from a URI.
- Parameters:
userDictionary(UserDictionary) -- A loaded user dictionary object - Returns: void
setKeepWhitespace(keep)
Sets whether whitespace tokens are preserved in the output.
- Parameters:
keep(boolean) --trueto keep whitespace tokens - Returns: void
builder.setKeepWhitespace(true);
appendCharacterFilter(name, args)
Appends a character filter to the preprocessing pipeline.
- Parameters:
name(string) -- Filter name (e.g.,"unicode_normalize","japanese_iteration_mark")args(object, optional) -- Filter configuration
- Returns: void
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });
appendTokenFilter(name, args)
Appends a token filter to the postprocessing pipeline.
- Parameters:
name(string) -- Filter name (e.g.,"japanese_stop_tags","lowercase")args(object, optional) -- Filter configuration
- Returns: void
builder.appendTokenFilter("japanese_stop_tags", {
tags: ["助詞", "助動詞", "記号"]
});
build()
Builds and returns a configured Tokenizer instance. Consumes the builder.
- Returns:
Tokenizer
const tokenizer = builder.build();
Tokenizer
The main tokenizer class. Can be created via TokenizerBuilder.build() or directly via the constructor.
Tokenizer Constructor
const tokenizer = new Tokenizer(dictionary, mode, userDictionary);
- Parameters:
dictionary(Dictionary) -- A loaded dictionary objectmode(string, optional) -- Tokenization mode ("normal"or"decompose", defaults to"normal")userDictionary(UserDictionary, optional) -- A loaded user dictionary
Tokenizer Methods
tokenize(text)
Tokenizes the input text.
- Parameters:
text(string) -- Text to tokenize - Returns:
Token[]-- Array of token objects
const tokens = tokenizer.tokenize("関西国際空港");
tokenizeNbest(text, n, unique?, costThreshold?)
Returns N-best tokenization results ordered by total path cost.
- Parameters:
text(string) -- Text to tokenizen(number) -- Number of results to returnunique(boolean, optional) -- Deduplicate results with identical segmentation (default:false)costThreshold(number, optional) -- Only return paths withinbestCost + threshold
- Returns: Array of
{ tokens: object[], cost: number }
const results = tokenizer.tokenizeNbest("すもももももももものうち", 3);
Token
Represents a single token produced by the tokenizer.
Properties
| Property | Type | Description |
|---|---|---|
surface | string | Surface form of the token |
byteStart | number | Start byte offset in the original text |
byteEnd | number | End byte offset in the original text |
position | number | Position index of the token |
wordId | number | Word ID in the dictionary |
isUnknown | boolean | Whether the token is an unknown word |
details | string[] | Morphological detail fields |
Token Methods
getDetail(index)
Returns the detail string at the specified index.
- Parameters:
index(number) -- Zero-based index into the details array - Returns:
string | undefined
const pos = token.getDetail(0); // e.g., "名詞"
const reading = token.getDetail(7); // e.g., "トウキョウ"
toJSON()
Returns a plain JavaScript object representation of the token.
- Returns:
objectwith keys:surface,byteStart,byteEnd,position,wordId,isUnknown,details
console.log(JSON.stringify(token.toJSON(), null, 2));
Helper Functions
loadDictionary(uri)
Loads a dictionary from the specified URI.
- Parameters:
uri(string) -- Dictionary URI (e.g.,"embedded://ipadic") - Returns:
Dictionary
import { loadDictionary } from 'lindera-wasm-web-ipadic';
const dict = loadDictionary("embedded://ipadic");
loadUserDictionary(uri, metadata)
Loads a user dictionary from the specified URI.
- Parameters:
uri(string) -- Path or URI to the user dictionary filemetadata(Metadata) -- Dictionary metadata object
- Returns:
UserDictionary
buildDictionary(inputDir, outputDir, metadata)
Builds a compiled dictionary from source files.
- Parameters:
inputDir(string) -- Path to the directory containing source dictionary filesoutputDir(string) -- Path to the output directorymetadata(Metadata) -- Dictionary metadata object
- Returns: void
buildUserDictionary(inputFile, outputDir, metadata?)
Builds a compiled user dictionary from a CSV file.
- Parameters:
inputFile(string) -- Path to the user dictionary CSV fileoutputDir(string) -- Path to the output directorymetadata(Metadata, optional) -- Dictionary metadata object
- Returns: void
version() / getVersion()
Returns the version string of the lindera-wasm package.
- Returns:
string
import { version } from 'lindera-wasm-web-ipadic';
console.log(version()); // e.g., "2.1.1"
Enums and Utility Classes
Mode
Tokenization mode enum.
| Value | Description |
|---|---|
Mode.Normal | Standard tokenization based on dictionary cost |
Mode.Decompose | Decompose compound words using penalty-based segmentation |
Penalty
Configuration for decompose mode. Controls how aggressively compound words are decomposed.
const penalty = new Penalty(
kanjiThreshold?, // Kanji length threshold (default: 2)
kanjiPenalty?, // Kanji length penalty (default: 3000)
otherThreshold?, // Other character length threshold (default: 7)
otherPenalty?, // Other character length penalty (default: 1700)
);
| Property | Type | Default | Description |
|---|---|---|---|
kanji_penalty_length_threshold | number | 2 | Length threshold for kanji compound splitting |
kanji_penalty_length_penalty | number | 3000 | Penalty cost for kanji compounds exceeding threshold |
other_penalty_length_threshold | number | 7 | Length threshold for non-kanji compound splitting |
other_penalty_length_penalty | number | 1700 | Penalty cost for non-kanji compounds exceeding threshold |
LinderaError
Error type for Lindera operations.
const error = new LinderaError("message");
console.log(error.message); // "message"
console.log(error.toString()); // "message"
| Property / Method | Type | Description |
|---|---|---|
message | string | Error message |
toString() | string | Returns the error message |
Snake-Case Aliases
For consistency with the Python API, all methods are also available in snake_case form:
| camelCase | snake_case |
|---|---|
setMode() | set_mode() |
setDictionary() | set_dictionary() |
setDictionaryInstance() | set_dictionary_instance() |
setUserDictionary() | set_user_dictionary() |
setUserDictionaryInstance() | set_user_dictionary_instance() |
setKeepWhitespace() | set_keep_whitespace() |
appendCharacterFilter() | append_character_filter() |
appendTokenFilter() | append_token_filter() |
tokenizeNbest() | tokenize_nbest() |
loadDictionary() | load_dictionary() |
loadDictionaryFromBytes() | load_dictionary_from_bytes() |
loadUserDictionary() | load_user_dictionary() |
buildDictionary() | build_dictionary() |
buildUserDictionary() | build_user_dictionary() |
Dictionary Management
Loading Dictionaries from OPFS
The recommended way to use dictionaries in WASM is to download them from GitHub Releases and load them via OPFS. This avoids embedding large dictionaries in the WASM binary.
Loading from Bytes
Use loadDictionaryFromBytes() to construct a Dictionary from raw byte arrays stored in OPFS or other browser storage.
loadDictionaryFromBytes(metadata, dictDa, dictVals, dictWordsIdx, dictWords, matrixMtx, charDef, unk)
- Parameters:
metadata(Uint8Array) -- Contents ofmetadata.jsondictDa(Uint8Array) -- Contents ofdict.da(Double-Array Trie)dictVals(Uint8Array) -- Contents ofdict.vals(word value data)dictWordsIdx(Uint8Array) -- Contents ofdict.wordsidx(word details index)dictWords(Uint8Array) -- Contents ofdict.words(word details)matrixMtx(Uint8Array) -- Contents ofmatrix.mtx(connection cost matrix)charDef(Uint8Array) -- Contents ofchar_def.bin(character definitions)unk(Uint8Array) -- Contents ofunk.bin(unknown word dictionary)
- Returns:
Dictionary
import { loadDictionaryFromBytes, TokenizerBuilder } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';
// Load dictionary files from OPFS
const files = await loadDictionaryFiles("ipadic");
// Create a Dictionary from bytes
const dictionary = loadDictionaryFromBytes(
files.metadata,
files.dictDa,
files.dictVals,
files.dictWordsIdx,
files.dictWords,
files.matrixMtx,
files.charDef,
files.unk,
);
// Use with TokenizerBuilder
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();
See OPFS Dictionary Storage for the full OPFS workflow including downloading and caching.
Embedded Dictionaries (Advanced)
If you built with an embed-* feature flag, you can load embedded dictionaries via the embedded:// URI scheme. This increases the WASM binary size significantly.
Loading an Embedded Dictionary
import { loadDictionary } from 'lindera-wasm-web-ipadic';
const dictionary = loadDictionary("embedded://ipadic");
Available embedded dictionary URIs (depending on which features were enabled at build time):
| URI | Feature Flag |
|---|---|
embedded://ipadic | embed-ipadic |
embedded://unidic | embed-unidic |
embedded://ko-dic | embed-ko-dic |
embedded://cc-cedict | embed-cc-cedict |
embedded://jieba | embed-jieba |
Using with TokenizerBuilder
const builder = new TokenizerBuilder();
builder.setDictionary("embedded://ipadic");
builder.setMode("normal");
const tokenizer = builder.build();
Using with Tokenizer Constructor
import { loadDictionary, Tokenizer } from 'lindera-wasm-web-ipadic';
const dictionary = loadDictionary("embedded://ipadic");
const tokenizer = new Tokenizer(dictionary, "normal");
Dictionary Class
The Dictionary class represents a loaded morphological analysis dictionary.
Properties
| Property | Type | Description |
|---|---|---|
name | string | Dictionary name (e.g., "ipadic") |
encoding | string | Character encoding of the dictionary |
metadata | Metadata | Full metadata object |
console.log(dictionary.name); // "ipadic"
console.log(dictionary.encoding); // "utf-8"
User Dictionaries
User dictionaries allow you to add custom words that are not in the system dictionary.
Loading a User Dictionary
import { loadUserDictionary } from 'lindera-wasm-web';
const metadata = dictionary.metadata;
const userDict = loadUserDictionary("/path/to/user_dict.csv", metadata);
Using a User Dictionary with Tokenizer
import { loadDictionaryFromBytes, loadUserDictionary, Tokenizer } from 'lindera-wasm-web';
import { loadDictionaryFiles } from 'lindera-wasm-web/opfs';
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const userDict = loadUserDictionary("/path/to/user_dict.csv", dictionary.metadata);
const tokenizer = new Tokenizer(dictionary, "normal", userDict);
User Dictionary CSV Format
The user dictionary CSV follows the same format as the Lindera user dictionary:
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
Each line contains: surface,part_of_speech,reading
Building Dictionaries
You can build compiled dictionaries from source files using the JavaScript API.
Building a System Dictionary
import { buildDictionary } from 'lindera-wasm-web';
const metadata = {
name: "custom-dict",
encoding: "utf-8",
// ... other metadata fields
};
buildDictionary("/path/to/source/dir", "/path/to/output/dir", metadata);
Building a User Dictionary
import { buildUserDictionary } from 'lindera-wasm-web';
buildUserDictionary("/path/to/user_dict.csv", "/path/to/output/dir");
The metadata parameter is optional for buildUserDictionary. If omitted, default metadata is used.
Metadata
The Metadata class configures dictionary parameters.
Constructor
const metadata = new Metadata(name?, encoding?);
- Parameters:
name(string, optional) -- Dictionary name (default:"default")encoding(string, optional) -- Character encoding (default:"UTF-8")
Static Methods
Metadata.createDefault()
Creates a Metadata instance with default values.
const metadata = Metadata.createDefault();
Metadata Properties
| Property | Type | Default | Description |
|---|---|---|---|
name | string | "default" | Dictionary name |
encoding | string | "UTF-8" | Character encoding |
dictionary_schema | Schema | IPADIC schema | Schema for the main dictionary |
user_dictionary_schema | Schema | Minimal schema | Schema for user dictionaries |
All properties support both getting and setting:
const metadata = Metadata.createDefault();
metadata.name = "custom_dict";
metadata.encoding = "EUC-JP";
console.log(metadata.name); // "custom_dict"
You can also access the metadata from a loaded dictionary via dictionary.metadata.
Schema
The Schema class defines the field structure of dictionary entries.
Schema Constructor
const schema = new Schema(["surface", "left_id", "right_id", "cost", "pos", "reading"]);
Schema Static Methods
Schema.create_default()-- Creates a default IPADIC-like schema
Schema Methods
| Method | Returns | Description |
|---|---|---|
get_field_index(name) | number | undefined | Get field index by name |
field_count() | number | Total number of fields |
get_field_name(index) | string | undefined | Get field name by index |
get_custom_fields() | string[] | Fields beyond index 3 (morphological features) |
get_all_fields() | string[] | All field names |
get_field_by_name(name) | FieldDefinition | undefined | Get full field definition |
FieldDefinition
| Property | Type | Description |
|---|---|---|
index | number | Field position index |
name | string | Field name |
field_type | FieldType | Field type enum |
description | string | undefined | Optional description |
FieldType
| Value | Description |
|---|---|
FieldType.Surface | Word surface text |
FieldType.LeftContextId | Left context ID |
FieldType.RightContextId | Right context ID |
FieldType.Cost | Word cost |
FieldType.Custom | Morphological feature field |
Browser Usage
ES Module Import
In browser environments, you must initialize the WASM module before using any Lindera functions. The default export __wbg_init handles this initialization.
The recommended approach is to load dictionaries from OPFS rather than embedding them in the WASM binary:
import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { downloadDictionary, loadDictionaryFiles, hasDictionary } from 'lindera-wasm-web/opfs';
async function main() {
// Initialize the WASM module (must be called once before using any API)
await __wbg_init();
// Download dictionary if not cached
if (!await hasDictionary("ipadic")) {
await downloadDictionary(
"https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
"ipadic",
);
}
// Load dictionary from OPFS
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("形態素解析を行います");
tokens.forEach(token => {
console.log(`${token.surface}: ${token.details.join(',')}`);
});
}
main();
Using Embedded Dictionaries (Advanced)
If you built with an embed-* feature flag, you can use embedded dictionaries instead of OPFS:
import __wbg_init, { TokenizerBuilder } from 'lindera-wasm-web-ipadic';
async function main() {
await __wbg_init();
const builder = new TokenizerBuilder();
builder.setDictionary("embedded://ipadic");
builder.setMode("normal");
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("形態素解析を行います");
tokens.forEach(token => {
console.log(`${token.surface}: ${token.details.join(',')}`);
});
}
main();
HTML Example
A minimal HTML page using lindera-wasm with OPFS dictionary loading:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Lindera WASM Demo</title>
</head>
<body>
<textarea id="input" rows="4" cols="50">関西国際空港限定トートバッグ</textarea>
<br>
<button id="tokenize" disabled>Tokenize</button>
<pre id="output">Loading dictionary...</pre>
<script type="module">
import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from './pkg/lindera_wasm.js';
import { downloadDictionary, loadDictionaryFiles, hasDictionary } from './pkg/opfs.js';
let tokenizer;
async function init() {
await __wbg_init();
// Download dictionary if not cached
if (!await hasDictionary("ipadic")) {
document.getElementById('output').textContent = 'Downloading dictionary...';
await downloadDictionary(
"https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip",
"ipadic",
);
}
// Load dictionary from OPFS
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
tokenizer = builder.build();
document.getElementById('tokenize').disabled = false;
document.getElementById('output').textContent = 'Ready!';
}
document.getElementById('tokenize').addEventListener('click', () => {
const text = document.getElementById('input').value;
const tokens = tokenizer.tokenize(text);
const output = tokens.map(t =>
`${t.surface}\t${t.details.join(',')}`
).join('\n');
document.getElementById('output').textContent = output;
});
init();
</script>
</body>
</html>
Webpack Configuration
When using Webpack 5, enable the asyncWebAssembly experiment:
// webpack.config.js
module.exports = {
experiments: {
asyncWebAssembly: true,
},
module: {
rules: [
{
test: /\.wasm$/,
type: "webassembly/async",
},
],
},
};
Then import using the bundler target build:
import { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-bundler';
import { loadDictionaryFiles } from 'lindera-wasm-bundler/opfs';
// Load dictionary from OPFS (see OPFS Dictionary Storage for setup)
const files = await loadDictionaryFiles("ipadic");
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();
With the bundler target, __wbg_init() is called automatically by the bundler.
Vite / Rollup Setup
Vite supports WASM out of the box with the web target. Place the built pkg/ directory in your project and import directly:
import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from './pkg/lindera_wasm.js';
import { loadDictionaryFiles } from './pkg/opfs.js';
await __wbg_init();
// Load dictionary from OPFS and use TokenizerBuilder as shown above
For the bundler target with Vite, you may need the vite-plugin-wasm plugin:
// vite.config.js
import wasm from 'vite-plugin-wasm';
export default {
plugins: [wasm()],
};
Chrome Extension Considerations
Chrome extensions using Manifest V3 restrict WebAssembly.compile and WebAssembly.instantiate by default. To use lindera-wasm in an extension, you need to add wasm-unsafe-eval to your Content Security Policy:
{
"content_security_policy": {
"extension_pages": "script-src 'self' 'wasm-unsafe-eval'; object-src 'self'"
}
}
Note that wasm-unsafe-eval only allows WebAssembly execution and does not permit arbitrary JavaScript eval().
Performance Tips
- Initialize once: Call
__wbg_init()once at application startup, not on every tokenization request. - Reuse the tokenizer: Create the
Tokenizerinstance once and reuse it for multiple calls totokenize(). - Web Workers: For heavy tokenization workloads, consider running Lindera in a Web Worker to avoid blocking the main thread.
OPFS Dictionary Storage
Lindera WASM provides OPFS (Origin Private File System) helper utilities for persistent dictionary caching in web browsers. This allows you to download dictionaries once and reuse them across sessions without embedding them in the WASM binary.
Overview
The OPFS helpers are distributed as a separate JavaScript module (opfs.js) alongside the WASM package. They provide functions to download, store, load, and manage dictionaries using the browser's Origin Private File System.
Dictionaries are stored under the OPFS path lindera/dictionaries/<name>/.
Import
import { downloadDictionary, loadDictionaryFiles, removeDictionary,
listDictionaries, hasDictionary } from 'lindera-wasm-web/opfs';
Functions
downloadDictionary(url, name, options?)
Downloads a dictionary zip archive, extracts it, and stores the files in OPFS.
The archive should be a zip file containing the 8 required dictionary files, optionally nested in a subdirectory.
- Parameters:
url(string) -- URL of the dictionary zip archivename(string) -- Name to store the dictionary under (e.g.,"ipadic")options(object, optional):onProgress(function) -- Progress callback
- Returns:
Promise<void>
await downloadDictionary(
"https://example.com/ipadic.zip",
"ipadic",
{
onProgress: (progress) => {
switch (progress.phase) {
case "downloading":
console.log(`Downloading: ${progress.loaded}/${progress.total} bytes`);
break;
case "extracting":
console.log("Extracting archive...");
break;
case "storing":
console.log("Storing in OPFS...");
break;
case "complete":
console.log("Done!");
break;
}
},
},
);
Progress Callback
The onProgress callback receives an object with the following shape:
| Property | Type | Description |
|---|---|---|
phase | string | "downloading", "extracting", "storing", or "complete" |
loaded | number | undefined | Bytes downloaded (only during "downloading" phase) |
total | number | undefined | Total bytes if known (only during "downloading" phase) |
loadDictionaryFiles(name)
Loads dictionary files from OPFS as an object of Uint8Array values.
The returned object can be passed directly to loadDictionaryFromBytes().
- Parameters:
name(string) -- The dictionary name (e.g.,"ipadic") - Returns:
Promise<DictionaryFiles>
const files = await loadDictionaryFiles("ipadic");
DictionaryFiles
| Property | Type | Source File |
|---|---|---|
metadata | Uint8Array | metadata.json |
dictDa | Uint8Array | dict.da (Double-Array Trie) |
dictVals | Uint8Array | dict.vals (word value data) |
dictWordsIdx | Uint8Array | dict.wordsidx (word details index) |
dictWords | Uint8Array | dict.words (word details) |
matrixMtx | Uint8Array | matrix.mtx (connection cost matrix) |
charDef | Uint8Array | char_def.bin (character definitions) |
unk | Uint8Array | unk.bin (unknown word dictionary) |
removeDictionary(name)
Removes a dictionary from OPFS.
- Parameters:
name(string) -- The dictionary name to remove - Returns:
Promise<void>
await removeDictionary("ipadic");
listDictionaries()
Lists all dictionaries stored in OPFS.
- Returns:
Promise<string[]>-- Array of dictionary names
const names = await listDictionaries();
console.log(names); // e.g., ["ipadic", "unidic"]
hasDictionary(name)
Checks if a dictionary exists in OPFS.
- Parameters:
name(string) -- The dictionary name to check - Returns:
Promise<boolean>
if (await hasDictionary("ipadic")) {
console.log("Dictionary is cached");
}
Complete Workflow
A typical workflow for using OPFS-based dictionaries:
import __wbg_init, { TokenizerBuilder, loadDictionaryFromBytes } from 'lindera-wasm-web';
import { downloadDictionary, loadDictionaryFiles, hasDictionary } from 'lindera-wasm-web/opfs';
async function main() {
await __wbg_init();
const DICT_NAME = "ipadic";
const DICT_URL = "https://github.com/lindera/lindera/releases/download/<version>/lindera-ipadic-<version>.zip";
// Download dictionary if not already cached
if (!await hasDictionary(DICT_NAME)) {
await downloadDictionary(DICT_URL, DICT_NAME, {
onProgress: ({ phase, loaded, total }) => {
if (phase === "downloading" && total) {
console.log(`${(loaded / total * 100).toFixed(1)}%`);
}
},
});
}
// Load dictionary from OPFS
const files = await loadDictionaryFiles(DICT_NAME);
const dictionary = loadDictionaryFromBytes(
files.metadata, files.dictDa, files.dictVals, files.dictWordsIdx,
files.dictWords, files.matrixMtx, files.charDef, files.unk,
);
// Build tokenizer
const builder = new TokenizerBuilder();
builder.setDictionaryInstance(dictionary);
builder.setMode("normal");
const tokenizer = builder.build();
// Tokenize
const tokens = tokenizer.tokenize("形態素解析を行います");
tokens.forEach(token => {
console.log(`${token.surface}\t${token.details.join(',')}`);
});
}
main();
Required Dictionary Files
A valid dictionary archive must contain these 8 files:
| File | Description |
|---|---|
metadata.json | Dictionary metadata (name, encoding, schema, etc.) |
dict.da | Double-Array Trie structure |
dict.vals | Word value data |
dict.wordsidx | Word details index |
dict.words | Word details (morphological features) |
matrix.mtx | Connection cost matrix |
char_def.bin | Character category definitions |
unk.bin | Unknown word dictionary |
Browser Compatibility
OPFS requires a secure context (HTTPS or localhost) and is supported in:
- Chrome 86+
- Edge 86+
- Firefox 111+
- Safari 15.2+
The zip extraction uses the DecompressionStream API, which requires:
- Chrome 80+
- Edge 80+
- Firefox 113+
- Safari 16.4+
Lindera IPADIC
Lindera IPADIC is a Japanese dictionary crate based on IPADIC. IPADIC is the most common dictionary for Japanese morphological analysis.
Contents
- Dictionary Format -- Field definitions for system and user dictionaries
- Build -- How to build the dictionary from source
- Examples -- Tokenization examples
API Reference
Lindera IPADIC
Dictionary version
This repository contains mecab-ipadic.
Dictionary format
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 品詞 | Part-of-speech | |
| 2 | 読み | Reading |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation | |
| 13 | - | - | After 13, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Build
This page describes how to build the IPADIC dictionary from source files.
Build system dictionary
Download the IPADIC source files and build the dictionary:
# Download and extract IPADIC source files
% curl -L -o /tmp/mecab-ipadic-2.7.0-20250920.tar.gz "https://Lindera.dev/mecab-ipadic-2.7.0-20250920.tar.gz"
% tar zxvf /tmp/mecab-ipadic-2.7.0-20250920.tar.gz -C /tmp
# Build the dictionary
% lindera build \
--src /tmp/mecab-ipadic-2.7.0-20250920 \
--dest /tmp/lindera-ipadic-2.7.0-20250920 \
--metadata ./lindera-ipadic/metadata.json
Build user dictionary
Build a user dictionary from a CSV file:
% lindera build \
--src ./resources/user_dict/ipadic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-ipadic/metadata.json \
--user
For more details about user dictionary format, see Dictionary Format.
Embedding in binary
To embed the IPADIC dictionary directly into the binary:
cargo build --features=embed-ipadic
This allows using embedded://ipadic as the dictionary path without external dictionary files.
Examples
This page shows tokenization examples using the IPADIC dictionary.
Tokenize with external IPADIC
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-ipadic-2.7.0-20250920
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Tokenize with embedded IPADIC
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
NOTE: To include IPADIC dictionary in the binary, you must build with the --features=embed-ipadic option.
Tokenize with user dictionary (CSV format)
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict embedded://ipadic \
--user-dict ./resources/user_dict/ipadic_simple_userdic.csv
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
Tokenize with user dictionary (binary format)
% echo "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です" | lindera tokenize \
--dict /tmp/lindera-ipadic-2.7.0-20250920 \
--user-dict ./resources/user_dict/ipadic_simple_userdic.bin
東京スカイツリー カスタム名詞,*,*,*,*,*,東京スカイツリー,トウキョウスカイツリー,*
の 助詞,連体化,*,*,*,*,の,ノ,ノ
最寄り駅 名詞,一般,*,*,*,*,最寄り駅,モヨリエキ,モヨリエキ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
とうきょうスカイツリー駅 カスタム名詞,*,*,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,*
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
Rust API example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "日本語の形態素解析を行うことができます。"; let mut tokens = tokenizer.tokenize(text)?; for token in tokens.iter_mut() { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Lindera IPADIC NEologd
Lindera IPADIC NEologd is a Japanese dictionary crate based on IPADIC NEologd, which includes neologisms (new words). It extends the standard IPADIC dictionary with additional vocabulary covering recent terms and proper nouns.
Contents
- Dictionary Format -- Field definitions for system and user dictionaries
- Build -- How to build the dictionary from source
- Examples -- Tokenization examples
API Reference
Lindera IPADIC NEologd
Dictionary version
This repository contains mecab-ipadic-neologd.
Dictionary format
Refer to the manual for details on the IPADIC dictionary format and part-of-speech tags.
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 品詞 | Part-of-speech | |
| 2 | 読み | Reading |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞 | Part-of-speech | |
| 5 | 品詞細分類1 | Part-of-speech subcategory 1 | |
| 6 | 品詞細分類2 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類3 | Part-of-speech subcategory 3 | |
| 8 | 活用形 | Conjugation form | |
| 9 | 活用型 | Conjugation type | |
| 10 | 原形 | Base form | |
| 11 | 読み | Reading | |
| 12 | 発音 | Pronunciation | |
| 13 | - | - | After 13, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Build
This page describes how to build the IPADIC NEologd dictionary from source files.
Build system dictionary
Download the IPADIC NEologd source files and build the dictionary:
% curl -L -o /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz "https://lindera.dev/mecab-ipadic-neologd-0.0.7-20200820.tar.gz"
% tar zxvf /tmp/mecab-ipadic-neologd-0.0.7-20200820.tar.gz -C /tmp
% lindera build \
--src /tmp/mecab-ipadic-neologd-0.0.7-20200820 \
--dest /tmp/lindera-ipadic-neologd-0.0.7-20200820 \
--metadata ./lindera-ipadic-neologd/metadata.json
Embedding in binary
To embed the IPADIC NEologd dictionary directly into the binary:
cargo build --features=embed-ipadic-neologd
This allows using embedded://ipadic-neologd as the dictionary path without external dictionary files.
Examples
This page shows tokenization examples using the IPADIC NEologd dictionary.
Tokenize with external IPADIC NEologd
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-ipadic-neologd-0.0.7-20200820
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析 名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
Notice that NEologd treats "形態素解析" (morphological analysis) as a single compound noun, whereas standard IPADIC splits it into "形態素" and "解析".
Tokenize with embedded IPADIC NEologd
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://ipadic-neologd
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
形態素解析 名詞,固有名詞,一般,*,*,*,形態素解析,ケイタイソカイセキ,ケイタイソカイセキ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
行う 動詞,自立,*,*,五段・ワ行促音便,基本形,行う,オコナウ,オコナウ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
でき 動詞,自立,*,*,一段,連用形,できる,デキ,デキ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
NOTE: To include IPADIC NEologd dictionary in the binary, you must build with the --features=embed-ipadic-neologd option.
Rust API example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic-neologd")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "日本語の形態素解析を行うことができます。"; let mut tokens = tokenizer.tokenize(text)?; for token in tokens.iter_mut() { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Lindera UniDic
Lindera UniDic is a Japanese dictionary crate based on UniDic, which uses uniform word unit definitions. UniDic provides more detailed morphological information than IPADIC, with 21 fields per entry.
Contents
- Dictionary Format -- Field definitions for system and user dictionaries
- Build -- How to build the dictionary from source
- Examples -- Tokenization examples
API Reference
Lindera UniDic
Dictionary version
This repository contains unidic-mecab.
Dictionary format
Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞大分類 | Part-of-speech | |
| 5 | 品詞中分類 | Part-of-speech subcategory 1 | |
| 6 | 品詞小分類 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類 | Part-of-speech subcategory 3 | |
| 8 | 活用型 | Conjugation type | |
| 9 | 活用形 | Conjugation form | |
| 10 | 語彙素読み | Reading | |
| 11 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | |
| 12 | 書字形出現形 | Orthographic surface form | |
| 13 | 発音形出現形 | Phonological surface form | |
| 14 | 書字形基本形 | Orthographic base form | |
| 15 | 発音形基本形 | Phonological base form | |
| 16 | 語種 | Word type | |
| 17 | 語頭変化型 | Initial mutation type | |
| 18 | 語頭変化形 | Initial mutation form | |
| 19 | 語末変化型 | Final mutation type | |
| 20 | 語末変化形 | Final mutation form |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 品詞大分類 | Part-of-speech | |
| 2 | 語彙素読み | Reading |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表層形 | Surface | |
| 1 | 左文脈ID | Left context ID | |
| 2 | 右文脈ID | Right context ID | |
| 3 | コスト | Cost | |
| 4 | 品詞大分類 | Part-of-speech | |
| 5 | 品詞中分類 | Part-of-speech subcategory 1 | |
| 6 | 品詞小分類 | Part-of-speech subcategory 2 | |
| 7 | 品詞細分類 | Part-of-speech subcategory 3 | |
| 8 | 活用型 | Conjugation type | |
| 9 | 活用形 | Conjugation form | |
| 10 | 語彙素読み | Reading | |
| 11 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | |
| 12 | 書字形出現形 | Orthographic surface form | |
| 13 | 発音形出現形 | Phonological surface form | |
| 14 | 書字形基本形 | Orthographic base form | |
| 15 | 発音形基本形 | Phonological base form | |
| 16 | 語種 | Word type | |
| 17 | 語頭変化型 | Initial mutation type | |
| 18 | 語頭変化形 | Initial mutation form | |
| 19 | 語末変化型 | Final mutation type | |
| 20 | 語末変化形 | Final mutation form | |
| 21 | - | - | After 21, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Build
This page describes how to build the UniDic dictionary from source files.
Build system dictionary
Download the UniDic source files and build the dictionary:
% curl -L -o /tmp/unidic-mecab-2.1.2.tar.gz "https://Lindera.dev/unidic-mecab-2.1.2.tar.gz"
% tar zxvf /tmp/unidic-mecab-2.1.2.tar.gz -C /tmp
% lindera build \
--src /tmp/unidic-mecab-2.1.2 \
--dest /tmp/lindera-unidic-2.1.2 \
--metadata ./lindera-unidic/metadata.json
Build user dictionary
Build a user dictionary from a CSV file:
% lindera build \
--src ./resources/user_dict/unidic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-unidic/metadata.json \
--user
For more details about user dictionary format, see Dictionary Format.
Embedding in binary
To embed the UniDic dictionary directly into the binary:
cargo build --features=embed-unidic
This allows using embedded://unidic as the dictionary path without external dictionary files.
Examples
This page shows tokenization examples using the UniDic dictionary.
Tokenize with external UniDic
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict /tmp/lindera-unidic-2.1.2
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語 名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態 名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素 接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析 名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う 動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと 名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が 助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき 動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
Notice that UniDic splits "日本語" into "日本" and "語", and "形態素" into "形態" and "素", reflecting its uniform word unit definitions.
Tokenize with embedded UniDic
% echo "日本語の形態素解析を行うことができます。" | lindera tokenize \
--dict embedded://unidic
日本 名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語 名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*
形態 名詞,普通名詞,一般,*,*,*,ケイタイ,形態,形態,ケータイ,形態,ケータイ,漢,*,*,*,*
素 接尾辞,名詞的,一般,*,*,*,ソ,素,素,ソ,素,ソ,漢,*,*,*,*
解析 名詞,普通名詞,サ変可能,*,*,*,カイセキ,解析,解析,カイセキ,解析,カイセキ,漢,*,*,*,*
を 助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
行う 動詞,一般,*,*,五段-ワア行,連体形-一般,オコナウ,行う,行う,オコナウ,行う,オコナウ,和,*,*,*,*
こと 名詞,普通名詞,一般,*,*,*,コト,事,こと,コト,こと,コト,和,コ濁,基本形,*,*
が 助詞,格助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*
でき 動詞,非自立可能,*,*,上一段-カ行,連用形-一般,デキル,出来る,でき,デキ,できる,デキル,和,*,*,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
。 補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS
NOTE: To include UniDic dictionary in the binary, you must build with the --features=embed-unidic option.
Rust API example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://unidic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "日本語の形態素解析を行うことができます。"; let mut tokens = tokenizer.tokenize(text)?; for token in tokens.iter_mut() { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Lindera ko-dic
Lindera ko-dic is a Korean dictionary crate based on mecab-ko-dic.
Contents
- Dictionary Format -- Field definitions for system and user dictionaries
- Build -- How to build the dictionary from source
- Examples -- Tokenization examples
API Reference
Lindera ko-dic
Dictionary version
This repository contains mecab-ko-dic.
Dictionary format
Information about the dictionary format and part-of-speech tags used by mecab-ko-dic id documented in this Google Spreadsheet, linked to from mecab-ko-dic's repository readme.
Note how ko-dic has one less feature column than NAIST JDIC, and has an altogether different set of information (e.g. doesn't provide the "original form" of the word).
The tags are a slight modification of those specified by 세종 (Sejong), whatever that is. The mappings from Sejong to mecab-ko-dic's tag names are given in tab 태그 v2.0 on the above-linked spreadsheet.
The dictionary format is specified fully (in Korean) in tab 사전 형식 v2.0 of the spreadsheet. Any blank values default to *.
| Index | Name (Korean) | Name (English) | Notes |
|---|---|---|---|
| 0 | 표면 | Surface | |
| 1 | 왼쪽 문맥 ID | Left context ID | |
| 2 | 오른쪽 문맥 ID | Right context ID | |
| 3 | 비용 | Cost | |
| 4 | 품사 태그 | Part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
| 5 | 의미 부류 | Meaning | (too few examples for me to be sure) |
| 6 | 종성 유무 | Presence or absence | T for true; F for false; else * |
| 7 | 읽기 | Reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
| 8 | 타입 | Type | One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석) |
| 9 | 첫번째 품사 | First part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV |
| 10 | 마지막 품사 | Last part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP |
| 11 | 표현 | Expression | 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 -- Fields that tell how usage, compound nouns, and key analysis are organized |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 표면 | Surface | |
| 1 | 품사 태그 | part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
| 2 | 읽기 | reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
Detailed version
| Index | Name (Korean) | Name (English) | Notes |
|---|---|---|---|
| 0 | 표면 | Surface | |
| 1 | 왼쪽 문맥 ID | Left context ID | |
| 2 | 오른쪽 문맥 ID | Right context ID | |
| 3 | 비용 | Cost | |
| 4 | 품사 태그 | part-of-speech tag | See 태그 v2.0 tab on spreadsheet |
| 5 | 의미 부류 | meaning | (too few examples for me to be sure) |
| 6 | 종성 유무 | presence or absence | T for true; F for false; else * |
| 7 | 읽기 | reading | usually matches surface, but may differ for foreign words e.g. Chinese character words |
| 8 | 타입 | type | One of: Inflect (활용); Compound (복합명사); or Preanalysis (기분석) |
| 9 | 첫번째 품사 | first part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return VV |
| 10 | 마지막 품사 | last part-of-speech | e.g. given a part-of-speech tag of "VV+EM+VX+EP", would return EP |
| 11 | 표현 | expression | 활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드 -- Fields that tell how usage, compound nouns, and key analysis are organized |
| 12 | - | - | After 12, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Build
Build system dictionary
Download and extract the mecab-ko-dic source files, then build the dictionary:
% curl -L -o /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz "https://Lindera.dev/mecab-ko-dic-2.1.1-20180720.tar.gz"
% tar zxvf /tmp/mecab-ko-dic-2.1.1-20180720.tar.gz -C /tmp
% lindera build \
--src /tmp/mecab-ko-dic-2.1.1-20180720 \
--dest /tmp/lindera-ko-dic-2.1.1-20180720 \
--metadata ./lindera-ko-dic/metadata.json
Build user dictionary
% lindera build \
--src ./resources/user_dict/ko-dic_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-ko-dic/metadata.json \
--user
Embedding the dictionary
To embed the ko-dic dictionary directly into the binary, build with the following feature flag:
% cargo build --features=embed-ko-dic
Examples
Tokenize with external ko-dic
% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
--dict /tmp/lindera-ko-dic-2.1.1-20180720
한국어 NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의 JKG,*,F,의,*,*,*,*
형태 NNG,*,F,형태,*,*,*,*
해석 NNG,행위,T,해석,*,*,*,*
을 JKO,*,T,을,*,*,*,*
실시 NNG,행위,F,실시,*,*,*,*
할 XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수 NNB,*,F,수,*,*,*,*
있 VV,*,T,있,*,*,*,*
습니다 EF,*,F,습니다,*,*,*,*
. SF,*,*,*,*,*,*,*
EOS
Tokenize with embedded ko-dic
% echo "한국어의형태해석을실시할수있습니다." | lindera tokenize \
--dict embedded://ko-dic
한국어 NNG,*,F,한국어,Compound,*,*,한국/NNG/*+어/NNG/*
의 JKG,*,F,의,*,*,*,*
형태 NNG,*,F,형태,*,*,*,*
해석 NNG,행위,T,해석,*,*,*,*
을 JKO,*,T,을,*,*,*,*
실시 NNG,행위,F,실시,*,*,*,*
할 XSV+ETM,*,T,할,Inflect,XSV,ETM,하/XSV/*+ᆯ/ETM/*
수 NNB,*,F,수,*,*,*,*
있 VV,*,T,있,*,*,*,*
습니다 EF,*,F,습니다,*,*,*,*
. SF,*,*,*,*,*,*,*
EOS
NOTE: To include ko-dic dictionary in the binary, you must build with the --features=embed-ko-dic option.
Rust API example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ko-dic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "한국어의형태해석을실시할수있습니다."; let mut tokens = tokenizer.tokenize(text)?; for token in tokens.iter_mut() { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Lindera CC-CEDICT
Lindera CC-CEDICT is a Chinese dictionary crate based on CC-CEDICT-MeCab.
Contents
- Dictionary Format -- Field definitions for system and user dictionaries
- Build -- How to build the dictionary from source
- Examples -- Tokenization examples
API Reference
Lindera CC-CE-DICT
Dictionary version
This repository contains CC-CEDICT-MeCab.
Dictionary format
Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.
| Index | Name (Chinese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 左语境ID | Left context ID | |
| 2 | 右语境ID | Right context ID | |
| 3 | 成本 | Cost | |
| 4 | 词类 | Part-of-speech | |
| 5 | 词类1 | Part-of-speech subcategory 1 | |
| 6 | 词类2 | Part-of-speech subcategory 2 | |
| 7 | 词类3 | Part-of-speech subcategory 3 | |
| 8 | 併音 | Pinyin | |
| 9 | 繁体字 | Traditional | |
| 10 | 簡体字 | Simplified | |
| 11 | 定义 | Definition |
User dictionary format (CSV)
Simple version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 词类 | Part-of-speech | |
| 2 | 併音 | Pinyin |
Detailed version
| Index | Name (Japanese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 左语境ID | Left context ID | |
| 2 | 右语境ID | Right context ID | |
| 3 | 成本 | Cost | |
| 4 | 词类 | Part-of-speech | |
| 5 | 词类1 | Part-of-speech subcategory 1 | |
| 6 | 词类2 | Part-of-speech subcategory 2 | |
| 7 | 词类3 | Part-of-speech subcategory 3 | |
| 8 | 併音 | Pinyin | |
| 9 | 繁体字 | Traditional | |
| 10 | 簡体字 | Simplified | |
| 11 | 定义 | Definition | |
| 12 | - | - | After 12, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Build
Build system dictionary
Download and extract the CC-CEDICT-MeCab source files, then build the dictionary:
% curl -L -o /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz "https://lindera.dev/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz"
% tar zxvf /tmp/CC-CEDICT-MeCab-0.1.0-20200409.tar.gz -C /tmp
% lindera build \
--src /tmp/CC-CEDICT-MeCab-0.1.0-20200409 \
--dest /tmp/lindera-cc-cedict-0.1.0-20200409 \
--metadata ./lindera-cc-cedict/metadata.json
Build user dictionary
% lindera build \
--src ./resources/user_dict/cc-cedict_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-cc-cedict/metadata.json \
--user
Embedding the dictionary
To embed the CC-CEDICT dictionary directly into the binary, build with the following feature flag:
% cargo build --features=embed-cc-cedict
Examples
Tokenize with external CC-CEDICT
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict /tmp/lindera-cc-cedict-0.1.0-20200409
可以 *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行 *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文 *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学 *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析 *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。 *,*,*,*,*,*,*,*
EOS
Tokenize with embedded CC-CEDICT
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict embedded://cc-cedict
可以 *,*,*,*,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good/
进行 *,*,*,*,jin4 xing2,進行,进行,to advance/to conduct/underway/in progress/to do/to carry out/to carry on/to execute/
中文 *,*,*,*,Zhong1 wen2,中文,中文,Chinese language/
形态学 *,*,*,*,xing2 tai4 xue2,形態學,形态学,morphology (in biology or linguistics)/
分析 *,*,*,*,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4]/
。 *,*,*,*,*,*,*,*
EOS
NOTE: To include CC-CEDICT dictionary in the binary, you must build with the --features=embed-cc-cedict option.
Rust API example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://cc-cedict")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "可以进行中文形态学分析。"; let mut tokens = tokenizer.tokenize(text)?; for token in tokens.iter_mut() { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Lindera Jieba
Lindera Jieba is a Chinese dictionary crate based on mecab-jieba.
Contents
- Dictionary Format -- Field definitions for system and user dictionaries
- Build -- How to build the dictionary from source
- Examples -- Tokenization examples
API Reference
Lindera Jieba
Dictionary version
This repository contains mecab-jieba.
Dictionary format
| Index | Name (Chinese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 左语境ID | Left context ID | |
| 2 | 右语境ID | Right context ID | |
| 3 | 成本 | Cost | |
| 4 | 词类 | Part-of-speech | |
| 5 | 併音 | Pinyin | |
| 6 | 繁体字 | Traditional | |
| 7 | 簡体字 | Simplified | |
| 8 | 定义 | Definition |
User dictionary format (CSV)
Simple version
| Index | Name (Chinese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 词类 | Part-of-speech | |
| 2 | 併音 | Pinyin |
Detailed version
| Index | Name (Chinese) | Name (English) | Notes |
|---|---|---|---|
| 0 | 表面形式 | Surface | |
| 1 | 左语境ID | Left context ID | |
| 2 | 右语境ID | Right context ID | |
| 3 | 成本 | Cost | |
| 4 | 词类 | Part-of-speech | |
| 5 | 併音 | Pinyin | |
| 6 | 繁体字 | Traditional | |
| 7 | 簡体字 | Simplified | |
| 8 | 定义 | Definition | |
| 9 | - | - | After 9, it can be freely expanded. |
API reference
The API reference is available. Please see following URL:
Build
Build system dictionary
Download and extract the mecab-jieba source files, then build the dictionary:
% curl -L -o /tmp/mecab-jieba-0.1.1.tar.gz "https://lindera.dev/mecab-jieba-0.1.1.tar.gz"
% tar zxvf /tmp/mecab-jieba-0.1.1.tar.gz -C /tmp
% lindera build \
--src /tmp/mecab-jieba-0.1.1/dict-src \
--dest /tmp/lindera-jieba-0.1.1 \
--metadata ./lindera-jieba/metadata.json
Build user dictionary
% lindera build \
--src ./resources/user_dict/jieba_simple_userdic.csv \
--dest ./resources/user_dict \
--metadata ./lindera-jieba/metadata.json \
--user
Embedding the dictionary
To embed the Jieba dictionary directly into the binary, build with the following feature flag:
% cargo build --features=embed-jieba
Examples
Tokenize with external Jieba
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict /tmp/lindera-jieba-0.1.1
可以 c,CHINESE,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good,2,可,以,high
进行 v,CHINESE,jin4 xing2,進行,进行,(of a process etc) to proceed; to be in progress; to be underway/(of people) to carry out; to conduct (an investigation or discussion etc)/(of an army etc) to be on the march; to advance,2,进,行,high
中文 nz,CHINESE,Zhong1 wen2,中文,中文,Chinese language,2,中,文,high
形态 n,CHINESE,xing2 tai4,形態,形态,shape/form/pattern/morphology,2,形,态,high
学 n,CHINESE,xue2,學,学,to learn/to study/to imitate/science/-ology,1,学,学,high
分析 vn,CHINESE,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4],2,分,析,high
。 w,*,*,*,*,*,*,*,*,*
EOS
Tokenize with embedded Jieba
% echo "可以进行中文形态学分析。" | lindera tokenize \
--dict embedded://jieba
可以 c,CHINESE,ke3 yi3,可以,可以,can/may/possible/able to/not bad/pretty good,2,可,以,high
进行 v,CHINESE,jin4 xing2,進行,进行,(of a process etc) to proceed; to be in progress; to be underway/(of people) to carry out; to conduct (an investigation or discussion etc)/(of an army etc) to be on the march; to advance,2,进,行,high
中文 nz,CHINESE,Zhong1 wen2,中文,中文,Chinese language,2,中,文,high
形态 n,CHINESE,xing2 tai4,形態,形态,shape/form/pattern/morphology,2,形,态,high
学 n,CHINESE,xue2,學,学,to learn/to study/to imitate/science/-ology,1,学,学,high
分析 vn,CHINESE,fen1 xi1,分析,分析,to analyze/analysis/CL:個|个[ge4],2,分,析,high
。 w,*,*,*,*,*,*,*,*,*
EOS
NOTE: To include Jieba dictionary in the binary, you must build with the --features=embed-jieba option.
Rust API example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://jieba")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "可以进行中文形态学分析。"; let mut tokens = tokenizer.tokenize(text)?; for token in tokens.iter_mut() { let details = token.details().join(","); println!("{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Development Guide
This section provides information for developers who want to build, test, or contribute to Lindera.
- Build & Test -- Build commands, test execution, and quality checks
- Feature Flags -- Available feature flags and their effects
- Project Structure -- Crate layout and module organization
- Training Pipeline -- CRF-based dictionary training workflow
- Contributing -- Guidelines for contributors
Build & Test
Build
Default Build
Build the workspace with default features (mmap):
cargo build
Build with Training Support
Include CRF-based dictionary training functionality:
cargo build --features train
Build CLI Only
cargo build -p lindera-cli
The CLI has the train feature enabled by default.
Test
Single Test
Run a specific test within a crate (recommended for development):
cargo test -p <crate> <test_name>
Training Feature Tests
cargo test -p lindera-dictionary --features train
All Features for a Crate
Run the full test suite for a single crate (matches CI):
cargo test -p <crate> --all-features
Workspace-Wide Tests
cargo test
Quality Checks
Format Check
Verify code formatting matches the project style:
cargo fmt --all -- --check
To auto-fix formatting:
cargo fmt --all
Lint
Run Clippy with warnings treated as errors:
cargo clippy -- -D warnings
Documentation
API Documentation
Generate and open Rust API documentation:
cargo doc --no-deps --open
mdBook Documentation
Build the user-facing documentation:
mdbook build docs
Preview locally at http://localhost:3000:
mdbook serve docs
Markdown Lint
Check documentation for Markdown style issues:
markdownlint-cli2 "docs/src/**/*.md"
Rules are configured in .markdownlint.json at the repository root.
Feature Flags
Lindera uses Cargo feature flags to control optional functionality and dictionary embedding.
Core Features
| Feature | Description | Default |
|---|---|---|
mmap | Memory-mapped file support | Yes |
train | CRF-based dictionary training (depends on lindera-crf) | CLI only |
mmapis enabled by default in the mainlinderacrate.trainis enabled by default only inlindera-cli. For library usage, enable it explicitly with--features train.
Using External Dictionaries (Recommended)
The recommended approach is to use pre-built dictionaries as external files. Download a dictionary from GitHub Releases and specify its path at runtime:
#![allow(unused)] fn main() { let dictionary = load_dictionary("/path/to/ipadic")?; }
No additional feature flags are required for this usage.
Dictionary Embedding Features (Advanced)
These features embed pre-built dictionaries directly into the binary, eliminating the need for external dictionary files at runtime. This is intended for advanced users who need self-contained binaries.
| Feature | Dictionary | Language |
|---|---|---|
embed-ipadic | IPADIC | Japanese |
embed-ipadic-neologd | IPADIC NEologd | Japanese |
embed-unidic | UniDic | Japanese |
embed-ko-dic | ko-dic | Korean |
embed-cc-cedict | CC-CEDICT | Chinese |
embed-jieba | Jieba | Chinese |
None of these are enabled by default. Enable them as needed:
[dependencies]
lindera = { version = "2.3.2", features = ["embed-ipadic"] }
When embedding is enabled, you can load the dictionary with:
#![allow(unused)] fn main() { let dictionary = load_dictionary("embedded://ipadic")?; }
Combination Features
These meta-features enable multiple dictionaries at once for multilingual applications.
| Feature | Included Dictionaries |
|---|---|
embed-cjk | IPADIC + ko-dic + Jieba |
embed-cjk2 | UniDic + ko-dic + Jieba |
embed-cjk3 | IPADIC NEologd + ko-dic + Jieba |
Combining Feature Flags
Multiple feature flags can be combined. For example, to embed both Japanese and Korean dictionaries:
[dependencies]
lindera = { version = "2.3.2", features = ["embed-ipadic", "embed-ko-dic"] }
Or from the command line:
cargo build --features embed-ipadic,embed-ko-dic
Notes
- Embedding dictionaries increases binary size significantly. Only embed dictionaries you actually need.
- The
trainfeature adds a dependency onlindera-crfand increases compile time. It is not needed for tokenization-only use cases. - The
mmapfeature enables memory-mapped dictionary loading, which reduces memory usage for large dictionaries loaded from disk. It has no effect on embedded dictionaries.
Project Structure
Lindera is organized as a Cargo workspace with multiple crates.
Directory Layout
lindera/
├── lindera-crf/ # CRF engine (pure Rust, no_std)
├── lindera-dictionary/ # Dictionary base library
├── lindera/ # Core morphological analysis library
├── lindera-cli/ # CLI tool
├── lindera-ipadic/ # IPADIC dictionary (Japanese)
├── lindera-ipadic-neologd/ # IPADIC NEologd dictionary (Japanese)
├── lindera-unidic/ # UniDic dictionary (Japanese)
├── lindera-ko-dic/ # ko-dic dictionary (Korean)
├── lindera-cc-cedict/ # CC-CEDICT dictionary (Chinese)
├── lindera-jieba/ # Jieba dictionary (Chinese)
├── lindera-python/ # Python bindings (PyO3)
├── lindera-wasm/ # WebAssembly bindings (wasm-bindgen)
├── resources/ # Test resources and sample data
├── docs/ # Documentation (mdBook)
└── examples/ # Example code
Crate Descriptions
Core Crates
lindera-crf
Pure Rust implementation of Conditional Random Fields (CRF). Supports no_std environments. Uses rkyv for fast zero-copy serialization. This crate provides the statistical learning engine used in dictionary training.
lindera-dictionary
Base library for dictionary handling: loading, building, and querying dictionaries. With the train feature enabled, it also provides the CRF training pipeline for creating custom dictionaries.
Key modules under src/trainer/:
| Module | Role |
|---|---|
config.rs | Configuration management (seed dict, char.def, feature.def, rewrite.def) |
corpus.rs | Training corpus processing |
feature_extractor.rs | Feature template parsing and feature ID management |
feature_rewriter.rs | MeCab-compatible feature rewriting (3-section format) |
model.rs | Trained model storage, serialization, and dictionary output |
lindera
The main morphological analysis library. Integrates dictionary crates and provides the Tokenizer, Segmenter, character filters, and token filters.
lindera-cli
Command-line interface for tokenization, dictionary training, export, and building. The train feature is enabled by default.
Dictionary Crates
Each dictionary crate contains pre-built dictionary data for a specific language and dictionary source.
| Crate | Language | Dictionary Source |
|---|---|---|
lindera-ipadic | Japanese | IPADIC |
lindera-ipadic-neologd | Japanese | IPADIC NEologd (extended vocabulary) |
lindera-unidic | Japanese | UniDic |
lindera-ko-dic | Korean | ko-dic |
lindera-cc-cedict | Chinese | CC-CEDICT |
lindera-jieba | Chinese | Jieba |
Bindings
lindera-python
Python bindings built with PyO3. Exposes the Lindera tokenizer API to Python applications.
lindera-wasm
WebAssembly bindings built with wasm-bindgen. Enables tokenization in browsers and Node.js.
Other Directories
resources/
Test resources including sample dictionaries, user dictionaries, and test corpora used by the test suite.
docs/
User-facing documentation built with mdBook. The table of contents is defined in docs/src/SUMMARY.md. A Japanese translation is available under docs/ja/.
examples/
Runnable example programs demonstrating common usage patterns. Run with:
cargo run --features=embed-ipadic --example=<example_name>
Training Pipeline
Lindera provides CRF-based dictionary training functionality for creating custom morphological analysis models. This feature requires the train feature flag.
Overview
The training pipeline follows three stages:
lindera train --> model.dat --> lindera export --> dictionary files --> lindera build --> compiled dictionary
- Train: Learn CRF weights from an annotated corpus and seed dictionary, producing a binary model file.
- Export: Convert the trained model into Lindera dictionary source files.
- Build: Compile the source files into a binary dictionary that Lindera can load at runtime.
Required Input Files
1. Seed Lexicon (seed.csv)
Base vocabulary dictionary in MeCab CSV format.
外国,0,0,0,名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人,0,0,0,名詞,接尾,一般,*,*,*,人,ジン,ジン
参政,0,0,0,名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
Each line contains: surface,left_id,right_id,cost,pos,pos_detail1,pos_detail2,pos_detail3,inflection_type,inflection_form,base_form,reading,pronunciation
The left_id, right_id, and cost fields are set to 0 in the seed dictionary -- the trainer will compute appropriate values from the CRF model.
2. Training Corpus (corpus.txt)
Annotated text data in tab-separated format. Each line is surface<TAB>pos_info, and sentences are separated by EOS.
外国 名詞,一般,*,*,*,*,外国,ガイコク,ガイコク
人 名詞,接尾,一般,*,*,*,人,ジン,ジン
参政 名詞,サ変接続,*,*,*,*,参政,サンセイ,サンセイ
権 名詞,接尾,一般,*,*,*,権,ケン,ケン
EOS
これ 連体詞,*,*,*,*,*,これ,コレ,コレ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト 名詞,サ変接続,*,*,*,*,テスト,テスト,テスト
EOS
Training quality depends heavily on the quantity and quality of this corpus.
3. Character Definition (char.def)
Defines character type categories and Unicode code point ranges.
# Category definition: category_name compatibility_flag continuity_flag length
DEFAULT 0 1 0
HIRAGANA 1 1 0
KATAKANA 1 1 0
KANJI 0 0 2
ALPHA 1 1 0
NUMERIC 1 1 0
# Character range mapping
0x3041..0x3096 HIRAGANA # Hiragana
0x30A1..0x30F6 KATAKANA # Katakana
0x4E00..0x9FAF KANJI # Kanji
0x0030..0x0039 NUMERIC # Numbers
0x0041..0x005A ALPHA # Uppercase letters
0x0061..0x007A ALPHA # Lowercase letters
Parameters control how unknown words of each character type are segmented: compatibility with adjacent characters, whether runs of the same type continue as a single token, and default token length.
4. Unknown Word Definition (unk.def)
Defines how out-of-vocabulary words are handled by character type.
DEFAULT,0,0,0,名詞,一般,*,*,*,*,*,*,*
HIRAGANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KATAKANA,0,0,0,名詞,一般,*,*,*,*,*,*,*
KANJI,0,0,0,名詞,一般,*,*,*,*,*,*,*
ALPHA,0,0,0,名詞,固有名詞,一般,*,*,*,*,*,*
NUMERIC,0,0,0,名詞,数,*,*,*,*,*,*,*
5. Feature Template (feature.def)
MeCab-compatible feature extraction patterns that define what information the CRF model uses for learning.
# Unigram features (word-level)
UNIGRAM U00:%F[0] # POS
UNIGRAM U01:%F[0],%F?[1] # POS + POS detail (%F?[n] = optional, skipped if *)
UNIGRAM U02:%F[6] # Base form
UNIGRAM U03:%w # Surface form
# Bigram features (context combination)
BIGRAM B00:%L[0]/%R[0] # Left POS / Right POS
BIGRAM B01:%L[0],%L[1]/%R[0],%R[1] # Left POS detail / Right POS detail
Template variables:
| Variable | Description |
|---|---|
%F[n] / %F?[n] | Feature field at index n (? = optional, skipped if value is *) |
%L[n] | Left context feature field (from rewrite.def left section) |
%R[n] | Right context feature field (from rewrite.def right section) |
%w | Surface form of the word |
%u | Unigram rewritten feature string |
%l | Left rewritten feature string |
%r | Right rewritten feature string |
6. Feature Rewrite Rules (rewrite.def)
Feature normalization rules in MeCab-compatible 3-section format. Sections are separated by blank lines.
# Section 1: Unigram rewrite rules
名詞,固有名詞,* 名詞,固有名詞
助動詞,*,*,*,特殊・デス 助動詞
* *
# Section 2: Left context rewrite rules
名詞,固有名詞,* 名詞,固有名詞
助詞,* 助詞
* *
# Section 3: Right context rewrite rules
名詞,固有名詞,* 名詞,固有名詞
助詞,* 助詞
* *
Each line is pattern<TAB>replacement. Patterns use * as a wildcard and are matched by prefix. The first matching rule in each section is applied. Different rules can be applied to unigram, left context, and right context independently, enabling fine-grained feature normalization to reduce sparsity.
Training Parameters
| Parameter | Description | Default |
|---|---|---|
lambda | L1 regularization coefficient (controls overfitting) | 0.01 |
max-iterations | Maximum number of training iterations | 100 |
max-threads | Number of parallel processing threads | 1 |
CLI Usage
Train
lindera train \
--seed seed.csv \
--corpus corpus.txt \
--char-def char.def \
--unk-def unk.def \
--feature-def feature.def \
--rewrite-def rewrite.def \
--lambda 0.01 \
--max-iter 100 \
--max-threads 4 \
--output model.dat
Export
Convert the trained model into dictionary source files:
lindera export --model model.dat --output-dir ./dict-source
This produces the following files:
| File | Description |
|---|---|
lex.csv | Lexicon with trained costs |
matrix.def | Connection cost matrix |
unk.def | Unknown word definition |
char.def | Character definition |
feature.def | Feature template |
rewrite.def | Feature rewrite rules |
left-id.def | Left context ID mapping |
right-id.def | Right context ID mapping |
metadata.json | Dictionary metadata |
Build
Compile the exported source files into a binary dictionary:
lindera build --input-dir ./dict-source --output-dir ./dict-compiled
Output Model Format
The trained model is serialized in rkyv binary format for fast loading. It contains:
- Feature weights learned by the CRF
- Label set (vocabulary entries)
- Part-of-speech information
- Feature templates
- Training metadata (regularization, iterations, feature/label counts)
API Usage
#![allow(unused)] fn main() { use std::fs::File; use lindera_dictionary::trainer::{Corpus, Trainer, TrainerConfig}; // Load configuration from files let seed_file = File::open("resources/training/seed.csv")?; let char_file = File::open("resources/training/char.def")?; let unk_file = File::open("resources/training/unk.def")?; let feature_file = File::open("resources/training/feature.def")?; let rewrite_file = File::open("resources/training/rewrite.def")?; let config = TrainerConfig::from_readers( seed_file, char_file, unk_file, feature_file, rewrite_file )?; // Initialize and configure trainer let trainer = Trainer::new(config)? .regularization_cost(0.01) .max_iter(100) .num_threads(4); // Load corpus let corpus_file = File::open("resources/training/corpus.txt")?; let corpus = Corpus::from_reader(corpus_file)?; // Execute training let model = trainer.train(corpus)?; // Save model (binary format) let mut output = File::create("trained_model.dat")?; model.write_model(&mut output)?; // Output in Lindera dictionary format let mut lex_out = File::create("output_lex.csv")?; let mut conn_out = File::create("output_conn.dat")?; let mut unk_out = File::create("output_unk.def")?; let mut user_out = File::create("output_user.csv")?; model.write_dictionary(&mut lex_out, &mut conn_out, &mut unk_out, &mut user_out)?; Ok::<(), Box<dyn std::error::Error>>(()) }
Recommended Corpus Specifications
For generating effective dictionaries for real applications:
Corpus Size
| Level | Sentences | Use Case |
|---|---|---|
| Minimum | 100+ | Basic operation verification |
| Recommended | 1,000+ | Practical applications |
| Ideal | 10,000+ | Commercial quality |
Quality Guidelines
- Vocabulary diversity: Balanced distribution of different parts of speech, coverage of inflections and suffixes, appropriate inclusion of technical terms and proper nouns.
- Consistency: Apply analysis criteria consistently across the corpus.
- Verification: Manually verify morphological analysis results. Maintain an error rate below 5%.
Contributing
Thank you for your interest in contributing to Lindera! This page provides guidelines to help you get started.
Getting Started
-
Fork the repository on GitHub.
-
Clone your fork locally:
git clone https://github.com/<your-username>/lindera.git cd lindera -
Create a feature branch:
git checkout -b feature/my-feature -
Make your changes, then verify they pass all checks:
cargo fmt --all -- --check cargo clippy -- -D warnings cargo test -
Commit and push your changes, then open a pull request.
Code Style
- Follow the existing code style in the repository.
- Run
cargo fmtbefore committing. - All public and private items (types, functions, modules, fields, constants, type aliases) must have documentation comments (
///). - Trait implementation methods should also have documentation comments describing implementation-specific behavior.
- Function and method documentation should include
# Argumentsand# Returnssections where applicable. - Code comments, documentation comments, commit messages, log messages, and error messages should be written in English.
- Avoid
unwrap()andexpect()in production code (test code is fine). - Use
unsafeblocks only when necessary, and always include a// SAFETY: ...comment. - Use file-based module style (
src/tokenizer.rs) instead ofmod.rsstyle.
Testing
-
Write unit tests for all new functionality.
-
Run the relevant test(s) during development for fast feedback:
cargo test -p <crate> <test_name> -
When working with the
trainfeature, include the feature flag:cargo test -p lindera-dictionary --features train
Commit Messages
Follow the Conventional Commits specification. Write commit messages in English.
Examples:
feat: add Korean dictionary supportfix: correct character category ID in trainerdocs: update installation instructionsrefactor: split large training method into smaller functions
Documentation
-
If your change affects user-facing documentation, update the relevant files in
docs/src/. -
After editing Markdown files, verify there are no lint errors:
markdownlint-cli2 "docs/src/**/*.md" -
Rules are configured in
.markdownlint.jsonat the repository root.
Dependencies
When adding new dependencies, verify license compatibility. Lindera uses the MIT / Apache-2.0 dual license.
Feature Flags
Use #[cfg(feature = "train")] for conditional compilation of training-related code. See Feature Flags for a full list.
Reporting Issues
When reporting a bug, please include:
- Lindera version (
lindera --versionor checkCargo.toml) - Rust version (
rustc --version) - Operating system
- Steps to reproduce the issue
- Expected and actual behavior