A lightweight Japanese tokenizer that runs in the browser via WebAssembly. Uses feature-based analysis instead of large dictionary files.
Suzume tokenizes Japanese text using character patterns, connection rules, and a small dictionary (~400KB), rather than the large dictionaries (20-50MB+) used by traditional morphological analyzers like MeCab or Kuromoji. The WASM build is around 360KB gzipped.
| Traditional Analyzers | Suzume | |
|---|---|---|
| Bundle Size | 20-50MB+ (dictionary) | <400KB gzipped |
| Browser Support | Limited or none | Supported (WASM) |
| Server Required | Usually yes | No |
| POS Tagging | Yes | Yes |
| Lemmatization | Yes | Yes |
- Smaller footprint — No large dictionary download; suitable for frontend, edge, and serverless environments
- Handles unknown words — Feature-based analysis doesn't fail on words missing from a dictionary
- Less accurate on edge cases — Traditional dictionary-based analyzers will be more accurate for specialized vocabulary and complex linguistic analysis
npm install @libraz/suzumeOr use yarn/pnpm/bun:
yarn add @libraz/suzume
pnpm add @libraz/suzume
bun add @libraz/suzumeimport { Suzume } from '@libraz/suzume'
const suzume = await Suzume.create()
const tokens = suzume.analyze('すもももももももものうち')
for (const t of tokens) {
console.log(`${t.surface} [${t.posJa}]`)
}
// Tag extraction (returns { tag, pos } objects)
const tags = suzume.generateTags('東京スカイツリーに行きました')
// → [{ tag: '東京', pos: 'noun' }, { tag: 'スカイツリー', pos: 'noun' }, { tag: '行く', pos: 'verb' }]
// Nouns only
suzume.generateTags('美味しいラーメンを食べた', { pos: ['noun'] })
// → [{ tag: 'ラーメン', pos: 'noun' }]
// Exclude basic words (hiragana-only lemma like する, ある, いい)
suzume.generateTags('今日はいい天気ですね', { excludeBasic: true })
// → [{ tag: '今日', pos: 'noun' }, { tag: '天気', pos: 'noun' }]<script type="module">
import { Suzume } from 'https://esm.sh/@libraz/suzume'
const suzume = await Suzume.create()
console.log(suzume.analyze('こんにちは'))
</script>#include "suzume.h"
suzume::Suzume tokenizer;
auto tokens = tokenizer.analyze("東京に行きました");
for (const auto& t : tokens) {
std::cout << t.surface << "\t" << t.lemma << std::endl;
}Build from source (requires C++17, CMake 3.15+):
make # Build
make test # Run tests- Getting Started — Installation and basic usage
- API Reference — API documentation
- User Dictionary — Adding custom words
- How It Works — Technical details