Suzume

A lightweight Japanese tokenizer that runs in the browser via WebAssembly. Uses feature-based analysis instead of large dictionary files.

Documentation | Live Demo

Overview

Suzume tokenizes Japanese text using character patterns, connection rules, and a small dictionary (~400KB), rather than the large dictionaries (20-50MB+) used by traditional morphological analyzers like MeCab or Kuromoji. The WASM build is around 360KB gzipped.

	Traditional Analyzers	Suzume
Bundle Size	20-50MB+ (dictionary)	<400KB gzipped
Browser Support	Limited or none	Supported (WASM)
Server Required	Usually yes	No
POS Tagging	Yes	Yes
Lemmatization	Yes	Yes

Trade-offs

Smaller footprint — No large dictionary download; suitable for frontend, edge, and serverless environments
Handles unknown words — Feature-based analysis doesn't fail on words missing from a dictionary
Less accurate on edge cases — Traditional dictionary-based analyzers will be more accurate for specialized vocabulary and complex linguistic analysis

Installation

npm install @libraz/suzume

Or use yarn/pnpm/bun:

yarn add @libraz/suzume
pnpm add @libraz/suzume
bun add @libraz/suzume

Quick Start

JavaScript / TypeScript

import { Suzume } from '@libraz/suzume'

const suzume = await Suzume.create()

const tokens = suzume.analyze('すもももももももものうち')
for (const t of tokens) {
  console.log(`${t.surface} [${t.posJa}]`)
}

// Tag extraction (returns { tag, pos } objects)
const tags = suzume.generateTags('東京スカイツリーに行きました')
// → [{ tag: '東京', pos: 'noun' }, { tag: 'スカイツリー', pos: 'noun' }, { tag: '行く', pos: 'verb' }]

// Nouns only
suzume.generateTags('美味しいラーメンを食べた', { pos: ['noun'] })
// → [{ tag: 'ラーメン', pos: 'noun' }]

// Exclude basic words (hiragana-only lemma like する, ある, いい)
suzume.generateTags('今日はいい天気ですね', { excludeBasic: true })
// → [{ tag: '今日', pos: 'noun' }, { tag: '天気', pos: 'noun' }]

Browser (CDN)

<script type="module">
  import { Suzume } from 'https://esm.sh/@libraz/suzume'

  const suzume = await Suzume.create()
  console.log(suzume.analyze('こんにちは'))
</script>

C++

#include "suzume.h"

suzume::Suzume tokenizer;
auto tokens = tokenizer.analyze("東京に行きました");

for (const auto& t : tokens) {
    std::cout << t.surface << "\t" << t.lemma << std::endl;
}

Build from source (requires C++17, CMake 3.15+):

make          # Build
make test     # Run tests

Documentation

Getting Started — Installation and basic usage
API Reference — API documentation
User Dictionary — Adding custom words
How It Works — Technical details

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 255 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
examples		examples
js		js
scripts/mcp		scripts/mcp
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitignore		.gitignore
.mcp.json		.mcp.json
.npmignore		.npmignore
.yarnrc.yml		.yarnrc.yml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.npm.md		README.npm.md
README_ja.md		README_ja.md
biome.json		biome.json
codecov.yml		codecov.yml
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Suzume

Overview

Trade-offs

Installation

Quick Start

JavaScript / TypeScript

Browser (CDN)

C++

Documentation

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Suzume

Overview

Trade-offs

Installation

Quick Start

JavaScript / TypeScript

Browser (CDN)

C++

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages