Japanese Text on Computers

A tour of the writing system, the fonts, and the engineering problems they cause.

One sentence, three scripts

  • , : Kanji (Chinese-origin)
  • , , : Hiragana (grammar / verb endings)
  • コーヒー: Katakana (loanword from English "coffee")

The three writing systems

Hiragana
~46 chars, syllabic, for native words / grammar
Katakana
~46 chars, syllabic, for loanwords / emphasis
Kanji
thousands, logographic, borrowed from Chinese

Same sounds, two scripts

Columns = vowels, rows = consonants. Each cell has the hiragana and its katakana twin (same sound, different glyph).

a i u e o
·
k
s
t
n
h
m
y · ·
r
w · · ·
n · · · ·

Small kana: yōon

Add a small after any i-column kana → combined syllable.

+ + +
ki kya kyu kyo
shi sha shu sho
chi cha chu cho
ri rya ryu ryo

(photo), (today), (trip)

Small kana: sokuon

A small (like , shrunk) doubles the next consonant.

Without With small
oto otto sound → husband
kako kakko past → parentheses
mate matte wait (root) → "wait!"

In speech: a brief glottal stop. In writing: a tiny that changes the word entirely.

Voicing marks: dakuten

Add to flip a consonant from unvoiced to voiced. Applies across whole rows.

Unvoiced + → voiced Sound shift
k-row ka ga k → g
s-row sa za s → z
t-row ta da t → d
h-row ha ba h → b

Rule extends across all 5 vowels:

H-row's special twin: handakuten

A small circle (only valid on the h-row) gives the unvoiced plosive p.

base
ha · hi · fu · he · ho
+
ba · bi · bu · be · bo (voiced)
+
pa · pi · pu · pe · po (plosive)

Only row with three forms. The h-row is phonetically unstable, so it picked up an extra diacritic over the centuries.

How many glyphs?

Log scale: every order of magnitude is the same width on screen. Linear scale: Latin and Jōyō almost vanish next to the full CJK set.

What does this cost on the web?

woff2, same font family (Noto Sans / Inter) where possible.

Subsetting to the rescue


@font-face {
  font-family: "Noto Sans JP";
  src: url("noto-jp-subset-hiragana.woff2") format("woff2");
  unicode-range: U+3040-309F; /* hiragana block only */
}
					

Google Fonts splits Noto Sans JP into ~120 subsets.

The browser downloads just the chunks that contain characters on the page.

Or: skip kanji entirely

Pokémon FireRed (Game Boy Advance, 2004). All UI in hiragana & katakana, no kanji.

Pokémon FireRed name entry screen: 'あなたの なまえは?' above a hiragana selection grid

What in-game dialogue looked like

"You, a rookie trainer, set off on an adventure with Pokémon!"

Why drop kanji?

Each kanji is a 16×16 bitmap → 32 bytes.
A full Jōyō set: 2,136 × 32 ≈ 68 KB of pure glyph data.

For comparison: Pokémon Red & Blue (Game Boy, 1996) shipped the entire game in 1 MB.
156 Pokémon, all sprites, all maps, all music, all code.

A full kanji set would have been ~7% of that cartridge, just for the letters.

Easier: ship the 92 kana glyphs, skip kanji altogether.

But then: readability tanks

Could be: (high school) / (filial piety) / (going second) / (at most)…

Solution Pokémon adopted: add spaces between phrases, a thing Japanese normally doesn't have.

Hardware caught up

2004 Pokémon FireRed (GBA): kana only
2006 Diamond/Pearl (DS): kanji as an option
2013 X/Y (3DS): kanji on by default

The same problem the web solved with subsetting, games solved by waiting for ROM to get cheap.

No spaces. Anywhere.

Where does each word start?

(honorific) + すし (sushi) + (object marker) + 食べ (eat, verb stem) + ました (polite past)

Need a dictionary and grammar rules to do this.

Compounds make it worse

A naïve Ctrl+F for finds , 自動, , 自転… you almost never want that.

…but compounds can be poetry

Each character carries meaning. Stack them and you get a tiny picture.

komorebi
tree · leaked · sun
sunlight filtering through leaves
sōmatō
running · horse · lantern
life flashing before your eyes
nekojita
cat · tongue
can't handle hot food
tsundoku
pile up · read
books bought but never read
haraguroi
belly · black
scheming, two-faced

Counters: count by shape

Japanese has no grammatical plural. But it has ~50 counter words that change based on the shape of the thing being counted.

Japanese Counter Used for
2 cars dai machines
2 cats hiki small animals
2 horses large animals
2 books satsu bound things
2 sheets of paper mai flat thin things
2 pencils hon long cylindrical things
2 rabbits wa birds… and rabbits 🐇

So localization has to know which counter goes with which noun. (We'll come back to this.)

What i18n libraries can't do for you

English needs plural forms. Japanese needs something CLDR doesn't ship.


new Intl.PluralRules('en').select(1)  // "one"
new Intl.PluralRules('en').select(2)  // "other"

new Intl.PluralRules('ja').select(1)  // "other"
new Intl.PluralRules('ja').select(2)  // "other"
						

Japanese has exactly one plural category. The library has nothing to do.

The real problem isn't grammar, it's lexical:

, , (per-noun, not per-language; no standard library knows this)

Plus the phonetic mutation: the number itself changes the counter's reading.

What apps actually do: hard-code per noun, or punt to (ko), the generic "thing" counter.

So how do learners read?

A browser extension that tokenizes + dictionary-looks-up under your cursor.

Yomitan

Yomitan in one sentence

Hover any Japanese text →
it segments words, deinflects them, and shows the dictionary entry.

  • (deinflected to dictionary form)
  • Shows readings, meanings, frequency, pitch accent
  • Bring your own dictionaries

Bring your own dictionaries

Yomitan ships empty. You import whichever dictionary files suit how you read.

JMdict
The community Japanese ↔ English dictionary. The default starting point.
大辞泉 · 大辞林
Monolingual JP ↔ JP, what native speakers reach for.
Pixiv encyclopedia
Anime, manga, slang, internet coinages.
NHK pitch accent
Where the pitch rises and falls within a word.
Frequency lists
BCCWJ (newspapers), anime subs, light novels. "Is this word common?"
Your own
Format is plain JSON. People ship Anki-linked, medical, legal, dialect-specific dicts.

Longest match wins

Hover anywhere. Yomitan reads forward from your cursor and picks the longest entry it can find.

✗ not a word
✓ library
books (skipped)
diagram (skipped)

So hovering 図 in this sentence pops up library, not "diagram", even though both are valid.

Live demo

NHK Easy: ne2026052112586 ↗

Open the link, hover any sentence.

And how do you type all this?

Reading was one side. The other side is input: the IME (Input Method Editor). Built into every desktop, phone, and browser.

1. Type romaji
nihongo
2. IME auto-converts to kana
3. Press space, pick kanji

Three layers of conversion between the keyboard and what lands in the document.

Picking from candidates

Press space again. The IME shows every kanji string the kana could possibly spell.

にほんご [space]
  1. Japanese language ★ default
  2. "after 2 cylindrical things"
  3. mixed numeral + counter
  4. leave as hiragana
  5. force katakana

Arrow keys or number keys to pick. The IME learns your preferences and reorders next time.

How many kanji do you actually need?

Approx. cumulative coverage of running newspaper text by the top-N most frequent kanji.

The official list: Jōyō kanji

: "regular-use kanji"

2,136
characters on the official list
(since the 2010 revision)
Grades 1–6 1,026
Grades 7–9 1,110
  • Defined by the government. Schools, newspapers, and official documents commit to staying within this set.
  • Anything outside → write in kana, or add furigana.
  • So "~2,000 kanji" on the previous chart isn't arbitrary. It's the curriculum.

Furigana: kanji with training wheels

Tiny kana printed above (or beside) kanji to show how to read them.

reading shown above

Where you see it: children's books, manga, learners' material, place & person names with rare readings.

HTML ships with <ruby>

The web has a native element for exactly this, for over 20 years, in every browser.


<ruby>
  漢字
  <rp>(</rp><rt>かんじ</rt><rp>)</rp>
</ruby>
						

Renders as:

  • <rt>: the small "ruby text" annotation
  • <rp>: fallback parentheses for browsers without ruby support (modern browsers ignore them)

Also handy for: chemical formulas (H2twoO), abbreviation expansions, screen-reader-friendly pronunciation hints.

Stroke order matters


(language)

0 / 14 strokes

Stroke data: KanjiVG (CC BY-SA 3.0)

Reading text from images

OCR: Optical Character Recognition. Pixels in, text out.

You use it constantly:

  • Google Lens, Apple Live Text
  • Searchable PDFs, document scanners
  • Receipts, business cards, license plates
  • Translating signs and menus on the go

For English: largely solved. For Japanese: still rough.

OCR: harder than it looks

26 letters vs ~6,000 glyphs

Some kanji differ by a single stroke.

Spot the difference

Pair Meaning
1 end / not yet
2 earth / samurai
3 sun / say
4 person / enter

At low resolution? Good luck.

And then there's manga

Vertical, right-to-left
Furigana: tiny kana on top
Tilted speech bubbles
Stylised handwritten SFX

→ specialised tools like mokuro, manga-ocr.

One last thing: emoji

picture + character =

1999. Shigetaka Kurita at NTT DoCoMo designs 176 12×12 pixel icons so pager messages can express more on tiny screens.

☀️❄️🌙🍦🍣🚗📞💌🎵

Same constraint we've been tracking all talk: tiny budget, want to say more. The Japanese answer: ship pictures as characters.

2008: iPhone picks them up. 2010: Unicode adopts them. Today: ~3,800 in the standard, and you've sent thousands.

Thank you!

My name is Truls

Slides on github:
github.com/trulshj/presentation

Tools mentioned: Yomitan, KanjiVG, mokuro, manga-ocr, Noto Sans JP