Introducing uu: a tool for inspecting Unicode text

I wrote a small tool called uu which can be used to examine streams of Unicode text.

The subcommand uu inspect will read from STDIN, parse it as UTF-8, and print a line for each code point it finds, with details about that code point.

$ echo 'V = ⁴⁄₃πr³ 🤔' | uu inspect
GLYPH	CODE POINT  UTF-8 BYTES  NAME                    BLOCK                                 CATEGORY
V	U+0056      56           LATIN CAPITAL LETTER V  Basic Latin                           Uppercase Letter
	U+0020      20           SPACE                   Basic Latin                           Space
=	U+003D      3d           EQUALS SIGN             Basic Latin                           Math Symbol
	U+0020      20           SPACE                   Basic Latin                           Space
⁴	U+2074      e2 81 b4     SUPERSCRIPT FOUR        Superscripts and Subscripts           Other Numeric
⁄	U+2044      e2 81 84     FRACTION SLASH          General Punctuation                   Math Symbol
₃	U+2083      e2 82 83     SUBSCRIPT THREE         Superscripts and Subscripts           Other Numeric
π	U+03C0      cf 80        GREEK SMALL LETTER PI   Greek and Coptic                      Lowercase Letter
r	U+0072      72           LATIN SMALL LETTER R    Basic Latin                           Lowercase Letter
³	U+00B3      c2 b3        SUPERSCRIPT THREE       Latin-1 Supplement                    Other Numeric
	U+0020      20           SPACE                   Basic Latin                           Space
🤔	U+1F914     f0 9f a4 94  THINKING FACE           Supplemental Symbols and Pictographs  Other Symbol
^J	U+000A      0a           <LINE FEED>             Basic Latin                           Control

The subcommand uu lookup takes a code point as a command line argument and prints a table of information about it.

$ uu lookup U+203D
Glyph:                ‽
Code point:           U+203D
Name:                 INTERROBANG
Block:                General Punctuation
Category:             Other Punctuation (Po)
Bidirectional Class:  OtherNeutral (ON)
Added in version:     1.1.0
UTF-8:                e2 80 bd
UTF-16BE:             20 3d
UTF-16LE:             3d 20
UTF-32BE:             00 00 20 3d
UTF-32LE:             3d 20 00 00

I wrote an early version of this tool in 2018, while working on a project to pre-process human language text to make it suitable for input into a text-to-speech ML model. I was using Tim Whitlock’s Unicode character inspector app to examine the sample inputs I was working with, and wishing for a command line tool that offered similar features, so I hacked together a quick Python script to do the job.

I recently got around to rewriting the program and releasing it under the ISC license. It’s now a stand-alone executable with no dependencies. You can get the source code on Github (you’ll need a Rust toolchain installed to build it). Alternatively, if you’re on macOS you can install it with Homebrew:

brew install jake-low/tools/uu

I hope someone else out there will find it useful.