Syntax

2025/06/08

Since I've been adding a decent amount of code snippets to the website, and because these snippets look fairly plain by default, I decided the other day to write a simple syntax highlighter. At first I began writing the tool in the usual way for parsing tools—read a stream of characters, process those into a stream of tokens, then finally process the tokens. After a bit though I realized I was overcomplicating things, and rewrote it to process the stream of characters directly.

The resulting approach is simpler, and removes the need for either allocating memory to store the tokens, or for providing a callback-style visitor interface for consuming the tokens. It's not as well-engineered, but it does exactly what I need it to in about 350 lines of documented code, uses no heap allocations[1], and processes files in about 10 milliseconds on my computer.

Before: A picture of C code without any syntax highlighting
After: A picture of C code with syntax highlighting added

I could have used a syntax highlighter that someone else had already written, but writing my own allowed me to accomplish some unique features that I haven't seen other tools provide. In particular, I've written the syntax highlighter so that it is idempotent—in other words, running the tool on the same file twice produces the same result as if it had only been run once. This is something I wanted because it allows me to keep just one HTML file containing the code snippets, rather keeping both an unprocessed HTML file and a post-processed HTML file. In addition, if the tool ever breaks or disappears, that single HTML file can continue to be served as-is. Finally, if I update the tool in the future, I can just re-process all the HTML files with the new version of the tool and they'll be automatically updated to show the new syntax highlighting.

Anyways, links to the code below. As mentioned above it's not the most well-engineered, and has some notes on how it could be improved, but it's working for me so far, and has made the code snippets in this website a little bit nicer to look at.


1. As an aside, I mentioned above that the program does not make any heap allocations. It does, however, make a small stack allocation for a character buffer. At first I attempted to write the syntax highlighter without any memory allocations at all, but using getc and ungetc as a one-character buffer was a bit too limited in terms of lookahead, and probably wasn't particularly efficient either.

The character buffer uses a sliding window with peek and drop functions to retrieve/­remove the next X characters from the stream, and is inspired by the byte buffer described in this article.