Categories
clojure programming

Why Clojure (Lisp) is good for writing transpilers

OmnICU is a new project to create Internationalization (i18n) functionality in multiple target languages and multiple resource-constrained runtimes. Two different approaches to solve that problem are wrapping a single common binary in multiple target language wrappers, and to write a source-to-source transpiler in a one-to-many fashion. Here are reasons why choosing Clojure (Lisp) would be a good decision for writing a transpiler.

Why write a transpiler?

Well, take a step back to the original problem: how do we write code that works consistently across multiple languages? Your place to get industry-standard basic i18n functionality for all languages is ICU, but we only have it in C++ (ICU4C) and Java (ICU4J). It would be prohibitively difficult & complex to expand to other languages and maintain consistency across them with new features and fixes to old ones. Those who have used wrappers (FFI) for a single C++ binary using the dark arts of SWIG also complain of a high degree of complexity.

More fundamentally, for resource-constrained environments, binary size also matters, sometimes more so than runtime, etc. What should be included in a binary depends on the user’s application, and thus it varies. Meanwhile, language build tools often include the ability for custom optimizations like dead code elimination. A transpiler would provide all the code directly in the target language, thereby enabling per-application dead code elimination, and bypassing the need for FFI concerns (configuration, deployment complexity).

Does a transpiler have to be in Lisp?

No, it doesn’t. Anything written in a Turing-complete language can be written in another. But that doesn’t mean people write code in assembly as their first choice anymore. In other words, there are obvious practical reasons for choosing higher-level abstractions. We’re rehashing the Blub Paradox at this point, and we really shouldn’t.

(Disclaimer: I used to dislike the above essay about the Blub Paradox and disagree that it held any merit. To the point I refused to learn what it advocated. Only on the advice of a respected friend did I venture to learn more, and realized to my chagrin that the essay described my ignorance perfectly.)

Fundamentally, programming languages are tools, and as always, we should “use the best tool for the job”.

Why should a transpiler be in Lisp?

The power of a language is important — we don’t reach for Assembly or Perl anymore — so all things being equal, use the most powerful tool you can, right? When the difficulty and inherent complexity of a problem is formidable, such as a one-to-many transpiler, it seems power would matter all the more.

And in the case of Clojure, a particular dialect of Lisp, simplicity and data-orientation are important, too. The cumulative effect of power, simplicity, and data-orientation is profound — it must be experienced to be appreciated, but others can explain it better than I can.

Lisp has a long history in compiler applications. Part of the reason is the obvious — the S-expressions (code written inside parentheses, with nesting) means that the syntax is minimal, and that the code itself partially mirrors the structure of the AST. Another reason is the “code-is-data-is-code” aspect of Lisp — the use of S-expressions means that Lisp is written in the data structures of the language — which 1) makes it easy to manipulate code, and 2) possible to use macros, the most powerful programming construct.

The ease of manipulating code makes Lisp a natural choice for solving similarly hard problems — term-rewrite systems. It leaves the door open to blur the lines between what we call term-rewriting and a simple Datalog-style declarative pre-processing of code for a transpiler. Operating at the level of code structure is the simple reason why Lisp macros are the most powerful construct, and why AST-based macros or lexical pre-processing macros (C/C++) are not as simple / powerful.

Syntax as simplicity

Using S-expressions as syntax simplifies the semantics of the syntax. There are fewer irregularities to deal with (ex: are functions expressed as f(obj, ...) or obj.f(...)? Is the + sign the unary operator, binary operator, or syntax error?). The scope of an expression is also clear (ex: does an if statement block implicitly include the next line only? does a switch/case statement need break statements for each case?). Delimiting the scope of an operation may also be why WebAssembly text format uses S-expressions for even its low-level assembly-like instructions.

More importantly, the syntax never changes. Every form (set of parantheses) is a call to a function, macro, or special form, written in prefix notation. A transpiler implementation (both in input -> AST parsing and AST -> output emission) needs much less maintenance work over time if the input source code, and its syntax definition, remain constant.

Syntax as flexibility / power

A transpiler is easier to implement if it can minimize the number of programming constructs that it supports in the input source language, ideally to a minimally sufficient set that have analogues in the (all) target language(s). But there is always the possibility that a target language will have unavoidable constructs that must be represented in the input source code (ex: code in Java is organized into classes, but Clojure namespaces are not the same).

In this scenario, macros provide the maximum flexibility to use macros to “create syntax” as necessary to represent new constructs, and to do so in user-land (not alter the language / compiler). The use of a macro is especially important since an implicit yet necessary requirement of a transpiler is that the input source code still executes and behaves as it is written. In other words, it is not enough to use Lisp to represent the logic in a textual form; rather, we need to also make it run and execute Lisp unit tests. This is the only way for user-written input source code to be tested in a single authoritative place before transpiling occurs. So in Clojure, a form called return to represent a return statement in the target language might be implemented to perform a pass-through, like:

(defn return
  [expr]
  expr)

Whereas trickier cases like enums might require resorting to macros to achieve both goals of target language representation and input language execution and behavior.

While the above is true even for a 1-to-1 source-to-source transpiler, the inherent complexity of a 1-to-many source-to-source transpiler in supporting each of the unique target language constructs requires a language with flexibility to change and power to create abstractions as needed. One example is string operations, which can be represented well enough with string buffers of UTF-16-encoded codepoints in C++ and Java, but require a higher level of abstraction to be able to also represent the iteration required for Rust to iterate over UTF-8-encoded codepoints. Such a string iteration construct is similar to other seq comprehension forms in Clojure like for, doseq, dotimes, etc.

Syntax close to the AST?

To be clear, Lisp syntax is not the same as its AST. But it isn’t far off because of the regularity and nesting of its syntax, as mentioned earlier. There is the possibility that transpiling can be thought of not just 2 steps (input -> AST; AST -> output), but rather can be further decomplected into 3 steps (input -> input AST; input AST -> target AST; target AST -> output). Doing so would simplify the emission of target-specific strings of output code. In this line of thought, if the ASTs are represented in S-expressions or equivalent, then the critical transformation of input AST -> output AST might be reduced to mere munging of data structures of code, or better yet, a term-rewrite system described simply as declarative rules.

Why should Clojure be the Lisp dialect used for a transpiler?

Transpiler technical benefits

Clojure is designed around a few principles that are rooted in a clear philosophical understanding of simplicity. As such, Clojure makes immutability a default in the language, and mutation is hard to achieve without planning and extra effort. There is an overlap with Rust, whose syntax encourages immutability, and with it, a different style of coding. Even though Clojure’s conception of immutability (based on values) differs from Rust’s (compiler-enforced memory-based), Clojure’s idiomatic immutable bindings in a let block of lexical scope results in code that, when translated/transpiled, might fit Rust’s coding patterns just as it certainly fits good Scala code style.

Like other Lisps, Clojure allows metadata to be attached to nearly any symbol (“identifier”) in the code. Clojure’s “associative view of data” means that metadata are maps, which allows flexibility and semantics. Clojure also provides multimethods, which allow OOP-style dynamic dispatch in a more flexible way via customizable dispatch functions and customizable type hierarchy graphs, etc. (ex: C++ and Java languages are syntactically similar, so their emitters would share a lot of code in common that can be refactored this way). Clojure also provides type hints as a form of metadata, which makes representing type information convenient for statically typed language targets.

Library support via Java ecosystem

Clojure has a constantly growing population of users that form, by far, the largest segment of modern Lisp users, enough to convince O’Reilly to lift its decades-long ban on new book proposals for Lisp. One key attribute of Clojure’s success is the leverage it gains by being hosted on the JVM, giving its users access to the full Java ecosystem of libraries.

The clojure.tools.analyzer library does an excellent job of providing a data-oriented representation of an AST of Clojure code with static analysis annotations, making it an easy jumping-off point for writing the AST -> target language emitters of a transpiler since the input -> AST is solidly covered already.

Tooling, references, etc.

Clojure, as a language used both on the backend, the frontend (“ClojureScript”), etc. has a userbase supported by multiple editors, build tools, books, etc.

What about language popularity numbers?

The popularity of a language is important to the extent that it contributes to its ecosystem of libraries, tooling, learning materials, etc. But relative usage numbers are not always indicative of the suitability of a language to being the best tool for a job (ex: for low-level programming, Rust is very compelling over C++, despite how the user numbers stand currently).

What are points of inspiration?

Even in the recent past, some of the compiler-related work in Lisp / Clojure show seemingly much harder problems being solved, either elegantly or with sophistication. The results provide inspiration that a one-to-many transpiler is within reason, too.

Common Lisp’s SLIME mode in Emacs integrates the disassemble function into its functionality that allows the user to optimize the compiled byte code’s performance on a scale of 1-10.

A talk on clojure.tools.analyzer (previously mentioned as providing annotated AST) shows off the interactive data-oriented nature of the library. A side benefit of Clojure’s persistent data structures allow each node of the AST to be annotated (from static analysis) with the exact value of the environment at each node, but without paying the memory cost of shared/duplicate information.

Ramsey Nasser’s MAGIC compiler adapts Clojure’s CLR implementation to get Clojure code to run fast enough to work with the Unity game engine on the CLR, despite the challenge of Clojure (dynamically typed language) on the CLR (for which types are needed more and affect performance). Macros are used to infer and insert type information and get rid of boilerplate in the compiler’s emitted output.

How to learn more or get involved?

Visit the Github page for the Kalai transpiler, which is a one-to-many source-to-source transpiler motivated by the OmnICU use case, and/or contact me via email.

View presentation slides to the local Clojure meetup about the transpiler.

Kalai is currently in early stages, and currently targets Rust, C++, and Java.

Update

2023-05-04: The above blog post was a loose collection of thoughts, on purpose. It was written at an inflection point in which a redesign happened, and the above post was written with the hope and belief that the new design will work. Now, it’s 3 years and a pandemic later, and what you read above really did pay off. You can see our Clojure/conj talk to get the detailed story and the backstory (or at least as much as we could fit in 40 minutes):

Leave a comment