How to create your own compiled Programming Language in Java (the hard way)

Last updated: October 23, 2019

You might want to read this if:

  1. You want to create your own Programming Language
  2. You are prepared to invest 1000+ hours into 10000+ lines of code, writing your own compiler.
  3. You care about performance and want a compiled Language
  4. You want to learn something and do most of the work yourself, without tools like YACC, Bison, Flex, LLVM, ...
  5. You want to do it in Java or C++ or C# or C or ...

Why would i want to do the hard work myself? Just use Flex/YACC/Bison/LLVM!!

Because hard work always pays off.

Why go compiled? I heard that writing an interpreter is so much easier!

Yeah, and your programs will execute so much slower. And no real programmer will use your language to write anything remotely performance-critical. (And in the end, every program is performance-critical, because it could take cpu cycles away from other programs which are actually performance-critical)

What i would/will change about my compiler

I should probably switch to writing it in C or Rust to get the greatest speed gains.

1. Language Design

The most important part: your language should look pretty. Should be easy to write and easy to read.

Java Example:

				class Main{
					public static void main(String[] args) {
						System.out.println("Hello World");
					}
				}
				

Haskell Example:

				main = putStrLn "Hello, World!"
				

Rust Example:

				fn main() {
					println!("Hello World!");
				}
				

C Example

				#include <stdio.h>
				int main() {
					printf("Hallo, Welt!");
					return 0;
				}
				

Python Example

				print("Hello, World!")
				

Dragon Example (the lang i am developing)

				use Base.dg
				namespace Main{
					public ()~>PInt main{
						printString(string("Hello World!",12));
						println();
						return 0;
					}
				}
				

What do you see? Some Languages are more verbose than others. They explicitly state what is going on. In some, the types being used are declared. In others, they are optional and can be inferred by the compiler.

[Optional] Exercise: write down some programs, in the language you want to develop, in a text editor. See how it looks. Then iterate on that, until you have a syntax which you feel good about.

1.1 Language Design: selecting features

You probably want features or a combination of features in your language that has not been done before. Otherwise, there would be no sense in investing 1000+ hours a custom compiler.
Here is a list with common language features and their associated difficulty (i could be slightly off about the difficulty of stuff i have not implemented myself):

There are some books on advanced programming language / compiler features. Programming Languages and Compilers are both being actively researched, with new stuff coming out.

1.2 Language Design: creating a grammar

To parse your language, and put it in a representation, which is easily understandable by your compiler, you can create a grammar.

The grammars of popular languages may guide you: Java Grammar , Haskell Grammar .

2.1 Writing a Lexer and a Parser

The standard approach for going from source code files to a representation Abstract Syntax Tree (AST) that your compiler can understand is to have a lexer convert the plain text of the source code into tokens, and then to have a parser to convert these tokens into an AST.
This step can be outsourced to various tools, but you probably want to do it yourself for performance/swag reasons.
Expect to spend at least 10 hours on the lexer and at least 30 hours on the parser.

3.1 Typechecking

You have your AST now. Now is the time to check if all the rules of your language have been followed in the program, that each subroutine call receives the correct number and types of parameters, and so on.

4.1 VM Code Generation

If you have a reasonably big/recursive/expressive grammar, your AST might be straightforward to compile to x86 intel assembly code. It is sensible to have an intermediate VM Language, which then can be later translated to assembly. You could also have multiple intermediate languages, if your language is truly sophisticated, or you want to support multiple target languages. I choose to use a stack-based vm language for Dragon. You can write programs in it which perform computations on the stack, not on registers. Also, some subroutine-dependent abstract virtual memory segments could be useful.

VM Code Compilation

The final step. To compile the VM code to assembly. For the last step of translating .asm to a native executable, i used NASM and ld.

Additional Resources

about which programming languages to learn