Writing your first Elixir code check

clock10 min read

Learn how to analyze and validate Elixir code by taking a mind-blowing journey through string scanning, AST traversal and code compilation.

There comes the time when just writing the code is no longer enough and you want to look deeper under the hood - parse, analyze and perhaps even compile it on your own for sake of enforcing some extra coding rules and conventions. Elixir is a particularly welcoming language when it comes to code analysis options. In this article I’ll explore them in an approachable way.

We’ll start with the simplest approaches and then climb all the way up to the top of Mount Olympus where we’ll dine with the father of all Elixir gods - the one and only Compilus.

Quest

Imagine that you’re doing a big Elixir project and you highly rely on quick „fuzzy” search by filename for jumping across modules. You’ve discovered that some of developers sometimes put multiple module definitions in a single file (defmodule inside a defmodule or a defmodule sequence). That completely kills your quick jump across modules and results in confusion across the team. It’s clear to you that one file must only define one module. 

Let’s make sure of that with an iron fist!

Setting the stage

Our script will have following flow:

lang: elixir
source_file_pattern
|> get_files_from_wildcard()
|> get_files_mods()
|> filter_one_file_many_mods()
|> print_errors()
|> emit_exit_status()

All parts except the get_files_mods/1  function are simple I/O + issue filtering logic. We’ll code them once and they won’t change as we proceed. You'll find this boilerplate code in this gist.

Once you'll have the script, you can put it into priv directory and run it via  elixir or mix run commands - depending on whether your solution uses Mix. 

And once you're done, you can add it to your list of project checks and run it consistently along with other project checks via the mix check task. In such case, you'd want to add the following to your project's .check.exs:

lang: elixir
[
  tools: [
    # if it's plain Elixir...
    {:one_file_one_mod, "elixir priv/one_file_one_mod.exs"}

    # ...or if it uses Mix or its compilation artifacts
    {:one_file_one_mod, "mix run priv/one_file_one_mod.exs"} 
  ]
]

Now the real challenge is finding the right way for fetching the list of module definitions in source files of interest. Let's iterate on that until we get it right.

Level 1: String scanning

When you first look at the source code, you may think that „it’s just a string”. Such remark wouldn’t even be specific to Elixir (although we all know that Elixir syntax is particularly expressive and beautiful). Indeed, we could just read the source file and check out the source string with functions from String or Regex modules or via binary pattern matching.

Let’s extract aliases of modules defined in a specific source file:

lang: elixir
@defmodule_regex ~r/^\s*defmodule ([\w.]+)/

defp get_file_mods(file) do
  @defmodule_regex
  |> Regex.scan(File.read!(file), capture: :all_but_first)
  |> List.flatten()
  |> Enum.map(&Module.concat("Elixir", &1))
end

Simple and quite effective. One thing that works in advantage of this method is Elixir Formatter, which makes the string representation of your code a little more normalized and predictable.

Still, depending on the problem, sooner or later you may start seeing the limits of looking at your code like it’s just a string. Our goal is to detect all defmodule calls that indeed result in defining modules during compilation. We did cover false positives in regular code comments starting with # in above regex but what about multi-line strings often used when writing @moduledoc? Or ones inside macros and their quote blocks?

Indeed, once you deploy such a check to your team, from then on your every next day won’t be truly complete without someone catching yet another „funny” case of string scanning & matching being unable to really understand the nested, tree-like nature of your code.

Is the string approach useless then? Yes and no. It'll prove too simplistic for most analysis cases but at the same time it may be good enough for the "flat parts" of the source, like the code comments. That's why string scanning is often effectively used by analysis tools to find code comments that control the analysis, e.g. by telling the tool to ignore issues in the next line.

Level 2: AST traversal

Now that we're aware of the nested nature of code, let's talk about a representation that takes it into account - the AST (Abstract Syntax Tree).  Elixir makes a first class citizen out of its AST. Just wrap any Elixir code in quote/2 block in your IEx session and an AST equivalent will be handed your way. You can also get it for any code string via Code.string_to_quoted/2. And you can traverse it with Macro.prewalk/3 or Macro.postwalk/3. All within the standard library!

With this, we should be able to solve the dilemmas from the previous sections:

lang: elixir
defp get_file_mods(file) do
  file
  |> File.read!()
  |> Code.string_to_quoted!()
  |> Macro.prewalk([], fn
    {:defmodule, _, [{:__aliases__, _, mod_alias}, _]} = ast, acc -> {ast, [mod_alias | acc]}
    ast, acc -> {ast, acc}
  end)
  |> elem(1)
  |> Enum.map(&Module.concat/1)
end

The most amazing thing that you may notice about this solution is that it's not so much harder than a string scan. It will actually look simpler and more logical, especially for those nested cases which would surely blow off your head in a regex form (look at regex definition of Elixir syntax).

This time we surely got it right. Right?

As opposed to string scanning, which proves very limiting very fast, AST approach is a good enough solution for many code checking scenarios. It gets the meaning of the code quite right, it can be run on a single source file regardless of its compile-time dependencies and it’s fast. This means you’ll get quality feedback from your check as soon as possible - even during the coding.

But it also comes with limitations. Elixir compiler along with its support for macros and metaprogramming introduces an added dynamism to the static AST. This means that we can’t be really sure that no extra defmodule calls will land in the final file e.g. via compile-time macro invocation. We’d have to go beyond AST analysis to fix that. And soon we will.

Don’t jump the static analysis ship just yet though - sometimes catching 95% of the issues is good enough and as mentioned above you get a fair share of benefits by intentionally missing on that extra 5%. It really comes down to the problem that you’re trying to solve.

Sidequest: Custom Credo check

By mastering the string and AST mojo, you have unlocked yourself a way towards writing custom checks for Credo, a popular code linter for Elixir. Indeed, Credo traverses AST in most of its checks and it scans the source string for code comments that control its behavior.

As for our exercise, Credo would take care of fetching source files, presenting errors and emitting the exit status - which is half of our boilerplate work from the the Setting the stage section. It'd also provide some handy helpers for parsing the AST. And you’ll probably find a plugin that integrates Credo with your editor of choice (like the one for VSCode). On the other hand, you’re giving up on the flexibility of a custom script by having to accept and live with the Credo flow and boundaries of the Credo.Check API (or go one step further with a Credo plugin).

It’s really up to you whether you prefer making a separate tool just for your check (or a set of checks) or if you want your check to be part of Credo. Either way, with the availability of mix check task, you don’t have to worry about teaching developers and your CI to consistently run custom checks even if they come as separate Elixir/Mix scripts instead of a Credo check.

I won’t provide the code for the Credo equivalent of our AST solution - after all it’s a sidequest. But it should be fairly easy to write one once you take a look at:

Being a static analysis tool, even Credo won’t dare to go as far as we will in the next section...

Level 3: Code compilation

As established in the AST section, sometimes you can’t be really sure about some aspects of the code without compiling it. Reaching for compiler itself may seem like a job for a real pro but it may not be that hard, depending on how you'll approach it. 

You can start by compiling a single file via Code.compile_file/1 which luckily for us returns a list of modules defined in the file. 

lang: elixir
defp get_file_mods(file) do
  file
  |> Code.compile_file()
  |> Enum.map(&elem(&1, 0))
end

Hooray - a list of defined modules approved by compiler itself, fetched even easier than in the string or AST approaches. After a quick celebration and a session of self-admiration for your boundless pragmatism, you feel like you’re ready to release v1.0 of your check.

Not so fast. You’ll quickly realize that Elixir compiler usually operates on Mix projects instead of single files - which may have compile-time dependencies that Code.compile_file/1 won’t be able to resolve precisely because it doesn’t understand a Mix project. Also, source compilation isn’t a fast operation and you’d have to recompile all of your source already compiled via mix compile just to extract those damn defined modules.

When you browse through standard library or just read through output from mix help, you may discover a little treasure among built-in Mix tasks - mix xref. This task allows to track calls between modules in a compiled project, which is a much more ambitious task compared to what we’re doing in this exercise. But in order to do that, mix xref must also establish which modules are defined in which files. How does it do that?

As it turns out, mix xref operates on a file called „compiler manifest”. This is an extra artifact produced by mix compile and it stores all the information mentioned above, ready for reading and using for our purposes. Let's use it then:

lang: elixir
defp get_files_mods(files) do
  Mix.Project.manifest_path()
  |> Path.join("compile.elixir")
  |> Mix.Compilers.Elixir.read_manifest(Mix.Project.compile_path())
  |> Enum.reduce(%{}, fn
    {:module, mod, _, [file], _, _, _}, acc -> Map.update(acc, file, [mod], &[mod | &1])
    _, acc -> acc
  end)
  |> Enum.filter(fn {file, _} -> Enum.member?(files, file) end)
end

Now we don’t have to recompile the code, dependencies are solved due to relying on Mix instead of raw Code module and we’re getting a whole bunch more information that may be used for building checks and analysis tools that go well beyond this exercise. 

The only issue and limitation here is... the manifest file. We should consider it a private part of Mix and as such its content is dictated by the needs of built-in tasks such as mix xref - it may change, it may break and it surely won't be extended upon our request just because our check needs more than what it provides. We'll try to solve that in a next sidequest.

Regardless, compilation manifest is a very powerful and cheap source of information about the Mix project and it may be a particularly welcoming place to start working with a compiler.

Sidequest: Custom Mix compiler

What if you still need more than what Mix compilation manifest provides? For example, you could use exact line and column numbers of references to modules in order to build a more agile tool that pinpoints and outputs exact locations of rule breaches in the source.

Well, you can always write your own Mix compiler! This is what the boundaries library is doing in order to enforce extra rules on cross-module calls. In such approach, you’ll execute your check simply by running the  mix compile —warnings-as-errors task. Just as with Credo, you’ll also fnd editor plugins (e.g. those based on ElixirLS) that integrate these errors right into the editor UI. Neat and powerful, huh?

Take boundaries as an example and, as a final exercise, see for yourself that it’s within the reach of mere humans to write a check that will become an actual part of project compilation and that will get to work with the most reliable representation of the Elixir code.

Summary

Now you’re familiar with a rich set of options provided by Elixir when it comes to code analysis. You have learned the trade-offs that come with string, AST and compiler approaches and so you're ready to take conscious choices when writing different kinds of checks. 

The doors to the incredible world of code checking and analysis are now wide open in front of you. What will you do with this power?