Elixir vs Ruby: File I/O performance (updated)
Is Elixir an equal replacement for Ruby when it comes to processing files line by line in command line scripts or background jobs?
Recently, I've stumbled upon this thread on Elixir Forum and it got me interested. It turned out that reading and processing a big CSV file in Elixir is as much as 4 times slower than in Ruby, which itself is well known not to be a speed racer. I've done some research, developed the attached code further by applying suggested optimizations and so here's the results.
Update: I've received some excellent feedback on the Elixir Forum from @sasajuric, the author of the Elixir in Action book and The Erlangelist blog. As a result I've rewritten some parts of this article to better present the newest findings and make more truthful conclusions out of them. It's also restructured to better present the whole thinking process.
But first of all, I'd like to explain why I've put my time into researching this, even though I'm not implementing any file parser myself at the time. There's a bunch of reasons:
- It's nice to touch and understand all the basic modules of the language's standard library. And in this case, quite a few are involved - namely File, IO and Stream. Ultimately, these are the basic building blocks of all sophisticated applications that we end up with.
- Understanding limitations of those basic building blocks may come useful when performance of the whole system is bad and you'll have to pin down the bottleneck.
- It may also be a crucial knowledge when choosing the framework for the project and/or for choosing the most fitting tool for a job such as writing even a tiniest microservice.
- If you ever apply for an Elixir job, you may end up writing some basic tool, as interviewers often go after your basic language knowledge and basic problem solving skills (I can assure you that's the case in the company I'm working at right now).
- Lastly, I just love to write some shell script from time to time and I'd like to know what to expect from Elixir especially since it's so easy here with
In general, when you work on kickass web applications, it's quite common that you may need to get back to basics from time to time. Whether the subject may be a simple command-line tool or a vital part of the app, it's nice to be ready to jump in and do it right.
I've decided to extend the original script, which was just splitting lines and doing nothing with them, into something more realistic. So, my script does the following:
- Loads the input CSV, line by line.
- Parses first column which is of format
Some text N.
- Leaves only those lines where
Nis dividable by 2 or 5.
- Saves those filtered, but unchanged lines into another CSV.
So it's a typical "Hey, I've got this big CSV, please do your magic and fetch me only rows with this and that as I really need it for investors by 5 pm" task. Simple enough, right? Let's assume we can't make assumptions about the input file size so the script should be able to handle even the largest ones. But it can include an option to read a whole file at once to process smaller ones faster.
You can find and play with the implementation, including both Elixir and Ruby source, in this repo.
I took ideas for optimizations mostly from the aforementioned thread (with a special mention of @sasajuric who helped the most), source code in this repo and José Valim posts here and there. I've generated a sample file with 500K rows and that's what I've used for getting times presented below. Let's go through the process.
1. Basic streaming
I've started with a following basic streaming implementation:
lang:elixir def main([ filename, "stream" ]) do File.stream!(filename) |> Stream.filter(&filter_line/1) |> Stream.into(File.stream!(filename <> ".out")) |> Stream.run end defp filter_line(line) do [ first_col | _ ] = String.split(line, ",") num = Regex.run(~r/\d+$/, first_col) |> hd |> String.to_integer rem(num, 2) == 0 || rem(num, 5) == 0 end
The initial Elixir version needed about 18s to parse my sample, while Ruby equivalent could do it in less than 3s. The difference was dramatic. At this stage I've tried the following optimizations:
File.stream instead of
This one did cut the execution time from 18s down to 12s. As explained by José Valim here, it's about avoiding an extra process. Such process is usually performant, but here we're iterating millions of times so the gain is big.
IO.binstream instead of
This means streaming the file as binary instead of UTF-8,. Unfortunately, this approach kills the previous optimization and doesn't offer nearly as much improvement on its own.
:re instead of
Similarly, we could replace costly UTF-8 calls offered by Elixir standard library with more basic Erlang equivalents. I've tried that in the
filter_line method but it didn't change the efficiency by more than 1s. It is something but with whole script running 12s it became obvious to me that it's the file I/O flow that is the slowest here. Which pushed me to the docs and to the next idea.
This option is responsible for buffered stream reading, so it seemed like a good next stop for our streaming solution. But it wasn't. It turned out that buffered reading is enabled by default (which is understandable) and changing the size of the buffer doesn't help much.
Compiling and consolidating the script
MIX_ENV=prod mix escript.build, I've created a precompiled binary perfect for command-line use on any Unix system with Erlang. It's also consolidated in order for calls to protocol functions to be run more efficiently. All of that proved to reduce the initial script startup time a little but didn't change much about the above streaming code being really, really slow.
2. Concurrent approach
I thought I could save the day by focusing on concurrency. I tried doing so by:
- reading the file concurrently (using
:file.preadand manually joining broken lines)
- filtering the lines concurrently (using
Unfortunately, I've failed to implement a concurrent solution that would be faster than what I already had. I thought the reason is that the streaming solution is already limited by message passing. That's why the best optimization so far was to reduce messaging by eliminating an extra process. By going for concurrency I've only increased the already big number of messages passed, so it didn't help at all. Furthermore, I had to do extra work to join chunks into lines.
3. Read the whole file
At this point I've tried all the ideas and advices applicable to the streaming solution. I've decided to put my focus on the simplified approach applicable only to files that can fit into memory as a whole. This is what I came up with:
lang:elixir def main([ filename, "read" ]) do File.write filename <> ".out", File.read!(filename) |> String.splitter("\n", trim: true) |> Enum.filter(&filter_line/1) |> Enum.join("\n") end
With this approach, I went through a following optimizations:
File.write instead of
This has cut down the execution time from 12s down to 3s. Switching to one-shot read & write puts this script pretty close Ruby one-shot code, which does the same in about 2s. Damn!
String.split instead of
The splitter allows to lazily fetch each line from a file loaded into memory. This seems to be a bit faster than either
String.split or, the slowest,
Regex.split, both of which do it all at once. This is interesting, because previously I assumed that using the Stream module and therefore passing all those lines in a tight loop is a reason for the streaming solution being slow. But if so, why is splitter so fast?
4. Understand your streams & embrace your patterns
So, as long as you can fit your file into memory, you'll be fine. Otherwise, keep away from Elixir when it comes to large text file processing. This is how I concluded the original article. But then, @sasajuric pointed out to me that the following implementation takes it all to the next level:
lang:elixir def main([ filename, "stream" ]) do File.stream!(filename, read_ahead: 10_000_000) |> Stream.filter(&filter_line/1) |> Stream.into(File.stream!(filename <> ".out", [:delayed_write])) |> Stream.run end defp filter_line(<<c::utf8, ?,::utf8, _::binary>>) when c in [?0, ?2, ?4, ?5, ?6, ?8], do: true defp filter_line(<<_::utf8, ?,::utf8, _::binary>>), do: false defp filter_line(<<_other::utf8, rest::binary>>), do: filter_line(rest)
Let's go through what happened here.
read_ahead option together with
I've already used the
:read_ahead option in section 1 above, but it wasn't effective until it was put together with
:delayed_write. Together, these options cut the script time by half - from 12s to about 6s. Why? Even if reading was buffered, writes were still flushed line by line. And delayed writes are not enabled by default as they need additional attention from developer.
Conclusion? In Elixir, it's crucial to pay attention to all parts of the stream. There must be a balance between them in order for optimal stream execution, as the slowest part will limit the performance of the whole. I was wrong assuming that it's the tight streaming loop or a process communication (which was already eliminated) that puts the biggest overhead here.
Using binary pattern matching over
Getting rid of the biggest bottleneck in the stream, it made sense to get back to the
filter_line method. And what could be better for gaining an edge over Ruby than this unique feature of Elixir: pattern matching on binaries? The new recursive approach made the Elixir streaming code run as fast as Ruby, cutting the time from 6s to less than 3s. But that's not all, it made the one-shot implementation run 0.8s compared to 1.7s in Ruby.
While Ruby may not be the fastest kid in town for shell scripting, its performance is good enough for casual tool writing. In case of Elixir you can get similar performance if you put streams into proper use (as shown above) or if you go for a read-all-at-once approach. You can also gain a serious performance edge over Ruby if you make use of pattern matching and recursion.
Therefore, it makes most sense to write such scripts from scratch with precise idea about how to put unique Elixir features into use. I can see some serious use cases here that could take benefit from OTP, pattern matching and streaming, like supervisioned CSV import/export workers, Unix daemons or command line tools. Doing blind conversion, like I did in this experiment, makes little sense and doesn't yield a fair comparison.
Please remember one thing. Erlang and OTP are great when it comes to concurrency, data safety, process stability, supervision and more. But that comes at the cost of running this amazing OS-like layer that the Erlang VM surely is. So when it comes to processing big files, depending on use case, you may find better ways (like writing interoperable native code).
Note: I'm not an ultimate expert at how the Erlang VM works. These are just logical observations that I've made based on as much information as I could find on my own and on the web. So please comment and correct me if I'm wrong or missing something important here.