Mastering Erlang OTP: Best Practices for Building Resilient Systems

Dive into essential best practices for Erlang OTP, covering robust supervision trees, effective message passing, error handling, secure distribution, and application structuring to build truly fault-tolerant and scalable systems.

Erlang OTP: Distributed & Fault-Tolerant Systems ProgrammingFeb 12, 2026 · 7 min read · 1,357 words

Welcome back, CoddyKit learners! In our previous post, we embarked on an exciting journey into the world of Erlang OTP, understanding its fundamental concepts and why it's a powerhouse for building distributed, fault-tolerant systems. We covered the basics – processes, message passing, and the 'let it crash' philosophy.

Now that you've got a taste of OTP's potential, it's time to elevate your game. Building robust systems isn't just about knowing the tools; it's about using them wisely. In this second installment of our Erlang OTP series, we'll dive deep into the best practices and essential tips that will transform your Erlang code from functional to truly resilient, scalable, and maintainable. Mastering these practices is key to harnessing the full power of OTP.

Embracing the OTP Philosophy: Beyond the Basics

Erlang OTP isn't just a library; it's a way of thinking about system design. Best practices often stem directly from its core philosophies:

Let It Crash, But Supervise Wisely: Don't try to catch every error. Instead, design your system so that when a component crashes, a supervisor process can restart it to a known good state.
Small, Isolated Processes: Each process should ideally have a single responsibility, making it easier to reason about, test, and replace.
Asynchronous Message Passing: Communication should primarily be asynchronous, avoiding tight coupling and deadlocks.

1. Designing Robust Supervision Trees

The supervision tree is the backbone of any fault-tolerant Erlang OTP application. A well-designed tree ensures your system can recover gracefully from failures.

a. Small, Focused Processes with `gen_server`

Think of gen_server as your go-to abstraction for stateful processes. Each gen_server should manage a specific piece of state or a particular service. Avoid monolithic gen_servers that try to do too much.

-module(my_counter).
-behaviour(gen_server).

-export([start_link/0, increment/0, get_count/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

increment() ->
    gen_server:cast(?MODULE, increment).

get_count() ->
    gen_server:call(?MODULE, get_count).

init([]) ->
    {ok, 0}. % Initial state: counter starts at 0

handle_call(get_count, _From, Count) ->
    {reply, Count, Count};
handle_call(_Request, _From, Count) ->
    {reply, unknown_request, Count}.

handle_cast(increment, Count) ->
    NewCount = Count + 1,
    {noreply, NewCount};
handle_cast(_Msg, Count) ->
    {noreply, Count}.

handle_info(_Info, Count) ->
    {noreply, Count}.

terminate(_Reason, _State) ->
    ok.

code_change(_OldVsn, State, _Extra) ->
    {ok, State}.

Tip: Keep the state in your gen_server simple and easy to serialize/deserialize. Complex state often indicates a need to break down the process further.

b. Strategic Supervision Strategies

When defining your supervisor, choose the right restart strategy:

one_for_one: If a child process terminates, only that child is restarted. Ideal for independent processes.
one_for_all: If a child process terminates, all other sibling processes are terminated and then all children are restarted. Use this when processes are tightly coupled and a failure in one implies a failure in the group.
rest_for_one: If a child process terminates, the terminating child and all children started after it are terminated and then restarted. Useful for pipelines or sequential dependencies.

Most applications start with one_for_one, but understanding the others is crucial for designing resilient groups of processes.

c. Linking vs. Monitoring

Linking creates a bidirectional connection: if one linked process terminates, the other receives an exit signal. Supervisors use linking to detect child process termination.

Monitoring creates a unidirectional connection: if the monitored process terminates, the monitor receives a {'DOWN', MonitorRef, process, Pid, Reason} message. The monitored process is unaware it's being monitored. Use monitoring when you want to observe a process without affecting its lifecycle, or when a supervisor isn't involved.

% Example of monitoring a remote process
monitor_remote_process(Pid) ->
    MonitorRef = erlang:monitor(process, Pid),
    receive
        {'DOWN', MonitorRef, process, Pid, Reason} ->
            io:format("Process ~p went down with reason ~p~n", [Pid, Reason])
    end.

2. Mastering Message Passing

Erlang's message passing is its superpower, but it needs to be wielded carefully.

a. Prioritize Asynchronous Communication (`gen_server:cast`)

Whenever possible, use `gen_server:cast` or direct `Pid ! Message` for communication. This avoids blocking the sender and improves concurrency. Reserve `gen_server:call` for situations where an immediate response is absolutely necessary, such as querying state.

b. Handle All Message Types (Even Unexpected Ones)

Your `gen_server` callbacks (`handle_call`, `handle_cast`, `handle_info`) should have a catch-all clause to log or gracefully discard unexpected messages. This prevents processes from crashing due to unhandled patterns.

handle_cast(_UnexpectedMsg, State) ->
    lager:warning("Received unexpected cast message: ~p", [_UnexpectedMsg]),
    {noreply, State}.

handle_info(_UnexpectedInfo, State) ->
    lager:warning("Received unexpected info message: ~p", [_UnexpectedInfo]),
    {noreply, State}.

(Note: lager is a popular Erlang logging library. You might use io:format for simpler cases.)

c. Avoid Large Messages

Sending large messages (especially between nodes) can be inefficient and consume significant memory. If you need to transfer large data, consider alternative approaches like sending a reference (e.g., a file path, a database key) and having the receiver fetch the data directly.

3. Robust Error Handling and Observability

While "let it crash" is fundamental, it doesn't mean ignoring errors. It means designing systems that recover from them.

a. Implement `try...catch` for Expected Failures

Use `try...catch` for errors you can anticipate and handle within a function, like parsing invalid input or file I/O issues. For example:

read_config(File) ->
    try file:read_file(File) of
        {ok, Bin} ->
            {ok, binary_to_term(Bin)}
    catch
        error:badarg ->
            {error, invalid_config_format};
        _Other:Reason ->
            lager:error("Failed to read config file ~s: ~p", [File, Reason]),
            {error, file_read_failure}
    end.

Unanticipated errors (bugs, resource exhaustion) should be allowed to crash the process and be handled by the supervisor.

b. Comprehensive Logging

Integrate a robust logging framework (like lager or Erlang's built-in logger) from the start. Log important events, state changes, and especially errors. Good logs are invaluable for debugging and understanding system behavior in production.

4. Distributed System Considerations

Erlang excels at distribution, but it requires careful setup.

a. Secure Your Nodes

Always use a shared secret "cookie" for distributed Erlang nodes to authenticate each other. Without it, any process can connect to and execute code on your nodes. Also, configure firewalls to restrict access to the Erlang distribution port (default 4369 and dynamic ephemeral ports).

% Start Erlang with a cookie
erl -sname mynode -setcookie \"my_secret_cookie\"

b. Graceful Node Shutdowns

When a node is shutting down, processes on other nodes linked to it will receive exit signals. Design your processes to handle these signals (e.g., `{'EXIT', Pid, Reason}`) to clean up resources or gracefully re-establish connections.

c. Monitor Nodes and Handle Network Partitions

Use `net_kernel:monitor_nodes(true)` to receive notifications when nodes connect or disconnect. Be prepared for network partitions where nodes temporarily lose connectivity. Your system should be designed to continue operating or recover gracefully when partitions heal.

5. Structuring Your OTP Application

OTP applications provide a standard way to package and manage your Erlang code.

a. Define OTP Applications Clearly

Every logical component of your system should be an OTP application. This provides a clear structure, manages dependencies, and defines the startup and shutdown sequence of your processes.

An .app file defines your application's metadata, including its modules, dependencies, and the main supervisor to start.

{application, my_app,
 [
  {description, "My Awesome OTP Application"},
  {vsn, "1.0.0"},
  {registered, []},
  {applications,
   [kernel,
    stdlib,
    sasl
   ]},
  {mod, {my_app_app, []}}, % The entry point for starting the application
  {env, []}
 ]}.

b. Use Behaviors Consistently

Stick to standard OTP behaviors like `gen_server`, `gen_statem`, `supervisor`, and `gen_event`. They provide battle-tested patterns for common concurrency problems and make your code easier for others (and your future self) to understand.

6. Testing for Resilience

Testing is paramount for fault-tolerant systems.

Unit Tests: Test individual functions and `gen_server` callbacks.
Integration Tests: Verify that your supervision trees behave as expected under various failure scenarios (e.g., intentionally crashing child processes).
Property-Based Testing (e.g., with PropEr): Generate random inputs to uncover edge cases and ensure your system behaves correctly across a wide range of scenarios.

Conclusion

Building robust, distributed systems with Erlang OTP is immensely rewarding, but it demands discipline and adherence to best practices. By focusing on well-structured supervision trees, thoughtful message passing, proactive error handling, secure distribution, and clear application architecture, you'll be well on your way to crafting highly available and fault-tolerant applications.

These tips are just the beginning. In our next post, we'll shift gears from what to do to what to avoid, exploring common mistakes Erlang developers make and how to steer clear of them. Stay tuned!

ProgrammingTutorialCoddyKit