~ne02ptzero/libfloat

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
1

[PATCH] election: Implememt gray-failures-proof elections, as described in the DynamoDB paper

Louis Solofrizzo <lsolofrizzo@scaleway.com>
Details
Message ID
<20221025154022.140172-1-lsolofrizzo@scaleway.com>
DKIM signature
missing
Download raw message
Patch: +155 -2
This patch implements a gray-failure election interface, as described below:

> Gray failures aka partial failures can lead the system to complete collapse if not handled carefully.
> These are failures where a replica is unable to reach the leader due to certain issues such as network
> failure, background garbage collection process etc. This results in replica invoking a leader election and
> in turn disrupting an already functioning system. To overcome this, DynamoDB has added a brilliant
> improvement in leader election process. Now whenever a node fails to reach the leader, it sends a
> message to other nodes in the system for confirmation. If any of the nodes send an acknowledgement
> about a healthy leader, it can cancel the leader election process and retry reaching the leader node. This
> in turn avoids a full-fledged leader election in case of transient issues in communication.

- Before any election requests, a request is sent to every member of the
  cluster in order to check if they have a leader. If every node in the
  cluster does not have one, an election is launched.
- This will mitigate possible election storms on node restarts, or other
  useless election packets that are being sent.

Aditionally, the last election reason is now stored and exposed through
the API for debug purposes.

If the leader_check RPC is implemented by the interface, the behavior
totally change on leader-loss:

- We now send "is_leader_healthy" requests to every node in the cluster,
  including the leader we just lost.
- On the periodic loop, we check for the responses. During this time,
  elections are not started.
    - If any node responds that it have a leader, the election is
      postponed for an entire new election_time. The idea behind that is
      that either: The leader is occupied, let's give him some time to
      reach us, or the leader-loss is not propagated yet, and the last
      node to be reached will launch the election.
    - If every node responds (minus ourselves and the potential dead
      leader) that it do not have a leader, an election is launched.
    - In case of a timeout (no responses at all) for an election_time
      duration, an election is launched.

Paper: https://www.usenix.org/system/files/atc22-elhemali.pdf

Signed-off-by: Louis Solofrizzo <lsolofrizzo@scaleway.com>
---
 election.c | 16 ++++++++++
 libfloat.c | 11 +++++++
 libfloat.h | 35 ++++++++++++++++++++++
 log.c      |  6 ++++
 node.h     |  3 ++
 periodic.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 6 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/election.c b/election.c
index 88530ec..601ab67 100644
--- a/election.c
+++ b/election.c
@@ -50,6 +50,8 @@ void __libfloat_election_start(libfloat_ctx_t *ctx, libfloat_elections_args_t *a
        libfloat_get_last_term(ctx, &last_id, &last_term);
    }

    ctx->stat.last_election_reason = args->reason;

    /* Send a vote request to each node of the cluster */
    for_every_node(ctx, node, {
        libfloat_rpc_request_vote_t     vote = { 0 };
@@ -232,3 +234,17 @@ void libfloat_election_resign_receive(libfloat_ctx_t *ctx)
    ERROR(ctx, "Restart election");
    libfloat_election_start(ctx, .reason = "Resign received");
}

void libfloat_update_leader_check(libfloat_ctx_t *ctx, libfloat_node_id_t id, bool has_leader)
{
    libfloat_node_t     *node = libfloat_get_node(ctx, id);

    if (node == NULL)
    {
        /* I don't know you, go away */
        return;
    }

    node->has_responded_to_leader_check = 1;
    node->has_a_leader = has_leader;
}
diff --git a/libfloat.c b/libfloat.c
index 62525fb..1000dc5 100644
--- a/libfloat.c
+++ b/libfloat.c
@@ -134,3 +134,14 @@ void libfloat_update_last_update(libfloat_ctx_t *ctx, libfloat_node_id_t id)
            node->last_update = ctx->time(NULL);
    }
}

void libfloat_send_heartbeat_to_node(libfloat_ctx_t *ctx, libfloat_node_id_t id)
{
    if (libfloat_am_i_leader(ctx))
    {
        libfloat_node_t         *node = libfloat_get_node(ctx, id);

        if (node != NULL)
            libfloat_send_append_entries(ctx, node, true);
    }
}
diff --git a/libfloat.h b/libfloat.h
index 17b3c89..b10d330 100644
--- a/libfloat.h
+++ b/libfloat.h
@@ -47,6 +47,7 @@ struct libfloat_ctx_s {
        uint64_t        leader_election;                /*!< Count of leader elections for this cluster */
        uint64_t        orphans_logs;                   /*!< Count of logs that are applied on the leader only */
        time_t          leader_election_time;           /*!< Timestamp of the last leader election */
        const char      *last_election_reason;          /*!< Last reason we had an election */
    } stat;

    struct {
@@ -73,6 +74,12 @@ struct libfloat_ctx_s {
    uint32_t                    logs_check;             /*!< Last check counter of log stuck */
    bool                        stepping_down;          /*!< Is the node stepping down from leadership */

    struct {
        bool                    checking;               /*!< Are we checking that every node has lost the leader? */
        time_t                  check_time;             /*!< Time that we launched the check */
    } gray_failures;


    bool                        is_snapshotting;        /*!< Is the cluster currently snapshotting */
    libfloat_entry_id_t         snapshot_to_commit;     /*!< Last log ID that is to be written after the snapshot is done */

@@ -96,6 +103,14 @@ struct libfloat_ctx_s {
     */
    void (*become_leader_cb)(struct libfloat_ctx_s *);

    /*!
     * \brief Send a "is leader healthy" to a node
     *
     * \param[in] struct libfloat_ctx_s * - Cluster context
     * \param[in] libfloat_node_t * - Node to send the request to
     */
    bool (*is_leader_healthy)(struct libfloat_ctx_s *, libfloat_node_t *);

    /*!
     * \brief Send an AppendEntries request to a node
     *
@@ -577,4 +592,24 @@ void libfloat_reset_timeout_elapsed(libfloat_ctx_t *ctx);
 */
void libfloat_update_last_update(libfloat_ctx_t *ctx, libfloat_node_id_t id);

/*!
 * \brief Send an heartbeat RPC to a specific node
 *
 * \param[in] ctx Libfloat context
 * \param[in] id Node ID to send the heartbeat to
 *
 * \note Leader function only
 * \note You probably do not need to call this function by hand
 */
void libfloat_send_heartbeat_to_node(libfloat_ctx_t *ctx, libfloat_node_id_t id);

/*!
 * \brief Update a node-response for a gray-failure leader check
 *
 * \param[in] ctx Libfloat context
 * \param[in] id Node ID to update
 * \param[in] has_leader Wether or not a node has a leadership
 */
void libfloat_update_leader_check(libfloat_ctx_t *ctx, libfloat_node_id_t id, bool has_leader);

#endif /* LIBFLOAT_H */
diff --git a/log.c b/log.c
index 2d740b0..5d0ef2b 100644
--- a/log.c
+++ b/log.c
@@ -380,6 +380,12 @@ void libfloat_append_entries_receive(libfloat_ctx_t *ctx, libfloat_rpc_append_en
{
    libfloat_node_t *node = libfloat_get_node(ctx, req->leader_id);

    if (ctx->gray_failures.checking == 1)
    {
        /* We were checking for gray failures, and a leader just appeared! */
        libfloat_become_follower(ctx, .reason = "Gray failure: Leader has woken up");
    }

    if (ctx->state == RAFT_STATE_CANDIDATE && ctx->persistent.term == req->term)
    {
        /* Someone has been elected! */
diff --git a/node.h b/node.h
index 9f0bed8..654f275 100644
--- a/node.h
+++ b/node.h
@@ -13,6 +13,9 @@ typedef struct {
    uint8_t             has_voted_for_me        : 1;
    uint8_t             is_up_to_date           : 1;

    uint8_t             has_a_leader                    : 1;
    uint8_t             has_responded_to_leader_check   : 1;

    void                *udata;                 /*!< User data */
    time_t              last_update;            /*!< Time of the last AE response (If I am the leader) */
    int                 snapshot_count;         /*!< Count of the times we are supposed to send a snapshot */
diff --git a/periodic.c b/periodic.c
index 15cf986..ce2c6f4 100644
--- a/periodic.c
+++ b/periodic.c
@@ -89,8 +89,90 @@ void libfloat_periodic(libfloat_ctx_t *ctx, uint32_t time)
        if (!ctx->is_snapshotting && !ctx->stepping_down)
        {
            /* We don't want leadership to change in the middle of a snapshot */
            DEBUG(ctx, "New election: timeout elapsed %d", ctx->timeout_elapsed);
            libfloat_election_start(ctx, .reason = "Election timeout, start a new one");
            if (ctx->is_leader_healthy == NULL)
            {
                /* DynamoDB-like elections are not implemented, let's simply start an election */
                DEBUG(ctx, "New election: timeout elapsed %d", ctx->timeout_elapsed);
                libfloat_election_start(ctx, .reason = "Election timeout, start a new one");
            }
            else
            {
                libfloat_node_t         *node = NULL;

                /* We have a DynamoDB-like gray-failure election implementation! Let's use it */
                if (ctx->gray_failures.checking == false)
                {
                    /* Discard leadership */
                    ctx->leader = NULL;

                    /* We did not fire the requests yet, let's init stuff and do it */
                    ctx->gray_failures.checking = true;
                    ctx->gray_failures.check_time = ctx->time(NULL);

                    for_every_node(ctx, node, {
                        if (node->id == ctx->me->id)
                        {
                            /* Skip myself */
                            continue;
                        }

                        node->has_a_leader = 0;
                        node->has_responded_to_leader_check = 0;
                        ctx->is_leader_healthy(ctx, node);
                    });
                }
                else
                {
                    uint32_t    node_with_leaders = 0;
                    uint32_t    node_responses = 0;

                    /* Requests have been fired, let's check the responses */
                    for_every_node(ctx, node, {
                        if (node->id == ctx->me->id)
                        {
                            /* Skip myself */
                            continue;
                        }

                        if (node->has_responded_to_leader_check == 1)
                        {
                            node_responses++;
                            if (node->has_a_leader == 1)
                                node_with_leaders++;
                        }
                    })

                    if (node_with_leaders > 0)
                    {
                        /* Some nodes in the cluster still have a leader, let's postpone the election for now */
                        ctx->timeout_elapsed = 0;
                        ctx->gray_failures.checking = false;
                    }
                    else if (node_responses >= ctx->n_nodes - 2 && node_with_leaders == 0)
                    {
                        /* We received a response from all the nodes (Minus myself and the potential dead-leader), and no-one has a leader, let's trigger an election */
                        ctx->gray_failures.checking = false;
                        libfloat_election_start(ctx, .reason = "Gray failures: Total leader loss");
                    }
                    else if (node_responses == 0)
                    {
                        if (ctx->gray_failures.check_time + (ctx->conf.election_timeout / 1000) < ctx->time(NULL))
                        {
                            /* We did not receive any reponses, and the election timeout has expired twice, let's launch an election */
                            ctx->gray_failures.checking = false;
                            libfloat_election_start(ctx, .reason = "Gray failures: Complete timeout");

                            /* XXX: In this specific case, I'm not sure that triggering an election actually does something:
                             * - Either every node in the cluster is down / unreachable, and a quorum will never be reached anyway
                             * - Or a network partition is on-going, and a quorum will never be reached anyway
                             *
                             * This is a bit of no-op in this case, but at least we're reporting a leader loss
                             * This behavior is open to reviews; the DynamoDB paper does not seem to provide a behaviors for this cases
                             */
                        }
                    }
                }
            }
        }
    }
}
-- 
2.38.0
Details
Message ID
<49145b9c5eef137357a7eebdadf64d7d@ptrk.io>
In-Reply-To
<20221025154022.140172-1-lsolofrizzo@scaleway.com> (view parent)
DKIM signature
missing
Download raw message
LG

October 25, 2022 5:40 PM, "Louis Solofrizzo" <lsolofrizzo@scaleway.com> wrote:

> This patch implements a gray-failure election interface, as described below:
> 
>> Gray failures aka partial failures can lead the system to complete collapse if not handled
>> carefully.
>> These are failures where a replica is unable to reach the leader due to certain issues such as
>> network
>> failure, background garbage collection process etc. This results in replica invoking a leader
>> election and
>> in turn disrupting an already functioning system. To overcome this, DynamoDB has added a brilliant
>> improvement in leader election process. Now whenever a node fails to reach the leader, it sends a
>> message to other nodes in the system for confirmation. If any of the nodes send an acknowledgement
>> about a healthy leader, it can cancel the leader election process and retry reaching the leader
>> node. This
>> in turn avoids a full-fledged leader election in case of transient issues in communication.
> 
> - Before any election requests, a request is sent to every member of the
> cluster in order to check if they have a leader. If every node in the
> cluster does not have one, an election is launched.
> - This will mitigate possible election storms on node restarts, or other
> useless election packets that are being sent.
> 
> Aditionally, the last election reason is now stored and exposed through
> the API for debug purposes.
> 
> If the leader_check RPC is implemented by the interface, the behavior
> totally change on leader-loss:
> 
> - We now send "is_leader_healthy" requests to every node in the cluster,
> including the leader we just lost.
> - On the periodic loop, we check for the responses. During this time,
> elections are not started.
> - If any node responds that it have a leader, the election is
> postponed for an entire new election_time. The idea behind that is
> that either: The leader is occupied, let's give him some time to
> reach us, or the leader-loss is not propagated yet, and the last
> node to be reached will launch the election.
> - If every node responds (minus ourselves and the potential dead
> leader) that it do not have a leader, an election is launched.
> - In case of a timeout (no responses at all) for an election_time
> duration, an election is launched.
> 
> Paper: https://www.usenix.org/system/files/atc22-elhemali.pdf
> 
> Signed-off-by: Louis Solofrizzo <lsolofrizzo@scaleway.com>
> ---
> election.c | 16 ++++++++++
> libfloat.c | 11 +++++++
> libfloat.h | 35 ++++++++++++++++++++++
> log.c | 6 ++++
> node.h | 3 ++
> periodic.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 6 files changed, 155 insertions(+), 2 deletions(-)
> 
> diff --git a/election.c b/election.c
> index 88530ec..601ab67 100644
> --- a/election.c
> +++ b/election.c
> @@ -50,6 +50,8 @@ void __libfloat_election_start(libfloat_ctx_t *ctx, libfloat_elections_args_t *a
> libfloat_get_last_term(ctx, &last_id, &last_term);
> }
> 
> + ctx->stat.last_election_reason = args->reason;
> +
> /* Send a vote request to each node of the cluster */
> for_every_node(ctx, node, {
> libfloat_rpc_request_vote_t vote = { 0 };
> @@ -232,3 +234,17 @@ void libfloat_election_resign_receive(libfloat_ctx_t *ctx)
> ERROR(ctx, "Restart election");
> libfloat_election_start(ctx, .reason = "Resign received");
> }
> +
> +void libfloat_update_leader_check(libfloat_ctx_t *ctx, libfloat_node_id_t id, bool has_leader)
> +{
> + libfloat_node_t *node = libfloat_get_node(ctx, id);
> +
> + if (node == NULL)
> + {
> + /* I don't know you, go away */
> + return;
> + }
> +
> + node->has_responded_to_leader_check = 1;
> + node->has_a_leader = has_leader;
> +}
> diff --git a/libfloat.c b/libfloat.c
> index 62525fb..1000dc5 100644
> --- a/libfloat.c
> +++ b/libfloat.c
> @@ -134,3 +134,14 @@ void libfloat_update_last_update(libfloat_ctx_t *ctx, libfloat_node_id_t id)
> node->last_update = ctx->time(NULL);
> }
> }
> +
> +void libfloat_send_heartbeat_to_node(libfloat_ctx_t *ctx, libfloat_node_id_t id)
> +{
> + if (libfloat_am_i_leader(ctx))
> + {
> + libfloat_node_t *node = libfloat_get_node(ctx, id);
> +
> + if (node != NULL)
> + libfloat_send_append_entries(ctx, node, true);
> + }
> +}
> diff --git a/libfloat.h b/libfloat.h
> index 17b3c89..b10d330 100644
> --- a/libfloat.h
> +++ b/libfloat.h
> @@ -47,6 +47,7 @@ struct libfloat_ctx_s {
> uint64_t leader_election; /*!< Count of leader elections for this cluster */
> uint64_t orphans_logs; /*!< Count of logs that are applied on the leader only */
> time_t leader_election_time; /*!< Timestamp of the last leader election */
> + const char *last_election_reason; /*!< Last reason we had an election */
> } stat;
> 
> struct {
> @@ -73,6 +74,12 @@ struct libfloat_ctx_s {
> uint32_t logs_check; /*!< Last check counter of log stuck */
> bool stepping_down; /*!< Is the node stepping down from leadership */
> 
> + struct {
> + bool checking; /*!< Are we checking that every node has lost the leader? */
> + time_t check_time; /*!< Time that we launched the check */
> + } gray_failures;
> +
> +
> bool is_snapshotting; /*!< Is the cluster currently snapshotting */
> libfloat_entry_id_t snapshot_to_commit; /*!< Last log ID that is to be written after the snapshot
> is done */
> 
> @@ -96,6 +103,14 @@ struct libfloat_ctx_s {
> */
> void (*become_leader_cb)(struct libfloat_ctx_s *);
> 
> + /*!
> + * \brief Send a "is leader healthy" to a node
> + *
> + * \param[in] struct libfloat_ctx_s * - Cluster context
> + * \param[in] libfloat_node_t * - Node to send the request to
> + */
> + bool (*is_leader_healthy)(struct libfloat_ctx_s *, libfloat_node_t *);
> +
> /*!
> * \brief Send an AppendEntries request to a node
> *
> @@ -577,4 +592,24 @@ void libfloat_reset_timeout_elapsed(libfloat_ctx_t *ctx);
> */
> void libfloat_update_last_update(libfloat_ctx_t *ctx, libfloat_node_id_t id);
> 
> +/*!
> + * \brief Send an heartbeat RPC to a specific node
> + *
> + * \param[in] ctx Libfloat context
> + * \param[in] id Node ID to send the heartbeat to
> + *
> + * \note Leader function only
> + * \note You probably do not need to call this function by hand
> + */
> +void libfloat_send_heartbeat_to_node(libfloat_ctx_t *ctx, libfloat_node_id_t id);
> +
> +/*!
> + * \brief Update a node-response for a gray-failure leader check
> + *
> + * \param[in] ctx Libfloat context
> + * \param[in] id Node ID to update
> + * \param[in] has_leader Wether or not a node has a leadership
> + */
> +void libfloat_update_leader_check(libfloat_ctx_t *ctx, libfloat_node_id_t id, bool has_leader);
> +
> #endif /* LIBFLOAT_H */
> diff --git a/log.c b/log.c
> index 2d740b0..5d0ef2b 100644
> --- a/log.c
> +++ b/log.c
> @@ -380,6 +380,12 @@ void libfloat_append_entries_receive(libfloat_ctx_t *ctx,
> libfloat_rpc_append_en
> {
> libfloat_node_t *node = libfloat_get_node(ctx, req->leader_id);
> 
> + if (ctx->gray_failures.checking == 1)
> + {
> + /* We were checking for gray failures, and a leader just appeared! */
> + libfloat_become_follower(ctx, .reason = "Gray failure: Leader has woken up");
> + }
> +
> if (ctx->state == RAFT_STATE_CANDIDATE && ctx->persistent.term == req->term)
> {
> /* Someone has been elected! */
> diff --git a/node.h b/node.h
> index 9f0bed8..654f275 100644
> --- a/node.h
> +++ b/node.h
> @@ -13,6 +13,9 @@ typedef struct {
> uint8_t has_voted_for_me : 1;
> uint8_t is_up_to_date : 1;
> 
> + uint8_t has_a_leader : 1;
> + uint8_t has_responded_to_leader_check : 1;
> +
> void *udata; /*!< User data */
> time_t last_update; /*!< Time of the last AE response (If I am the leader) */
> int snapshot_count; /*!< Count of the times we are supposed to send a snapshot */
> diff --git a/periodic.c b/periodic.c
> index 15cf986..ce2c6f4 100644
> --- a/periodic.c
> +++ b/periodic.c
> @@ -89,8 +89,90 @@ void libfloat_periodic(libfloat_ctx_t *ctx, uint32_t time)
> if (!ctx->is_snapshotting && !ctx->stepping_down)
> {
> /* We don't want leadership to change in the middle of a snapshot */
> - DEBUG(ctx, "New election: timeout elapsed %d", ctx->timeout_elapsed);
> - libfloat_election_start(ctx, .reason = "Election timeout, start a new one");
> + if (ctx->is_leader_healthy == NULL)
> + {
> + /* DynamoDB-like elections are not implemented, let's simply start an election */
> + DEBUG(ctx, "New election: timeout elapsed %d", ctx->timeout_elapsed);
> + libfloat_election_start(ctx, .reason = "Election timeout, start a new one");
> + }
> + else
> + {
> + libfloat_node_t *node = NULL;
> +
> + /* We have a DynamoDB-like gray-failure election implementation! Let's use it */
> + if (ctx->gray_failures.checking == false)
> + {
> + /* Discard leadership */
> + ctx->leader = NULL;
> +
> + /* We did not fire the requests yet, let's init stuff and do it */
> + ctx->gray_failures.checking = true;
> + ctx->gray_failures.check_time = ctx->time(NULL);
> +
> + for_every_node(ctx, node, {
> + if (node->id == ctx->me->id)
> + {
> + /* Skip myself */
> + continue;
> + }
> +
> + node->has_a_leader = 0;
> + node->has_responded_to_leader_check = 0;
> + ctx->is_leader_healthy(ctx, node);
> + });
> + }
> + else
> + {
> + uint32_t node_with_leaders = 0;
> + uint32_t node_responses = 0;
> +
> + /* Requests have been fired, let's check the responses */
> + for_every_node(ctx, node, {
> + if (node->id == ctx->me->id)
> + {
> + /* Skip myself */
> + continue;
> + }
> +
> + if (node->has_responded_to_leader_check == 1)
> + {
> + node_responses++;
> + if (node->has_a_leader == 1)
> + node_with_leaders++;
> + }
> + })
> +
> + if (node_with_leaders > 0)
> + {
> + /* Some nodes in the cluster still have a leader, let's postpone the election for now */
> + ctx->timeout_elapsed = 0;
> + ctx->gray_failures.checking = false;
> + }
> + else if (node_responses >= ctx->n_nodes - 2 && node_with_leaders == 0)
> + {
> + /* We received a response from all the nodes (Minus myself and the potential dead-leader), and
> no-one has a leader, let's trigger an election */
> + ctx->gray_failures.checking = false;
> + libfloat_election_start(ctx, .reason = "Gray failures: Total leader loss");
> + }
> + else if (node_responses == 0)
> + {
> + if (ctx->gray_failures.check_time + (ctx->conf.election_timeout / 1000) < ctx->time(NULL))
> + {
> + /* We did not receive any reponses, and the election timeout has expired twice, let's launch an
> election */
> + ctx->gray_failures.checking = false;
> + libfloat_election_start(ctx, .reason = "Gray failures: Complete timeout");
> +
> + /* XXX: In this specific case, I'm not sure that triggering an election actually does something:
> + * - Either every node in the cluster is down / unreachable, and a quorum will never be reached
> anyway
> + * - Or a network partition is on-going, and a quorum will never be reached anyway
> + *
> + * This is a bit of no-op in this case, but at least we're reporting a leader loss
> + * This behavior is open to reviews; the DynamoDB paper does not seem to provide a behaviors for
> this cases
> + */
> + }
> + }
> + }
> + }
> }
> }
> }
> --
> 2.38.0
Reply to thread Export thread (mbox)