关于redis:Redis-70-Multi-Part-AOF的设计和实现

简介：本文将详解 Redis 中现有 AOF 机制的一些有余以及 Redis 7.0 中引入的 Multi Part AOF 的设计和实现细节。

Redis 作为一种十分风行的内存数据库，通过将数据保留在内存中，Redis 得以领有极高的读写性能。然而一旦过程退出，Redis 的数据就会全副失落。

为了解决这个问题，Redis 提供了 RDB 和 AOF 两种长久化计划，将内存中的数据保留到磁盘中，防止数据失落。本文将重点探讨 AOF 长久化计划，以及其存在的一些问题，并探讨在 Redis 7.0 (已公布 RC1) 中 Multi Part AOF（下文简称为 MP-AOF，本个性由阿里云数据库 Tair 团队奉献）设计和实现细节。

AOF(append only file)长久化以独立日志文件的形式记录每条写命令，并在 Redis 启动时回放 AOF 文件中的命令以达到复原数据的目标。

因为 AOF 会以追加的形式记录每一条 redis 的写命令，因而随着 Redis 解决的写命令增多，AOF 文件也会变得越来越大，命令回放的工夫也会增多，为了解决这个问题，Redis 引入了 AOF rewrite 机制（下文称之为 AOFRW）。AOFRW 会移除 AOF 中冗余的写命令，以等效的形式重写、生成一个新的 AOF 文件，来达到缩小 AOF 文件大小的目标。

图 1 展现的是 AOFRW 的实现原理。当 AOFRW 被触发执行时，Redis 首先会 fork 一个子过程进行后盾重写操作，该操作会将执行 fork 那一刻 Redis 的数据快照全副重写到一个名为 temp-rewriteaof-bg-pid.aof 的长期 AOF 文件中。

因为重写操作为子过程后盾执行，主过程在 AOF 重写期间仍然能够失常响应用户命令。因而，为了让子过程最终也能获取重写期间主过程产生的增量变动，主过程除了会将执行的写命令写入 aof_buf，还会写一份到 aof_rewrite_buf 中进行缓存。在子过程重写的前期阶段，主过程会将 aof_rewrite_buf 中累积的数据应用 pipe 发送给子过程，子过程会将这些数据追加到长期 AOF 文件中（具体原理可参考这里）。

当主过程承接了较大的写入流量时，aof_rewrite_buf 中可能会沉积十分多的数据，导致在重写期间子过程无奈将 aof_rewrite_buf 中的数据全副生产完。此时，aof_rewrite_buf 残余的数据将在重写完结时由主过程进行解决。

当子过程实现重写操作并退出后，主过程会在 backgroundRewriteDoneHandler 中解决后续的事件。首先，将重写期间 aof_rewrite_buf 中未生产完的数据追加到长期 AOF 文件中。其次，当所有准备就绪时，Redis 会应用 rename 操作将长期 AOF 文件原子的重命名为 server.aof_filename，此时原来的 AOF 文件会被笼罩。至此，整个 AOFRW 流程完结。

图 1 AOFRW 实现原理

由图 1 能够看到，在 AOFRW 期间，主过程会将 fork 之后的数据变动写进 aof_rewrite_buf 中，aof_rewrite_buf 和 aof_buf 中的内容绝大部分都是反复的，因而这将带来额定的内存冗余开销。

在 Redis INFO 中的 aof_rewrite_buffer_length 字段能够看到以后时刻 aof_rewrite_buf 占用的内存大小。如上面显示的，在高写入流量下 aof_rewrite_buffer_length 简直和 aof_buffer_length 占用了同样大的内存空间，简直节约了一倍的内存。

 aof_pending_rewrite:0
aof_buffer_length:35500
aof_rewrite_buffer_length:34000
aof_pending_bio_fsync:0

当 aof_rewrite_buf 占用的内存大小超过肯定阈值时，咱们将在 Redis 日志中看到如下信息。能够看到，aof_rewrite_buf 占用了 100MB 的内存空间且主过程和子过程之间传输了 2135MB 的数据（子过程在通过 pipe 读取这些数据时也会有外部读 buffer 的内存开销）。对于内存型数据库 Redis 而言，这是一笔不小的开销。

 3351:M 25 Jan 2022 09:55:39.655 * Background append only file rewriting started by pid 6817
3351:M 25 Jan 2022 09:57:51.864 * AOF rewrite child asks to stop sending diffs.
6817:C 25 Jan 2022 09:57:51.864 * Parent agreed to stop sending diffs. Finalizing AOF...
6817:C 25 Jan 2022 09:57:51.864 * Concatenating 2135.60 MB of AOF diff received from parent.
3351:M 25 Jan 2022 09:57:56.545 * Background AOF buffer size: 100 MB

AOFRW 带来的内存开销有可能导致 Redis 内存忽然达到 maxmemory 限度，从而影响失常命令的写入，甚至会触发操作系统限度被 OOM Killer 杀死，导致 Redis 不可服务。

CPU 的开销次要有三个中央，别离解释如下：

在 AOFRW 期间，主过程须要破费 CPU 工夫向 aof_rewrite_buf 写数据，并应用 eventloop 事件循环向子过程发送 aof_rewrite_buf 中的数据：

 /* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
    // 此处省略其余细节...
  
    /* Install a file event to send data to the rewrite child if there is
     * not one already. */
    if (!server.aof_stop_sending_diff &&
        aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0)
    {
        aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
            AE_WRITABLE, aofChildWriteDiffData, NULL);
    } 
  
    // 此处省略其余细节...
}

在子过程执行重写操作的前期，会循环读取 pipe 中主过程发送来的增量数据，而后追加写入到长期 AOF 文件：

 int rewriteAppendOnlyFile(char *filename) {
    // 此处省略其余细节...
  
    /* Read again a few times to get more data from the parent.
     * We can't read forever (the server may receive data from clients
     * faster than it is able to send data to the child), so we try to read
     * some more data in a loop as soon as there is a good chance more data
     * will come. If it looks like we are wasting time, we abort (this
     * happens after 20 ms without new data). */
    int nodata = 0;
    mstime_t start = mstime();
    while(mstime()-start < 1000 && nodata < 20) {if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
        {
            nodata++;
            continue;
        }
        nodata = 0; /* Start counting from zero, we stop on N *contiguous*
                       timeouts. */
        aofReadDiffFromParent();}
    // 此处省略其余细节...
}

在子过程实现重写操作后，主过程会在 backgroundRewriteDoneHandler 中进行收尾工作。其中一个工作就是将在重写期间 aof_rewrite_buf 中没有生产实现的数据写入长期 AOF 文件。如果 aof_rewrite_buf 中遗留的数据很多，这里也将耗费 CPU 工夫。

 void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    // 此处省略其余细节...
  
    /* Flush the differences accumulated by the parent to the rewritten AOF. */
    if (aofRewriteBufferWrite(newfd) == -1) {
        serverLog(LL_WARNING,
                "Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
        close(newfd);
        goto cleanup;
     }
    
     // 此处省略其余细节...
}

AOFRW 带来的 CPU 开销可能会造成 Redis 在执行命令时呈现 RT 上的抖动，甚至造成客户端超时的问题。

如前文所述，在 AOFRW 期间，主过程除了会将执行过的写命令写到 aof_buf 之外，还会写一份到 aof_rewrite_buf 中。aof_buf 中的数据最终会被写入到以后应用的旧 AOF 文件中，产生磁盘 IO。同时，aof_rewrite_buf 中的数据也会被写入重写生成的新 AOF 文件中，产生磁盘 IO。因而，同一份数据会产生两次磁盘 IO。

Redis 应用上面所示的六个 pipe 进行主过程和子过程之间的数据传输和管制交互，这使得整个 AOFRW 逻辑变得更为简单和难以了解。

 /* AOF pipes used to communicate between parent and child during rewrite. */
 int aof_pipe_write_data_to_child;
 int aof_pipe_read_data_from_parent;
 int aof_pipe_write_ack_to_parent;
 int aof_pipe_read_ack_from_child;
 int aof_pipe_write_ack_to_child;
 int aof_pipe_read_ack_from_parent;

顾名思义，MP-AOF 就是将原来的单个 AOF 文件拆分成多个 AOF 文件。在 MP-AOF 中，咱们将 AOF 分为三种类型，别离为：

BASE：示意根底 AOF，它个别由子过程通过重写产生，该文件最多只有一个。
INCR：示意增量 AOF，它个别会在 AOFRW 开始执行时被创立，该文件可能存在多个。
HISTORY：示意历史 AOF，它由 BASE 和 INCR AOF 变动而来，每次 AOFRW 胜利实现时，本次 AOFRW 之前对应的 BASE 和 INCR AOF 都将变为 HISTORY，HISTORY 类型的 AOF 会被 Redis 主动删除。

为了治理这些 AOF 文件，咱们引入了一个 manifest（清单）文件来跟踪、治理这些 AOF。同时，为了便于 AOF 备份和拷贝，咱们将所有的 AOF 文件和 manifest 文件放入一个独自的文件目录中，目录名由 appenddirname 配置（Redis 7.0 新增配置项）决定。

图 2 MP-AOF Rewrite 原理

图 2 展现的是在 MP-AOF 中执行一次 AOFRW 的大抵流程。在开始时咱们仍然会 fork 一个子过程进行重写操作，在主过程中，咱们会同时关上一个新的 INCR 类型的 AOF 文件，在子过程重写操作期间，所有的数据变动都会被写入到这个新关上的 INCR AOF 中。子过程的重写操作齐全是独立的，重写期间不会与主过程进行任何的数据和管制交互，最终重写操作会产生一个 BASE AOF。新生成的 BASE AOF 和新关上的 INCR AOF 就代表了以后时刻 Redis 的全副数据。AOFRW 完结时，主过程会负责更新 manifest 文件，将新生成的 BASE AOF 和 INCR AOF 信息退出进去，并将之前的 BASE AOF 和 INCR AOF 标记为 HISTORY（这些 HISTORY AOF 会被 Redis 异步删除）。一旦 manifest 文件更新结束，就标记整个 AOFRW 流程完结。

由图 2 能够看到，咱们在 AOFRW 期间不再须要 aof_rewrite_buf，因而去掉了对应的内存耗费。同时，主过程和子过程之间也不再有数据传输和管制交互，因而对应的 CPU 开销也全副去掉。对应的，前文提及的六个 pipe 及其对应的代码也全副删除，使得 AOFRW 逻辑更加简略清晰。

在内存中的示意

MP-AOF 强依赖 manifest 文件，manifest 在内存中示意为如下构造体，其中：

aofInfo：示意一个 AOF 文件信息，以后仅包含文件名、文件序号和文件类型
base_aof_info：示意 BASE AOF 信息，当不存在 BASE AOF 时，该字段为 NULL
incr_aof_list：用于寄存所有 INCR AOF 文件的信息，所有的 INCR AOF 都会依照文件关上程序排放
history_aof_list：用于寄存 HISTORY AOF 信息，history_aof_list 中的元素都是从 base_aof_info 和 incr_aof_list 中 move 过去的

 typedef struct {
    sds           file_name;  /* file name */
    long long     file_seq;   /* file sequence */
    aof_file_type file_type;  /* file type */
} aofInfo;
typedef struct {
    aofInfo     *base_aof_info;       /* BASE file information. NULL if there is no BASE file. */
    list        *incr_aof_list;       /* INCR AOFs list. We may have multiple INCR AOF when rewrite fails. */
    list        *history_aof_list;    /* HISTORY AOF list. When the AOFRW success, The aofInfo contained in
                                         `base_aof_info` and `incr_aof_list` will be moved to this list. We
                                         will delete these AOF files when AOFRW finish. */
    long long   curr_base_file_seq;   /* The sequence number used by the current BASE file. */
    long long   curr_incr_file_seq;   /* The sequence number used by the current INCR file. */
    int         dirty;                /* 1 Indicates that the aofManifest in the memory is inconsistent with
                                         disk, we need to persist it immediately. */
} aofManifest;

为了便于原子性批改和回滚操作，咱们在 redisServer 构造中应用指针的形式援用 aofManifest。

 struct redisServer {
    // 此处省略其余细节...
    aofManifest *aof_manifest;       /* Used to track AOFs. */
    // 此处省略其余细节...
}

在磁盘上的示意

Manifest 实质就是一个蕴含多行记录的文本文件，每一行记录对应一个 AOF 文件信息，这些信息通过 key/value 对的形式展现，便于 Redis 解决、易于浏览和批改。上面是一个可能的 manifest 文件内容：

 file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i
file appendonly.aof.2.incr.aof seq 2 type i

Manifest 格局自身须要具备肯定的扩展性，以便未来增加或反对其余的性能。比方能够不便的反对新增 key/value 和注解（相似 AOF 中的注解），这样能够保障较好的 forward compatibility。

 file appendonly.aof.1.base.rdb seq 1 type b newkey newvalue
file appendonly.aof.1.incr.aof type i seq 1 
# this is annotations
seq 2 type i file appendonly.aof.2.incr.aof

在 MP-AOF 之前，AOF 的文件名为 appendfilename 参数的设置值（默认为 appendonly.aof）。

在 MP-AOF 中，咱们应用 basename.suffix 的形式命名多个 AOF 文件。其中，appendfilename 配置内容将作为 basename 局部，suffix 则由三个局部组成，格局为 seq.type.format，其中：

seq 为文件的序号，由 1 开始枯燥递增，BASE 和 INCR 领有独立的文件序号
type 为 AOF 的类型，示意这个 AOF 文件是 BASE 还是 INCR
format 用来示意这个 AOF 外部的编码方式，因为 Redis 反对 RDB preamble 机制，

因而 BASE AOF 可能是 RDB 格局编码也可能是 AOF 格局编码：

 #define BASE_FILE_SUFFIX           ".base"
#define INCR_FILE_SUFFIX           ".incr"
#define RDB_FORMAT_SUFFIX          ".rdb"
#define AOF_FORMAT_SUFFIX          ".aof"
#define MANIFEST_NAME_SUFFIX       ".manifest"

因而，当应用 appendfilename 默认配置时，BASE、INCR 和 manifest 文件的可能命名如下：

 appendonly.aof.1.base.rdb // 开启 RDB preamble
appendonly.aof.1.base.aof // 敞开 RDB preamble
appendonly.aof.1.incr.aof
appendonly.aof.2.incr.aof

兼容老版本升级

因为 MP-AOF 强依赖 manifest 文件，Redis 启动时会严格依照 manifest 的批示加载对应的 AOF 文件。然而在从老版本 Redis（指 Redis 7.0 之前的版本）降级到 Redis 7.0 时，因为此时并无 manifest 文件，因而如何让 Redis 正确辨认这是一个降级过程并正确、平安的加载旧 AOF 是一个必须反对的能力。

辨认能力是这一重要过程的首要环节，在真正加载 AOF 文件之前，咱们会查看 Redis 工作目录下是否存在名为 server.aof_filename 的 AOF 文件。如果存在，那阐明咱们可能在从一个老版本 Redis 执行降级，接下来，咱们会持续判断，当满足上面三种状况之一时咱们会认为这是一个降级启动：

如果 appenddirname 目录不存在
或者 appenddirname 目录存在，然而目录中没有对应的 manifest 清单文件
如果 appenddirname 目录存在且目录中存在 manifest 清单文件，且清单文件中只有 BASE AOF 相干信息，且这个 BASE AOF 的名字和 server.aof_filename 雷同，且 appenddirname 目录中不存在名为 server.aof_filename 的文件

 /* Load the AOF files according the aofManifest pointed by am. */
int loadAppendOnlyFiles(aofManifest *am) {
    // 此处省略其余细节...
  
    /* If the 'server.aof_filename' file exists in dir, we may be starting
     * from an old redis version. We will use enter upgrade mode in three situations.
     *
     * 1. If the 'server.aof_dirname' directory not exist
     * 2. If the 'server.aof_dirname' directory exists but the manifest file is missing
     * 3. If the 'server.aof_dirname' directory exists and the manifest file it contains
     *    has only one base AOF record, and the file name of this base AOF is 'server.aof_filename',
     *    and the 'server.aof_filename' file not exist in 'server.aof_dirname' directory
     * */
    if (fileExist(server.aof_filename)) {if (!dirExists(server.aof_dirname) ||
            (am->base_aof_info == NULL && listLength(am->incr_aof_list) == 0) ||
            (am->base_aof_info != NULL && listLength(am->incr_aof_list) == 0 &&
             !strcmp(am->base_aof_info->file_name, server.aof_filename) && !aofFileExist(server.aof_filename)))
        {aofUpgradePrepare(am);
        }
    }
  
    // 此处省略其余细节...
  }

一旦被辨认为这是一个降级启动，咱们会应用 aofUpgradePrepare 函数进行降级前的筹备工作。

降级筹备工作次要分为三个局部：

应用 server.aof_filename 作为文件名来结构一个 BASE AOF 信息
将该 BASE AOF 信息长久化到 manifest 文件
应用 rename 将旧 AOF 文件挪动到 appenddirname 目录中

 void aofUpgradePrepare(aofManifest *am) {
    // 此处省略其余细节...
  
    /* 1. Manually construct a BASE type aofInfo and add it to aofManifest. */
    if (am->base_aof_info) aofInfoFree(am->base_aof_info);
    aofInfo *ai = aofInfoCreate();
    ai->file_name = sdsnew(server.aof_filename);
    ai->file_seq = 1;
    ai->file_type = AOF_FILE_TYPE_BASE;
    am->base_aof_info = ai;
    am->curr_base_file_seq = 1;
    am->dirty = 1;
    /* 2. Persist the manifest file to AOF directory. */
    if (persistAofManifest(am) != C_OK) {exit(1);
    }
    /* 3. Move the old AOF file to AOF directory. */
    sds aof_filepath = makePath(server.aof_dirname, server.aof_filename);
    if (rename(server.aof_filename, aof_filepath) == -1) {sdsfree(aof_filepath);
        exit(1);;
    }
  
    // 此处省略其余细节...
}

降级筹备操作是 Crash Safety 的，以上三步中任何一步产生 Crash 咱们都能在下一次的启动中正确的辨认并重试整个降级操作。

Redis 在加载 AOF 时会记录加载的进度，并通过 Redis INFO 的 loading_loaded_perc 字段展现进去。在 MP-AOF 中，loadAppendOnlyFiles 函数会依据传入的 aofManifest 进行 AOF 文件加载。在进行加载之前，咱们须要提前计算所有待加载的 AOF 文件的总大小，并传给 startLoading 函数，而后在 loadSingleAppendOnlyFile 中一直的上报加载进度。

接下来，loadAppendOnlyFiles 会依据 aofManifest 顺次加载 BASE AOF 和 INCR AOF。以后加载完所有的 AOF 文件，会应用 stopLoading 完结加载状态。

 int loadAppendOnlyFiles(aofManifest *am) {
    // 此处省略其余细节...
    /* Here we calculate the total size of all BASE and INCR files in
     * advance, it will be set to `server.loading_total_bytes`. */
    total_size = getBaseAndIncrAppendOnlyFilesSize(am);
    startLoading(total_size, RDBFLAGS_AOF_PREAMBLE, 0);
    /* Load BASE AOF if needed. */
    if (am->base_aof_info) {aof_name = (char*)am->base_aof_info->file_name;
        updateLoadingFileName(aof_name);
        loadSingleAppendOnlyFile(aof_name);
    }
    /* Load INCR AOFs if needed. */
    if (listLength(am->incr_aof_list)) {
        listNode *ln;
        listIter li;
        listRewind(am->incr_aof_list, &li);
        while ((ln = listNext(&li)) != NULL) {aofInfo *ai = (aofInfo*)ln->value;
            aof_name = (char*)ai->file_name;
            updateLoadingFileName(aof_name);
            loadSingleAppendOnlyFile(aof_name);
        }
    }
  
    server.aof_current_size = total_size;
    server.aof_rewrite_base_size = server.aof_current_size;
    server.aof_fsync_offset = server.aof_current_size;
    stopLoading();
    
    // 此处省略其余细节...
}

AOFRW Crash Safety

当子过程实现重写操作，子过程会创立一个名为 temp-rewriteaof-bg-pid.aof 的长期 AOF 文件，此时这个文件对 Redis 而言还是不可见的，因为它还没有被退出到 manifest 文件中。要想使得它能被 Redis 辨认并在 Redis 启动时正确加载，咱们还须要将它依照前文提到的命名规定进行 rename 操作，并将其信息退出到 manifest 文件中。

AOF 文件 rename 和 manifest 文件批改尽管是两个独立操作，但咱们必须保障这两个操作的原子性，这样能力让 Redis 在启动时能正确的加载对应的 AOF。MP-AOF 应用两个设计来解决这个问题：

BASE AOF 的名字中蕴含文件序号，保障每次创立的 BASE AOF 不会和之前的 BASE AOF 抵触
先执行 AOF 的 rename 操作，再批改 manifest 文件

为了便于阐明，咱们假如在 AOFRW 开始之前，manifest 文件内容如下：

 file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i

则在 AOFRW 开始执行后 manifest 文件内容如下：

 file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i
file appendonly.aof.2.incr.aof seq 2 type i

子过程重写完结后，在主过程中，咱们会将 temp-rewriteaof-bg-pid.aof 重命名为 appendonly.aof.2.base.rdb，并将其退出 manifest 中，同时会将之前的 BASE 和 INCR AOF 标记为 HISTORY。此时 manifest 文件内容如下：

 file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.1.base.rdb seq 1 type h
file appendonly.aof.1.incr.aof seq 1 type h
file appendonly.aof.2.incr.aof seq 2 type i

此时，本次 AOFRW 的后果对 Redis 可见，HISTORY AOF 会被 Redis 异步清理。

backgroundRewriteDoneHandler 函数通过七个步骤实现了上述逻辑：

在批改内存中的 server.aof_manifest 前，先 dup 一份长期的 manifest 构造，接下来的批改都将针对这个长期的 manifest 进行。这样做的益处是，一旦前面的步骤呈现失败，咱们能够简略的销毁长期 manifest 从而回滚整个操作，防止净化 server.aof_manifest 全局数据结构
从长期 manifest 中获取新的 BASE AOF 文件名（记为 new_base_filename），并将之前（如果有）的 BASE AOF 标记为 HISTORY
将子过程产生的 temp-rewriteaof-bg-pid.aof 临时文件重命名为 new_base_filename
将长期 manifest 构造中上一次的 INCR AOF 全副标记为 HISTORY 类型
将长期 manifest 对应的信息长久化到磁盘（persistAofManifest 外部会保障 manifest 自身批改的原子性）
如果上述步骤都胜利了，咱们能够释怀的将内存中的 server.aof_manifest 指针指向长期的 manifest 构造（并开释之前的 manifest 构造），至此整个批改对 Redis 可见
清理 HISTORY 类型的 AOF，该步骤容许失败，因为它不会导致数据一致性问题

 void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    snprintf(tmpfile, 256, "temp-rewriteaof-bg-%d.aof",
        (int)server.child_pid);
    /* 1. Dup a temporary aof_manifest for subsequent modifications. */
    temp_am = aofManifestDup(server.aof_manifest);
    /* 2. Get a new BASE file name and mark the previous (if we have)
     * as the HISTORY type. */
    new_base_filename = getNewBaseFileNameAndMarkPreAsHistory(temp_am);
    /* 3. Rename the temporary aof file to 'new_base_filename'. */
    if (rename(tmpfile, new_base_filename) == -1) {aofManifestFree(temp_am);
        goto cleanup;
    }
    /* 4. Change the AOF file type in 'incr_aof_list' from AOF_FILE_TYPE_INCR
     * to AOF_FILE_TYPE_HIST, and move them to the 'history_aof_list'. */
    markRewrittenIncrAofAsHistory(temp_am);
    /* 5. Persist our modifications. */
    if (persistAofManifest(temp_am) == C_ERR) {bg_unlink(new_base_filename);
        aofManifestFree(temp_am);
        goto cleanup;
    }
    /* 6. We can safely let `server.aof_manifest` point to 'temp_am' and free the previous one. */
    aofManifestFreeAndUpdate(temp_am);
    /* 7. We don't care about the return value of `aofDelHistoryFiles`, because the history
     * deletion failure will not cause any problems. */
    aofDelHistoryFiles();}

反对 AOF truncate

在过程呈现 Crash 时 AOF 文件很可能呈现写入不残缺的问题，如一条事务里只写了 MULTI，然而还没写 EXEC 时 Redis 就 Crash。默认状况下，Redis 无奈加载这种不残缺的 AOF，然而 Redis 反对 AOF truncate 性能（通过 aof-load-truncated 配置关上）。其原理是应用 server.aof_current_size 跟踪 AOF 最初一个正确的文件偏移，而后应用 ftruncate 函数将该偏移之后的文件内容全副删除，这样尽管可能会失落局部数据，但能够保障 AOF 的完整性。

在 MP-AOF 中，server.aof_current_size 曾经不再示意单个 AOF 文件的大小而是所有 AOF 文件的总大小。因为只有最初一个 INCR AOF 才有可能呈现不残缺写入的问题，因而咱们引入了一个独自的字段 server.aof_last_incr_size 用于跟踪最初一个 INCR AOF 文件的大小。当最初一个 INCR AOF 呈现不残缺写入时，咱们只须要将 server.aof_last_incr_size 之后的文件内容删除即可。

if (ftruncate(server.aof_fd, server.aof_last_incr_size) == -1) {// 此处省略其余细节...}

AOFRW 限流

Redis 在 AOF 大小超过肯定阈值时反对主动执行 AOFRW，当呈现磁盘故障或者触发了代码 bug 导致 AOFRW 失败时，Redis 将不停的反复执行 AOFRW 直到胜利为止。在 MP-AOF 呈现之前，这看似没有什么大问题（顶多就是耗费一些 CPU 工夫和 fork 开销）。然而在 MP-AOF 中，因为每次 AOFRW 都会关上一个 INCR AOF，并且只有在 AOFRW 胜利时才会将上一个 INCR 和 BASE 转为 HISTORY 并删除。因而，间断的 AOFRW 失败势必会导致多个 INCR AOF 并存的问题。极其状况下，如果 AOFRW 重试频率很高咱们将会看到成千盈百个 INCR AOF 文件。

为此，咱们引入了 AOFRW 限流机制。即当 AOFRW 曾经间断失败三次时，下一次的 AOFRW 会被强行提早 1 分钟执行，如果下一次 AOFRW 仍然失败，则会提早 2 分钟，顺次类推提早 4、8、16…，以后最大延迟时间为 1 小时。

在 AOFRW 限流期间，咱们仍然能够应用 bgrewriteaof 命令立刻执行一次 AOFRW。

 if (server.aof_state == AOF_ON &&
    !hasActiveChildProcess() &&
    server.aof_rewrite_perc &&
    server.aof_current_size > server.aof_rewrite_min_size &&
    !aofRewriteLimited())
{
    long long base = server.aof_rewrite_base_size ?
        server.aof_rewrite_base_size : 1;
    long long growth = (server.aof_current_size*100/base) - 100;
    if (growth >= server.aof_rewrite_perc) {rewriteAppendOnlyFileBackground();
    }
}

AOFRW 限流机制的引入，还能够无效的防止 AOFRW 高频重试带来的 CPU 和 fork 开销。Redis 中很多的 RT 抖动都和 fork 有关系。

MP-AOF 的引入，胜利的解决了之前 AOFRW 存在的内存和 CPU 开销对 Redis 实例甚至业务拜访带来的不利影响。同时，在解决这些问题的过程中，咱们也遇到了很多未曾意料的挑战，这些挑战次要来自于 Redis 宏大的应用群体、多样化的应用场景，因而咱们必须思考用户在各种场景下应用 MP-AOF 可能遇到的问题。如兼容性、易用性以及对 Redis 代码尽可能的缩小侵入性等。这都是 Redis 社区性能演进的重中之重。

同时，MP-AOF 的引入也为 Redis 的数据长久化带来了更多的设想空间。如在开启 aof-use-rdb-preamble 时，BASE AOF 实质是一个 RDB 文件，因而咱们在进行全量备份的时候无需在独自执行一次 BGSAVE 操作。间接备份 BASE AOF 即可。MP-AOF 反对敞开主动清理 HISTORY AOF 的能力，因而那些历史的 AOF 有机会得以保留，并且目前 Redis 曾经反对在 AOF 中退出 timestamp annotation，因而基于这些咱们甚至能够实现一个简略的 PITR 能力（point-in-time recovery）。

MP-AOF 的设计原型来自于 Tair for redis 企业版的 binlog 实现，这是一套在阿里云 Tair 服务上久经验证的外围性能，在这个外围性能上阿里云 Tair 胜利构建了寰球多活、PITR 等企业级能力，使用户的更多业务场景需要失去满足。明天咱们将这个外围能力奉献给 Redis 社区，心愿社区用户也能享受这些企业级个性，并通过这些企业级个性更好的优化，发明本人的业务代码。无关 MP-AOF 的更多细节，请移步参考相干 PR(#9788)，那里有更多的原始设计和残缺代码。

原文链接
本文为阿里云原创内容，未经容许不得转载。

关于redis:Redis-70-Multi-Part-AOF的设计和实现

AOF

AOFRW

AOFRW 存在的问题

内存开销

CPU 开销

磁盘 IO 开销

代码复杂度

MP-AOF 实现

计划概述

要害实现

Manifest

文件命名规定

多文件加载及进度计算

总结

Just My Socks（注册教程内含优惠码）

	aof_pending_rewrite:0
	aof_buffer_length:35500
	aof_rewrite_buffer_length:34000
	aof_pending_bio_fsync:0

	3351:M 25 Jan 2022 09:55:39.655 * Background append only file rewriting started by pid 6817
	3351:M 25 Jan 2022 09:57:51.864 * AOF rewrite child asks to stop sending diffs.
	6817:C 25 Jan 2022 09:57:51.864 * Parent agreed to stop sending diffs. Finalizing AOF...
	6817:C 25 Jan 2022 09:57:51.864 * Concatenating 2135.60 MB of AOF diff received from parent.
	3351:M 25 Jan 2022 09:57:56.545 * Background AOF buffer size: 100 MB

	/* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
	void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
	// 此处省略其余细节...

	/* Install a file event to send data to the rewrite child if there is
	* not one already. */
	if (!server.aof_stop_sending_diff &&
	aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0)
	{
	aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
	AE_WRITABLE, aofChildWriteDiffData, NULL);
	}

	// 此处省略其余细节...
	}

	int rewriteAppendOnlyFile(char *filename) {
	// 此处省略其余细节...

	/* Read again a few times to get more data from the parent.
	* We can't read forever (the server may receive data from clients
	* faster than it is able to send data to the child), so we try to read
	* some more data in a loop as soon as there is a good chance more data
	* will come. If it looks like we are wasting time, we abort (this
	* happens after 20 ms without new data). */
	int nodata = 0;
	mstime_t start = mstime();
	while(mstime()-start < 1000 && nodata < 20) {if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
	{
	nodata++;
	continue;
	}
	nodata = 0; /* Start counting from zero, we stop on N contiguous
	timeouts. */
	aofReadDiffFromParent();}
	// 此处省略其余细节...
	}

	void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
	// 此处省略其余细节...

	/* Flush the differences accumulated by the parent to the rewritten AOF. */
	if (aofRewriteBufferWrite(newfd) == -1) {
	serverLog(LL_WARNING,
	"Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
	close(newfd);
	goto cleanup;
	}

	// 此处省略其余细节...
	}

	/* AOF pipes used to communicate between parent and child during rewrite. */
	int aof_pipe_write_data_to_child;
	int aof_pipe_read_data_from_parent;
	int aof_pipe_write_ack_to_parent;
	int aof_pipe_read_ack_from_child;
	int aof_pipe_write_ack_to_child;
	int aof_pipe_read_ack_from_parent;

	typedef struct {
	sds file_name; /* file name */
	long long file_seq; /* file sequence */
	aof_file_type file_type; /* file type */
	} aofInfo;
	typedef struct {
	aofInfo base_aof_info; / BASE file information. NULL if there is no BASE file. */
	list incr_aof_list; / INCR AOFs list. We may have multiple INCR AOF when rewrite fails. */
	list history_aof_list; / HISTORY AOF list. When the AOFRW success, The aofInfo contained in
	`base_aof_info` and `incr_aof_list` will be moved to this list. We
	will delete these AOF files when AOFRW finish. */
	long long curr_base_file_seq; /* The sequence number used by the current BASE file. */
	long long curr_incr_file_seq; /* The sequence number used by the current INCR file. */
	int dirty; /* 1 Indicates that the aofManifest in the memory is inconsistent with
	disk, we need to persist it immediately. */
	} aofManifest;

	struct redisServer {
	// 此处省略其余细节...
	aofManifest aof_manifest; / Used to track AOFs. */
	// 此处省略其余细节...
	}

	file appendonly.aof.1.base.rdb seq 1 type b
	file appendonly.aof.1.incr.aof seq 1 type i
	file appendonly.aof.2.incr.aof seq 2 type i

	file appendonly.aof.1.base.rdb seq 1 type b newkey newvalue
	file appendonly.aof.1.incr.aof type i seq 1
	# this is annotations
	seq 2 type i file appendonly.aof.2.incr.aof

	#define BASE_FILE_SUFFIX ".base"
	#define INCR_FILE_SUFFIX ".incr"
	#define RDB_FORMAT_SUFFIX ".rdb"
	#define AOF_FORMAT_SUFFIX ".aof"
	#define MANIFEST_NAME_SUFFIX ".manifest"

	appendonly.aof.1.base.rdb // 开启 RDB preamble
	appendonly.aof.1.base.aof // 敞开 RDB preamble
	appendonly.aof.1.incr.aof
	appendonly.aof.2.incr.aof

	/* Load the AOF files according the aofManifest pointed by am. */
	int loadAppendOnlyFiles(aofManifest *am) {
	// 此处省略其余细节...

	/* If the 'server.aof_filename' file exists in dir, we may be starting
	* from an old redis version. We will use enter upgrade mode in three situations.
	*
	* 1. If the 'server.aof_dirname' directory not exist
	* 2. If the 'server.aof_dirname' directory exists but the manifest file is missing
	* 3. If the 'server.aof_dirname' directory exists and the manifest file it contains
	* has only one base AOF record, and the file name of this base AOF is 'server.aof_filename',
	* and the 'server.aof_filename' file not exist in 'server.aof_dirname' directory
	* */
	if (fileExist(server.aof_filename)) {if (!dirExists(server.aof_dirname) \|\|
	(am->base_aof_info == NULL && listLength(am->incr_aof_list) == 0) \|\|
	(am->base_aof_info != NULL && listLength(am->incr_aof_list) == 0 &&
	!strcmp(am->base_aof_info->file_name, server.aof_filename) && !aofFileExist(server.aof_filename)))
	{aofUpgradePrepare(am);
	}
	}

	// 此处省略其余细节...
	}

	void aofUpgradePrepare(aofManifest *am) {
	// 此处省略其余细节...

	/* 1. Manually construct a BASE type aofInfo and add it to aofManifest. */
	if (am->base_aof_info) aofInfoFree(am->base_aof_info);
	aofInfo *ai = aofInfoCreate();
	ai->file_name = sdsnew(server.aof_filename);
	ai->file_seq = 1;
	ai->file_type = AOF_FILE_TYPE_BASE;
	am->base_aof_info = ai;
	am->curr_base_file_seq = 1;
	am->dirty = 1;
	/* 2. Persist the manifest file to AOF directory. */
	if (persistAofManifest(am) != C_OK) {exit(1);
	}
	/* 3. Move the old AOF file to AOF directory. */
	sds aof_filepath = makePath(server.aof_dirname, server.aof_filename);
	if (rename(server.aof_filename, aof_filepath) == -1) {sdsfree(aof_filepath);
	exit(1);;
	}

	// 此处省略其余细节...
	}

	int loadAppendOnlyFiles(aofManifest *am) {
	// 此处省略其余细节...
	/* Here we calculate the total size of all BASE and INCR files in
	* advance, it will be set to `server.loading_total_bytes`. */
	total_size = getBaseAndIncrAppendOnlyFilesSize(am);
	startLoading(total_size, RDBFLAGS_AOF_PREAMBLE, 0);
	/* Load BASE AOF if needed. */
	if (am->base_aof_info) {aof_name = (char*)am->base_aof_info->file_name;
	updateLoadingFileName(aof_name);
	loadSingleAppendOnlyFile(aof_name);
	}
	/* Load INCR AOFs if needed. */
	if (listLength(am->incr_aof_list)) {
	listNode *ln;
	listIter li;
	listRewind(am->incr_aof_list, &li);
	while ((ln = listNext(&li)) != NULL) {aofInfo ai = (aofInfo)ln->value;
	aof_name = (char*)ai->file_name;
	updateLoadingFileName(aof_name);
	loadSingleAppendOnlyFile(aof_name);
	}
	}

	server.aof_current_size = total_size;
	server.aof_rewrite_base_size = server.aof_current_size;
	server.aof_fsync_offset = server.aof_current_size;
	stopLoading();

	// 此处省略其余细节...
	}

	file appendonly.aof.2.base.rdb seq 2 type b
	file appendonly.aof.1.base.rdb seq 1 type h
	file appendonly.aof.1.incr.aof seq 1 type h
	file appendonly.aof.2.incr.aof seq 2 type i

	void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
	snprintf(tmpfile, 256, "temp-rewriteaof-bg-%d.aof",
	(int)server.child_pid);
	/* 1. Dup a temporary aof_manifest for subsequent modifications. */
	temp_am = aofManifestDup(server.aof_manifest);
	/* 2. Get a new BASE file name and mark the previous (if we have)
	* as the HISTORY type. */
	new_base_filename = getNewBaseFileNameAndMarkPreAsHistory(temp_am);
	/* 3. Rename the temporary aof file to 'new_base_filename'. */
	if (rename(tmpfile, new_base_filename) == -1) {aofManifestFree(temp_am);
	goto cleanup;
	}
	/* 4. Change the AOF file type in 'incr_aof_list' from AOF_FILE_TYPE_INCR
	* to AOF_FILE_TYPE_HIST, and move them to the 'history_aof_list'. */
	markRewrittenIncrAofAsHistory(temp_am);
	/* 5. Persist our modifications. */
	if (persistAofManifest(temp_am) == C_ERR) {bg_unlink(new_base_filename);
	aofManifestFree(temp_am);
	goto cleanup;
	}
	/* 6. We can safely let `server.aof_manifest` point to 'temp_am' and free the previous one. */
	aofManifestFreeAndUpdate(temp_am);
	/* 7. We don't care about the return value of `aofDelHistoryFiles`, because the history
	* deletion failure will not cause any problems. */
	aofDelHistoryFiles();}

	if (server.aof_state == AOF_ON &&
	!hasActiveChildProcess() &&
	server.aof_rewrite_perc &&
	server.aof_current_size > server.aof_rewrite_min_size &&
	!aofRewriteLimited())
	{
	long long base = server.aof_rewrite_base_size ?
	server.aof_rewrite_base_size : 1;
	long long growth = (server.aof_current_size*100/base) - 100;
	if (growth >= server.aof_rewrite_perc) {rewriteAppendOnlyFileBackground();
	}
	}

关于redis:Redis-70-Multi-Part-AOF的设计和实现

AOF

AOFRW

AOFRW 存在的问题

内存开销

CPU 开销

磁盘 IO 开销

代码复杂度

MP-AOF 实现

计划概述

要害实现

Manifest

文件命名规定

多文件加载及进度计算

总结

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）