作者:天翼云 谭龙关键词:SPDK、NVMeOF、Ceph、CPU负载平衡
SPDK是intel公司主导开发的一套存储高性能开发套件,提供了一组工具和库,用于编写高性能、可扩大和用户态存储利用。它通过应用一些关键技术实现了高性能:1.将所有必须的驱动程序移到用户空间,以防止零碎调用并且反对零拷贝拜访2.IO的实现通过轮询硬件而不是依赖中断,以升高时延3.应用消息传递,以防止IO门路中应用锁SPDK是一个框架而不是分布式系统,它的基石是用户态(user space)、轮询(polled-mode)、异步(asynchronous)和无锁的NVMe驱动,其提供了零拷贝、高并发间接用户态拜访SSD的个性。SPDK的最后目标是为了优化块存储落盘性能,但随同继续的演进,曾经被用于优化各种存储协定栈。SPDK架构分为协定层、服务层和驱动层,协定层蕴含NVMeOF Target、vhost-nvme Target、iscsi Target、vhost-scsi Target以及vhost-blk Target等,服务层蕴含LV、Raid、AIO、malloc以及Ceph RBD等,driver层次要是NVMeOF initiator、NVMe PCIe、virtio以及其余用于长久化内存的driver等。
SPDK架构Ceph是目前利用比拟宽泛的一种分布式存储,它提供了块、对象和文件等存储服务,SPDK很早就反对连贯Ceph RBD作为块存储服务,咱们在应用SPDK测试RBD做性能测试时发现性能达到肯定规格后无奈持续晋升,影响产品的综合性能,通过多种定位办法并联合现场与代码剖析,最终定位问题起因并解决,过程如下。1.测试方法:启动SPDK nvmf_tgt并绑定0~7号核,./build/bin/nvmf_tgt -m 0xff,创立8个rbd bdev,8个nvmf subsystem,每个rbd bdev作为namespace attach到nvmf subsystem上,开启监听,initiator端通过nvme over rdma连贯每一个subsystem,生成8个nvme bdev,应用fio对8个nvme bdev同时进行性能测试。2.问题:咱们搭建了一个48 OSD的Ceph全闪集群,集群性能大概40w IOPS,咱们发现最高跑到20w IOPS就上不去了,无论减少盘数或调节其余参数均不见效。3.剖析定位:应用spdk_top显示0号核绝对其余核更加繁忙,持续加压,0号核繁忙水平减少而其余核则减少不显著。
查看poller显示rbd只有一个poller bdev_rbd_group_poll,与nvmf_tgt_poll_group_0都运行在id为2的thread上,而nvmf_tgt_poll_group_0是运行在0号核上的,故bdev_rbd_group_poll也运行在0号核。[root@test]# spdk_rpc.py thread_get_pollers{ "tick_rate": 2300000000, "threads": [ { "timed_pollers": [ { "period_ticks": 23000000, "run_count": 77622, "busy_count": 0, "state": "waiting", "name": "nvmf_tgt_accept" }, { "period_ticks": 9200000, "run_count": 194034, "busy_count": 194034, "state": "waiting", "name": "rpc_subsystem_poll" } ], "active_pollers": [], "paused_pollers": [], "id": 1, "name": "app_thread" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5919074761, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" }, { "run_count": 40969661, "busy_count": 0, "state": "waiting", "name": "bdev_rbd_group_poll" } ], "paused_pollers": [], "id": 2, "name": "nvmf_tgt_poll_group_0" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5937329587, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" } ], "paused_pollers": [], "id": 3, "name": "nvmf_tgt_poll_group_1" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5927158562, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" } ], "paused_pollers": [], "id": 4, "name": "nvmf_tgt_poll_group_2" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5971529095, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" } ], "paused_pollers": [], "id": 5, "name": "nvmf_tgt_poll_group_3" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5923260338, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" } ], "paused_pollers": [], "id": 6, "name": "nvmf_tgt_poll_group_4" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5968032945, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" } ], "paused_pollers": [], "id": 7, "name": "nvmf_tgt_poll_group_5" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5931553507, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" } ], "paused_pollers": [], "id": 8, "name": "nvmf_tgt_poll_group_6" }, { "timed_pollers": [], "active_pollers": [ { "run_count": 5058745767, "busy_count": 0, "state": "waiting", "name": "nvmf_poll_group_poll" } ], "paused_pollers": [], "id": 9, "name": "nvmf_tgt_poll_group_7" } ]}再联合代码剖析,rbd模块加载时会将创立io_channel的接口bdev_rbd_create_cb向外注册,rbd bdev在创立rbd bdev时默认做bdev_examine,这个流程会创立一次io_channel,而后销毁。在将rbd bdev attach到nvmf subsystem时,会调用创立io_channel接口,因为nvmf_tgt有8个线程,所以会调用8次创立io_channel接口,但disk->main_td总是第一次调用者的线程,即nvmf_tgt_poll_group_0,而每个IO达到rbd模块后bdev_rbd_submit_request接口会将IO上下文调度到disk->main_td,所以每个rbd的线程都运行在0号核上。
...