关于apache:博文推荐｜使用-Pulsar-IO-打造流数据管道

本文翻译自 StreamNative 博客。博客原作者：Ioannis Polyzos，StreamNative 解决方案工程师。原文链接：https://streamnative.io/blog/…

构建古代数据基础设施始终是当今企业的难题。当今的企业须要治理全天候生成和交付的大量异构数据。然而，因为企业对数据的数量和速度等等有多种要求，没有“一刀切”的解决方案。相同，企业需在不同零碎之间挪动数据，以便存储、解决和提供数据。

粗看搭建基础设施的历史，企业应用了许多不同的工具来尝试挪动数据，例如用于流式工作负载的 Apache Kafka 和用于音讯工作负载的 RabbitMQ。当初，Apache Pulsar 的诞生为企业简化了这个过程。

Apache Pulsar 是一个云原生的分布式音讯流平台。Pulsar 旨在满足古代数据需要，反对灵便的消息传递语义、分层存储、多租户和异地复制（跨区域数据复制）。自 2018 年毕业成为 Apache 软件基金会顶级我的项目以来，Pulsar 我的项目经验了疾速的社区增长、周边生态的倒退和寰球用户的增长。将 Pulsar 用作数据基础设施的支柱，公司可能以疾速且可扩大的形式挪动数据。在这篇博文中，咱们将介绍如何应用 Pulsar IO 在 Pulsar 和内部零碎之间轻松导入和导出数据。

Pulsar IO 是一个残缺的工具包，用于创立、部署和治理与内部零碎（如键 / 值存储、分布式文件系统、搜寻索引、数据库、数据仓库、其余消息传递零碎等）集成的 Pulsar 连接器。因为 Pulsar IO 构建在 Pulsar 的无服务器计算层（称为 Pulsar Function）之上，因而编写 Pulsar IO 连接器就像编写 Pulsar Function 一样简略。

借助 Pulsar IO，用户能够应用现有的 Pulsar 连接器或编写本人的自定义连接器，轻松地将数据移入和移出 Pulsar。Pulsar IO 领有以下劣势：

多样的连接器：以后 Pulsar 生态中有许多现有的 Pulsar IO 连接器用于内部零碎，例如 Apache Kafka、Cassandra 和 Aerospike。应用这些连接器有助于缩短生产工夫，因为创立集成所需的所有部件都已就位。开发人员只须要提供配置（如连贯 url 和凭据）来运行连接器。
托管运行时：Pulsar IO 带有托管运行时，负责执行、调度、扩大和容错。开发人员能够专一于配置和业务逻辑。
多接口：通过 Pulsar IO 提供的接口，用户能够缩小用于生成和应用应用程序的样板代码。
高扩展性：在须要更多实例来解决传入流量的场景下，用户能够通过更改一个简略的配置值轻松横向扩大；如果用户应用 Kubernetes 运行时，可依据流量需要进行弹性扩大。
充分利用 schema：Pulsar IO 通过在数据模型上指定 schema 类型来帮忙用户充分运用 schema，Pulsar IO 反对 JSON、Avro 和 Protobufs 等 schema 类型。

因为 Pulsar IO 建设在 Pulsar Function 之上，因而 Pulsar IO 和 Pulsar Function 具备雷同的运行时选项。部署 Pulsar IO 连接器时，用户有以下抉择：

线程：在与工作线程雷同的 JVM 中运行。（通常用于测试的和本地运行，不举荐用于生产部署。）
过程：在不同的过程中运行，用户能够应用多个工作线程跨多个节点横向扩大。
Kubernetes：在 Kubernetes 集群中作为 Pod 运行，worker 与 Kubernetes 协调。这种运行时形式保障用户能够充分利用 Kubernetes 这样的云原生环境提供的劣势，比方轻松横向扩大。云原生环境提供的劣势，比方轻松横向扩大。

如前所述，Pulsar IO 缩小了生成和生产应用程序所需的样板代码。它通过提供不同的根本接口来实现这一点，这些接口形象出样板代码并容许咱们专一于业务逻辑。

Pulsar IO 反对 Source 和 Sink 的根本接口。Source 连接器（Source connector）容许用户将数据从内部零碎带入 Pulsar，而 Sink 连接器（Sink Connector）可用于将数据移出 Pulsar 并移入内部零碎，例如数据库。

还有一种非凡类型的 Source 连接器，称为 Push Source。Push Source 连接器能够轻松实现某些须要推送数据的集成。举例来说，Push Source 能够是变更数据捕捉源零碎，它在接管到新变更后，会主动将该变更推送到 Pulsar。

public interface Source<T> extends AutoCloseable {
 
    /**
     * Open connector with configuration.
     *
     * @param config initialization config
     * @param sourceContext environment where the source connector is running
     * @throws Exception IO type exceptions when opening a connector
     */
    void open(final Map<String, Object> config, SourceContext sourceContext) throws Exception;
 
    /**
     * Reads the next message from source.
     * If source does not have any new messages, this call should block.
     * @return next message from source.  The return result should never be null
     * @throws Exception
     */
    Record<T> read() throws Exception;}

public interface BatchSource<T> extends AutoCloseable {
 
    /**
     * Open connector with configuration.
     *
     * @param config config that's supplied for source
     * @param context environment where the source connector is running
     * @throws Exception IO type exceptions when opening a connector
     */
    void open(final Map<String, Object> config, SourceContext context) throws Exception;
 
    /**
     * Discovery phase of a connector.  This phase will only be run on one instance, i.e. instance 0, of the connector.
     * Implementations use the taskEater consumer to output serialized representation of tasks as they are discovered.
     *
     * @param taskEater function to notify the framework about the new task received.
     * @throws Exception during discover
     */
    void discover(Consumer<byte[]> taskEater) throws Exception;
 
    /**
     * Called when a new task appears for this connector instance.
     *
     * @param task the serialized representation of the task
     */
    void prepare(byte[] task) throws Exception;
 
    /**
     * Read data and return a record
     * Return null if no more records are present for this task
     * @return a record
     */
    Record<T> readNext() throws Exception;}

public interface Sink<T> extends AutoCloseable {
    /**
     * Open connector with configuration.
     *
     * @param config initialization config
     * @param sinkContext environment where the sink connector is running
     * @throws Exception IO type exceptions when opening a connector
     */
    void open(final Map<String, Object> config, SinkContext sinkContext) throws Exception;
 
    /**
     * Write a message to Sink.
     *
     * @param record record to write to sink
     * @throws Exception
     */
    void write(Record<T> record) throws Exception;
}

Apache Pulsar 可能作为古代数据基础设施的支柱，它使企业可能以疾速且可扩大的形式搬运数据。Pulsar IO 是一个连接器框架，它为开发人员提供了所有必要的工具来创立、部署和治理与不同系统集成的 Pulsar 连接器。Pulsar IO 形象掉所有样板代码，使开发人员能够专一于利用程序逻辑。

如果您有趣味理解更多信息并构建本人的连接器，请查看以下资源：

查看 Pulsar 周边生态中所有 Pulsar IO 连接器
构建和部署 Source 连接器
为 Pulsar IO 编写自定义 Sink 连接器
监控和故障排除连接器

宋博，就任于北京百观科技有限公司，高级开发工程师，专一于微服务，云计算，大数据畛域。

退出 Apache Pulsar 中文交换群 👇🏻

点击链接，查看 Apache Pulsar 干货集锦

关于apache:博文推荐｜使用-Pulsar-IO-打造流数据管道

背景

1. Pulsar IO 简介

2. Pulsar IO 运行时

3. Pulsar IO 接口

Source 接口

Push Source 接口

Sink 接口

4. 总结

5. 延长浏览

译者简介