关于flink:Flink中基于State的有状态计算开发方法

作者：王东阳

前言
状态在Flink中叫作State，用来保留两头计算结果或者缓存数据。依据是否须要保留两头后果，分为无状态计算和有状态计算。对于流计算而言，事件继续一直地产生，如果每次计算都是互相独立的，不依赖于上下游的事件，则是无状态计算。如果计算须要依赖于之前或者后续的事件，则是有状态计算。

在Flink中依据数据集是否依据Key进行分区，将状态分为Keyed State和Operator State（Non-keyed State）两种类型，本文次要介绍Keyed State，后续文章会介绍Operator State。

Keyed State 示意和key相干的一种State，只能用于KeydStream类型数据集对应的Functions和Operators之上。Keyed State是Operator State的特例，区别在于Keyed State当时依照key对数据集进行了分区，每个Key State仅对应一个Operator和Key的组合。Keyed State能够通过Key Groups进行治理，次要用于当算子并行度发生变化时，主动从新散布 Keyed State数据。在零碎运行过程中，一个Keyed算子实例可能运行一个或者多个Key Groups的keys。

依照数据结构的不同，Flink中定义了多种Keyed State，具体如下:

ValueState<T> 即类型为T的单值状态。这个状态与对应的Key绑定，是最简略的状态。
ListState<T>即Key上的状态值为一个列表。
MapState<UK,UV> 定义与Key对应键值对的状态，用于保护具备key-value构造类型的状态数据。
ReducingState<T>这种State通过用户传入的ReduceFucntion，每次调用add(T)办法增加元素时，会调用ReduceFucntion，最初合并到一个繁多的状态值。
AggregatingState<IN,OUT> 聚合State和ReducingState<T>十分相似，不同的是，这里聚合的类型能够是不同的元素类型，应用add（IN）来退出元素，并应用AggregateFunction函数计算聚合后果。
State开发实战
本章节将通过理论的我的项目代码演示不同数据结构在不同计算场景下的开发方法。

本文中我的项目 pom文件
ValueState
ValueState<T> 即类型为T的单值状态。这个状态与对应的Key绑定，是最简略的状态。能够通过update(T)办法更新状态值，通过T value()办法获取状态值。

因为ValueState<T> 是与Key对应单个值的状态，利用场景能够是例如统计user_id对应的交易次数，每次用户交易都会在count状态值上进行更新。

接下来咱们将通过一个具体的实例代码展现如何利用ValueState去统计各个用户的交易次数。

ValueStateCountUser实现
通过定义ValueState<Integer>类型的countState存储用户曾经拜访的次数，并在每次收到新的元素时，对countState中的数据加1更新，同时把以后用户名和曾经拜访的总次数返回。在本样例代码中将会通过 RichMapFunction 来对流元素进行解决。

在其中的 open(Configuration parameters) 办法中对 countState进行初始化。Flink提供了RuntimeContext用于获取状态数据，同时RuntimeContext提供了罕用的Managed Keyd State的获取形式，能够通过创立相应的StateDescriptor并调用RuntimeContext办法来获取状态数据。例如获取ValueState能够调用ValueState[T] getState(ValueStateDescriptor[T])办法。
在 Tuple2<String, Integer> map(Tuple2<String, String> s)中，通过 countState.value() 获取用户之前的拜访次数，而后对countState中的数据加1更新，同时把以后用户名和曾经拜访的总次数返回。
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;

public class ValueStateCountUser extends RichMapFunction<Tuple2<String, String>, Tuple2<String, Integer>> {
private ValueState<Integer> countState;

public ValueStateCountUser() {}

@Override
public void open(Configuration parameters) throws Exception {

// super.open(parameters);countState = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("count", Integer.class));

}

@Override
public Tuple2<String, Integer> map(Tuple2<String, String> s) throws Exception {

Integer count = countState.value();if (count == null) {  countState.update(1);  return Tuple2.of(s.f0, 1);}countState.update(count + 1);return Tuple2.of(s.f0, count + 1);

}

@Override
public void close() throws Exception {

// super.close();System.out.println(String.format("finally %d", countState.value()));

}
}
代码地址：ValueStateCountUser.java

验证代码
编写验证代码如下

private static void valueStateUserCount() throws Exception {

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setParallelism(1);DataStream<Tuple2<String, String>> inputStream = env.fromElements(    new Tuple2<>("zhangsan","aa"),    new Tuple2<>("lisi","aa"),    new Tuple2<>("zhangsan","aa"),    new Tuple2<>("lisi","aa"),    new Tuple2<>("wangwu","aa"));inputStream.keyBy(new KeySelector<Tuple2<String, String>, String>() {      @Override      public String getKey(Tuple2<String, String> integerLongTuple2) throws Exception {        return integerLongTuple2.f0;      }    })    .map(new ValueStateCountUser())    .print();env.execute("StateCompute");

}
代码地址：valueStateUserCount

程序运行失去输入如下

(zhangsan,1)
(lisi,1)
(zhangsan,2)
(lisi,2)
(wangwu,1)
finally 1
能够看到对于流数据中的每一个元素，依据用户名进行统计，实时显示该用户曾经拜访了多少次。

ListState
ListState<T>即Key上的状态值为一个列表。能够通过add(T)办法或者addAll(List[T])往列表中附加值；也能够通过Iterable get()办法返回一个Iterable<T>来遍历状态值；应用update(List[T])来更新元素。

因为ListState保留的是与Key对应元素列表的状态，状态中寄存元素的List列表，利用场景能够是例如定义ListState存储用户常常拜访的IP地址。

接下来咱们将通过一个具体的实例代码展现如何利用ListState去统计各个用户拜访过的IP地址列表。为了更宽泛的展现State的利用场景，本样例代码中将会基于KeyedProcessFunction来解决流元素。

ListStateUserIP 实现
申明ListState<String> ipState 用于记录用户拜访过的ip地址
在open(Configuration parameters)中利用 getRuntimeContext().getListState 对 ipState 进行初始化，getListState须要传入ListStateDescriptor类型的参数。
在 processElement中，将以后记录中的IP地址查究到 ipState中，而后从ipState 获取曾经拜访过的所有IP地址，和用户名一起放入到Collector
import com.google.common.collect.Lists;
import java.util.List;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

public class ListStateUserIpKeyedProcessFunction
extends KeyedProcessFunction<String, Tuple2<String,String>, Tuple2<String, List<String>>>{
private ListState<String> ipState;

@Override
public void open(Configuration parameters) throws Exception {

// super.open(parameters);ipState = getRuntimeContext().getListState(new ListStateDescriptor<String>("ipstate", String.class));

}

@Override
public void processElement(

  Tuple2<String, String> value,  KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, List<String>>>.Context context,  Collector<Tuple2<String, List<String>>> collector) throws Exception {ipState.add(value.f1);List<String> ips = Lists.newArrayList(ipState.get());collector.collect(Tuple2.of(value.f0, ips));

}
}
代码地址：ListStateUserIpKeyedProcessFunction.java

验证代码
编写验证程序如下

private static void listStateUserIP() throws Exception {

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setParallelism(1);DataStream<Tuple2<String, String>> inputStream = env.fromElements(    new Tuple2<>("zhangsan","192.168.0.1"),    new Tuple2<>("lisi","192.168.0.1"),    new Tuple2<>("zhangsan","192.168.0.2"),    new Tuple2<>("lisi","192.168.0.3"),    new Tuple2<>("wangwu","192.168.0.1"));inputStream.keyBy(new KeySelector<Tuple2<String, String>, String>() {      @Override      public String getKey(Tuple2<String, String> integerLongTuple2) throws Exception {        return integerLongTuple2.f0;      }    })    .process(new ListStateUserIpKeyedProcessFunction())    .print();env.execute("listStateUserIP");

}
}
代码地址：listStateUserIP

执行后果打印如下

(zhangsan,[192.168.0.1])
(lisi,[192.168.0.1])
(zhangsan,[192.168.0.1, 192.168.0.2])
(lisi,[192.168.0.1, 192.168.0.3])
(wangwu,[192.168.0.1])
达到预期成果

MapState
MapState<UK,UV> 定义与Key对应键值对的状态，用于保护具备key-value构造类型的状态数据。 MapState<UK,UV> 使用 Map 存储 Key-Value 对， MapState的办法和Java的Map的办法极为类似，所以上手绝对容易。罕用的有如下：

get()办法获取值
put()，putAll()办法更新值
remove()删除某个key
contains()判断是否存在某个key
isEmpty() 判断是否为空
和HashMap接口类似，MapState也能够通过entries()、keys()、values()获取对应的keys或values的汇合。

接下来咱们通过一个具体的实例介绍如何通过ReducingState<T> 计算每个人的以后最高和最低体温。

实现KeyedProcessFunction
定义 MapState<String, Integer> minMaxState 用于记录每个用户的最低高温和最高体温。
在 open办法中通过getRuntimeContext().getMapState初始化minMaxState，getMapState 须要传入MapStateDescriptor作为参数，构建MapStateDescriptor的第二个参数标识Map中Key的类型，第三个参数标识Map中Value的类型
在processElement办法中，将输出的流元素也就是用户以后的体温记录和minMaxState中的最低温和最低温进行比照，并更新到minMaxState中。
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

public class MapStateMinMaxTempKeyedProcessFunction

extends KeyedProcessFunction<String, Tuple2<String, Integer>, Tuple3<String, Integer, Integer>> {

private MapState<String, Integer> minMaxState;

private final String minKey = "min";
private final String maxKey = "max";

@Override
public void open(Configuration parameters) throws Exception {

// super.open(parameters);minMaxState = getRuntimeContext().getMapState(    new MapStateDescriptor<String, Integer>("minMaxState",        String.class,        Integer.class));

}

@Override
public void processElement(

  Tuple2<String, Integer> in,  KeyedProcessFunction<String, Tuple2<String, Integer>, Tuple3<String, Integer, Integer>>.Context context,  Collector<Tuple3<String, Integer, Integer>> collector) throws Exception {if (!minMaxState.contains(minKey)) {  minMaxState.put(minKey, in.f1);}if (!minMaxState.contains(maxKey)) {  minMaxState.put(maxKey, in.f1);}if (in.f1 > minMaxState.get(maxKey)) {  minMaxState.put(maxKey, in.f1);}if (in.f1 < minMaxState.get(minKey)) {  minMaxState.put(minKey, in.f1);}collector.collect(Tuple3.of(in.f0, minMaxState.get(minKey), minMaxState.get(maxKey)));

}
}
代码地址：MapStateMinMaxTempKeyedProcessFunction.java

验证代码
编写验证代码如下：

private static void mapStateMinMaxUserTemperature() throws Exception {

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setParallelism(1);DataStream<Tuple2<String, Integer>> inputStream = env.fromElements(    new Tuple2<>("zhangsan",36),    new Tuple2<>("lisi",37),    new Tuple2<>("zhangsan",35),    new Tuple2<>("lisi",38),    new Tuple2<>("wangwu",37));inputStream.keyBy(new KeySelector<Tuple2<String, Integer>, String>() {      @Override      public String getKey(Tuple2<String, Integer> integerLongTuple2) throws Exception {        return integerLongTuple2.f0;      }    })    .process(new MapStateMinMaxTempKeyedProcessFunction())    .print();env.execute("mapStateMinMaxUserTemperature");

}
代码地址：mapStateMinMaxUserTemperature

程序运行输入

(zhangsan,36,36)
(lisi,37,37)
(zhangsan,35,36)
(lisi,37,38)
(wangwu,37,37)
能够看到对于每一条用户记录，正确计算输入了以后的最低和最高高温，达到了预期的计算需要。

ReducingState
ReducingState<T>这种State通过用户传入的ReduceFucntion，每次调用add(T)办法增加元素时，会调用ReduceFucntion，最初合并到一个繁多的状态值，因而，ReducingState须要指定ReduceFucntion实现状态数据的聚合。

ReducingState获取元素应用T get()办法。

接下来咱们通过一个具体的实例介绍如何通过ReducingState<T> 计算每个人的以后最高体温。

定义ReduceFunction
实现 reduce办法，求以后最大值

new ReduceFunction<Integer>() {

      @Override      public Integer reduce(Integer integer, Integer t1) throws Exception {        return integer > t1 ? integer : t1;      }    }

实现KeyedProcessFunction
申明ReducingState<Integer> tempState 用于记录用户以后的最高体温；
在open(Configuration parameters)中利用 getRuntimeContext().getReducingState 对 tempState进行初始化，getReducingState须要传入ReducingStateDescriptor，ReducingStateDescriptor构建的时候，第二个参数要参入ReduceFunction;
在 processElement中，将以后用户的以后体温传入到 tempState中(tempState在外部会调用ReduceFunction), 而后从tempState 获取用户目前为止的最高发问，和用户名一起放入到Collector
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

public class ReducingStateKeyedProcessFunction
extends KeyedProcessFunction<String, Tuple2<String, Integer> , Tuple2<String, Integer>> {

private ReducingState<Integer> tempState;

@Override
public void open(Configuration parameters) throws Exception {

// super.open(parameters);tempState = getRuntimeContext().getReducingState(new ReducingStateDescriptor<Integer>("reduce",    new ReduceFunction<Integer>() {      @Override      public Integer reduce(Integer integer, Integer t1) throws Exception {        return integer > t1 ? integer : t1;      }    }, Integer.class));

}

@Override
public void processElement(

  Tuple2<String, Integer> stringIntegerTuple2,  KeyedProcessFunction<String, Tuple2<String, Integer>, Tuple2<String, Integer>>.Context context,  Collector<Tuple2<String, Integer>> collector) throws Exception {tempState.add(stringIntegerTuple2.f1);collector.collect(Tuple2.of(stringIntegerTuple2.f0, tempState.get()));

}
}
代码地址：ReducingStateKeyedProcessFunction.java

验证代码
private static void reducingStateUserTemperature() throws Exception {

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setParallelism(1);DataStream<Tuple2<String, Integer>> inputStream = env.fromElements(    new Tuple2<>("zhangsan",36),    new Tuple2<>("lisi",37),    new Tuple2<>("zhangsan",35),    new Tuple2<>("lisi",38),    new Tuple2<>("wangwu",37));inputStream.keyBy(new KeySelector<Tuple2<String, Integer>, String>() {      @Override      public String getKey(Tuple2<String, Integer> integerLongTuple2) throws Exception {        return integerLongTuple2.f0;      }    })    .process(new ReducingStateKeyedProcessFunction())    .print();env.execute("reducingStateUserTemperature");

}
代码地址：reducingStateUserTemperature

程序运行后果如下

(zhangsan,36)
(lisi,37)
(zhangsan,36)
(lisi,38)
(wangwu,37)
达到预期成果

AggregatingState
AggregatingState<IN,OUT> 聚合State和ReducingState<T>十分相似，不同的是，这里聚合的类型能够是不同的元素类型，应用add（IN）来退出元素，并应用AggregateFunction函数计算聚合后果。

AggregatingState<IN,OUT>这种State通过用户传入的AggregateFunction，每次调用add(IN)办法增加元素时，会调用AggregateFunction，最初合并到一个繁多的状态值，因而，ReducingState须要指定AggregateFunction实现状态数据的聚合。

定义 AggregateFunction
AggregateFunction<IN, ACC, OUT> 中的三个类型参数，别离对应
IN: 流元素的类型;
ACC: AggregateFunction外部累加器的类型
OUT: 最终输入的聚合后果的类型
AggregateFunction<IN, ACC, OUT> 须要重写 createAccumulator, add, getResult, merge这四个函数
createAccumulator 用于创立累加器Accumulator
add 用于将流元素退出到累加器Accumulator
getResult 用于从累加器Accumulator中计算获取聚合后果
merge 目前临时用不到
代码如下

import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.java.tuple.Tuple2;

public class AvgTempAggregateFunction

implements AggregateFunction<Tuple2<String, Integer>, Tuple2<Integer, Integer>, Double> {

@Override
public Tuple2<Integer, Integer> createAccumulator() {

return Tuple2.of(0, 0);

}

@Override
public Tuple2<Integer, Integer> add(

  Tuple2<String, Integer> in, Tuple2<Integer, Integer> acc) {  // in 流元素， acc 外部累加器acc.f0 = acc.f0 + in.f1;acc.f1 = acc.f1 + 1;return acc;

}

@Override
public Double getResult(Tuple2<Integer, Integer> acc) {

return new Double(acc.f0)/ acc.f1;

}

@Override
public Tuple2<Integer, Integer> merge(

  Tuple2<Integer, Integer> a, Tuple2<Integer, Integer> b) {return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);

}
}
代码地址：AvgTempAggregateFunction.java

实现 KeyedProcesssFunction
申明AggregatingState<Tuple2<String, Integer>, Double> avgState 用于记录用户以后的均匀体温。其中第一个类型参数标识流元素类型，第二个类型参数标识聚合后果的类型；
在open(Configuration parameters)中利用 getRuntimeContext().getAggregatingState 对 avgState进行初始化，getAggregatingState须要传入AggregatingStateDescriptor，AggregatingStateDescriptor构建的时候，第二个参数要参入AggregateFunction，第三个参数要传入AggregateFunction中累加器的TypeInformation；
在 processElement中，将以后用户的以后体温传入到 avgState中(avgState外部会调用add)，而后从avgState 获取用户体温的聚合后果(avgState外部调用getResult )，和用户名一起放入到Collector
import org.apache.flink.api.common.state.AggregatingState;
import org.apache.flink.api.common.state.AggregatingStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

public class AggregatingStateKeyedProcessFunction
extends KeyedProcessFunction<String, Tuple2<String, Integer>, Tuple2<String, Double>> {
private AggregatingState<Tuple2<String, Integer>, Double> avgState;

@Override
public void open(Configuration parameters) throws Exception {

// super.open(parameters);avgState = getRuntimeContext().getAggregatingState(new AggregatingStateDescriptor<Tuple2<String, Integer>,    Tuple2<Integer, Integer>, Double>("aggregating", new AvgTempAggregateFunction(),    TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {})));

}

@Override
public void processElement(

  Tuple2<String, Integer> in,  KeyedProcessFunction<String, Tuple2<String, Integer>, Tuple2<String, Double>>.Context context,  Collector<Tuple2<String, Double>> collector) throws Exception {avgState.add(in);collector.collect(Tuple2.of(in.f0, avgState.get().doubleValue()));

}
}
代码地址：AggregatingStateKeyedProcessFunction.java

验证代码
private static void aggregatingStateAvgUserTemperature() throws Exception {

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setParallelism(1);DataStream<Tuple2<String, Integer>> inputStream = env.fromElements(    new Tuple2<>("zhangsan",36),    new Tuple2<>("lisi",37),    new Tuple2<>("zhangsan",35),    new Tuple2<>("lisi",38),    new Tuple2<>("wangwu",37));inputStream.keyBy(new KeySelector<Tuple2<String, Integer>, String>() {      @Override      public String getKey(Tuple2<String, Integer> integerLongTuple2) throws Exception {        return integerLongTuple2.f0;      }    })    .process(new AggregatingStateKeyedProcessFunction())    .print();env.execute("aggregatingStateAvgUserTemperature");

}
代码地址：aggregatingStateAvgUserTemperature，程序运行后果如下

(zhangsan,36.0)
(lisi,37.0)
(zhangsan,35.5)
(lisi,37.5)
(wangwu,37.0)
能够看到对于每一条流元素，输入了用户以及以后的均匀体温，实现了预期性能需要。

总结
本文通过典型的场景用例，展现了不同数据结构的State在不同计算场景下的应用办法，帮忙Flink开发者相熟State相干API以及State在相应的计算场景中如何存储、查问相干的两头计算信息从而实现计算目标。

参考资料
《Flink原理、实战与性能优化》5.1 有状态计算
《Flink内核原理与实现》第7章状态原理
Flink教程(17) Keyed State状态治理之AggregatingState应用案例求平均值 https://blog.csdn.net/winterk...
公布于 2022-03-11 09:57