[case44]聊聊storm trident的operations

51次阅读

共计 16795 个字符,预计需要花费 42 分钟才能阅读完成。


本文主要研究一下 storm trident 的 operations
function filter projection
Function
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Function.java
public interface Function extends EachOperation {
/**
* Performs the function logic on an individual tuple and emits 0 or more tuples.
*
* @param tuple The incoming tuple
* @param collector A collector instance that can be used to emit tuples
*/
void execute(TridentTuple tuple, TridentCollector collector);
}
Function 定义了 execute 方法,它发射的字段会追加到 input tuple 中
Filter
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Filter.java
public interface Filter extends EachOperation {

/**
* Determines if a tuple should be filtered out of a stream
*
* @param tuple the tuple being evaluated
* @return `false` to drop the tuple, `true` to keep the tuple
*/
boolean isKeep(TridentTuple tuple);
}
Filter 提供一个 isKeep 方法,用来决定该 tuple 是否输出
projection
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* Filters out fields from a stream, resulting in a Stream containing only the fields specified by `keepFields`.
*
* For example, if you had a Stream `mystream` containing the fields `[“a”, “b”, “c”,”d”]`, calling”
*
* “`java
* mystream.project(new Fields(“b”, “d”))
* “`
*
* would produce a stream containing only the fields `[“b”, “d”]`.
*
*
* @param keepFields The fields in the Stream to keep
* @return
*/
public Stream project(Fields keepFields) {
projectionValidation(keepFields);
return _topology.addSourcedNode(this, new ProcessorNode(_topology.getUniqueStreamId(), _name, keepFields, new Fields(), new ProjectedProcessor(keepFields)));
}
这里使用了 ProjectedProcessor 来进行 projection 操作
repartitioning operations
partition
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* @param partitioner
* @return
*/
public Stream partition(CustomStreamGrouping partitioner) {
return partition(Grouping.custom_serialized(Utils.javaSerialize(partitioner)));
}
这里使用了 CustomStreamGrouping
partitionBy
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* @param fields
* @return
*/
public Stream partitionBy(Fields fields) {
projectionValidation(fields);
return partition(Grouping.fields(fields.toList()));
}
这里使用 Grouping.fields
identityPartition
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* @return
*/
public Stream identityPartition() {
return partition(new IdentityGrouping());
}
这里使用 IdentityGrouping
shuffle
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* Use random round robin algorithm to evenly redistribute tuples across all target partitions
*
* @return
*/
public Stream shuffle() {
return partition(Grouping.shuffle(new NullStruct()));
}
这里使用 Grouping.shuffle
localOrShuffle
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* Use random round robin algorithm to evenly redistribute tuples across all target partitions, with a preference
* for local tasks.
*
* @return
*/
public Stream localOrShuffle() {
return partition(Grouping.local_or_shuffle(new NullStruct()));
}
这里使用 Grouping.local_or_shuffle
global
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* All tuples are sent to the same partition. The same partition is chosen for all batches in the stream.
* @return
*/
public Stream global() {
// use this instead of storm’s built in one so that we can specify a singleemitbatchtopartition
// without knowledge of storm’s internals
return partition(new GlobalGrouping());
}
这里使用 GlobalGrouping
batchGlobal
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* All tuples in the batch are sent to the same partition. Different batches in the stream may go to different
* partitions.
*
* @return
*/
public Stream batchGlobal() {
// the first field is the batch id
return partition(new IndexHashGrouping(0));
}
这里使用 IndexHashGrouping,是对整个 batch 维度的 repartition
broadcast
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Repartitioning Operation
*
* Every tuple is replicated to all target partitions. This can useful during DRPC – for example, if you need to do
* a stateQuery on every partition of data.
*
* @return
*/
public Stream broadcast() {
return partition(Grouping.all(new NullStruct()));
}
这里使用 Grouping.all
groupBy
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
/**
* ## Grouping Operation
*
* @param fields
* @return
*/
public GroupedStream groupBy(Fields fields) {
projectionValidation(fields);
return new GroupedStream(this, fields);
}
这里返回的是 GroupedStream
aggregators
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java
//partition aggregate
public Stream partitionAggregate(Aggregator agg, Fields functionFields) {
return partitionAggregate(null, agg, functionFields);
}

public Stream partitionAggregate(CombinerAggregator agg, Fields functionFields) {
return partitionAggregate(null, agg, functionFields);
}

public Stream partitionAggregate(Fields inputFields, CombinerAggregator agg, Fields functionFields) {
projectionValidation(inputFields);
return chainedAgg()
.partitionAggregate(inputFields, agg, functionFields)
.chainEnd();
}

public Stream partitionAggregate(ReducerAggregator agg, Fields functionFields) {
return partitionAggregate(null, agg, functionFields);
}

public Stream partitionAggregate(Fields inputFields, ReducerAggregator agg, Fields functionFields) {
projectionValidation(inputFields);
return chainedAgg()
.partitionAggregate(inputFields, agg, functionFields)
.chainEnd();
}

//aggregate
public Stream aggregate(Fields inputFields, Aggregator agg, Fields functionFields) {
projectionValidation(inputFields);
return chainedAgg()
.aggregate(inputFields, agg, functionFields)
.chainEnd();
}

public Stream aggregate(Fields inputFields, CombinerAggregator agg, Fields functionFields) {
projectionValidation(inputFields);
return chainedAgg()
.aggregate(inputFields, agg, functionFields)
.chainEnd();
}

public Stream aggregate(Fields inputFields, ReducerAggregator agg, Fields functionFields) {
projectionValidation(inputFields);
return chainedAgg()
.aggregate(inputFields, agg, functionFields)
.chainEnd();
}

//persistent aggregate
public TridentState persistentAggregate(StateFactory stateFactory, CombinerAggregator agg, Fields functionFields) {
return persistentAggregate(new StateSpec(stateFactory), agg, functionFields);
}

public TridentState persistentAggregate(StateSpec spec, CombinerAggregator agg, Fields functionFields) {
return persistentAggregate(spec, null, agg, functionFields);
}

public TridentState persistentAggregate(StateFactory stateFactory, Fields inputFields, CombinerAggregator agg, Fields functionFields) {
return persistentAggregate(new StateSpec(stateFactory), inputFields, agg, functionFields);
}

public TridentState persistentAggregate(StateSpec spec, Fields inputFields, CombinerAggregator agg, Fields functionFields) {
projectionValidation(inputFields);
// replaces normal aggregation here with a global grouping because it needs to be consistent across batches
return new ChainedAggregatorDeclarer(this, new GlobalAggScheme())
.aggregate(inputFields, agg, functionFields)
.chainEnd()
.partitionPersist(spec, functionFields, new CombinerAggStateUpdater(agg), functionFields);
}

public TridentState persistentAggregate(StateFactory stateFactory, ReducerAggregator agg, Fields functionFields) {
return persistentAggregate(new StateSpec(stateFactory), agg, functionFields);
}

public TridentState persistentAggregate(StateSpec spec, ReducerAggregator agg, Fields functionFields) {
return persistentAggregate(spec, null, agg, functionFields);
}

public TridentState persistentAggregate(StateFactory stateFactory, Fields inputFields, ReducerAggregator agg, Fields functionFields) {
return persistentAggregate(new StateSpec(stateFactory), inputFields, agg, functionFields);
}

public TridentState persistentAggregate(StateSpec spec, Fields inputFields, ReducerAggregator agg, Fields functionFields) {
projectionValidation(inputFields);
return global().partitionPersist(spec, inputFields, new ReducerAggStateUpdater(agg), functionFields);
}

trident 的 aggregators 主要分为三类,分别是 partitionAggregate、aggregate、persistentAggregate;aggregator 操作会改变输出
partitionAggregate 其作用的粒度为每个 partition,而非整个 batch
aggregrate 操作作用的粒度为 batch,对每个 batch,它先使用 global 操作将该 batch 的 tuple 从所有 partition 合并到一个 partition,最后再对 batch 进行 aggregation 操作;这里提供了三类参数,分别是 Aggregator、CombinerAggregator、ReducerAggregator;调用 stream.aggregrate 方法时,相当于一次 global aggregation,此时使用 Aggregator 或 ReducerAggregator 时,stream 会先将 tuple 划分到一个 partition,然后再进行 aggregate 操作;而使用 CombinerAggregator 时,trident 会进行优化,先对每个 partition 进行局部的 aggregate 操作,然后再划分到一个 partition,最后再进行 aggregate 操作,因而相对 Aggregator 或 ReducerAggregator 可以节省网络传输耗时
persistentAggregate 操作会对 stream 上所有 batch 的 tuple 进行 aggretation,然后将结果存储在 state 中

Aggregator
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Aggregator.java
public interface Aggregator<T> extends Operation {
T init(Object batchId, TridentCollector collector);
void aggregate(T val, TridentTuple tuple, TridentCollector collector);
void complete(T val, TridentCollector collector);
}

Aggregator 首先会调用 init 进行初始化,然后通过参数传递给 aggregate 以及 complete 方法
对于 batch partition 中的每个 tuple 执行一次 aggregate;当 batch partition 中的 tuple 执行完 aggregate 之后执行 complete 方法
假设自定义 Aggregator 为累加操作,那么对于 [4]、[7]、[8] 这批 tuple,init 为 0,对于[4],val=0,0+4=4;对于[7],val=4,4+7=11;对于[8],val=11,11+8=19;然后 batch 结束,val=19,此时执行 complete,可以使用 collector 发射数据

CombinerAggregator
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/CombinerAggregator.java
public interface CombinerAggregator<T> extends Serializable {
T init(TridentTuple tuple);
T combine(T val1, T val2);
T zero();
}

CombinerAggregator 每收到一个 tuple,就调用 init 获取当前 tuple 的值,调用 combine 操作使用前一个 combine 的结果 (没有的话取 zero 的值) 与 init 取得的值进行新的 combine 操作,如果该 partition 中没有 tuple,则返回 zero 方法的值
假设 combine 为累加操作,zero 返回 0,那么对于 [4]、[7]、[8] 这批 tuple,init 值分别是 4、7、8,对于[4],没有前一个 combine 结果,于是 val1=0,val2=4,combine 结果为 4;对于[7],val1=4,val2=7,combine 结果为 11;对于[8],val1 为 11,val2 为 8,combine 结果为 19
CombinerAggregator 操作的网络开销相对较低,因此性能比其他两类 aggratator 好

ReducerAggregator
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/ReducerAggregator.java
public interface ReducerAggregator<T> extends Serializable {
T init();
T reduce(T curr, TridentTuple tuple);
}

ReducerAggregator 在对一批 tuple 进行计算时,先调用一次 init 获取初始值,然后再执行 reduce 操作,curr 值为前一次 reduce 操作的值,没有的话,就是 init 值
假设 reduce 为累加操作,init 返回 0,那么对于 [4]、[7]、[8] 这批 tuple,对于[4],init 为 0,然后 curr=0,先是 0 +4=4;对于[7],curr 为 4,就是 4 +7=11;对于[8],curr 为 11,最后就是 11+8=19

topology stream operations
join
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java
public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields) {
return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields);
}

public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields) {
return join(streams, joinFields, outFields, JoinType.INNER);
}

public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinType type) {
return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, type);
}

public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, JoinType type) {
return join(streams, joinFields, outFields, repeat(streams.size(), type));
}

public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, List<JoinType> mixed) {
return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mixed);

}

public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, List<JoinType> mixed) {
return join(streams, joinFields, outFields, mixed, JoinOutFieldsMode.COMPACT);
}

public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinOutFieldsMode mode) {
return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mode);
}

public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, JoinOutFieldsMode mode) {
return join(streams, joinFields, outFields, JoinType.INNER, mode);
}

public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinType type, JoinOutFieldsMode mode) {
return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, type, mode);
}

public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, JoinType type, JoinOutFieldsMode mode) {
return join(streams, joinFields, outFields, repeat(streams.size(), type), mode);
}

public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, List<JoinType> mixed, JoinOutFieldsMode mode) {
return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mixed, mode);

}

public Stream join(List<Stream> streams, List<Fields> joinFields, Fields outFields, List<JoinType> mixed, JoinOutFieldsMode mode) {
switch (mode) {
case COMPACT:
return multiReduce(strippedInputFields(streams, joinFields),
groupedStreams(streams, joinFields),
new JoinerMultiReducer(mixed, joinFields.get(0).size(), strippedInputFields(streams, joinFields)),
outFields);

case PRESERVE:
return multiReduce(strippedInputFields(streams, joinFields),
groupedStreams(streams, joinFields),
new PreservingFieldsOrderJoinerMultiReducer(mixed, joinFields.get(0).size(),
getAllOutputFields(streams), joinFields, strippedInputFields(streams, joinFields)),
outFields);

default:
throw new IllegalArgumentException(“Unsupported out-fields mode: ” + mode);
}
}
可以看到 join 最后调用了 multiReduce,对于 COMPACT 类型使用的 GroupedMultiReducer 是 JoinerMultiReducer,对于 PRESERVE 类型使用的 GroupedMultiReducer 是 PreservingFieldsOrderJoinerMultiReducer
merge
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java
public Stream merge(Fields outputFields, Stream… streams) {
return merge(outputFields, Arrays.asList(streams));
}

public Stream merge(Stream… streams) {
return merge(Arrays.asList(streams));
}

public Stream merge(List<Stream> streams) {
return merge(streams.get(0).getOutputFields(), streams);
}

public Stream merge(Fields outputFields, List<Stream> streams) {
return multiReduce(streams, new IdentityMultiReducer(), outputFields);
}
可以看到 merge 最后是调用了 multiReduce,使用的 MultiReducer 是 IdentityMultiReducer
multiReduce
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java
public Stream multiReduce(Stream s1, Stream s2, MultiReducer function, Fields outputFields) {
return multiReduce(Arrays.asList(s1, s2), function, outputFields);
}

public Stream multiReduce(Fields inputFields1, Stream s1, Fields inputFields2, Stream s2, MultiReducer function, Fields outputFields) {
return multiReduce(Arrays.asList(inputFields1, inputFields2), Arrays.asList(s1, s2), function, outputFields);
}

public Stream multiReduce(GroupedStream s1, GroupedStream s2, GroupedMultiReducer function, Fields outputFields) {
return multiReduce(Arrays.asList(s1, s2), function, outputFields);
}

public Stream multiReduce(Fields inputFields1, GroupedStream s1, Fields inputFields2, GroupedStream s2, GroupedMultiReducer function, Fields outputFields) {
return multiReduce(Arrays.asList(inputFields1, inputFields2), Arrays.asList(s1, s2), function, outputFields);
}

public Stream multiReduce(List<Stream> streams, MultiReducer function, Fields outputFields) {
return multiReduce(getAllOutputFields(streams), streams, function, outputFields);
}

public Stream multiReduce(List<GroupedStream> streams, GroupedMultiReducer function, Fields outputFields) {
return multiReduce(getAllOutputFields(streams), streams, function, outputFields);
}

public Stream multiReduce(List<Fields> inputFields, List<GroupedStream> groupedStreams, GroupedMultiReducer function, Fields outputFields) {
List<Fields> fullInputFields = new ArrayList<>();
List<Stream> streams = new ArrayList<>();
List<Fields> fullGroupFields = new ArrayList<>();
for(int i=0; i<groupedStreams.size(); i++) {
GroupedStream gs = groupedStreams.get(i);
Fields groupFields = gs.getGroupFields();
fullGroupFields.add(groupFields);
streams.add(gs.toStream().partitionBy(groupFields));
fullInputFields.add(TridentUtils.fieldsUnion(groupFields, inputFields.get(i)));

}
return multiReduce(fullInputFields, streams, new GroupedMultiReducerExecutor(function, fullGroupFields, inputFields), outputFields);
}

public Stream multiReduce(List<Fields> inputFields, List<Stream> streams, MultiReducer function, Fields outputFields) {
List<String> names = new ArrayList<>();
for(Stream s: streams) {
if(s._name!=null) {
names.add(s._name);
}
}
Node n = new ProcessorNode(getUniqueStreamId(), Utils.join(names, “-“), outputFields, outputFields, new MultiReducerProcessor(inputFields, function));
return addSourcedNode(streams, n);
}
multiReduce 方法有个 MultiReducer 参数,join 与 merge 虽然都调用了 multiReduce,但是他们传的 MultiReducer 值不一样
小结

trident 的操作主要有几类,一类是基本的 function、filter、projection 操作;一类是 repartitioning 操作,主要是一些 grouping;一类是 aggregate 操作,包括 aggregate、partitionAggregate、persistentAggregate;一类是在 topology 对 stream 的 join、merge 操作
function 的话,若有 emit 字段会追加到原始的 tuple 上;filter 用于过滤 tuple;projection 用于提取字段
repartitioning 操作有 Grouping.local_or_shuffle、Grouping.shuffle、Grouping.all、GlobalGrouping、CustomStreamGrouping、IdentityGrouping、IndexHashGrouping 等;partition 操作可以理解为将输入的 tuple 分配到 task 上,也可以理解为是对 stream 进行 grouping
aggregate 操作的话,普通的 aggregate 操作有 3 类接口,分别是 Aggregator、CombinerAggregator、ReducerAggregator,其中 Aggregator 是最为通用的,它继承了 Operation 接口,而且在方法参数里头可以使用到 collector,这是 CombinerAggregator 与 ReducerAggregator 所没有的;而 CombinerAggregator 与 Aggregator 及 ReducerAggregator 不同的是,调用 stream.aggregrate 方法时,trident 会优先在 partition 进行局部聚合,然后再归一到一个 partition 做最后聚合,相对来说比较节省网络传输耗时,但是如果将 CombinerAggregator 与非 CombinerAggregator 的进行 chaining 的话,就享受不到这个优化;partitionAggregate 主要是在 partition 维度上进行操作;而 persistentAggregate 则是在整个 stream 的维度上对所有 batch 的 tuple 进行操作,结果持久化在 state 上
对于 stream 的 join 及 merge 操作,其最后都是依赖 multiReduce 来实现,只是传递的 MultiReducer 值不一样;join 的话 join 的话需要字段来进行匹配(字段名可以不一样),可以选择 JoinType,是 INNER 还是 OUTER,不过 join 是对于 spout 的 small batch 来进行 join 的;merge 的话,就是纯粹的几个 stream 进行 tuple 的归总。

doc

Trident API Overview
Trident API 综述
JStorm Trident 入门精华一页纸
Everything You Need to Know about Apache Storm
batch and partition – differences

正文完
 0