聊聊storm window trident的FreshCollector

序
本文主要研究一下storm window trident的FreshCollector
实例
TridentTopology topology = new TridentTopology();
topology.newStream(“spout1”, spout)
.partitionBy(new Fields(“user”))
.window(windowConfig,windowsStoreFactory,new Fields(“user”,”score”),new UserCountAggregator(),new Fields(“aggData”))
.parallelismHint(1)
.each(new Fields(“aggData”), new PrintEachFunc(),new Fields());
WindowTridentProcessor
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/windowing/WindowTridentProcessor.java
public class WindowTridentProcessor implements TridentProcessor {

private FreshCollector collector;

//……

public void prepare(Map stormConf, TopologyContext context, TridentContext tridentContext) {
this.topologyContext = context;
List<TridentTuple.Factory> parents = tridentContext.getParentTupleFactories();
if (parents.size() != 1) {
throw new RuntimeException(“Aggregation related operation can only have one parent”);
}

Long maxTuplesCacheSize = getWindowTuplesCacheSize(stormConf);

this.tridentContext = tridentContext;
collector = new FreshCollector(tridentContext);
projection = new TridentTupleView.ProjectionFactory(parents.get(0), inputFields);

windowStore = windowStoreFactory.create(stormConf);
windowTaskId = windowId + WindowsStore.KEY_SEPARATOR + topologyContext.getThisTaskId() + WindowsStore.KEY_SEPARATOR;
windowTriggerInprocessId = getWindowTriggerInprocessIdPrefix(windowTaskId);

tridentWindowManager = storeTuplesInStore ?
new StoreBasedTridentWindowManager(windowConfig, windowTaskId, windowStore, aggregator, tridentContext.getDelegateCollector(), maxTuplesCacheSize, inputFields)
: new InMemoryTridentWindowManager(windowConfig, windowTaskId, windowStore, aggregator, tridentContext.getDelegateCollector());

tridentWindowManager.prepare();
}

public void finishBatch(ProcessorContext processorContext) {

Object batchId = processorContext.batchId;
Object batchTxnId = getBatchTxnId(batchId);

LOG.debug(“Received finishBatch of : [{}] “, batchId);
// get all the tuples in a batch and add it to trident-window-manager
List<TridentTuple> tuples = (List<TridentTuple>) processorContext.state[tridentContext.getStateIndex()];
tridentWindowManager.addTuplesBatch(batchId, tuples);

List<Integer> pendingTriggerIds = null;
List<String> triggerKeys = new ArrayList<>();
Iterable<Object> triggerValues = null;

if (retriedAttempt(batchId)) {
pendingTriggerIds = (List<Integer>) windowStore.get(inprocessTriggerKey(batchTxnId));
if (pendingTriggerIds != null) {
for (Integer pendingTriggerId : pendingTriggerIds) {
triggerKeys.add(triggerKey(pendingTriggerId));
}
triggerValues = windowStore.get(triggerKeys);
}
}

// if there are no trigger values in earlier attempts or this is a new batch, emit pending triggers.
if(triggerValues == null) {
pendingTriggerIds = new ArrayList<>();
Queue<StoreBasedTridentWindowManager.TriggerResult> pendingTriggers = tridentWindowManager.getPendingTriggers();
LOG.debug(“pending triggers at batch: [{}] and triggers.size: [{}] “, batchId, pendingTriggers.size());
try {
Iterator<StoreBasedTridentWindowManager.TriggerResult> pendingTriggersIter = pendingTriggers.iterator();
List<Object> values = new ArrayList<>();
StoreBasedTridentWindowManager.TriggerResult triggerResult = null;
while (pendingTriggersIter.hasNext()) {
triggerResult = pendingTriggersIter.next();
for (List<Object> aggregatedResult : triggerResult.result) {
String triggerKey = triggerKey(triggerResult.id);
triggerKeys.add(triggerKey);
values.add(aggregatedResult);
pendingTriggerIds.add(triggerResult.id);
}
pendingTriggersIter.remove();
}
triggerValues = values;
} finally {
// store inprocess triggers of a batch in store for batch retries for any failures
if (!pendingTriggerIds.isEmpty()) {
windowStore.put(inprocessTriggerKey(batchTxnId), pendingTriggerIds);
}
}
}

collector.setContext(processorContext);
int i = 0;
for (Object resultValue : triggerValues) {
collector.emit(new ConsList(new TriggerInfo(windowTaskId, pendingTriggerIds.get(i++)), (List<Object>) resultValue));
}
collector.setContext(null);
}
}

WindowTridentProcessor在prepare的时候创建了FreshCollector
finishBatch的时候，调用FreshCollector.emit将窗口的aggregate的结果集传递过去
传递的数据结构为ConsList，其实是个AbstractList的实现，由Object类型的first元素，以及List<Object>结构的_elems组成

FreshCollector
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/planner/processor/FreshCollector.java
public class FreshCollector implements TridentCollector {
FreshOutputFactory _factory;
TridentContext _triContext;
ProcessorContext context;

public FreshCollector(TridentContext context) {
_triContext = context;
_factory = new FreshOutputFactory(context.getSelfOutputFields());
}

public void setContext(ProcessorContext pc) {
this.context = pc;
}

@Override
public void emit(List<Object> values) {
TridentTuple toEmit = _factory.create(values);
for(TupleReceiver r: _triContext.getReceivers()) {
r.execute(context, _triContext.getOutStreamId(), toEmit);
}
}

@Override
public void reportError(Throwable t) {
_triContext.getDelegateCollector().reportError(t);
}

public Factory getOutputFactory() {
return _factory;
}
}

FreshCollector在构造器里头根据context的selfOutputFields(第一个field固定为_task_info，之后的几个field为用户在window方法定义的functionFields)构造FreshOutputFactory
emit方法，首先使用FreshOutputFactory根据outputFields构造TridentTupleView，之后获取TupleReceiver，调用TupleReceiver的execute方法把TridentTupleView传递过去
这里的TupleReceiver有ProjectedProcessor、PartitionPersistProcessor

TridentTupleView.FreshOutputFactory
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/tuple/TridentTupleView.java
public static class FreshOutputFactory implements Factory {
Map<String, ValuePointer> _fieldIndex;
ValuePointer[] _index;

public FreshOutputFactory(Fields selfFields) {
_fieldIndex = new HashMap<>();
for(int i=0; i<selfFields.size(); i++) {
String field = selfFields.get(i);
_fieldIndex.put(field, new ValuePointer(0, i, field));
}
_index = ValuePointer.buildIndex(selfFields, _fieldIndex);
}

public TridentTuple create(List<Object> selfVals) {
return new TridentTupleView(PersistentVector.EMPTY.cons(selfVals), _index, _fieldIndex);
}

@Override
public Map<String, ValuePointer> getFieldIndex() {
return _fieldIndex;
}

@Override
public int numDelegates() {
return 1;
}

@Override
public List<String> getOutputFields() {
return indexToFieldsList(_index);
}
}

FreshOutputFactory是TridentTupleView的一个静态类，其构造方法主要是计算_index以及_fieldIndex
_fieldIndex是一个map，key是field字段，value是ValuePointer，记录其delegateIndex(这里固定为0)、index及field信息；第一个field为_task_info，index为0；之后的fields为用户在window方法定义的functionFields
这里的create方法主要是构造TridentTupleView，其构造器第一个值为IPersistentVector，第二个值为_index，第三个值为_fieldIndex

ValuePointer
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/tuple/ValuePointer.java
public class ValuePointer {
public static Map<String, ValuePointer> buildFieldIndex(ValuePointer[] pointers) {
Map<String, ValuePointer> ret = new HashMap<String, ValuePointer>();
for(ValuePointer ptr: pointers) {
ret.put(ptr.field, ptr);
}
return ret;
}

public static ValuePointer[] buildIndex(Fields fieldsOrder, Map<String, ValuePointer> pointers) {
if(fieldsOrder.size()!=pointers.size()) {
throw new IllegalArgumentException(“Fields order must be same length as pointers map”);
}
ValuePointer[] ret = new ValuePointer[pointers.size()];
for(int i=0; i<fieldsOrder.size(); i++) {
ret[i] = pointers.get(fieldsOrder.get(i));
}
return ret;
}

public int delegateIndex;
protected int index;
protected String field;

public ValuePointer(int delegateIndex, int index, String field) {
this.delegateIndex = delegateIndex;
this.index = index;
this.field = field;
}

@Override
public String toString() {
return ToStringBuilder.reflectionToString(this);
}
}
这里的buildIndex，主要是根据selfOutputFields的顺序返回ValuePointer数组
ProjectedProcessor
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/planner/processor/ProjectedProcessor.java
public class ProjectedProcessor implements TridentProcessor {
Fields _projectFields;
ProjectionFactory _factory;
TridentContext _context;

public ProjectedProcessor(Fields projectFields) {
_projectFields = projectFields;
}

@Override
public void prepare(Map conf, TopologyContext context, TridentContext tridentContext) {
if(tridentContext.getParentTupleFactories().size()!=1) {
throw new RuntimeException(“Projection processor can only have one parent”);
}
_context = tridentContext;
_factory = new ProjectionFactory(tridentContext.getParentTupleFactories().get(0), _projectFields);
}

@Override
public void cleanup() {
}

@Override
public void startBatch(ProcessorContext processorContext) {
}

@Override
public void execute(ProcessorContext processorContext, String streamId, TridentTuple tuple) {
TridentTuple toEmit = _factory.create(tuple);
for(TupleReceiver r: _context.getReceivers()) {
r.execute(processorContext, _context.getOutStreamId(), toEmit);
}
}

@Override
public void finishBatch(ProcessorContext processorContext) {
}

@Override
public Factory getOutputFactory() {
return _factory;
}
}

ProjectedProcessor在prepare的时候，创建了ProjectionFactory，其_projectFields就是window方法定义的functionFields，这里还使用tridentContext.getParentTupleFactories().get(0)提取了parent的第一个Factory，由于是FreshCollector传递过来的，因而这里是TridentTupleView.FreshOutputFactory
execute的时候，首先调用ProjectionFactory.create方法，对TridentTupleView进行字段提取操作，toEmit就是根据window方法定义的functionFields重新提取的TridentTupleView
execute方法之后对_context.getReceivers()挨个调用execute操作，将toEmit传递过去，这里的receiver就是window操作之后的各种processor了，比如EachProcessor

TridentTupleView.ProjectionFactory
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/tuple/TridentTupleView.java
public static class ProjectionFactory implements Factory {
Map<String, ValuePointer> _fieldIndex;
ValuePointer[] _index;
Factory _parent;

public ProjectionFactory(Factory parent, Fields projectFields) {
_parent = parent;
if(projectFields==null) projectFields = new Fields();
Map<String, ValuePointer> parentFieldIndex = parent.getFieldIndex();
_fieldIndex = new HashMap<>();
for(String f: projectFields) {
_fieldIndex.put(f, parentFieldIndex.get(f));
}
_index = ValuePointer.buildIndex(projectFields, _fieldIndex);
}

public TridentTuple create(TridentTuple parent) {
if(_index.length==0) return EMPTY_TUPLE;
else return new TridentTupleView(((TridentTupleView)parent)._delegates, _index, _fieldIndex);
}

@Override
public Map<String, ValuePointer> getFieldIndex() {
return _fieldIndex;
}

@Override
public int numDelegates() {
return _parent.numDelegates();
}

@Override
public List<String> getOutputFields() {
return indexToFieldsList(_index);
}
}
ProjectionFactory是TridentTupleView的静态类，它在构造器里头根据projectFields构造_index及_fieldIndex，这样create方法就能根据所需的字段创建TridentTupleView
EachProcessor
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/planner/processor/EachProcessor.java
public class EachProcessor implements TridentProcessor {
Function _function;
TridentContext _context;
AppendCollector _collector;
Fields _inputFields;
ProjectionFactory _projection;

public EachProcessor(Fields inputFields, Function function) {
_function = function;
_inputFields = inputFields;
}

@Override
public void prepare(Map conf, TopologyContext context, TridentContext tridentContext) {
List<Factory> parents = tridentContext.getParentTupleFactories();
if(parents.size()!=1) {
throw new RuntimeException(“Each operation can only have one parent”);
}
_context = tridentContext;
_collector = new AppendCollector(tridentContext);
_projection = new ProjectionFactory(parents.get(0), _inputFields);
_function.prepare(conf, new TridentOperationContext(context, _projection));
}

@Override
public void cleanup() {
_function.cleanup();
}

@Override
public void execute(ProcessorContext processorContext, String streamId, TridentTuple tuple) {
_collector.setContext(processorContext, tuple);
_function.execute(_projection.create(tuple), _collector);
}

@Override
public void startBatch(ProcessorContext processorContext) {
}

@Override
public void finishBatch(ProcessorContext processorContext) {
}

@Override
public Factory getOutputFactory() {
return _collector.getOutputFactory();
}
}

EachProcessor的execute方法，首先设置_collector的context为processorContext，然后调用_function.execute方法
这里调用了_projection.create(tuple)来提取字段，主要是根据_function定义的inputFields来提取
这里传递给_function的collector为AppendCollector

AppendCollector
storm-core-1.2.2-sources.jar!/org/apache/storm/trident/planner/processor/AppendCollector.java
public class AppendCollector implements TridentCollector {
OperationOutputFactory _factory;
TridentContext _triContext;
TridentTuple tuple;
ProcessorContext context;

public AppendCollector(TridentContext context) {
_triContext = context;
_factory = new OperationOutputFactory(context.getParentTupleFactories().get(0), context.getSelfOutputFields());
}

public void setContext(ProcessorContext pc, TridentTuple t) {
this.context = pc;
this.tuple = t;
}

@Override
public void emit(List<Object> values) {
TridentTuple toEmit = _factory.create((TridentTupleView) tuple, values);
for(TupleReceiver r: _triContext.getReceivers()) {
r.execute(context, _triContext.getOutStreamId(), toEmit);
}
}

@Override
public void reportError(Throwable t) {
_triContext.getDelegateCollector().reportError(t);
}

public Factory getOutputFactory() {
return _factory;
}
}
AppendCollector在构造器里头创建了OperationOutputFactory，其emit方法也是提取OperationOutputFields，然后挨个调用_triContext.getReceivers()的execute方法；如果each之后没有其他操作，那么AppendCollector的_triContext.getReceivers()就为空
小结

WindowTridentProcessor里头使用的是FreshCollector，WindowTridentProcessor在finishBatch的时候，会从TridentWindowManager提取window创建的pendingTriggers(提取之后会将其数据从pendingTriggers移除)，里头包含了窗口累积的数据，然后使用FreshCollector发射这些数据，默认第一个value为TriggerInfo，第二个value就是窗口累积发射的values
FreshCollector的emit方法首先使用TridentTupleView.FreshOutputFactory根据selfOutputFields(第一个field固定为_task_info，之后的几个field为用户在window方法定义的functionFields)构建TridentTupleView，然后挨个调用_triContext.getReceivers()的execute方法
后续的receivers中有一个ProjectedProcessor，用于根据window方法定义的functionFields重新提取的TridentTupleView，它的execute方法也类似FreshCollector.emit方法，先提取所需字段构造TridentTupleView，然后挨个调用_triContext.getReceivers()的execute方法(比如EachProcessor.execute)
EachProcessor使用的collector为AppendCollector，它的emit方法也类似FreshCollector的emit方法，先进行字段提取构造TridentTupleView，然后挨个调用_triContext.getReceivers()的execute方法
FreshCollector的emit方法与ProjectedProcessor的execute方法以及AppendCollector的emit方法都非常类似，首先是使用Factory提取所需字段构建TridentTupleView，然后挨个调用_triContext.getReceivers()的execute方法；当一个_triContext没有receiver的时候，tuple的传递也就停止了

doc
Windowing Support in Core Storm

聊聊storm window trident的FreshCollector

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

聊聊storm window trident的FreshCollector

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

发表回复取消回复