乐趣区

flink学习系列基础知识学习三

前言

前面的第二讲,我们说过要介绍 flink 的水印,触发器相关概念。现在让我们先了解一下水印,触发器,迟到生存周期的概念。这里的概念有点抽象,需要动脑筋去理解。

事件时间,进入时间,处理时间

在理解水印之前,我们需要先行介绍 flink 里面的三个时间:时间时间,进入时间,处理时间。
下面先看一张图:

这里我自问自答一下:
问:为什么会要有事件时间和处理时间?
答:假设生产者生存消息以后,由于网络延迟或者其他因素,我们(flink)拿到数据的时间总是晚于生产者生产消息的时间的。那么这个时间间隔,总该有个约束吧?比如我(flink)等 10s 或者更久,那么在这 10 秒钟以内到达的数据,我们称之为早到或者按时到达的数据,对于 10 秒以后到的数据我们称之为迟到数据。按时,早到的数据我们都可以正常处理,那么迟到的数据该怎么办呢?是否丢弃?或者将这些数据存放在某个地方后续统一处理?… 这些 flink 都为我们考虑到了,并且有相应的类和方法,轮子已经造好,仅仅需要你去扬帆 … 哈哈,扯远了..

如上图,我们以从队列读取数据为例,事件时间是生产者产生数据的时候,存入数据的。进入时间是我们从 datasource 获取到生产者的消息的时间,处理时间就是我们真正处理这条数据的时间。相比较于事件时间,进入时间程序不同处理无序和迟到的事件,但是这个程序没必要定义怎样去生成水印。对于内部来说,进入时间更像是事件时间,但是有自动的时间戳分配和自动的水印生成。

下面我们将通过一个具体例子来了解水印,触发器相关用法

public class WatermarkTest {public static void main(String[] args) throws Exception {StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "172.19.141.60:31090");
        properties.setProperty("group.id", "crm_stream_window");
        properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
        DataStream<String> stream =
                env.addSource(new FlinkKafkaConsumer011<>("test-demo12", new SimpleStringSchema(), properties));
        env.setParallelism(1);
        DataStream<Tuple3<String, Long, Integer>> inputMap = stream.map(new MapFunction<String, Tuple3<String, Long, Integer>>() {
            private static final long serialVersionUID = -8812094804806854937L;

            @Override
            public Tuple3<String, Long, Integer> map(String value) throws Exception {KafkaEntity kafkaEntity = JSON.parseObject(value, KafkaEntity.class);
                return new Tuple3(kafkaEntity.getName(), kafkaEntity.getCreate_time(), kafkaEntity.getId());
            }
        });
        DataStream<Tuple3<String, Long, Integer>> watermark =
                inputMap.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple3<String, Long, Integer>>() {

                    private static final long serialVersionUID = 8252616297345284790L;
                    Long currentMaxTimestamp = 0L;
                    Long maxOutOfOrderness = 2000L;// 最大允许的乱序时间是 2s
                    Watermark watermark = null;
                    SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");

                    @Nullable
                    @Override
                    public Watermark getCurrentWatermark() {watermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness);
                        return watermark;
                    }

                    @Override
                    public long extractTimestamp(Tuple3<String, Long, Integer> element, long previousElementTimestamp) {
                        Long timestamp = element.f1;
                        currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
                        System.out.println("timestamp :" + element.f1 + "|" + format.format(element.f1) + "currentMaxTimestamp :" + currentMaxTimestamp + "|" + format.format(currentMaxTimestamp) + "," + "watermark :" + watermark.getTimestamp() + "|" + format.format(watermark.getTimestamp()));
                        return timestamp;
                    }
                });

        OutputTag<Tuple3<String, Long, Integer>> lateOutputTag = new OutputTag<Tuple3<String, Long, Integer>>("late-data") {private static final long serialVersionUID = -1552769100986888698L;};

        SingleOutputStreamOperator<String> resultStream = watermark
                .keyBy(0)
                .window(TumblingEventTimeWindows.of(Time.seconds(3)))
                .trigger(new Trigger<Tuple3<String, Long, Integer>, TimeWindow>() {
                    private static final long serialVersionUID = 2742133264310093792L;
                    ValueStateDescriptor<Integer> sumStateDescriptor = new ValueStateDescriptor<Integer>("sum", Integer.class);

                    @Override
                    public TriggerResult onElement(Tuple3<String, Long, Integer> element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {ValueState<Integer> sumState = ctx.getPartitionedState(sumStateDescriptor);
                        if (null == sumState.value()) {sumState.update(0);
                        }
                        sumState.update(element.f2 + sumState.value());
                        System.out.println(sumState.value());
//                        if (sumState.value() >= 2) {
                            // 这里可以选择手动处理状态
                            //  默认的 trigger 发送是 TriggerResult.FIRE 不会清除窗口数据
//                            return TriggerResult.FIRE_AND_PURGE;
                                return TriggerResult.FIRE_AND_PURGE;
//                        }
//                        return TriggerResult.CONTINUE;
                    }

                    @Override
                    public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {return TriggerResult.CONTINUE;}

                    @Override
                    public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {return TriggerResult.CONTINUE;}

                    @Override
                    public void clear(TimeWindow window, TriggerContext ctx) throws Exception {SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
                        System.out.println("清理窗口状态 | 窗口内保存值为" + ctx.getPartitionedState(sumStateDescriptor).value());
                        ctx.getPartitionedState(sumStateDescriptor).clear();}
                })
                // 如果使用 allowedLateness 会有重复计算的效果
                // 默认的 trigger 情况下
                // 在 event time>window_end_time+watermark+allowedLateness 时会触发窗口的 clear
                // 后续数据如果属于该窗口而且数据的 event_time>watermark-allowedLateness 会触发重新计算
                //
                // 在使用自定义的 trigger 情况下
                // 同一个窗口内只要满足要求可以不停的触发窗口数据往下流
                // 在 event time>window_end_time+watermark+allowedLateness 时会触发窗口 clear
                // 后续数据如果属于该窗口而且数据的 event_time>watermark-allowedLateness 会触发重新计算
                //
                // 窗口状态的 clear 只和时间有关与是否自定义 trigger 无关
                .allowedLateness(Time.seconds(3))
                .sideOutputLateData(lateOutputTag)
                .apply(new WindowFunction<Tuple3<String, Long, Integer>, String, Tuple, TimeWindow>() {
                    private static final long serialVersionUID = 7813420265419629362L;

                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Long, Integer>> input, Collector<String> out) throws Exception {for (Tuple3<String, Long, Integer> stringLongTuple2 : input) {System.out.println(stringLongTuple2.f1);
                        }
                        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
                        out.collect("window" + format.format(window.getStart()) + "window" + format.format(window.getEnd()));
                        System.out.println("-------------------------");
                    }
                });

        resultStream.print();
        resultStream.getSideOutput(lateOutputTag).print();
        env.execute("window test");

    }
}


package cn.crawler.mft_seconed.demo4;

import cn.crawler.mft_seconed.KafkaEntity;
import cn.crawler.mft_seconed.demo2.SendDataToKafkaSql;
import com.alibaba.fastjson.JSON;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Properties;
import java.util.Random;
import java.util.UUID;

public class SendDataToKafkaDemo4 {public static void main(String[] args){SendDataToKafkaDemo4 sendDataToKafkaDemo4 = new SendDataToKafkaDemo4();
        for(int i=0;i<40;i++){KafkaEntity build = KafkaEntity.builder().id(1).message("message"+i).create_time(System.currentTimeMillis()).name(""+1).build();
            System.out.println(build.toString());
            sendDataToKafkaDemo4.send("test-demo13", "123", JSON.toJSONString(build));
        }
    }

    public void send(String topic,String key,String data){Properties props = new Properties();
        props.put("bootstrap.servers", "172.19.141.60:31090");
        props.put("acks", "all");
        props.put("retries", 0);
        props.put("batch.size", 16384);
        props.put("linger.ms", 1);
        props.put("buffer.memory", 33554432);
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        KafkaProducer<String, String> producer = new KafkaProducer<String,String>(props);
        for(int i=1;i<2;i++){
            try {Thread.sleep(100);
            } catch (InterruptedException e) {e.printStackTrace();
            }
            producer.send(new ProducerRecord<String, String>(topic, key+i, data));
 

           }
            producer.close();}
    }

下面看一下输出数据:

timestamp : 1564038394140|2019-07-25 15:06:34.140 currentMaxTimestamp : 1564038394140|2019-07-25 15:06:34.140, watermark : -2000|1970-01-01 07:59:58.000
1
1564038394140
window  2019-07-25 15:06:33.000   window  2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038395056|2019-07-25 15:06:35.056 currentMaxTimestamp : 1564038395056|2019-07-25 15:06:35.056, watermark : 1564038392140|2019-07-25 15:06:32.140
2
1564038395056
window  2019-07-25 15:06:33.000   window  2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038395363|2019-07-25 15:06:35.363 currentMaxTimestamp : 1564038395363|2019-07-25 15:06:35.363, watermark : 1564038393056|2019-07-25 15:06:33.056
3
1564038395363
window  2019-07-25 15:06:33.000   window  2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038395786|2019-07-25 15:06:35.786 currentMaxTimestamp : 1564038395786|2019-07-25 15:06:35.786, watermark : 1564038393363|2019-07-25 15:06:33.363
4
1564038395786
window  2019-07-25 15:06:33.000   window  2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038396216|2019-07-25 15:06:36.216 currentMaxTimestamp : 1564038396216|2019-07-25 15:06:36.216, watermark : 1564038393786|2019-07-25 15:06:33.786
1
1564038396216
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038396504|2019-07-25 15:06:36.504 currentMaxTimestamp : 1564038396504|2019-07-25 15:06:36.504, watermark : 1564038394216|2019-07-25 15:06:34.216
2
1564038396504
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038396960|2019-07-25 15:06:36.960 currentMaxTimestamp : 1564038396960|2019-07-25 15:06:36.960, watermark : 1564038394504|2019-07-25 15:06:34.504
3
1564038396960
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038397376|2019-07-25 15:06:37.376 currentMaxTimestamp : 1564038397376|2019-07-25 15:06:37.376, watermark : 1564038394960|2019-07-25 15:06:34.960
4
1564038397376
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038397755|2019-07-25 15:06:37.755 currentMaxTimestamp : 1564038397755|2019-07-25 15:06:37.755, watermark : 1564038395376|2019-07-25 15:06:35.376
5
1564038397755
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038398077|2019-07-25 15:06:38.077 currentMaxTimestamp : 1564038398077|2019-07-25 15:06:38.077, watermark : 1564038395755|2019-07-25 15:06:35.755
6
1564038398077
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038398511|2019-07-25 15:06:38.511 currentMaxTimestamp : 1564038398511|2019-07-25 15:06:38.511, watermark : 1564038396077|2019-07-25 15:06:36.077
7
1564038398511
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038398904|2019-07-25 15:06:38.904 currentMaxTimestamp : 1564038398904|2019-07-25 15:06:38.904, watermark : 1564038396511|2019-07-25 15:06:36.511
8
1564038398904
window  2019-07-25 15:06:36.000   window  2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038399218|2019-07-25 15:06:39.218 currentMaxTimestamp : 1564038399218|2019-07-25 15:06:39.218, watermark : 1564038396904|2019-07-25 15:06:36.904
1
1564038399218
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038399635|2019-07-25 15:06:39.635 currentMaxTimestamp : 1564038399635|2019-07-25 15:06:39.635, watermark : 1564038397218|2019-07-25 15:06:37.218
2
1564038399635
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038399874|2019-07-25 15:06:39.874 currentMaxTimestamp : 1564038399874|2019-07-25 15:06:39.874, watermark : 1564038397635|2019-07-25 15:06:37.635
3
1564038399874
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038400261|2019-07-25 15:06:40.261 currentMaxTimestamp : 1564038400261|2019-07-25 15:06:40.261, watermark : 1564038397874|2019-07-25 15:06:37.874
4
1564038400261
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038400614|2019-07-25 15:06:40.614 currentMaxTimestamp : 1564038400614|2019-07-25 15:06:40.614, watermark : 1564038398261|2019-07-25 15:06:38.261
5
1564038400614
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038400935|2019-07-25 15:06:40.935 currentMaxTimestamp : 1564038400935|2019-07-25 15:06:40.935, watermark : 1564038398614|2019-07-25 15:06:38.614
6
1564038400935
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038401351|2019-07-25 15:06:41.351 currentMaxTimestamp : 1564038401351|2019-07-25 15:06:41.351, watermark : 1564038398935|2019-07-25 15:06:38.935
7
1564038401351
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
清理窗口状态 | 窗口内保存值为 4 // 这里!!!!!!触发了触发器的 clear() 操作
timestamp : 1564038401856|2019-07-25 15:06:41.856 currentMaxTimestamp : 1564038401856|2019-07-25 15:06:41.856, watermark : 1564038399351|2019-07-25 15:06:39.351
8
1564038401856
window  2019-07-25 15:06:39.000   window  2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038402142|2019-07-25 15:06:42.142 currentMaxTimestamp : 1564038402142|2019-07-25 15:06:42.142, watermark : 1564038399856|2019-07-25 15:06:39.856
1
1564038402142
window  2019-07-25 15:06:42.000   window  2019-07-25 15:06:45.000
-------------------------
timestamp : 1564038402501|2019-07-25 15:06:42.501 currentMaxTimestamp : 1564038402501|2019-07-25 15:06:42.501, watermark : 1564038400142|2019-07-25 15:06:40.142
2
1564038402501
window  2019-07-25 15:06:42.000   window  2019-07-25 15:06:45.000
-------------------------


我们分析一下以上代码:
SendDataToKafkaDemo4 类发送了 40 条数据进 kafka,WatermarkTest 会接到数据,并将其转换为 java 实体类。然后为其添加水印(最大迟到时间是 2S)。并将窗口划分为 3s 的固定大小窗口。根据第一个字段 key by 后,为每个 key by 后的窗口设置(更新)state 的值。当水印时间 = window end time + 3s 时,触动触发器的 clear 方法,执行清除窗口数据的操作。当然,我们也可以看到,触发器的重写方法有好几种,我们可以在自己需要的地方重写方法。
例如:
第一个时间窗口:15:06:33.000 – 15:06:36.000
第一个时间窗口最终 value:4
我们拿到第一个时间窗口的最后时间 36s + 3s(allowedLateness 时间)= 39 s 的时间点,即当水印达到 15:06:39 000 时间点的时候,会执行窗口触发器的 clear 方法,随即,我们在事件时间为 timestamp : 1564038401856|2019-07-25 15:06:41.856 的时候,水印时间戳已经达到了 39s 的时间点,即:
watermark : 1564038399351|2019-07-25 15:06:39.3518 是这个点。此时触发 ….

退出移动版