共计 34647 个字符,预计需要花费 87 分钟才能阅读完成。
简介: 数据库是什么?将来的数据会被存在 DNA 里?数据库里的数据湖是什么?1 月 16 日,扫地僧做了一场直播,请到我的共事——数据库资深专家封神,和来自帝国理工的高级讲师 Thomas Heinis(托马斯·海尼斯),2 人就数据库这个话题做了比拟深刻的探讨,老僧印象比拟深的是一些前沿的 DNA 贮存大数据等概念。在此老僧奉上单方谈话的全部内容,因为英国学者应用英文解说,所以对全文进行了中英文的翻译。心愿这个速记能帮忙对前沿科学有趣味的同好。
1 月 16 日,扫地僧做了一场直播,请到我的共事——数据库资深专家封神,和来自帝国理工的高级讲师 Thomas Heinis(托马斯·海尼斯),2 人就数据库这个话题做了比拟深刻的探讨,老僧印象比拟深的是一些前沿的 DNA 贮存大数据等概念。在此老僧奉上单方谈话的全部内容,因为英国学者应用英文解说,所以对全文进行了中英文的翻译。心愿这个速记能帮忙对前沿科学有趣味的同好。
主持人:大家好,欢送来到阿里达摩院扫地僧的直播间,我是扫地僧的小助理。明天,咱们的主动驾驶机器人小蛮驴带大家逛了一下阿里云飞天园区,一路没人讲话,不晓得盆友们有没有看急了,那接下来咱们就聊聊天。
Moderator: Hello everyone. Welcome to the live streaming studio of the sweeper monks of Alibaba DAMO Academy. I am the assistant to the sweeper monks. This morning, our autonomous robot Xiaomanlv (Little Competent Donkey) took you guys on a tour in Alibaba Cloud Apsara Park without saying anything. I wonder if you felt anxious or not. Now let’s have a good chat.
这次陪咱们聊天的是扫地僧的老同事封神。
Today, we have with us our old friend, the Sweeper Monk Fengshen.
封神(阿里云智能 云原生数据湖剖析 DLA 技术负责人):大家好,我叫封神,来自阿里云数据库团队,09 年退出阿里,目前次要做数据库数据湖剖析方向,次要负责云原生数据湖剖析 DLA 的技术,之前也做了 10 年左右的数据库与大数据相干的事。
Hello everyone, I am Fengshen from Alibaba Cloud database team. I joined Alibaba in 2009. Currently, I am mainly responsible for Data Lake Analytics, known as DLA. Before joining Alibaba, I had spent 10 years doing things related to databases and big data.
主持人:直播间还请到了 1 位来自远方的客人,Mr. Thomas Heinis,他是帝国理工学院的讲师,请客人自我介绍。
Moderator: We are also honored to have with us here Mr. Thomas Heinis, lecturer at Imperial College London. Mr. Heinis, please kindly introduce yourself.
托马斯·海尼斯(帝国理工数据库业余高级讲师):是的,当然。我叫托马斯·海尼斯,当初是帝国理工学院的高级讲师。我在钻研小组做钻研,咱们钻研小组基本上负责所有与数据无关的钻研工作。我从事大量的数据分析和数据可视化工作,目前也负责数据存储工作,包含所有的新技术,确保咱们将来可能无效地剖析和了解数据。
Yes, sure. My name is Thomas Heinis. I am a senior lecturer at the moment at the Imperial College in London. I do research here in the research group, which basically takes care of everything to do with data.So I do a lot of data analytics, also, data visualization, and then also data storage at the moment, including all new technologies such that we can basically analyze data efficiently and understand it in the future.
主持人:既然请到 2 位数据库畛域的专家,咱们明天探讨的主题必定离不开数据库。数据库对咱们这些非专业人士而言,可能最常听的词是删库跑路,而对数据库最直观的了解就是 Excel 表格。先给咱们的观众介绍一下什么是数据库。请问 Mr. Heinis,你给学生上第一堂课时如何介绍?
Moderator: Since we have invited two experts in the field of databases, the topic of our discussion today is certainly inseparable from databases. For non-professionals like us, probably the most common words we have heard is dropping a database. And our most intuitive understanding of databases is Excel tables. First, please introduce to our audience what a database is. Mr. Heinis, how would you explain databases to freshmen in your first class?
托马斯·海尼斯:我通常会给学生略微介绍一下数据的由来,这与银行密不可分。上个世纪六七十年代期间,银行贮存了大量的数据。它们须要将这些数据组织起来,关系型数据库应运而生。数据库实质上是很多信息和数据的汇合,这些数据被组织起来,从而实现高效的剖析、拜访、治理和更新的目标。所以,计算机数据库通常来自于数据文件的导航,从传统上而言,或者就历史渊源而言,数据库的确来自银行,有很多对于银行客户的信息以及他们账户余额的信息。
I usually explain a little bit of history about it. That’s all to do with banks. Banks had a lot of data in the 60s and 70s. And they needed to organize that data. That’s kind of where relational databases come from.
Essentially, what I tell students is that a database is a collection of lots of information, lots of data that are organized so that it can be analyzed, accessed, managed, updated very efficiently, right?
So computer databases, typically from the navigation of data files. Databases, traditionally, or historically, do come from banks. There was a lot about bank customers and clients their balance of their accounts and information.
And within databases, then we have this massive for this big branch of this big technology called relational database, which is really where we organize data to tables, rows, columns, which really contain information about customers, clients, transactions, sales, etc. and all kinds of very well structured information.
而在数据库外面,有一个宏大的技术分支,叫做“关系型数据库”,其实就是咱们把客户、交易、销售等信息以高度结构化的模式组织到表、行、列当中。而有些同学一开始就曾经据说过 SQL,也就是用于查询数据库的查询语言。最近几年,状况有所变动。但总的来说,关系型数据库的确是在上世纪六七十年代期间由银行的利用案例所驱动的。过来二十年间,状况产生了微小的变动。因为咱们有了须要组织数据的新型利用,比方迷信利用,或者是社交网络等类型的利用。
基本上,这些利用须要稍微不同的数据库。因而,起初咱们转向了 NoSQL 数据库或非关系型数据库,有了不同的用例,也开始治理更多的数据,也就是当初所说的大数据。简略地说,咱们收集了海量的数据,须要用数据库来剖析和存储这些数据。
And some of the students have already, on the start, already heard about SQL, which is the query language to ask for a query database.
In recent years, things have changed a little bit. So relational databases, they do come from the 60s 70s driven by this banking use case, for this banking application.
And in the last 20 years, things have changed drastically, right?
Because we have new applications that need to organize the data such as scientific applications, or the types of applications like social networks, and etc.
Basically, these applications require slightly different databases. And so that’s where then we went to kind of noSQL databases or non-relational databases and we moved to different use cases. And we also move to managing much much more data, What is nowadays called Big Data. Basically, we collect tremendous amounts of data. And then we need to come to databases to analyze and store these data.
主持人:想问从事这个行业的封神老师,进入到古代,数据库为何越来越重要?
Moderator: I have a question to Fengshen. Why are databases more and more important in the modern era?
其实数据库始终很重要,最为简略的一条就是 数据不能丢。企业如果丢了最为外围的数据库,则企业可能间接面临破产。
Actually, databases have always been important. To put it simply, data cannot be lost. If an enterprise loses its core database, it might directly go bankrupt.
为什么数据库越来越重要 那是因为数据还蕴藏着宝藏,之前数据存着也就存着,个别存外围的数据,如交易、客户、商品的数据,一些日志数据存就是了查问问题。也不会去埋点,更加谈不上去爬取或者购买数据了。
Why are databases more and more important? That is because data also contain treasure, In the past, data were just stored there. Generally, core data were stored, such as the data of transactions, customers, and goods. Some log data were stored just for query purposes. There were no buried points, let alone crawling or purchasing data.
互联网也经验了好几个阶段,从开始的新闻门户时代,到能够有交互的相似 BBS、淘宝购物时代,到 all in 无线,到智能时代,产生的数据也越来越多。
The Internet has gone through several stages, from the news portal era in the beginning, to the era of interactions with such platforms as BBS and Taobao, to the era of wireless technology, and then to the era of intelligence. More and more data have been generated in the process of evolution.
- 从数据量来看,IDC 统计,2005 年寰球数据是 130EB,2019 年为 41ZB,涨了 322 倍;
-According to the statistics released by IDC, the amount of global data generated in 2005 were 130 EB, and that in 2019 was 41ZB, an increase of 322 times.
- 从数据利用来看,越来越多的公司也应用数据做出了杰出的问题,阿里、头条、百度、滴滴等 Top100 出名互联网企业都是数据驱动的企业。咱们再来看传统的产业,有智慧园区、城市大脑、县域大脑、智慧农业、智慧城市、智慧医疗、工业 4.0 等等,也都在应用数据技术在赋能各个产品,帮忙这些产业数字化转型,晋升效率。在产业利用看,要充分发挥设想的空间,应用数据赋能产业转型成长。
-From the perspective of data application, more and more companies have also made outstanding achievements with data. For example, the Top 100 well-known internet enterprises, including Alibaba, Toutiao, Baidu and DiDi, are all data-driven enterprises. In terms of the traditional industries, data technology is used in smart parks, city brain, agriculture brain, smart agriculture, smart cities, smart medicine and Industry 4.0 to empower each product and help the respective industries achieve digital transformation and increased efficiency. Seen from the perspective of industrial application, it is necessary to give full play to imagination and use data to empower industrial transformation and growth.
- 从国内形势看,国家也提出了新基建,外围是以 云计算、大数据、人工智能、5G、区块链为外围,这些外围中的外围是数据的利用,再过 10 年,真的是万物互联的时代,数据量增长的速度会更放慢;在疫情时代,有相干机构钻研表明,疫情让数字化转型快了 5 年左右。
-Seen from the domestic situation, the Chinese government has also put forward the concept of new infrastructure. Its core technologies include cloud computing, big data, artificial intelligence, 5G and blockchain. And the core of these core technologies is the application of data. It is expected that in 10 years, we will enter the era of the Internet of Everything, when the growth of data volume will be even faster. Relevant institutions have found through research that the COVID-19 pandemic has expediated the digital transformation by 5 years.
- 从高校和钻研机构看,国内高校业余增设最多是 大数据技术、人工智能的业余;总之,21 世纪除了人才,还有什么最贵,那就是数据,数据相当于 20 理论的石油,是 21 世纪整个社会效力运行的润滑剂。因为数据越来越重要,所以数据库越来越重要,数据库是这所有的外围载体。
-From the perspective of universities and research institutions, big data technology and artificial intelligence programs have been widely offered these days. In short, in addition to talents in the 21st century, what else is the most expensive? The answer is data. Data are equivalent to the oil of the 20th century. Data are the lubricant for the functioning of the whole society in the 21st Century. Due to the increasing importance of data, databases are also becoming more and more important. And databases are the core carriers of everything.
主持人:中国和英国的对大数据的定义有什么不同?这会导致单方程序员对数据库的了解不同吗?
Questions: What is the difference in the definition of big data between China and the U.K.? Will such differences lead to different perceptions of databases between the programmers in the two countries?
封神:大数据其实我认为没有精确的定义。比方,如果数据量比拟小,然而训练用的机器比拟多,也能够认为是用到了大数据技术。我个别认为用数据驱动业务倒退就属于应用了大数据相干的技术。之前国内提 Big Data(中文指:大数据)比拟多,当初国外提 Data Lake(中文指:数据湖)的概念比拟多,次要还是云公司在主导,数据很多存在了对象存储上。阿里数据库团队提的是库、仓(Data Warehouse)、湖(DataLake)、多模(Multi-Model),并且咱们还专门做了一个 云原生数据湖剖析 DLA 的产品。另外,咱们看到驰名的征询公司 Gartner 把大数据报告合并到了数据库报告外面。
Fengshen: I don’t think there is a precise definition of big data. For example, if the amount of data is small, but many machines are used for training, it can still be considered that big data technology has been used. I generally think that adopting the data-driven business mode is equivalent to using big data-related technologies. In the past, the concept of Big Data was mentioned a lot in China. Nowadays, the concept of Data Lake becomes popular in foreign countries. In most cases, cloud companies are taking the lead. And a lot of data exist in object-based storage. Alibaba’s database team mentions data library, Data Warehouse, Data Lake, and multi-model. We have also specifically made a product based on Data Lake Analytics (DLA). In addition, the renowned advisory company Gartner Inc. integrated big data reports into its database report.
我认为数据库包含传统的数据库技术(如 MySQL、PG);也包含数据仓库、数据湖的技术,如开源的 Spark、Hadoop,阿里的 ADB、DLA 等,也包含最近风行的 LakeHouse 技术。
In my view, databases include traditional database technologies (such as MySQL and PG) as well as data warehouse and data lake technologies, such as open-source Spark, Hadoop, Alibaba’s ADB and DLA. They also include the Lakehouse technology which has been quite popular these days.
我跟英国的程序员交换比拟少,跟北美有一些交换,整体应该了解差不多。技术流传的速度也比拟快,大家了解应该比拟相似。因为中国的市场比拟大,因为某一些起因,数据也比拟多,这些会放慢对数据的利用的倒退。
I haven’t had much communication with programmers in the UK. But I have had some communication with programmers in North America. Overall, our perceptions are more or less similar. The speed of technology communication is also relatively fast. So, the understanding should be more or less similar. Besides, the huge market in China and the larger amount of data generated here will accelerate the development of data application.
托马斯·海尼斯:On some level, yes. I think this is gonna be long long rounds. But I don’t think you know, if you look at it, from a technical perspective, the types of data could be the same, it’s going to be very similar. But what I do think is that the scale is massively different. And that’s quite, you know, just because there’s so much more data available in China.
某种程度上,会的。这一点说来话长。但我认为,从技术角度而言,数据的类型可能雷同,或者十分类似。但数据的规模却差别微小,这次要是因为在中国,可获取的数据更多。
And it’s because I think personally, I think it’s because the society is a little bit more technologically advanced, or is it easier to…I think China adopts technology easier, which means that, for example, you have more sensors everywhere with traffic measurements, and an engine mentioned the DiDi, right, which, which, you know, produces the same kind of data as Uber, for example, but on a different scale, right. And that applies to everything.
我集体认为,这是因为相比之下,中国社会的科技更为发达,并且中国更易采纳技术。这就意味着,例如,中国的路面上会有更多的传感器来监测交通状况。又如,滴滴打车软件所产生的数据和 Uber 所产生的数据一样,然而规模却不同。其余畛域也是如此。
And what he said about the pandemic, being a catalyst for this transformation is absolutely also absolutely true. Like we have made way more electronic payments, now, we have just everything is more digital, and I think a lot of it will stay digital, and that produces data that produces data that we need to analyze.
刚刚他提到说新冠肺炎疫情是推动这一转型的催化剂,这一点毋庸置疑。比方,咱们当初应用电子领取的频率比以前高了很多,生存的方方面面都更加数字化了,我认为数字化的趋势会继续上来,这就产生了须要咱们剖析的数据。
So with what I mentioned, about China, being a technological, a bit more about having more technical technological affinity means that we basically have more places where we collect data, and much more sensors, people interacting online that produces more data.
我刚提到说中国更亲热技术,这就意味着在中国,数据的起源更多,有更多的传感器,网络用户互动所产生的数据也更多。
And then also, you know, China’s population is huge. So that also means that more data is being produced. So it’s, I would say, you know, there are two technical challenges when it comes to big data or databases in general.
此外,中国人口众多,因此产生的数据也更多。因而,我认为,大数据或数据库次要面临两大技术挑战。
One is the data formats, you know, that’s life changing, as mentioned, is going towards also data lake right loads of different formats. But I don’t think that differs much between China and the rest of the world.
一是数据格式上的挑战,其发展趋势是走向数据湖等不同类型的格局。但在这一点上,中国和世界其余中央的区别不大。
But what what is really different is the amounts of data. And so that means that we need whatever we develop the analysis, visualization, storing the data that needs to happen efficiently on a much, much larger scale, which then again, brings in a lot of technical challenges and challenges as well. Right. So I think i think that’s that’s, that’s what I think that’s the difference. But I think the definition as such, it’s roughly roughly the same.
中外真正的区别在于数据量。具体而言,中国须要更大规模和更高效地进行数据分析、可视化和存储数据等工作。这又带来了很多技术挑战。我认为中外在数据库畛域的区别就在于此,但两者对数据库的定义基本相同。
It’s also I also have the feeling That, that, you know, China has a different, the Chinese people have a different understanding of personal data as well,
此外,我认为中国人对个人资料的了解与外国也不同。
it’s very, very difficult for us to get data from companies here or from you know, there’s always a perception of these breaches data privacy, whereas in China and appear of works a lot with with Chinese University as well. So we get a lot of data from I don’t think it’s DiDi, but somebody saw some similar data sets, it’s quite easy. So there’s also kind of like, I feel like China has really this, this kind of this, this more affinity towards technology, and this is a and like, this more of a project, let’s see what we can do with the data. Let’s see if we can improve things, you know, so we’re also collaborating on a traffic optimization project that you know, they collect massive amounts of data about which vehicle passes through the road, where at what speed what’s the congestion level? Can we remove traffic and all these kinds- of things? It’s really kind of like a very pragmatic approach to using data really, what can we do to improve you know, everything really. That’s what data does these days.
在英国或者其余国家,从公司获取数据十分艰难,因为外国企业总是放心进犯数据隐衷。相对而言,从中国企业或者院校获取数据容易得多。因而,咱们获取了很多数据,可能不是间接来自滴滴公司,但数据集也比拟相似。总地来说,在中国,数据的获取更为容易。我感觉中国更亲热技术,更违心应用技术,更违心利用技术来改善现状。咱们正在与中方单干一个交通优化的我的项目,他们收集了大量的数据,包含通过路线的车辆信息,车辆的地位,车辆的行驶速度,路面的拥挤水平,是否能够打消拥挤等。这种数据应用真的十分求实,相当于应用数据来改善所有可改善之处。这便是数据在当今社会所施展的作用。
主持人:随着大数据这个概念的呈现,数据库是怎么进化的,请封神老师讲讲?
With the emergence of the concept of big data, how has databases evolved over the years? Fengshen, please share with us your view on this question.
封神:数据库怎么进化,先看看数据库怎么来的。从狭义看,数据记录的历史早就有了。在 5000 年前,人类开始用绳结计数;在 2000 年前有纸张,到 1946 第一台计算机的诞生;计算机的诞生后,才有了古代意义的数据库。为了形象说数据库是什么,好比 你有一个管家,管家有一个记账本,你每天破费多少钱,支出多少都会告知管家,管家记录下来。你就能够晓得你目前多少钱,每个月破费多少钱;数据库就是管家 + 账本;管家提供计算力,账本提供存储;
To figure out how databases have evolved, let’s first look at how they came into being. In a broad sense, data recording dates back to a long time ago. Back 5000 years ago, humans began to count with knots; 2000 years ago, paper started to be used. And in 1946, the first computer was invented. After the invention of the first computer, the databases in the modern sense came into being. Let me use an analogy to explain what a database is. Suppose you have a housekeeper, and the housekeeper has a ledger. You inform the housekeeper how much you spend and how much you earn every day. The housekeeper records your income and expenditure in the ledger accordingly. You can then know how much money you currently have and how much you spend each month. In this case, the database is the housekeeper + the ledger: The housekeeper provides the computing power, while the ledger provides the storage.
数据库也倒退经验了很多阶段,为了“记好账”,数据库也在一直演进。咱们个别依据把数据库倒退分为了 4 个阶段:
The development of databases has gone through several stages. And in order to“keep good accounts”, databases have also been evolving. We generally divide the development of databases into 4 stages:
- 1970~1990 商业数据库时代 免费时代
- The two decades from 1970 to 1990 was an era of business databases when fees were charged for the use of databases.
- 1990~2000 开源数据库时代 开源时代
- The decade from 1990 to 2000 was an era of open-source databases.
- 2000~2015 互联网浪潮 大数据时代(大数据计算、存储、NoSQL)
- The period from 2000 to 2015 was an era of Internet and big data (big data computing, storage, NoSQL)
- 2015~ 当初 云的浪潮 云原生时代 +AI
- Finally, the era from 2015 to the present is an era of cloud technology and the widespread application of Cloud Native and AI technologies.
数据库的倒退跟几个因素无关,硬件的倒退,需要;硬件次要指 存储、网络、内存、CPU。存储就是存数据,内存与 CPU 关系到计算力,网络就是传输。
The development of databases is related to several factors, including the development of hardware and the market demand; hardware mainly refers to storage, network, memory and CPU; storage refers to data storage; memory and CPU are related to computing power; and network concerns transmission.
大数据这个词语大略在 10 年前开始风行,大数据系统开始独立于数据库系统倒退的,随着最近 5 年的倒退,大数据相干技术又缓缓与数据库技术联合回归到数据库的小家庭。比方,2020 年,驰名的征询公司 Gartner 把大数据报告合并到了数据库报告外面。最为典型的是 DalaLake 的倒退,融入了事务 &MVCC 的概念,NewSQL 的倒退,NewSQL 也交融了分布式的实践,并且还有一个 HTAP 的方向在摸索。目前数据库畛域分为 TP、NoSQL、AP 等畛域。TP 个别有单机、分布式、事务型的数据库;NoSQL 就绝对散一些:宽表、图、文档、时序、时空等;AP 有 Data Warehouse、DalaLake 畛域。
The term big data became popular around 10 years ago, when big data systems started to develop independent from database systems. With the development in the last 5 years, big data-related technologies have slowly returned to the database family by combining with database technologies. For example, in 2020, the famous research and advisory company Gartner integrated its big data report into its database report. And the most typical case is the development of Alibaba’s Dala Lake Analytics team, which incorporates the concepts of transactional databases & Multi-Version Concurrency Control (MVCC). Besides, the development of NewSQL also incorporates the theory of distributed databases. And the direction of HTAP is under exploration. Currently, the database field consists of such segments as TP, NoSQL and AP, TP generally consists of standalone databases, distributed databases and transactional databases; NoSQL covers a relatively wider scope, including wide tables, graphs, documents, time sequence and space-time. AP consists of such fields as data warehouses and data lakes.
在企业界,必定是做看得见,并且在 5 年内能落地的事件。将来 5 年,数据库畛域外围倒退方向是云原生 + 分布式,具体讲:Serverless、数据库与大数据一体化、智能化、平安可信、软硬件一体化、离在线一体化、多模数据处理。举个例子,我负责做的云原生数据湖剖析 DLA 就是传统大数据、Hadoop、Spark 的降级,须要交融传统数据库技术,并且基于存储与计算齐全拆散的云原生架构。咱们选用对象存储,反对常见的音讯、TP 和 NoSQ 数据库系统数据的归档。咱们个别归档到 DalaLake 外面。还反对了一些事务、版本的货色,并且把 Spark、Presto 等组件做成云原生的弹性、随时可用,即开即用,按需计费,拆散后带宽的损耗通过引入本地的 Cache 解决。
The prospects of the business community in the next 5 years are definitely foreseeable. In the next 5 years, the core development directions of the database field are Cloud Native and distributed databases, which specifically include serverless, integration of databases and big data, intelligence, security and trustworthiness, hardware and software integration, offline and online integration, and multi-mode data processing. For example, I am currently responsible for Data Lake Analytics (DLA), which can be considered the upgrade of the traditional technologies such as big data, Hadoop and Spark. It requires the integration of traditional database technologies and is based on the Cloud Native architecture of complete separation of storage and computing. It selects object-based storage and supports the archiving of common messages as well as the data in TP & NoSQL databases systems. Normally we archive the data in the data lakes. Besides, DLA also supports transactions and versions. Besides, the Spark and Presto components are also incorporated to achieve elasticity for the Cloud Native, which is accessible at any time and charged on the basis of demand. The loss of bandwidth after separation is solved by introducing local Cache.
托马斯·海尼斯:其实封神的分享曾经很具体,并无太多可补充之处。如果非要补充的话,如您刚才所言,开源数据库大大推动了数据在社区内的扩散。我认为相较以前,数据库的应用也变得容易得多了。
Well, there’s not much more to add to what Fengshen has already said. Right. But I think I think what would what would I do want to maybe add really is that also, you know, Like you said absolutely correctly, is that open source databases have done a tremendous service to the community in kind of getting databases everywhere. I will say that I think it’s also kind of the databases that have become much, much, much easier to use as well. And, you know, years back the first year students, they’d never seen a database. Today, if I asked, they’ve all seen MongoDB, or other technologies, kind of like easy-to-use databases, you know, not relational databases necessarily, but easy to use technology.
比方说,若干年前,大一的学生从未见过数据库。而当初的大一新生。简直都见过 MongoDB 或者其余一些易于应用的数据技术。他们未必见过关系型数据库,但根本都见过易于应用的数据科技。
And that really has made a difference in terms of training people, they also kind of has changed their understanding the approach of people to using databases. Now, back in the day, people were storing data just in RAW files. Nowadays, they know, if I want to have efficient access to the data, then I need to use a database and they know how to do so.
这对人员的培训起到了微小的助推作用。这也扭转了人们对数据库应用办法的了解。以前,人们只是把数据存储在原始档案中。当初,他们晓得,须要应用数据库能力无效地拜访数据,并且他们也晓得如何应用数据库。
So databases have really changed, or kind of database have become much more pervasive. They’re used everywhere these days. So that has definitely changed.
所以,数据库真的产生了变动,或者说数据库的应用变得更加广泛,当初各地的人们都在应用数据库,这是一个很显著的变动。
And now I unfortunately, forgot your question, which I didn’t answer.
不好意思,我遗记你提的问题了,所以没有答复。
So essentially, what really changed, right, and I said this initially, already kind of databases were designed relational databases were designed for banking applications that were revolving around transactions, which was really the centerpiece of banking applications.And that has made a lot of design decisions difficult.
至于数据库产生了什么样的演变。之前我有提到,数据库,或者说关系型数据库是为银行的利用而设计的,是围绕交易设计的,而交易是银行利用的真正外围。这也使得许多设计决策变得艰难。
And then in recent years, like XX had mentioned, right, databases are kind of new use cases emerged, all of a sudden, we no longer have the data we have along fits nicely in a in a table, we actually have a graph.
然而,最近几年,正如封神刚提到的,呈现了一些数据库的新用例。忽然间,咱们取得的曾经不再是表格数据,而是图形数据。
So we have graph databases, or we realized a lot of data is natively very structured in a document, XML or similar. So we develop document databases.
于是咱们有了图形数据库,或者说咱们意识到很多数据天生就是高度结构化的,相似于文档数据或 XML 数据。所以咱们开发了文档数据库。
So there’s, there’s the now we, in the early, maybe around 2000, a little bit after 2000, we had this understanding that one size doesn’t fit all. So we need to have different types of databases. So I mentioned graph database, document databases. But there have also been other other databases, very customized databases. For scientific applications, right?
所以,大概在 2000 年左右,具体说是刚过 2000 年的时候,咱们意识到不能一刀切,而是须要不同类型的数据库。我刚提到了图形数据库和文档数据库。然而也有其余类型的数据库,比方用于迷信利用的高度客制化的数据库。
They produce massive amounts of data, like physics experiments, like astronomy, DNA experiments in biological experiments, they have all kinds of their own database technology these days,
这些迷信利用数据库会产生大量的数据,比方物理试验、天文学、生物试验中的 DNA 试验等。这些畛域都有各自不同类型的数据库技术。
Back in the day, we’ve tried to fit everything in relational database and it didn’t work really well. So each one of those now as their own title type of database.
以前,咱们试图把所有数据都装进关系型数据库,但成果并不现实。因而,当初不同畛域都有各自不同类型的数据库。
At the same time, we also, you know, more and more data has been produced in different formats. And this is really where the kind of what is this notion of a data lake of engine has been mentioning is coming from that we have tons of data in different formats, we still want to analyze the data as a whole. So we need some sort of kind of some sort of integration between that or some sort of way of analyzing heterogeneous different data types. And that has also changed. So we have now this capability to just produce data, throw it in a database, Put simply, and then analyze it efficient, efficiently, efficiently at scale. Right. So that’s really how things I believe have changed.
同时,越来越多数据也在以不同的格局产生,因而咱们一直提到数据湖引擎的概念。之所以引出这个概念,是因为咱们有大量的不同格局的数据,但依然心愿将数据作为一个整体来剖析。所以咱们须要某种整合技术,或者说须要采纳某种形式来剖析不同类型的数据。所以咱们当初有这样的能力:即把数据生产进去,简略地扔到数据库外面,而后高效地进行大规模的剖析。我认为这就是数据库所产生的演变。
There’s also other trends like cloud computing, in general, which has made it which is also supported for particularly smaller businesses to have their own database their own data solution. Because they no longer need to own the resources. And they can just if they have a big analysis to run on their data, or they just use cloud resources to do so temporarily, without having the hassle of owning. Right. So that has also held,
还有其余的趋势,比方云计算。总地来说,云计算帮忙小微企业领有本人的数据库和本人的数据解决方案,因为它们不再须要领有资源,只须要把握数据的剖析能力,或者只是临时应用云端资源来进行剖析,而不须要领有资源。所以云计算对小微企业起到了助推作用。
then we also have a huge trend in terms of hardware. So we have, obviously we have better hardware. And every now and again, the database community tries to really optimize the database for new hardware beat is multi core processors, which are not particularly new, but all kinds of hardware aspects of new CPUs, new types of memory, non volatile memory, for example, change a little bit how we organize and analyze data. So a lot of hardware trends has also changed or shaped database, database technology.
此外,硬件方面也有很大的发展趋势。显然,咱们有更好的硬件。而且每隔一段时间,数据库社区就会尝试真正的优化数据库,针对新的硬件采纳多核处理器,这也并不离奇,然而硬件各个方面的优化,比方应用新的中央处理器、新的储存器,非易失性存储器等,在肯定水平上扭转了咱们组织和剖析数据的形式。所以说,硬件的很多发展趋势也扭转或者塑造了数据库技术。
And then finally, what has happened in the last couple of years is really the use of machine learning or artificial intelligence in and around data and that has driven a lot of research and has also produced a lot of products. And when I talk about AI or artificial intelligence databases, it’s really kind of The database research community has taken an approach And has done a lot of different things
最初,过来几年间,机器学习或人工智能技术开始利用于数据畛域或与数据相干的畛域,这推动了很多钻研,也催生了很多产品。谈到人工智能或人工智能数据库,数据库钻研社区做了很多不同的事件。
for example, you know, artificial intelligence machine learning requires a lot of learning, which requires a lot of data, and for that we need to have data that is clean and has been processed and has been manual has been brought in the right format. So, that’s a database task.
例如,人工智能和机器学习的倒退须要大量的学习,这就须要大量的数据。为此,咱们须要有洁净的数据,通过解决的数据,和经人工解决为正确格局的的数据。这是数据库层面的工作。
And then the learning itself is also to some degree a database. Right And so, we have worked on that, that has had a tremendous impact in recent years.
此外,从肯定水平上而言,学习自身也是一个数据库。咱们在这方面也下了不少功夫,这在近几年产生了微小的影响。
We also use artificial intelligence within the database itself to accelerate the database accelerate query execution the analysis. And then we also use artificial intelligence to organize the database itself.
咱们还在数据库外部应用了人工智能技术来减速数据库的索引、执行和剖析。咱们也应用了人工智能技术来组织数据库自身。
so that so I would say that artificial intelligence is a mega trend of course, we all know and you know has touched all aspects of life but it is also interesting enough to touch databases which not just touched but changed profoundly how we design and use databases.
因而,我认为,人工智能当然是大趋势。家喻户晓,人工智能曾经涉及到咱们生存的方方面面,但更乏味的是,它也涉及到了数据库畛域。精确的说,不仅是涉及,而且粗浅地扭转了咱们设计和应用数据库的形式。
主持人:刚刚两位老师说到了很多对于数据库的基础知识,如果我当初给这场直播起个名字,我会叫它“数据库入门必看”。开玩笑,咱们实际上是个前沿学术分享的直播。其实任何一门学科在学界和工业界都有 2 种状态,在工业界落地很重要,你能在工业界为一门学科找到很多利用场景,比方刚刚封神老师讲到的双十一、工业大脑,而学界的摸索往往十分有想象力。咱们很想请 2 位来瞻望一下,数据库的将来会如何倒退?比方 5 年内、10 年内、50 年内、100 年内?
Our two teachers just shared a lot of basics about databases with us. If I were to give this livestreaming interview a name, I would call it“Database Essentials”. Just kidding. This is actually a live interview about cutting-edge database technologies. In fact, any discipline exists in two different forms, one in academia and the other in industry. It is important to apply technologies in industry. You can find many application scenarios for a discipline in industry, such as Double 11 Shopping Festival and Industrial Brain mentioned by Fengshen just now. In comparison, the exploration in academia is often very imaginative. We would love to ask you two to look ahead into the development of databases in the future. For example, what will databases be like in 5 years, 10 years, 50 years, or even 100 years?
封神:我关注的一些方向,将来 5 年,数据库畛域外围倒退方向是云原生 + 分布式,具体讲:Serverless、数据库与大数据一体化、智能化、平安可信、软硬件一体化、离在线一体化、多模数据处理,这个会对每个数据库的每个子畛域都有影响。具体在学术界钻研的,我看的还绝对含糊一些。依照人类倒退来看,倒退应该是越来越快。不过,计算机还是冯诺依曼架构,未什么时候会颠覆,目前我也没有概念。10 年是什么样,我其实压根不晓得。目前惟一的就是放弃敬畏之心,放弃学习。
I would like to talk about some of the directions I focus on. In the next 5 years, the core development directions of databases would be Cloud Native and distributed databases. Specifically, I’m talking about serverless, integration of databases and big data, intelligence, security and trustworthiness, software and hardware integration, offline and online integration, and multi-mode data processing. These technologies will have an impact on each subfield of each database. As to database research in academia, I only have a vague idea. Seen from the history of human development, the development of databases should be faster and faster. However, computers nowadays are still based on the von Neumann architecture. I have no idea when it will be replaced. And I actually have no idea what kind of development will have happened in 10 years. At present, the only thing I am sure about is to maintain a sense of awe and keep learning.
主持人:学术界就是 Mr.Heinis 的钻研方向了,请 Mr.Heinis 持续来说
托马斯·海尼斯:Yeah, well, what’s the future? It’s difficult to predict, right?But in terms of, you know, kind of like a five year perspective, the only thing I would add in that I think will make a difference in the in the short term is probably also, like I mentioned, AI, artificial intelligence helping us a little bit to organize the data to accelerate analysis, etc.
将来会如何?咱们很难预测。但就将来 5 年而言,短期内可能呈现的停顿,就是我刚提到的:人工智能将在肯定水平上帮忙咱们组织数据和减速数据分析。
So I think we’re kind of lucky that this is the case. Because a lot of students want to work on AI. And if you can kind of combine as a database technology, we get a lot of talented students involved. But yeah, so I think in the short term, I think AI will also have an impact on databases.
从这点来看,我认为咱们很侥幸。因为很多学生想钻研人工智能。如果能把数据库技术联合起来,咱们就能吸引大量人才参加进来。所以,我认为在短期内,人工智能也会对数据库产生影响。
I think also that visualization will become important. And we move there to virtual reality, right, kind of which, which offers us a much more, much more kind of, you know,
可视化技术也会变得更加重要,以及与之相干的虚拟现实技术。
we can kind of interact with the data, we can touch the data to some degree, you know,
它能帮忙咱们与数据互动,在某种程度上,咱们将能“触摸”到数据。
I used to do research and have a feed with gloves with haptic feedback, we can touch to data, this kind of thing will I think will become more important not for an individual analyzing data. But I think for collaborative analysis of data to analyze data together to understand it together, I think that’s where we also need to put in and put some research to kind of like help to, to find easier ways for people to understand the impact of things.
我曾做过相干钻研,戴上触觉反馈手套,咱们能够“触摸”到数据。我认为这类技术会变得更加重要,不是针对集体剖析数据而言,而是针对数据协同剖析,即团队独特剖析和了解数据。这也是咱们须要投入钻研的中央,以便找到更简略的办法,帮忙公众了解数据及其带来的影响。
And then, like Fengshen said right, at one, one important thing that’s going to happen fairly soon, probably five to 10 years, maybe a little bit more, it’s going to be quantum
另外,正如封神所言,不久的未来数字畛域将产生重大突破,那便是量子科技。这兴许会产生在 5 到 10 年之后,兴许更久一点。
and quantum. You know, it’s difficult to fathom what is gonna, what it’s going to do to our two databases. But one thing is for sure, I believe with quantum sensing, quantum sensors will just have so much more data to deal with. And that will challenge database technology, or Big Data technology in itself, right?
咱们很难弄清楚量子科技会对数据库造成什么影响。但有一点是必定的,我置信随着量子传感器的利用和遍及,咱们将有更多的数据须要解决。而这将对数据库技术或大数据技术带来挑战。
Then when it comes to go a little bit beyond 20-30-50 years, maybe or 50 years a bit later. But yeah, one of my favorite topics DNA storage basic for the store information to store data within synthetic DNA. And this is interesting, because we know essentially has been talking about numbers initially how much data we have
再过 20、30、50 年,甚至超过 50 年之后,就得谈到我最喜爱探讨的话题之一,DNA 存储,也就是在合成 DNA 中存储数据。这个话题很有意思,因为封神刚刚始终在议论咱们目前领有海量数据。
a lot of this data we don’t look at every day, right, we store it in the long from the long term, because we need to for the law says we have to keep records around for hundreds of years, right.
很多数据咱们并不是每天查看,只是长期保留而已,因为法律规定咱们必须保留数百年的数据记录。
And we do this with traditional technology with tape, disk, they don’t last forever, the last maybe 10,15 years. And then we need to copy the data on to a new disk or a new tape etc. So as always, as data migration, as much as the hassling is also quite expensive. And a lot of companies don’t want to afford this anymore, can’t afford to do this anymore.
咱们应用磁带和光盘等传统技术保留数据。然而磁带和光盘无奈永恒保留,可能顶多保留 10 到 15 年。接着,咱们就须要把数据复制到新的光盘或磁带上。数据迁徙耗时耗力耗财,很多公司要么不想再承当这样的老本,要么承当不起这样的老本。
So what we’re looking at with DNA storage, for example, is really to store data for 10s of years, maybe hundreds of years, right, such that we can retrieve it.
通过 DNA 存储技术,咱们心愿将数据存储几十年,甚至几百年,以便日后检索。
So we really can take the data, convert it to two strings of nucleotides and then synthesize this and store it in, in the fridge essentially. And when we need it, we sequence and get it back. So anyway, that’s kind of I think that’s going to happen.
咱们能够把数据转换成核苷酸串,而后合成并贮存在冰箱里。须要时,咱们再进行测序,并取回数据。这就是我所构想的将来。
So generally, I don’t want to focus too much on DNA storage itself, I think like, the underlying technology will change drastically
总体来讲,我不想过多地关注 DNA 存储自身,我认为其底层技术将会产生天翻地覆的变动。
in the past, we looked a lot of when we looked at storage, the storage medium, we had a lot of collaboration with computing, and electrical engineering. Now I think we’re getting to a point where we go from, from computing, collaborating between computing and biology, or chemistry, etc.
过来钻研存储介质的时候,咱们与计算和电气工程畛域有很多单干。而当初,咱们开始在计算和生物或化学等等畛域之间进行合作。
Doesn’t have to be DNA can be another kind of storage medium. But I think that’s what’s going on.
不肯定是 DNA,也能够是另一种存储介质。但我的构想大略就是这样。
And what’s quite interesting there is also I think, when we look at a little bit beyond 20 years, when it comes to DNA storage, but we can also implement some of some data processing some data analytics on top of the DNA using biological processes,
同样有意思的是,瞻望 20 年之后,在 DNA 存储方面,咱们还能够通过生物过程在 DNA 之上实现数据处理和数据分析。
which is extremely energy efficient, and also very, very fast.
这种做法十分节能,而且速度极快。
There are limits to this technology, but we’ll find out over the next couple of years. next couple of decades, maybe we’ll come in on there
这项技术存在局限性,咱们将在将来几年或几十年内找到答案。
but I think generally that we also will the whole field of computing will expand into other into other will collaborate more with other fields. And that also has implications for databases for data analytics to
can use biological processes or chemical processes or anything or similar to do computations right. I think that’s that’s what’s gonna that’s what’s definitely gonna happen. But it’s, you know, the difficulty in the future is very difficult to predict.
总体而言,整个计算畛域将会扩大到其它畛域,与其它畛域发展更多的合作。而这也将对数据库和数据分析产生影响,咱们能够利用生物过程、化学过程或其它相似的办法进行计算。我想这相对是一个趋势。但将来是很难预测的。
The implication of quantum, for example, like I mentioned, quantum sensing will deliver tons of data. But there’s gotta be other implications.
例如量子技术的影响。如我之前所言,量子传感技术的利用和遍及将给咱们提供大量数据。但除此之外,必定还会产生其它影响。
For example, it’s one tiny operation in a database, query optimization, which is kind of like you give the database a query, it figures out how to do it efficiently. And that takes a lot of time to compute to figure out how to execute that query efficiently.
例如,数据库中有这样一个小操作,即查问优化,也就是说,你在数据库里进行一项查问,它会找到高效执行的方法。这一操作须要破费大量的工夫。
And we’ve also already seen in the community that somebody took a query optimization and implemented it on a quantum computer showing that this would be massively faster to optimize the query on the quantum computer. So there’s a lot of really I don’t think I understand all the implications of quantum but there the quantum computing but that will be definitely also have an impact on databases.
而咱们曾经在社区里目击了这样一个案例,有人在一台量子计算机上进行查问优化,结果表明,在量子计算机上优化查问,速度要快得多。我无奈理解量子技术的所有影响,但量子计算必定会对数据库产生影响。
So in the short term, adding to essentially I say, is really kind of I think AI is having a tremendous impact in the short term, in the somewhat longer term, I think, we really have to think about interfaces to data virtual reality being one of them, right?Augmented reality being another, but we need to think about how can we make it easy for people to interact with data and understanding typing the query that works for an analyst that’s not going to work for everyone right for pretty good for for a broad class of people who need to you know, I think we all need to deal with interpret and analyze data and I think we need to make it easy for everyone. That’s a little bit more medium term and in the long term, I think that hardware will change dramatically with quantum with DNA storage with other types of storage medium etc. But 100 years I’m not gonna make a prediction here that’s too far out.
短期而言,接着封神刚刚的观点讲,我认为人工智能将在短期内产生微小影响,而更久远来讲,咱们必须思考数据界面,比方虚拟现实和加强事实。咱们必须思考如何找到更简略的办法,帮忙公众与数据互动,了解数据。输出查问对数据分析师而言是可行的,但并不适用于所有人。所以,咱们须要解释和剖析数据,升高数据的门槛。这是针对中期而言。久远来看,在量子技术、DNA 存储和其它类型的存储介质影响下,硬件将产生微小的变动。但 100 年后我就不做预测了,那太边远了。
主持人:感激 2 位敌人,本期节目的最初,咱们也为数据库团队和 Dr. Heinis 打个招聘广告。
Databases research falls within the expertise of Mr. Heinis. Let’s invite Mr. Heinis to share with us the prospects of database research in academia. I would like to thank our two friends. I’d like to take this opportunity to share a recruitment ad for the database team and Dr. Heinis.
封神:对数据库技术有激情的,有技术现实,且技术过硬的同学。具体数据库 TP、NoSQL、AP 各个方向都在招聘。目前我在阿里云数据库重点做数据湖剖析,欢送大家分割我。封神:dragon.caol@alibaba-inc.com
托马斯·海尼斯:Currently, the specific sub-fields of databases, including TP, NoSQL and AP are all recruiting talents. We welcome those who are passionate about database technology, who have technical aspirations, and who are technically proficient. Currently I focus on DLA at Alibaba Cloud. Look forward to hearing from you. Fengshen: dragon.caol@alibaba-inc.com.
Absolutely, of course I do. We have Imperial has been very good collaborations with China in general. And we, for some reason, I don’t know why. But we have a big share of Chinese students though, and they are very talented. So if any one of those would like to work on kind of and we offer everything, you know, internships, student shapes for PhD and postdocs as well, if anybody wants to work to to change database technology in the future, of course, you can go to Alibaba or you can come You know, seriously, I’m really looking always recruiting interested to the students, we look at all kinds of aspects of databases, like I mentioned. So, one of the some of the topics that I work on with my team are AI, virtual reality and DNA storage, but we also have other aspects. So, if anybody wants to kind of you know, learn learn these technologies work with these technologies and contribute to this research please do get in touch.
当然了。帝国理工学院和中国始终放弃着十分良好的单干。出于某种原因,我也不晓得是为什么,咱们的学生中有很大一部分是中国学生,他们才华横溢。如果他们有趣味参加钻研,我校提供各种实际机会,招收实习生、博士和博士后等等。如果你想要扭转将来的数据技术,你能够抉择退出阿里巴巴团体,或者成为帝国理工学院的一份子。我始终在寻找有志于此的学生。正如我一开始提到的,咱们关注数据库的方方面面。咱们团队正在钻研的课题包含人工智能、虚拟现实和 DNA 存储等技术,但也包含其它方面。如果有人想学习这些技术、应用这些技术并为这类钻研做出奉献,欢送分割咱们。
作者:stromal
原文链接
本文为阿里云原创内容,未经容许不得转载