Giraph源码分析六Edge-分析

120次阅读

共计 1202 个字符，预计需要花费 4 分钟才能阅读完成。

1. 在 Vertex 类中，顶点的存储方式采用邻接表形式。每个顶点有 VertexId、VertexValue、OutgoingEdges 和 Halt，boolean 型的 halt 变量用于记录顶点的状态，false 时表示 active，true 表示 inactive 状态。片段代码如下。

2.org.apache.giraph.edge.Edge 接口，用于存储顶点的边，每条边包含 targetVertexId 和 edgeValue 两个属性。类关系图如下：

Giraph 默认使用 DefaultEdge 类存储边，该类中有两个变量：I targetVertexId 和 E value，I 为顶点 ID 的类型，E 为边的类型。注意，DefaultEdge 类同时继承 ReusableEdge<I,E> 接口，在 ReusableEdge<I,E> 类的定义中，有如下说明文字：
A complete edge, the target vertex and the edge value. Can only be one edge with a destination vertex id per edge map. This edge can be reused, that is you can set it’s target vertex ID and edge value. Note: this class is useful for certain optimizations, but it’s not meant to be exposed to the user. Look at MutableEdge instead.

从上述说明文字可知，edge 可以被重用，只需要修改 targetVertexId 和 value 的值就行。即每个 Vertex 若有多条出边，只会创建一个 DefaultEdge 对象来存储边。
3.org.apache.giraph.edge.OutEdges 用于存储每个顶点的 out-edges。从 Vertex 类的定义可知，顶点的每条边都被存储在 OutEdges 类型的 edge 对象中，OutEdges 接口的关系图如下：

Giraph 默认的使用 ByteArrayEdges<I,E>，每个顶点的所有边都被存储在 byte[] 中。当顶点向它的出边发送消息时，需要遍历 Vertex 类中的 edges 对象。示例代码如下：

注意：由 DefaultEdge 的定义可知，遍历 getEdges 时，返回的 Edge 对象时同一个对象，只是该对象中值改变了。下面继续查看代码来证明此观点。
查看 ByteArrayEdges 类的 iterator() 方法，如下：

返回的是内部类 ByteArrayEdgeIterator 对象，定义如下：

总结：当顶点的出度很大时，此优化甚好，能很好的节约内存。如 UK-2005 数据中，顶点的最大出度为 5213。
假设顶点 1 的出度顶点有 <2 , 0.4>，<3 , 7.8>，<5 , 6.4>。如下代码：

输出结果为：
[2]
[3 , 3]
[5 , 5 , 5]
并非是希望的 [2 , 3 , 5]

正文完

大数据

发表至： hadoop

2019-08-12

0

贾扬清达摩院OceanBase都会到场分享的大会一场中国互联网技术人的盛宴

如何进入大数据领域学习路线是什么