关于go:UTF8编码原理及GO语言标准库中编解码的实现

背景

最近在应用GO写业务，闲暇时看看三方库和规范库的一些工具的实现。

实践

底层数据都是按字节为单位存储。
ASCII码取值区间0-127，二进制示意为0b00000000-0b01111111。
UTF8编码方式：依据具体字符状况，应用1-4字节长度存储数据。
1字节：用ASCII能示意的字符，间接应用ASCII码示意，占用1字节。
多（2-4）字节：首字节数据的高3位-高5位，用于示意以后字符占用的字节数量，残余位存储具体数据。其余字节数据的最高2位必须是1和0，残余位存储具体数据。

UTF8数据存储举例

“界”字在UTF8编码中占用3字节，示意形式为：
首字节：0b11100111（高4位的1110为数据长度示意，前三位的1示意占用3字节，第4位的0为分隔符，前面的0111为首字节存储的理论数据）
二字节：0b10010101（高2位的10为非首字符的固定示意形式，用于区别于ASCII码数据。前面的6位010101为存储的理论数据）
三字节：0b10001100（与二字节同理）

编码

// EncodeRune writes into p (which must be large enough) the UTF-8 encoding of the rune.
// If the rune is out of range, it writes the encoding of RuneError.
// It returns the number of bytes written.
func EncodeRune(p []byte, r rune) int {
  // Negative values are erroneous. Making it unsigned addresses the problem.
  switch i := uint32(r); {
  // rune1Max的值为127，小于等于这个数的字符，能够间接应用1字节的ASCII码示意
  case i <= rune1Max:
      p[0] = byte(r)
      return 1
  // rune2Max的值为2047，小于等于这个数的字符，能够用2字节存储
  case i <= rune2Max:
      _ = p[1] // eliminate bounds checks
      // t2的值为0b11000000，使得首字符的高2位必然为11
      p[0] = t2 | byte(r>>6)
      p[1] = tx | byte(r)&maskx
      return 2
  case i > MaxRune, surrogateMin <= i && i <= surrogateMax:
      r = RuneError
      fallthrough
  // rune3Max的值为65535，小于等于这个数的字符，能够用3字节存储
  case i <= rune3Max:
      _ = p[2] // eliminate bounds checks
      // t3的值为0b11100000，使得首字符的高3位必然为111
      p[0] = t3 | byte(r>>12)
      // t1的值为0b10000000
      p[1] = tx | byte(r>>6)&maskx
      p[2] = tx | byte(r)&maskx
      return 3
  // 其余的用4字节存储
  default:
      _ = p[3] // eliminate bounds checks
      // t4的值为0b11110000，使得首字符的高4位必然为1111
      p[0] = t4 | byte(r>>18)
      p[1] = tx | byte(r>>12)&maskx
      p[2] = tx | byte(r>>6)&maskx
      p[3] = tx | byte(r)&maskx
      return 4
  }
}

该办法依据字符数据的不同大小，决定应用多少字节的空间进行编码和存储。编码过程中，首字节数据的高位会依据不同字节长度，设置其高位1的个数。因为数据的区间大小及每字节最多存储6位理论数据的起因，在高位前面必然有一位位0，用于示意高位的完结和理论数据的开始。

解码

// DecodeRune unpacks the first UTF-8 encoding in p and returns the rune and
// its width in bytes. If p is empty it returns (RuneError, 0). Otherwise, if
// the encoding is invalid, it returns (RuneError, 1). Both are impossible
// results for correct, non-empty UTF-8.
//
// An encoding is invalid if it is incorrect UTF-8, encodes a rune that is
// out of range, or is not the shortest possible UTF-8 encoding for the
// value. No other validation is performed.
func DecodeRune(p []byte) (r rune, size int) {
  n := len(p)
  if n < 1 {
      return RuneError, 0
  }
  // 取出首字节
  p0 := p[0]
  // 依据首位的值，取出长度的示意数据
  x := first[p0]
  if x >= as {
      // The following code simulates an additional check for x == xx and
      // handling the ASCII and invalid cases accordingly. This mask-and-or
      // approach prevents an additional branch.
      mask := rune(x) << 31 >> 31 // Create 0x0000 or 0xFFFF.
      return rune(p[0])&^mask | RuneError&mask, 1
  }
  sz := int(x & 7)
  accept := acceptRanges[x>>4]
  if n < sz {
      return RuneError, 1
  }
  b1 := p[1]
  if b1 < accept.lo || accept.hi < b1 {
      return RuneError, 1
  }
  if sz <= 2 { // <= instead of == to help the compiler eliminate some bounds checks
      return rune(p0&mask2)<<6 | rune(b1&maskx), 2
  }
  b2 := p[2]
  if b2 < locb || hicb < b2 {
      return RuneError, 1
  }
  if sz <= 3 {
      return rune(p0&mask3)<<12 | rune(b1&maskx)<<6 | rune(b2&maskx), 3
  }
  b3 := p[3]
  if b3 < locb || hicb < b3 {
      return RuneError, 1
  }
  return rune(p0&mask4)<<18 | rune(b1&maskx)<<12 | rune(b2&maskx)<<6 | rune(b3&maskx), 4
}

关于go:UTF8编码原理及GO语言标准库中编解码的实现

背景

实践

UTF8数据存储举例

编码

解码

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

关于go:UTF8编码原理及GO语言标准库中编解码的实现

背景

实践

UTF8数据存储举例

编码

解码

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

发表回复取消回复