经验拾忆纯手工-Tensorflow20语法-dataset数据封装训测验切割二

jiezi

5 年前

因为 sklearn 的 train_test_split 只能切 2 份，所以我们需要切 2 次：

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x, y,                # x,y 是原始数据
    test_size=0.2        # test_size 默认是 0.25
)  # 返回的是 剩余训练集 + 测试集

x_train, x_valid, y_train, y_valid = train_test_split(
    x_train, y_train,    # 把上面剩余的 x_train, y_train 继续拿来切
    test_size=0.2        # test_size 默认是 0.25
)  # 返回的是 二次剩余训练集 + 验证集

切分好的数据，一般需要做 batch_size，shuffle 等，可以使用 tf.keras 模型的 fit() 一步传递！
eg:

model.compile(
    loss=keras.losses.mean_squared_error, 
    optimizer=keras.optimizers.SGD(),
    metrics=['acc']    # 注意这个 metrics 参数，下面一会就提到
)

history = model.fit(
    x_train, 
    y_train, 
    validation_data=(x_valid, y_valid),     # 验证集在这里用了！！！epochs=100, 
    batch_size = 32      #  batch_size 不传也行，因为默认就是 32
    shuffle=True,        #  shuffle    不传也行，因为默认就是 True
    # callbacks=callbacks, #
)
度量指标 = model.evaluate(x_test, y_test)    # 返回的是指标（可能包括 loss,acc）# 这里说一下，为什么我说可能包括。# 因为这个返回结果取决于 你的  model.compile() 传递的参数
    # 如果你传了  metrics=['acc']，那么这个度量指标的返回结果就是 (loss, acc)
    # 如果你没传 metrics，那么这个度量指标的返回结果就是一个 loss

y_predict = model.predict(x_test)            # 返回的是预测结果

自己封装的代码：功能包括：3 切分，乱序数据集，分批操作一体化！！！（可能有瑕疵）
已上传至 Github : https://github.com/hacker-lin…
定义部分：

class HandlerData:
    def __init__(self, x, y):
        """我封装的类，数据通过实例化传进来保存"""
        self.x = x
        self.y = y

    def shuffle_and_batch(self, x, y, batch_size=None):
        """默认定死乱序操作，batch_size 可选参数，其实乱序参数也应该设置可选的。懒了"""
        data = tf.data.Dataset.from_tensor_slices((x, y))    # 封装 dataset 数据集格式

        data_ = data.shuffle(        # 乱序
            buffer_size=x.shape[0],  # 官方文档说明 shuffle 的 buffer_size 必须大于或等于样本数量
        )
        if batch_size:
            data_ = data_.batch(batch_size)
        return data_

    def train_test_valid_split(self, 
        test_size=0.2,                 # 测试集的切割比例
        valid_size=0.2,                # 验证集的切割比例
        batch_size=32,                 # batch_size 默认我设为了 32
        is_batch_and_shuffle=True      # 这个是需不需要乱序和分批，默认设为使用乱序和分批
    ):
    
        sample_num = self.x.shape[0]    # 获取样本总个数
        train_sample = int(sample_num * (1 - test_size - valid_size))  # 训练集的份数
        test_sample = int(sample_num * test_size)                      # 测试集测份数
        valid_train = int(sample_num * valid_size)                     # 验证集的份数
        # 这三个为什么我用 int 包裹起来了，因为我调试过程中发现，有浮点数计算精度缺失现象。# 所以必须转整形
        
        # tf.split()  此语法上一篇我讲过，分 n 份，每份可不同数量
        x_train, x_test, x_valid = tf.split(  
            self.x,
            num_or_size_splits=[train_sample, test_sample, valid_train],
            axis=0
        )
        y_train, y_test, y_valid = tf.split(
            self.y,
            [train_sample, test_sample, valid_train],
            axis=0
        )
        # 因为份数是我切割 x,y 之前计算出来的公共变量。所以不用担心 x,y 不匹配的问题。if is_batch_and_shuffle:   # 是否使用乱序和分批，默认是使用的，所以走这条
            return (self.shuffle_and_batch(x_train, y_train, batch_size=batch_size),
                self.shuffle_and_batch(x_test, y_test, batch_size=batch_size),
                self.shuffle_and_batch(x_valid, y_valid, batch_size=batch_size),
            )
        else:    # 如果你只想要切割后的原生数据，那么你把 is_batch_and_shuffle 传 False 就走这条路了
            return ((x_train, y_train),
                (x_test, y_test),
                (x_valid, y_valid)
            )

调用案例：

x = tf.ones([1000, 5000])
y = tf.ones([1000, 1])

data_obj = HandlerData(x,y)   # x 是原生的样本数据，x 是原生的 label 数据

# 方式 1：使用乱序，使用分批，就是一个参数都不用传，全是默认值
train, test, valid = data_obj.train_test_valid_split(
    # test_size=0.2, 
    # valid_size=0.2, 
    # batch_size=32, 
    # is_batch_and_shuffle=True
) # 这些参数你都可以不传，这都是设置的默认值。print(train)
print(test)
print(valid)

# 结果
>>> <BatchDataset shapes: ((None, 5000), (None, 1)), types: (tf.float32, tf.float32)>
>>> <BatchDataset shapes: ((None, 5000), (None, 1)), types: (tf.float32, tf.float32)>
>>> <BatchDataset shapes: ((None, 5000), (None, 1)), types: (tf.float32, tf.float32)>

# 虽然你看见了样本数为 None，但是没关系，因为你还没使用，遍历一下就明白了    
for x_train,y_train in train:
    print(x_train.shape,y_train.shape)

# 结果  600 // 32 == 18（你可以查一下正好 18 个）# 结果  600 % 32 == 24（你可以看一下最后一个就是 24）(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(24, 5000) (24, 1)   # 32 个一批，最后一个就是余数 24 个了。# 方式 2：不使用乱序，使用分批，只要原生数据，(x_train, y_train), (x_test, y_test), (x_valid, y_valid) = data_obj.train_test_valid_split(
    # test_size=0.2,
    # valid_size=0.2,
    # batch_size=32,
    is_batch_and_shuffle=False    # 这个改为 False 即可，其他参数可选
)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
print(x_valid.shape, y_valid.shape)

# 结果
>>> (600, 5000) (600, 1)
>>> (200, 5000) (200, 1)
>>> (200, 5000) (200, 1)

这个模块的作用就是，将我们的数据，或者 TF 张量，封装成数据集。
这个数据集具有成品 API，比如：可以帮助我们，分批，乱序，制作迭代，等一些列操作。

dataset = tf.data.Dataset.from_tensor_slices(np.arange(16).reshape(4,4))
按理来说（先不取），数据形状应该是这样的。（一个大列表里面，有 4 个小列表）[[0, 1, 2 ,3],
    [4, 5, 6 ,7],
    [8, 9, 10,11],
    [12,13,14,15],
]

for data in dataset:   # 封装的数据集需要遍历（或者 iter() 改变为迭代器类型），才能返回值
    print(data)        # 每遍历一条就是里面的小列表。eg: 第一条形状：[0, 1, 2 ,3]
                       # 但是别忘了。我们这是 Tensorflow，因此每层数据集都被封装为 Tensor。# 因此，我们每遍历出一条数据，都是一条 Tensor
输出：>>    tf.Tensor([0 1 2 3], shape=(4,), dtype=int32)
      tf.Tensor([4 5 6 7], shape=(4,), dtype=int32)
      tf.Tensor([8  9 10 11], shape=(4,), dtype=int32)
      tf.Tensor([12 13 14 15], shape=(4,), dtype=int32)

前面说了，这个数据的格式就是（一个大列表里面，有 4 个小列表）对应来看，（一个大 Tensor 里面，有 4 个小 Tensor）。记住这个理念

参数传元组：

question = [[1, 0], [1, 1]]
answer = ['encode', 'decoder']
dataset = tf.data.Dataset.from_tensor_slices((question, answer) ) # 用元组包起来了
for data in dataset:
    print(data[0],'=>' ,data[1])
输出:
>> tf.Tensor([1 0], shape=(2,), dtype=int32) => tf.Tensor(b'encode', shape=(), dtype=string)
   tf.Tensor([1 1], shape=(2,), dtype=int32) => tf.Tensor(b'decoder', shape=(), dtype=string)
   
你可以看出它自动把我们传递的 question 和 answer 两个大列表。"相当于做了 zip() 操作"。# 我的实验经历：训练 Encoder-Decoder 模型的，"问答对数据"，做编码后，就可以这样用元组传。

参数传字典：

data_dict = {'encoder': [1, 0],
    'decoder': [1, 1]
}

dataset = tf.data.Dataset.from_tensor_slices(data_dict)
for data in dataset:    # 其实每一个元素就是一个字典
    print(data)

# 其实就是把你的 value 部分，转成了 Tensor 类型。总体结构没变

Dataset API 大多数操作几乎都是链式调用（就像 python 字符串的 replace 方法）
用上面的数据作为案例数据，介绍几种 API：

for data in dataset.batch(2):    
    print(data) 
输出：>>    tf.Tensor([[0 1 2 3] [4 5 6 7]], shape=(2, 4), dtype=int32)
      tf.Tensor([[8  9 10 11] [12 13 14 15]], shape=(2, 4), dtype=int32)
                     
上面说过，默认就是 遍历出的每个子项，就是一个 Tensor，如上数据，遍历出 4 个 Tensor
而调用 batch(2) 后，把 2 个子项分成一批，然后再包装成为 Tensor。so, 4/2 = 2 批，包装成 2 个 Tensor

 注意（传的就是总重复数，算自身）：1. 如果 repeat() 不传参数，那就是无限重复。。。2. 如果传参数 = 0,  那么代表不取数据
    3. 如果传参数 = 1,  那么代表一共就一份数据
    4. 如果传参数 = 2,  那么代表一共就 2 份数据（把自己算上，一共 2 份，就这么个重复的意思）for data in dataset.repeat(2).batch(3):   # 重复 2 次。3 个一组（这就是链式调用）print(data)

结果
>>  tf.Tensor([[0  1  2  3] [4  5  6  7] [8  9 10 11]], shape=(3, 4), dtype=int32)  
    tf.Tensor([[12 13 14 15] [0  1  2  3] [4  5  6  7]], shape=(3, 4), dtype=int32)
    tf.Tensor([[8  9 10 11] [12 13 14 15]], shape=(2, 4), dtype=int32)  
    
    原数据是 4 个子项，重复 2 次：4*2=8 
    然后链式调用分 3 批：8/3=2 ..... 2（整批 3 个一组，最后一批余数一组）# 还要注意一下，它们重复是顺序重复拼接。分批时，可以首尾相连的（eg: 就像小时候吃的一连串棒棒糖，拽不好，会把上一个的糖皮连着拽下来）

经验拾忆纯手工-Tensorflow20语法-dataset数据封装训测验切割二

训练集 - 测试集 - 验证集切割

方法 1：（借用三方 sklearn 库）

方法 2：（tf.split）

数据处理（dataset）

基本理解

数据来源参数类型

链式调用

batch (分批)

repeat（重复使用数据：epoch 理念，重复训练 n 轮次）

未完待续