关于神经网络:MindSpore易点通精讲系列数据集加载之CSVDataset

5次阅读

共计 12130 个字符,预计需要花费 31 分钟才能阅读完成。

Dive Into MindSpore – CSVDataset For Dataset LoadMindSpore 精讲系列 – 数据集加载之 CSVDataset 本文开发环境 Ubuntu 20.04Python 3.8MindSpore 1.7.0 本文内容摘要先看 API 数据筹备两种试错正确示例本文总结问题改良本文参考 1. 先看 API 老传统,先看看官网文档:

参数解读:dataset_files – 数据集文件门路,能够单文件也能够是文件列表 filed_delim – 字段宰割符,默认为 ”,”column_defaults – 一个巨坑的参数,留待前面解读 column_names – 字段名,用于后续数据字段的 keynum_paraller_workers – 不再解释 shuffle – 是否打乱数据,三种抉择 [False, Shuffle.GLOBAL, Shuffle.FILES]Shuffle.GLOBAL – 混洗文件和文件中的数据,默认 Shuffle.FILES – 仅混洗文件 num_shards – 不再解释 shard_id – 不再解释 cache – 不再解释 2. 数据筹备 2.1 数据下载阐明:数据下载地址:UCI Machine Learning Repository: Iris Data Set 应用如下命令下载数据 iris.data 和 iris.names 到目标目录:mkdir iris && cd iris
wget -c https://archive.ics.uci.edu/m…
wget -c https://archive.ics.uci.edu/m…
备注:如果受零碎限度,无奈应用 wget 命令,能够思考用浏览器下载,下载地址见阐明。2.2 数据简介 Iris 也称鸢尾花卉数据集,是一类多重变量剖析的数据集。数据集蕴含 150 个数据集,分为 3 类,每类 50 个数据,每个数据蕴含 4 个属性。可通过花萼长度,花萼宽度,花瓣长度,花瓣宽度 4 个属性预测鸢尾花卉属于(Setosa,Versicolour,Virginica)三个品种中的哪一类。更具体的介绍参见官网阐明:5. Number of Instances: 150 (50 in each of three classes)

  1. Number of Attributes: 4 numeric, predictive attributes and the class
  2. Attribute Information:

    1. sepal length in cm
    2. sepal width in cm
    3. petal length in cm
    4. petal width in cm
    5. class:
      — Iris Setosa
      — Iris Versicolour
      — Iris Virginica
  3. Missing Attribute Values: None

Summary Statistics:

         Min  Max   Mean    SD   Class Correlation

sepal length: 4.3 7.9 5.84 0.83 0.7826

sepal width: 2.0  4.4   3.05  0.43   -0.4194

petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width: 0.1  2.5   1.20  0.76    0.9565  (high!)
  1. Class Distribution: 33.3% for each of 3 classes.
    2.3 数据调配这里对数据进行初步调配,分成训练集和测试集,分配比例为 4:1。相干解决代码如下:from random import shuffle

def preprocess_iris_data(iris_data_file, train_file, test_file, header=True):

cls_0 = "Iris-setosa"
cls_1 = "Iris-versicolor"
cls_2 = "Iris-virginica"

cls_0_samples = []
cls_1_samples = []
cls_2_samples = []

with open(iris_data_file, "r", encoding="UTF8") as fp:
    lines = fp.readlines()
    for line in lines:
        line = line.strip()
        if not line:
            continue
        if cls_0 in line:
            cls_0_samples.append(line)
            continue
        if cls_1 in line:
            cls_1_samples.append(line)
            continue
        if cls_2 in line:
            cls_2_samples.append(line)

shuffle(cls_0_samples)
shuffle(cls_1_samples)
shuffle(cls_2_samples)

print("number of class 0: {}".format(len(cls_0_samples)), flush=True)
print("number of class 1: {}".format(len(cls_1_samples)), flush=True)
print("number of class 2: {}".format(len(cls_2_samples)), flush=True)

train_samples = cls_0_samples[:40] + cls_1_samples[:40] + cls_2_samples[:40]
test_samples = cls_0_samples[40:] + cls_1_samples[40:] + cls_2_samples[40:]

header_content = "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Classes"

with open(train_file, "w", encoding="UTF8") as fp:
    if header:
        fp.write("{}\n".format(header_content))
    for sample in train_samples:
        fp.write("{}\n".format(sample))

with open(test_file, "w", encoding="UTF8") as fp:
    if header:
        fp.write("{}\n".format(header_content))
    for sample in test_samples:
        fp.write("{}\n".format(sample))

def main():

iris_data_file = "{your_path}/iris/iris.data"
iris_train_file = "{your_path}/iris/iris_train.csv"
iris_test_file = "{your_path}/iris/iris_test.csv"

preprocess_iris_data(iris_data_file, iris_train_file, iris_test_file)

if name == “__main__”:

main()

将以上代码保留到 preprocess.py 文件,应用如下命令运行:留神批改相干数据文件门路 python3 preprocess.py
输入内容如下:number of class 0: 50
number of class 1: 50
number of class 2: 50
同时在目标目录生成 iris_train.csv 和 iris_test.csv 文件,目录内容如下所示:.
├── iris.data
├── iris.names
├── iris_test.csv
└── iris_train.csv

  1. 两种试错上面通过几种 谬误(带引号)用法,来初步认识一下 CSVDataset。3.1 column_defaults 是哪样首先,先来个简略加载,代码如下:为不便读者复现,这里将 shuffle 设置为 False。from mindspore.dataset import CSVDataset

def dataset_load(data_files):

column_defaults = [float, float, float, float, str]
column_names = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Classes"]

dataset = CSVDataset(
    dataset_files=data_files,
    field_delim=",",
    column_defaults=column_defaults,
    column_names=column_names,
    num_samples=None,
    shuffle=False)

data_iter = dataset.create_dict_iterator()
item = None
for data in data_iter:
    item = data
    break

print("====== sample ======\n{}".format(item), flush=True)

def main():

iris_train_file = "{your_path}/iris/iris_train.csv"

dataset_load(data_files=iris_train_file)

if name == “__main__”:

main()

将以上代码保留到 load.py 文件,运行命令:留神批改数据文件门路 python3 load.py
纳尼,报错,来看看报错内容:Traceback (most recent call last):
File “/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py”, line 107, in <module>

main()

File “/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py”, line 103, in main

dataset_load(data_files=iris_train_file)

File “/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py”, line 75, in dataset_load

dataset = CSVDataset(

File “/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/validators.py”, line 1634, in new_method

raise TypeError("column type in column_defaults is invalid.")

TypeError: column type in column_defaults is invalid.
看看引发报错的源码,mindspore/dataset/engine/validators.py 中 1634 行,相干代码如下:# check column_defaults

    column_defaults = param_dict.get('column_defaults')
    if column_defaults is not None:
        if not isinstance(column_defaults, list):
            raise TypeError("column_defaults should be type of list.")
        for item in column_defaults:
            if not isinstance(item, (str, int, float)):
                raise TypeError("column type in column_defaults is invalid.")

3.1.1 报错剖析更多对于 column_defaults 参数的剖析请参考 6.1 节。还记得官网参数阐明吗,不记得没关系,这里再列出来。column_defaults (list, 可选) – 指定每个数据列的数据类型,无效的类型包含 float、int 或 string。默认值:None,不指定。如果未指定该参数,则所有列的数据类型将被视为 string。很显然,官网参数阐明是数据类型,然而到 mindspore/dataset/engine/validators.py 代码外面,却检测的是数据实例类型。明确了这点,将代码:column_defaults = [float, float, float, float, str]
批改为:这里的数值取自 iris.names 文件,详情参考该文件。column_defaults = [5.84, 3.05, 3.76, 1.20, “Classes”]
再次运行代码,再次报错:WARNING: Logging before InitGoogleLogging() is written to STDERR
[ERROR] MD(13306,0x70000269b000,Python):2022-06-14-16:51:59.681.109 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Unexpected error. Invalid csv, csv file: /Users/kaierlong/Downloads/iris/iris_train.csv parse failed at line 1, type does not match.
Line of code : 506
File : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_X86_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc

Traceback (most recent call last):
File “/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py”, line 107, in <module>

main()

File “/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py”, line 103, in main

dataset_load(data_files=iris_train_file)

File “/Users/kaierlong/Documents/Codes/OpenI/Dive_Into_MindSpore/code/chapter_01/csv_dataset.py”, line 90, in dataset_load

for data in data_iter:

File “/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py”, line 147, in next

data = self._get_next()

File “/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py”, line 211, in _get_next

raise err

File “/Users/kaierlong/Documents/PyEnv/env_ms_1.7.0/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py”, line 204, in _get_next

return {k: self._transform_tensor(t) for k, t in self._iterator.GetNextAsMap().items()}

RuntimeError: Unexpected error. Invalid csv, csv file: /Users/kaierlong/Downloads/iris/iris_train.csv parse failed at line 1, type does not match.
Line of code : 506
File : /Users/jenkins/agent-working-dir/workspace/Compile_CPU_X86_MacOS_PY39/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc
好了,这个谬误咱们到 3.2 局部进行剖析。3.2 header 要不要在 3.1 中,咱们依据对报错源码的剖析,明确了 column_defaults 的正确用法,然而仍然存在一个谬误。3.2.1 报错剖析依据报错信息,发现是 mindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc 中 506 行的报错,相干源码如下:Status CsvOp::LoadFile(const std::string &file, int64_t start_offset, int64_t end_offset, int32_t worker_id) {
CsvParser csv_parser(worker_id, jagged_rows_connector_.get(), field_delim_, column_default_list_, file);
RETURN_IF_NOT_OK(csv_parser.InitCsvParser());
csv_parser.SetStartOffset(start_offset);
csv_parser.SetEndOffset(end_offset);

auto realpath = FileUtils::GetRealPath(file.c_str());
if (!realpath.has_value()) {

MS_LOG(ERROR) << "Invalid file path," << file << "does not exist.";
RETURN_STATUS_UNEXPECTED("Invalid file path," + file + "does not exist.");

}

std::ifstream ifs;
ifs.open(realpath.value(), std::ifstream::in);
if (!ifs.is_open()) {

RETURN_STATUS_UNEXPECTED("Invalid file, failed to open" + file + ", the file is damaged or permission denied.");

}
if (column_name_list_.empty()) {

std::string tmp;
getline(ifs, tmp);

}
csv_parser.Reset();
try {

while (ifs.good()) {// when ifstream reaches the end of file, the function get() return std::char_traits<char>::eof()
  // which is a 32-bit -1, it's not equal to the 8-bit -1 on Euler OS. So instead of char, we use
  // int to receive its return value.
  int chr = ifs.get();
  int err = csv_parser.ProcessMessage(chr);
  if (err != 0) {
    // if error code is -2, the returned error is interrupted
    if (err == -2) return Status(kMDInterrupted);
    RETURN_STATUS_UNEXPECTED("Invalid file, failed to parse csv file:" + file + "at line" +
                             std::to_string(csv_parser.GetTotalRows() + 1) +
                             ". Error message:" + csv_parser.GetErrorMessage());
  }
}

} catch (std::invalid_argument &ia) {

std::string err_row = std::to_string(csv_parser.GetTotalRows() + 1);
RETURN_STATUS_UNEXPECTED("Invalid csv, csv file:" + file + "parse failed at line" + err_row +
                         ", type does not match.");

} catch (std::out_of_range &oor) {

std::string err_row = std::to_string(csv_parser.GetTotalRows() + 1);
RETURN_STATUS_UNEXPECTED("Invalid csv," + file + "parse failed at line" + err_row + ": value out of range.");

}
return Status::OK();
}
通过浏览下面的源码,发现源码中没有解决 header 行的代码,即默认所有行都是数据行。还记得 2.3 中数据调配局部的代码,咱们写入了 header 信息,而 CSVDataset 并不提供解决 header 行的能力。当初依据报错剖析定位,对 2.3 的数据调配代码进行批改,将代码 preprocess_iris_data(iris_data_file, iris_train_file, iris_test_file)
批改为 preprocess_iris_data(iris_data_file, iris_train_file, iris_test_file, header=False)
再次运行 preprocess.py 文件,生成新的数据。而后运行 load.py 文件(这里并不需要再改代码),输入内容如下:阐明:为不便读者查看,这里对格局进行了人为解决,内容不变。这里曾经可能正确读取数据,数据蕴含 5 个字段。数据字段名曾经依据指定的 column_names 做了解决。====== sample ======
{‘Sepal.Length’: Tensor(shape=[], dtype=Float32, value= 5.5), ‘Sepal.Width’: Tensor(shape=[], dtype=Float32, value= 4.2), ‘Petal.Length’: Tensor(shape=[], dtype=Float32, value= 1.4), ‘Petal.Width’: Tensor(shape=[], dtype=Float32, value= 0.2),
‘Classes’: Tensor(shape=[], dtype=String, value= ‘Iris-setosa’)}

  1. 正确示例通过 3 中的两种试错,咱们对 CSVDataset 有了初步意识,仔细的读者可能会发现,3 中仍然有一个问题,那就是 Classes 字段没有进行数值化,上面咱们就来介绍一种对其数值化的办法。源码如下:from mindspore.dataset import CSVDataset
    from mindspore.dataset.text import Lookup, Vocab

def dataset_load(data_files):

column_defaults = [5.84, 3.05, 3.76, 1.20, "Classes"]
column_names = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Classes"]

dataset = CSVDataset(
    dataset_files=data_files,
    field_delim=",",
    column_defaults=column_defaults,
    column_names=column_names,
    num_samples=None,
    shuffle=False)

cls_to_id_dict = {"Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2}
vocab = Vocab.from_dict(word_dict=cls_to_id_dict)
lookup = Lookup(vocab)
dataset = dataset.map(input_columns="Classes", operations=lookup)

data_iter = dataset.create_dict_iterator()
item = None
for data in data_iter:
    item = data
    break

print("====== sample ======\n{}".format(item), flush=True)

def main():

iris_train_file = "{your_path}/iris/iris_train.csv"

dataset_load(data_files=iris_train_file)

if name == “__main__”:

main()

将以上代码保留到 load.py 文件,运行命令:留神批改数据文件门路 python3 load.py
输入内容如下:阐明:数据蕴含 5 个字段。Classes 字段曾经依据 cls_to_id_dict = {“Iris-setosa”: 0, “Iris-versicolor”: 1, “Iris-virginica”: 2}进行了数值化转换。数值化转换用到了 mindspore.dataset.text 的无关办法,读者能够自行查阅,后续会出相干的解读文章。====== sample ======
{‘Sepal.Length’: Tensor(shape=[], dtype=Float32, value= 5.5), ‘Sepal.Width’: Tensor(shape=[], dtype=Float32, value= 4.2), ‘Petal.Length’: Tensor(shape=[], dtype=Float32, value= 1.4), ‘Petal.Width’: Tensor(shape=[], dtype=Float32, value= 0.2),
‘Classes’: Tensor(shape=[], dtype=Int32, value= 0)}
后续:这里还存在其余字段的数据归一化,就留待读者去尝试了。数值化转换局部,也能够通过在数据调配局部减少代码来提前转换,读者也能够进行尝试。5. 本文总结本文对 MindSpore 中的 CSVDataset 数据集接口进行了摸索和示例展现。通过谬误试探,发现目前 CSVDataset 的文档和性能还绝对较弱,只能说是可用。6. 问题改良 6.1 column_defaults 文档谬误英文文档 column_defaults (list, optional) – List of default values for the CSV field (default=None). Each item in the list is either a valid type (float, int, or string). If this is not provided, treats all columns as string type.
中文文档 column_defaults (list, 可选) – 指定每个数据列的数据类型,无效的类型包含 float、int 或 string。默认值:None,不指定。如果未指定该参数,则所有列的数据类型将被视为 string。
这里中文翻译有误。其实英文 API 就有肯定的歧义性,后面说了是每个字段的默认值(CSV 文件中存在字段为空的状况),前面又说如果为空,则依照 string 类型解决,让人分不清到底是数据类型实例还是数据类型。留神:其实这里既有数据类型实例的意思,又有数据类型的意思。当指定了 column_defaults 参数,则字段的默认值为 column_defaults 中相应地位的值,字段的类型为 column_defaults 相应地位值的数据类型。例如:某 CSV 文件蕴含三个字段,指定 column_defaults 为[2.0, 1,“x”],则读取该文件时,三个字段的类型会被辨认为 float、int、str,如果某行中第二个字段为空,则就用默认值 1 填充。6.2 不反对文件含有 header 如题 6.3 不反对读取指定字段如题,API 层面不显式反对,不过能够通过后续的数据处理来反对。7. 本文参考 mindspore.dataset.CSVDatasetmindspore/ccsrc/minddata/dataset/engine/datasetops/source/csv_op.cc 本文为原创文章,版权归作者所有,未经受权不得转载!

正文完
 0