关于人工智能:MindSpore数据集加载调试小工具-pyspy

37次阅读

共计 2405 个字符，预计需要花费 7 分钟才能阅读完成。

编写 MindSpore 数据集加载代码的时候，有时候会遇到一些蛊惑的代码行为，比方不晓得代码卡在哪里不动了，仿佛解决的很慢，或者写了死循环的逻辑。这些问题往往会破费很多精力打点去调试 python 代码，那么有没有不便的工具帮咱们这么做呢—— 有请明天的配角：py-spypy-spy 介绍援用官网的话：py-spy 是 Python 程序的采样分析器。它能够让您可视化 Python 程序破费的工夫在什么上，而无需重新启动程序或以任何形式批改代码。py-spy 开销极低：它是用 Rust 编写的，以进步速度，并且与剖析的 Python 程序不在雷同的过程中运行。这意味着 py-spy 能够平安地用于生产 Python 代码。官网 github 地址：https://github.com/benfred/py… 如果有相干的问题也能够去 github 发问。py-spy 装置 pip install py-spy 装置后应用 py-spy - h 能够验证装置，并查看应用帮忙。py-spy -hpy-spy 0.3.12
Sampling profiler for Python programs

USAGE:

py-spy <SUBCOMMAND>

OPTIONS:

-h, --help       Print help information
-V, --version    Print version information

SUBCOMMANDS:

record    Records stack trace information to a flamegraph, speedscope or raw file
top       Displays a top like view of functions consuming CPU
dump      Dumps stack traces for a target program to stdout
help      Print this message or the help of the given subcommand(s)

py-spy 基本功能个别在命令行中运行 py-spy，并对 py-spy 传入要剖析的过程的 PID 或要运行的 python 程序。py-spy 具备三个子命令 record，top 和 dump：record 生成火焰图

top 实时查看每个函数运行工夫并统计

dump 显示每个 python 线程的以后调用堆栈

应用 py-spy 调试 MindSpore 数据集加载结构一个数据迭代很慢的场景先上一个经典代码 test_dataset.pyimport mindspore.dataset as ds
import numpy as np
import time
class DatasetGenerator:

def __init__(self):
    pass

def __getitem__(self, item):
    self.do_something()
    return (np.array([item]),)

def do_something(self):
    cnt = 0
    for i in range(100000000):
        cnt += 1

def __len__(self):
    return 50

def test_generator_0():

data1 = ds.GeneratorDataset(DatasetGenerator(), ["data"])

start = time.time()
for item in data1.create_dict_iterator(num_epochs=1, output_numpy=True):
    print("data time:", time.time() - start)
    start = time.time()

尽管这个用例比较简单，然而的确会看到迭代的时候出数据很慢，运行一下能够看到 data time: 5.431891679763794
data time: 5.6114866733551025
data time: 5.38549542427063
data time: 5.577831268310547
基本上 5 秒才读出一条数据，50 条数据就是 250 秒！这才 50 条数据，如果 10w 条呢，这个读取工夫难以承受。那么到底问题呈现在哪里呢？（尽管明眼能看到 do_something 这个函数有点问题，然而咱们当初伪装不晓得）咱们须要 py-spy 帮咱们定位一下问题。应用 py-spy 的 top 性能查看调用栈和工夫占比咱们从新运行一下这个脚本 python test_dataset.py & 留神咱们容许的时候加了 & 使其能够在后盾运行，同时返回一个 pid，比方这样

咱们失去了这个 python 的过程 id 116079，有了这个 pid 之后，咱们就能够利用 py-spy 给咱们做实时剖析了，接下来咱们在命令行中输出 py-spy top –pid 116079 如果遇到这个提醒 Permission Denied: Try running again with elevated permissions by going ‘sudo env “PATH=$PATH” !!’
从新在命令行输出这个即可 sudo env“PATH=$PATH”!! 此时将会看到 py-spy 列出了调用栈及相干的函数用时占比：

直观地从这个调用栈剖析，do_something 这个函数竟然占用了 100% 的运行工夫，运行超过了 50+s，必定有某些问题！或是解决逻辑太慢，或是写了死循环？那么咱们就能够疾速来到代码的这里看看产生什么事了 def do_something(self):

cnt = 0
for i in range(100000000):
    cnt += 1

天哪！这里竟然有一个一亿次的循环加法，到底是谁写的代码！（不是我）咱们能够通过优化这一部分代码，来晋升整体数据处理的速度。总的来说 py-spy 是一个挺不便的工具，帮忙用户疾速定位 python 的性能问题，除了 top 性能之外，py-spy 也提供了 dump 性能把残缺的堆栈写入到本地不便剖析。当咱们在 debug 一个线上正在运行的程序的时候，只须要提供过程 id，py-spy 就能够间接接入应用。

正文完