关于devsecops:SARIF在应用过程中对深层次需求的实现

摘要：为了升高各种剖析工具的后果汇总到通用工作流程中的老本和复杂性, 业界开始采纳动态剖析后果替换格局(Static Analysis Results Interchange Format (SARIF))来解决这些问题。

本文分享自华为云社区《DevSecOps工具与平台交互的桥梁 -- SARIF进阶》，原文作者：Uncle_Tom。

1. 引言

目前DevSecOps曾经成为构建企业级研发平安的重要模式。动态扫描工具融入在DevSecOps的开发过程中，对进步产品的整体的平安程度施展着重要的作用。为了获取安全检查能力笼罩的最大化，开发团队通常会引入多个平安扫描工具。但这也给开发人员和平台带来了更多的问题，为了升高各种剖析工具的后果汇总到通用工作流程中的老本和复杂性, 业界开始采纳动态剖析后果替换格局(Static Analysis Results Interchange Format (SARIF))来解决这些问题。本篇是SARIF利用的入门篇和进阶篇中的进阶篇，将介绍SARIF在利用过程中对深层次需要的实现。对于SARIF的根底介绍，请参看《DevSecOps工具与平台间交互的桥梁–SARIF入门》。

2. SARIF 进阶

上次咱们说了SARIF的一些根本利用，这里咱们再来说下SARIF在更简单的场景中的一些利用，这样能力为动态扫描工具提供一个残缺的报告解决方案。

在业界驰名的动态剖析工具Coverity最新的2021.03版本中，新增的性能就包含: 反对在GitHub代码仓中以SARIF格局显示Coverity的扫描后果。可见Covreity也实现了SARIF格局的适配。

2.1. 元数据（metadata）的应用

为了防止扫描报告过大，对一些重复使用的信息，须要提取进去，做为元数据。例如：规定、规定的音讯，扫描的内容等。

上面的例子中，将规定、规定信息在tool.driver.rules 中进行定义，在扫描后果(results)中间接应用规定编号ruleId来失去规定的信息，同时音讯也采纳了message.id的形式失去告警信息。这样能够防止规定产生同样告警的大量的反复信息，无效的放大报告的大小。

vscode 中显示如下：

{  "version": "2.1.0",  "runs": [    {      "tool": {        "driver": {          "name": "CodeScanner",          "rules": [            {              "id": "CS0001",              "messageStrings": {                "default": {                  "text": "This is the message text. It might be very long."                }              }            }          ]        }      },      "results": [        {          "ruleId": "CS0001",          "ruleIndex": 0,          "message": {            "id": "default"          }        }      ]    }  ]}

2.2. 音讯参数的应用

扫描后果的告警往往须要，依据具体的代码问题，在提醒音讯中给出具体的变量或函数的相干信息，便于用户对问题的了解。这个时候能够采纳音讯参数的形式，提供可变动缺点音讯。

下例中，对规定的音讯中采纳占位符的形式("{0}")提供信息模板，在扫描后果(results)中，通过arguments数组，提供对应的参数。在vscode中显示如下：

{  "version": "2.1.0",  "runs": [    {      "tool": {        "driver": {          "name": "CodeScanner",          "rules": [            {              "id": "CS0001",              "messageStrings": {                "default": {                  "text": "Variable '{0}' was used without being initialized."                }              }            }          ]        }      },      "results": [        {          "ruleId": "CS0001",          "ruleIndex": 0,          "message": {            "id": "default",            "arguments": [              "x"            ]          }        }      ]    }  ]}

2.3. 音讯中关联信息的应用

在有些时候，为了更好的阐明这个告警的产生起因，须要给用户提供更多的参考信息，帮忙他们了解问题。比方，给出这个变量的定义地位，污染源的引入点，或者其余辅助信息。

下例中，通过定义问题的产生地位(locations)的关联地位(relatedLocations)给出了，污染源的引入地位。在vscode中显示如下, 但用户点击“here”时，工具就能够跳转到变量expr引入的地位。

{  "ruleId": "PY2335",  "message": {    "text": "Use of tainted variable 'expr' (which entered the system [here](1)) in the insecure function 'eval'."  },  "locations": [    {      "physicalLocation": {        "artifactLocation": {          "uri": "3-Beyond-basics/bad-eval.py"        },        "region": {          "startLine": 4        }      }    }  ],  "relatedLocations": [    {      "id": 1,      "message": {        "text": "The tainted data entered the system here."      },      "physicalLocation": {        "artifactLocation": {          "uri": "3-Beyond-basics/bad-eval.py"        },        "region": {          "startLine": 3        }      }    }  ]}

2.4. 缺点分类信息的应用

缺点的分类对于工具和扫描后果的剖析是十分重要的。工具能够依靠对缺点的分类进行规定的治理，不便用户选取须要的规定；另一方面用户在查看剖析报告时，也能够通过对缺点的分类，疾速对剖析后果进行过滤。工具能够参考业界的规范，例如咱们罕用的Common Weakness Enumeration (CWE), 也能够自定义本人的分类，这些SARIF都提供了反对。

缺点分类的例子

{  "version": "2.1.0",  "runs": [    {      "taxonomies": [        {          "name": "CWE",          "version": "3.2",          "releaseDateUtc": "2019-01-03",          "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",          "informationUri": "https://cwe.mitre.org/data/published/cwe_v3.2.pdf/",          "downloadUri": "https://cwe.mitre.org/data/xml/cwec_v3.2.xml.zip",          "organization": "MITRE",          "shortDescription": {            "text": "The MITRE Common Weakness Enumeration"          },          "taxa": [            {              "id": "401",              "guid": "10F28368-3A92-4396-A318-75B9743282F6",              "name": "Memory Leak",              "shortDescription": {                "text": "Missing Release of Memory After Effective Lifetime"              },              "defaultConfiguration": {                "level": "warning"              }            }          ],          "isComprehensive": false        }      ],      "tool": {        "driver": {          "name": "CodeScanner",          "supportedTaxonomies": [            {              "name": "CWE",              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82"            }          ],          "rules": [            {              "id": "CA2101",              "shortDescription": {                "text": "Failed to release dynamic memory."              },              "relationships": [                {                  "target": {                    "id": "401",                    "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",                    "toolComponent": {                      "name": "CWE",                      "guid": "10F28368-3A92-4396-A318-75B9743282F6"                    }                  },                  "kinds": [                    "superset"                  ]                }              ]            }          ]        }      },      "results": [        {          "ruleId": "CA2101",          "message": {            "text": "Memory allocated in variable 'p' was not released."          },          "taxa": [            {              "id": "401",              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",              "toolComponent": {                "name": "CWE",                "guid": "10F28368-3A92-4396-A318-75B9743282F6"              }            }          ]        }      ]    }  ]}

2.4.1. 业界分类规范的引入（runs.taxonomies）

taxonomies 的定义

 "taxonomies": {    "description": "An array of toolComponent objects relevant to a taxonomy in which results are categorized.",    "type": "array",    "minItems": 0,    "uniqueItems": true,    "default": [],    "items": {      "$ref": "#/definitions/toolComponent"    }  },

taxonomies节点是个数组节点，能够定义多个分类规范。同时taxonomies的定义参考定义组节点definitions下的toolComponent的定义。这与咱们后面的工具扫描引擎(tool.driver)和工具扩大(tool.extensions)放弃了统一. 这样设计的起因是引擎和后果的强相关性，能够通过这样的办法使之放弃属性上的统一。

业界规范分类(standard taxonomy)的定义
例子中通过runs.taxonomies节点，申明了业界的分类规范CWE。在节点taxonomies中通过属性节点给出了该标准的形容，上面的只是样例，具体的参考SARIF的标准阐明：
name: 标准的名字;
version: 版本;
releaseDateUtc: 公布日期;
guid: 惟一标识，便于其余中央援用此标准；
informationUri: 规定的文档信息;
downloadUri：下载地址；
organization：公布组织
shortDescription：标准的短形容。

2.4.2. 自定义分类的引入(runs.taxonomies.taxa)

taxa是个数组节点，为了放大报告的尺寸，没有必要将所有自定义的分类信息都放在taxa节点上面，只须要列出和本次扫描相干的分类信息就够了。这也是为什么前面标识是否全面(isComprehensive)节点的默认值是false的起因。

例子中通过taxa节点引入了一个工具须要的分类：CWE-401 内存透露，并用guid 和id，做了这个分类的惟一标识，便于前面工具在规定或缺点中援用这个标识。

2.4.3. 工具与业界分类规范关联(tool.driver.supportedTaxonomies)

工具对象通过tool.driver.supportedTaxonomies节点和定义的业界分类规范关联。supportedTaxonomies的数组元素是toolComponentReference对象，因为分类法taxonomies自身是toolComponent对象。 toolComponentReference.guid属性与run.taxonomies []中定义的分类法的对象的guid属性匹配。

例子中supportedTaxonomies.name:CWE, 它示意此工具反对CWE分类法，并用援用了taxonomies[0]中的guid：A9282C88-F1FE-4A01-8137-E8D2A037AB82，使之与业界分类规范CWE关联。

2.5. 规定与缺点分类关联(rule.relationships)

规定是在tool.driver.rules节点下定义，rules是个数组节点，规定通过数组元素中的reportingDescriptor对象定义；
每个规定(ReportingDescriptor)中的relationships是个数组元素，每个元素都是一个reportingDescriptorRelationship对象，该对象建设了从该规定到另一个reportingDescriptor对象的关系。关系的指标能够是分类法中的分类单元（如本例中所示），也能够是另一个工具组件中的另一个规定；
关系(ReportingDescriptorRelationship)中的target属性标识关系的指标，它的值是一个reportingDescriptorReference对象，由此援用对象toolComponent中的reportingDescriptor；
reportingDescriptorReference对象中的toolComponent是一个toolComponentReference对象, 指向工具supportedTaxonomies中定义的分类。

下图为例子中的规定与缺点分类的关联图：

2.5.1. 扫描后果中的分类(result.taxa)

在扫描后果(run.results)中, 每一个后果(result)下，有一个属性分类(taxa), taxa是一个数组元素，数组中的每个元素指向reportingDescriptorReference对象，用于指定该缺点的分类。这个与规定对应分类的形式一样。从这一点也能够看出，咱们能够省略result下的taxa，而是通过规定对应到缺点的分类。

2.6. 代码流（Code Flow)

一些工具通过模拟程序的执行来检测问题，有时跨多个执行线程。 SARIF通过一组地位信息模仿执行过程，像代码流(Code Flow)一样。 SARIF代码流蕴含一个或多个线程流，每个线程流形容了单个执行线程上按工夫顺序排列的代码地位。

2.6.1. 缺点代码流组（result.codeFlows）

因为缺点中，可能存在不止一个代码流，因而可选的result.codeFlows属性是一个数组模式的codeFlow对象。

 "result": {      "description": "A result produced by an analysis tool.",      "additionalProperties": false,      "type": "object",      "properties": {        ... ...        "codeFlows": {          "description": "An array of 'codeFlow' objects relevant to the result.",          "type": "array",          "minItems": 0,          "uniqueItems": false,          "default": [],          "items": {            "$ref": "#/definitions/codeFlow"          }        },      }   }

2.6.2. 代码流的线程流组（codeFlow.threadFlows）

codeFlow的定义能够看到，每个代码流有，由一个线程组(threadFlows)形成，且线程组(threadFlows)是必须的。

 "codeFlow": {      "description": "A set of threadFlows which together describe a pattern of code execution relevant to detecting a result.",      "additionalProperties": false,      "type": "object",      "properties": {        "message": {          "description": "A message relevant to the code flow.",          "$ref": "#/definitions/message"        },        "threadFlows": {          "description": "An array of one or more unique threadFlow objects, each of which describes the progress of a program through a thread of execution.",          "type": "array",          "minItems": 1,          "uniqueItems": false,          "items": {            "$ref": "#/definitions/threadFlow"          }        },      },      "required": [ "threadFlows" ]    },

2.6.3. 线程流（threadFlow）和线程流地位（threadFlowLocation）

在每个线程流(threadFlow)中，一个数组模式的地位组(locations)来形容工具对代码的剖析过程。

线程流（threadFlow）定义：

 "threadFlow": {      "description": "Describes a sequence of code locations that specify a path through a single thread of execution such as an operating system or fiber.",      "type": "object",      "additionalProperties": false,      "properties": {        "id": {        ...        "message": {        ...          "initialState": {        ...        "immutableState": {        ...        "locations": {          "description": "A temporally ordered array of 'threadFlowLocation' objects, each of which describes a location visited by the tool while producing the result.",          "type": "array",          "minItems": 1,          "uniqueItems": false,          "items": {            "$ref": "#/definitions/threadFlowLocation"          }        },        "properties": {        ...      },      "required": [ "locations" ]    },

线程流地位（threadFlowLocation）定义：
地位组(locations)中的每个元素, 又是通过threadFlowLocation来示意工具的对代码地位的拜访。最终通过location类型的location属性给出剖析的地位信息。location能够蕴含物理和逻辑地位信息，因而codeFlow也能够用于二进制的剖析流的示意。

在threadFlowLocation还有一个state属性的节点，咱们能够通过它来存储变量、表达式的值或者符号表信息，或者用于状态机的表述。

 "threadFlowLocation": {      "description": "A location visited by an analysis tool while simulating or monitoring the execution of a program.",      "additionalProperties": false,      "type": "object",      "properties": {        "index": {          "description": "The index within the run threadFlowLocations array.",        ...         "location": {          "description": "The code location.",          "$ref": "#/definitions/location"        },        "state": {          "description": "A dictionary, each of whose keys specifies a variable or expression, the associated value of which represents the variable or expression value. For an annotation of kind 'continuation', for example, this dictionary might hold the current assumed values of a set of global variables.",          "type": "object",          "additionalProperties": {            "$ref": "#/definitions/multiformatMessageString"          }        },        ...      }    },

2.6.4. 代码流样例

参考代码

1. # 3-Beyond-basics/bad-eval-with-code-flow.py2.3. print("Hello, world!")4. expr = input("Expression> ")5. use_input(expr)6. 7. def use_input(raw_input):8.    print(eval(raw_input))

下面是一个python代码的代码注入的一个案例。

在第四行，输出信息赋值给变量expr；
在第五行，变量expr通过函数use_input的第一个参数，进入到函数use_input;
在第八行，通过函数print打印输出后果，但这里应用了函数eval()对输出参数进行了解决，因为参数在输出后，未通过测验，就间接用于函数eval的解决，这里可能会引入代码注入的平安问题。

这个剖析过程能够通过上面的扫描后果体现进去，便于用户了解问题的产生过程。

扫描后果

{  "version": "2.1.0",  "runs": [    {      "tool": {        "driver": {          "name": "PythonScanner"        }      },      "results": [        {          "ruleId": "PY2335",          "message": {            "text": "Use of tainted variable 'raw_input' in the insecure function 'eval'."          },          "locations": [            {              "physicalLocation": {                "artifactLocation": {                  "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"                },                "region": {                  "startLine": 8                }              }            }          ],          "codeFlows": [            {              "message": {                "text": "Tracing the path from user input to insecure usage."              },              "threadFlows": [                {                  "locations": [                    {                      "message": {                        "text": "The tainted data enters the system here."                      },                      "location": {                        "physicalLocation": {                          "artifactLocation": {                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"                          },                          "region": {                            "startLine": 4                          }                        }                      },                      "state": {                        "expr": {                          "text": "42"                        }                      },                      "nestingLevel": 0                    },                    {                      "message": {                        "text": "The tainted data is used insecurely here."                      },                      "location": {                        "physicalLocation": {                          "artifactLocation": {                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"                          },                          "region": {                            "startLine": 8                          }                        }                      },                      "state": {                        "raw_input": {                          "text": "42"                        }                      },                      "nestingLevel": 1                    }                  ]                }              ]            }          ]        }      ]    }  ]}

这里只是一个简略的示例，通过SARIF的codeFLow，咱们能够适应更加简单的剖析过程，从而让用户更好的了解问题，进而疾速做出判断和批改。

2.7. 缺点指纹（fingerprint）

在大型软件我的项目中，剖析工具一次就能够产生成千上万个后果。为了解决如此多的后果，在缺点治理上，咱们须要记录现有缺点，制订一个扫描基线，而后对现有问题进行解决。同时在前期的扫描中，须要将新的扫描后果与基线进行比拟，以辨别是否有新问题的引入。为了确定后续运行的后果在逻辑上是否与基线的后果雷同，必须通过一种算法:应用缺点后果中蕴含的特有信息来结构一个稳固的标识，咱们将此标识称为指纹。应用这个指纹来标识这个缺点的特色以区别于其余缺点，咱们也称这个指纹为这个缺点的缺点指纹。

缺点指纹应该蕴含绝对稳固不变的缺点信息：

产生后果的工具的名称；
规定编号；
剖析指标的文件系统门路；这个门路应该是工程自身具备的相对路径。不应该蕴含门路后面工程寄存地位信息，因为每台机器寄存工程的地位可能不同；
缺点特征值（partialFingerprints）。

SARIF的每个扫描后果(result)中提供了一组这样的属性节点，用于缺点指纹的寄存，便于缺点的管理系统通过这些标识，辨认缺点的唯一性。

 "result": {      "description": "A result produced by an analysis tool.",      "additionalProperties": false,      "type": "object",      "properties": {        ... ...        "guid": {          "description": "A stable, unique identifier for the result in the form of a GUID.",          "type": "string",          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"        },        "correlationGuid": {          "description": "A stable, unique identifier for the equivalence class of logically identical results to which this result belongs, in the form of a GUID.",          "type": "string",          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"        },        "occurrenceCount": {          "description": "A positive integer specifying the number of times this logically unique result was observed in this run.",          "type": "integer",          "minimum": 1        },        "partialFingerprints": {          "description": "A set of strings that contribute to the stable, unique identity of the result.",          "type": "object",          "additionalProperties": {            "type": "string"          }        },        "fingerprints": {          "description": "A set of strings each of which individually defines a stable, unique identity for the result.",          "type": "object",          "additionalProperties": {            "type": "string"          }        },        ... ...      }    }

只通过缺点的固有的信息特色，在某些状况下，不容易失去惟一辨认后果的信息。这个时候咱们须要减少一些与这个缺点强相干的一些属性值，做为附加信息来退出到缺点指纹的计算中，使最初的计算失去的指纹惟一。这个有些像咱们做加密算法时的盐值，只是这个盐值须要保障生成的惟一值具备可重复性，以确保下次扫描时，对于同一缺点可能失去雷同的输出值，从而失去和上次一样的指纹。例如，工具在查看文档中是否存在敏感性的单词，告警信息为：“ xxx不应在文档中应用。”，这个时候就能够应用这个单词作为这个缺点的一个特征值。

SARIF格局就提供了这样一个partialFingerprints属性，用于保留这个特征值，以容许SARIF生态系统中的剖析工具和其余组件应用这个信息。缺点管理系统能够将其附加到为每个后果结构的指纹中。后面的例子中，该工具就能够会将partialFingerprints对象中的属性的值设置为：禁止的单词。缺点管理系统应该在其指纹计算中将信息包含在partialFingerprints中。

对于partialFingerprints，应该只增加和缺点特色强相干的属性，而且属性的值应该绝对稳固。比方，缺点产生的代码行号就不适宜退出到指纹的的逻辑运算中，因为代码行是一个会常常变动的值，在下次扫描的时候，很可能因为开发人员在问题行前增加或删除了一些代码行，而使同样的问题在新的扫描报告中失去不一样的代码行，从而影响缺点指纹的计算值，导致比对时产生差别。

只管咱们试图为每个缺点找到惟一的标识特色，还退出了一些可变的特色属性，但还是很难设计出一种算法来结构一个真正稳固的指纹后果。比方方才的例子，如果同一个文件中存在几个同样的敏感字，咱们这个时后还是无奈为每一个告警缺点给出一个惟一的标识。当然这个时候还能够退出函数名作为一个指纹的计算因子，因为函数名在一个程序中是绝对稳固的存在，函数名的退出有助于辨别同一个文件中同一个问题的呈现范畴，但还是会存在同一个函数内同样问题的多个雷同缺点。所以只管咱们尽量辨别每一个告警，但缺点指纹雷同的场景在理论的扫描中还是会存在的。

侥幸的是，出于理论目标，指纹并不一定要相对稳固。它只须要足够稳固，就能够将错误报告为“新”的后果数量缩小到足够低的程度，以使开发团队能够无需过多致力就能够治理错误报告的后果。

3. 总结

SARIF给出了动态扫描工具的规范输入的通用格局，可能满足动态扫描工具报告输入的各种要求；
对于各种动态扫描工具整合到DevSecOps平台，SARIF将升高扫描后果汇总到通用工作流程中的老本和复杂性；
SARIF也将为IDE整合各种扫描后果，提供对立的缺点解决模块提供了可能；扫描后果在IDE中的缺点展现、修复等，这样能够让工具的开发商专一于问题的发现，而缩小对各种IDE的适配的工作量；
SARIF曾经成为OASIS的规范之一，并被微软、GrammaTech等重要动态扫描工具厂商在工具中提供反对；同时U.S. DHS, U.S. NIST在一些动态查看工具的评估和较量中，也要求提供扫描报告的格局采纳SARIF；
SARIF尽管目前次要是为动态扫描工具的后果设计的，但因为其设计的通用性，一些动态分析工具厂商也给出了SARIF的胜利利用。

4. Reference

Industry leaders collaborate to define SARIF interoperability standard for detecting software defects and vulnerabilities
OASIS Awards 2018 Open Standards Cup to KMIP for Key Management Security and SARIF for Static Analysis Tools
OASIS Static Analysis Results Interchange Format (SARIF) Technical Committee
SARIF Specification
SARIF Tutorials
Vscode Extension: Sarif Viewer
SARIF-SDK
Fortify FPR to SARIF
GrammaTech SARIF integration for GitHub
Static Analysis Results: A Format and a Protocol: SARIF & SASP
浅谈 language server & LSIF & SARIF & Babelfish & Semantic & Tree-sitter & Kythe & Glean等

点击关注，第一工夫理解华为云陈腐技术~