关于php:一起学习PHP中的Tidy扩展库

这个扩大预计很多同学可能都没听说过，这可不是泰迪熊呀，而是一个解决 HTML 相干操作的扩大，次要是能够用于 HTML、XHTML、XML 这类数据格式内容的格式化及展现。

Tidy 库扩大是随 PHP 一起公布的，也就是说，咱们能够在编译装置 PHP 时加上 –with-tidy 来一起装置这个扩大，也能够在预先通过源码包中 ext/ 文件夹下的 tidy 目录中的源码来进行装置。同时，Tidy 扩大还须要依赖一个 tidy 函数库，咱们须要在操作系统上装置，如果是 CentOS 的话，间接 yum install libtidy-devel 就能够了。

首先咱们来看一下如何通过这个 Tidy 扩大库来格式化一段 HTML 代码。

$content = <<<EOF
<html><head><title>test</title></head> <body><p>error<br>another line</i></body>
</html>
EOF;

$tidy = new Tidy();
$config = [
        'indent'=>true,
        'output-xhtml'=>true,
];
$tidy->parseString($content, $config);
$tidy->cleanRepair();

echo $tidy, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

咱们定义的 $content 中的这段 HTML 代码是没有任何格局的十分不标准的一段 HTML 代码。通过实例化一个 Tidy 对象之后，应用 parseString() 办法，并执行 cleanRepair() 办法之后，再间接打印 $tidy 对象，咱们就取得了格式化之后的 HTML 代码。看起来是不是十分地标准，不论是 xmlns 还是缩进格局都十分规范。

parseString() 办法有两个参数，第一个参数就是须要格式化的字符串。第二个参数是格式化的配置，这个配置接管的是一个数组，同时它外部的内容也必须是 Tidy 组件中所定义的那些配置信息。这些配置信息咱们能够在文后的第二条链接中进行查问。这里咱们只配置了两个内容，indent 示意是否利用缩进块级，output-xhtml 示意是否输入为 xhtml。

cleanRepair() 办法用于对已解析的内容执行革除和修复的操作，其实也就是格式化的清理工作。

留神咱们在测试代码中是间接打印的 Tidy 对象，也就是说，这个对象实现了 \_\_toString()，而它真正的样子其实是这样的。

var_dump($tidy);
// object(tidy)#1 (2) {//     ["errorBuffer"]=>
//     string(112) "line 1 column 1 - Warning: missing <!DOCTYPE> declaration
//   line 1 column 70 - Warning: discarding unexpected </i>"//     ["value"]=>
//     string(195) "<html xmlns="http://www.w3.org/1999/xhtml">
//     <head>
//       <title>
//         test
//       </title>
//     </head>
//     <body>
//       <p>
//         error<br />
//         another line
//       </p>
//     </body>
//   </html>"
//   }

var_dump($tidy->isXml()); // bool(false)

var_dump($tidy->isXhtml()); // bool(false)

var_dump($tidy->getStatus()); // int(1)

var_dump($tidy->getRelease());  // string(10) "2017/11/25"

var_dump($tidy->getHtmlVer()); // int(500)

咱们能够通过 Tidy 对象的属性获取一些对于待处理文档的信息，比方是否是 XML，是否是 XHTML 内容。

getStatus() 返回的是 Tidy 对象的状态信息，以后这个 1 示意的是有正告或辅助性能谬误的信息，从下面打印的 Tidy 对象的内容咱们就能够看出，在这个对象的 errorBuffer 属性中是有 warning 报警信息的。

getRelease() 返回的是以后 Tidy 组件的版本信息，也就是你在操作系统上装置的那个 tidy 组件的信息。getHtmlVer() 返回的是检测到的 HTML 版本，这里的 500 没有更多的阐明和介绍材料，不晓得这个 500 是什么意思。

除了下面的这些内容之后，咱们还能够取得后面 $config 中的配置信息及相干的阐明。

var_dump($tidy->getOpt('indent')); // int(1)

var_dump($tidy->getOptDoc('output-xhtml'));
// string(489) "This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. <br/>This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML, and will use the corrected value in output regardless of other sources. <br/>For XHTML, entities can be written as named or numeric entities according to the setting of <code>numeric-entities</code>. <br/>The original case of tags and attributes will be preserved, regardless of other options."

getOpt() 办法须要一个参数，也就是须要查问的 $config 中配置的信息内容，如果是查看咱们没有在 $config 中配置的参数的话，那么返回就都是默认的配置值。getOptDoc() 十分贴心，它返回的是对于某个参数的阐明文档。

最初，是更加干货的一些办法，能够间接操作节点。

echo $tidy->head(), PHP_EOL;
// <head>
//   <title>
//   test
// </title>
// </head>

$body = $tidy->body();

var_dump($body);
// object(tidyNode)#2 (9) {//     ["value"]=>
//     string(60) "<body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>"//     ["name"]=>
//     string(4) "body"
//     ["type"]=>
//     int(5)
//     ["line"]=>
//     int(1)
//     ["column"]=>
//     int(40)
//     ["proprietary"]=>
//     bool(false)
//     ["id"]=>
//     int(16)
//     ["attribute"]=>
//     NULL
//     ["child"]=>
//     array(1) {//       [0]=>
//       object(tidyNode)#3 (9) {//         ["value"]=>
//         string(37) "<p>
// ………………
// ………………

echo $tidy->html(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

echo $tidy->root(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

置信不须要过多地解释就可能看出，head() 返回的就是 <head> 标签外面的内容，而 body()、html() 也都是对应的相干标签，root() 返回的则是根结点的全部内容，能够看作是整个文档内容。

这些办法函数返回的内容其实都是一个 TidyNode 对象，这个咱们在前面再具体地阐明。

下面的操作代码咱们都是基于 parseString() 这个办法。它没有返回值，或者说返回的只是一个布尔类型的成功失败标识。如果咱们须要获取格式化之后的内容，只能间接将对象当做字符串或者应用 root() 来取得所有的内容。其实，还有一个办法间接就是返回一个格式化后的字符串的。

$tidy = new Tidy();
$repair = $tidy->repairString($content, $config);

echo $repair, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

repairString() 办法的参数和 parseString() 是截然不同的，惟一不同的就是它是返回的一个字符串，而不是在 Tidy 对象外部进行操作。

在最开始的测试代码中，咱们应用 var_dump() 打印 Tidy 对象时就看到了 errorBuffer 这个变量里是有错误信息的。这回咱们再来一个有更多问题的 HTML 代码片断。

$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<p>paragraph</p>
HTML;
$tidy = new Tidy();
$tidy->parseString($html);
$tidy->cleanRepair();

echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element

$tidy ->diagnose();
echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element
// Info: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
// Info: Document content looks like XHTML 1.0 Strict
// Tidy found 3 warnings and 0 errors!

在这段测试代码中，咱们又应用了一个新的 diagnose() 办法，它的作用是对文档进行诊断测试，并且在 errorBuffer 这个对象变量中增加无关文档的更多信息。

之前咱们说到过，head()、html()、body()、root() 这几个办法返回的都是一个 TidyNode 对象，那么这个对象有什么非凡的中央吗？

$html = <<<EOF
<html><head>
<?php echo '<title>title</title>'; ?>
<#
  /* JSTE code */
  alert('Hello World');
#>
</head>
<body>

<?php
  // PHP code
  echo 'hello world!';
?>

<%
  /* ASP code */
  response.write("Hello World!")
%>

<!-- Comments -->
Hello World
</body></html>
Outside HTML
EOF;

$tidy = new Tidy();
$tidy->parseString($html);

$tidyNode = $tidy->html();

showNodes($tidyNode);

function showNodes($node){if($node->isComment()){echo '========', PHP_EOL,'This is Comment Node :"', $node->value, '"', PHP_EOL;}
    if($node->isText()){echo '--------', PHP_EOL,'This is Text Node :"', $node->value, '"', PHP_EOL;}
    if($node->isAsp()){echo '++++++++', PHP_EOL,'This is Asp Script :"', $node->value, '"', PHP_EOL;}
    if($node->isHtml()){echo '********', PHP_EOL,'This is HTML Node :"', $node->value, '"', PHP_EOL;}
    if($node->isPhp()){echo '########', PHP_EOL,'This is PHP Script :"', $node->value, '"', PHP_EOL;}
    if($node->isJste()){echo '@@@@@@@@', PHP_EOL,'This is JSTE Script :"', $node->value, '"', PHP_EOL;}

    if($node->name){// getParent()
        if($node->getParent()){echo '&&&&&&&&', $node->name ,'getParent is :', $node->getParent()->name, PHP_EOL;
        }

        // hasSiblings
        echo '^^^^^^^^', $node->name, 'has siblings is :';
        var_dump($node->hasSiblings());
        echo PHP_EOL;
    }

    if($node->hasChildren()){foreach($node->child as $child){showNodes($child);
        }
    }
}

// ………………
// ………………
// ********
// This is HTML Node :"<head>
// <?php echo '<title>title</title>'; ><#
//   /* JSTE code */
//   alert('Hello World');
// #>
// <title></title>
// </head>
// "
// &&&&&&&& head getParent is : html
// ^^^^^^^^ head has siblings is : bool(true)
// ………………
// ………………
// ++++++++
// This is Asp Script :"<%
//   /* ASP code */
//   response.write("Hello World!")
// %>" 
// ………………
// ………………

这段代码具体的测试步骤和各个函数的解释就不具体地一一列举阐明了。大家通过代码就可以看进去，咱们的 TidyNode 对象能够判断各个节点的内容，比方是否还有子结点、是否有兄弟结点。对象结点内容，能够判断结点的格局，是否是正文、是否是文本、是否是 JS 代码、是否是 PHP 代码、是否是 ASP 代码之类的内容。不晓得看到这里的你是什么感觉，反正我是感觉这个玩意就十分有意思了，特地是判断 PHP 代码这些的办法。

最初咱们再来看一下 Tidy 扩大库中的一些统计函数。

$html = <<<EOF
<p>test</i>
<bogustag>bogus</bogustag>
EOF;
$config = array('accessibility-check' => 3,'doctype'=>'bogus');
$tidy = new Tidy();
$tidy->parseString($html, $config);

echo 'tidy access count:', tidy_access_count($tidy), PHP_EOL;
echo 'tidy config count:', tidy_config_count($tidy), PHP_EOL;
echo 'tidy error count:', tidy_error_count($tidy), PHP_EOL;
echo 'tidy warning count:', tidy_warning_count($tidy), PHP_EOL;

// tidy access count: 4
// tidy config count: 2
// tidy error count: 1
// tidy warning count: 6

其实它们返回的这些数量都是一些错误信息的数量。tidy_access_count() 示意的是遇到的辅助性能正告数量，tidy_config_count() 是配置信息谬误的数量，另外两个从名字就看进去了，也就不必我多说了。

总之，Tidy 扩大库又是一个不太常见但十分有意思的库。对于某些场景，比方模板开发之类的性能来说还是有一些用武之地的。大家能够报着学习的心态好好再深刻的理解一下，说不定它正好就能解决你当初最辣手的问题哦！

测试代码：

https://github.com/zhangyue0503/dev-blog/blob/master/php/2021/01/source/8. 一起学习 PHP 中的 Tidy 扩大库.php

参考文档：

https://www.php.net/manual/zh/book.tidy.php

http://tidy.sourceforge.net/docs/quickref.html

关于php:一起学习PHP中的Tidy扩展库

对于 Tidy 库

Tidy 格式化

各种属性信息获取

间接转换为字符串

转换错误信息

TidyNode 操作

信息统计函数

总结