C语言中的正则表达式使用

jiezi

5 年前

正则表达式，又称正规表示法、常规表示法（英语：Regular Expression，在代码中常简写为 regex、regexp 或 RE）。正则表达式是使用单个字符串来描述、匹配一系列符合某个句法规则的字符串。

在 c 语言中，用 regcomp、regexec、regfree 和 regerror 处理正则表达式。处理正则表达式分三步：

编译正则表达式，regcomp；
匹配正则表达式，regexec；
释放正则表达式，regfree。

/*
函数说明：Regcomp 将正则表达式字符串 regex 编译成 regex_t 的形式，后续 regexec 以此进行搜索。参数说明：Preg：一个 regex_t 结构体指针。Regex：正则表达式字符串。Cflags：是下边四个值或者是他们的或 (|) 运算。REG_EXTENDED：使用 POSIX 扩展正则表达式语法解释的正则表达式。如果没有设置，基本 POSIX 正则表达式语法。REG_ICASE：忽略字母的大小写。REG_NOSUB：不存储匹配的结果。REG_NEWLINE：对换行符进行“特殊照顾”，后边详细说明。返回值：0：表示成功编译；非 0：表示编译失败，用 regerror 查看失败信息
*/
int regcomp(regex_t *preg, const char *regex, int cflags);

/*
函数说明：Regexec 用来匹配正则文本。参数说明：Preg：由 regcomp 编译好的 regex_t 结构体指针，String：要进行正则匹配的字符串。Nmatch：regmatch_t 结构体数组的大小
    Pmatch：regmatch_t 结构体数组。用来保存匹配结果的子串位置。regmatch_t 结构体定义如下
        typedef struct {
            regoff_t rm_so;
            regoff_t rm_eo;
        } regmatch_t;
        rm_so, 它的值如果不为 -1，表示匹配的最大子串在字符串中的起始偏移量，rm_eo，表示匹配的最大字串在字符串的结束偏移量。Eflags: REG_NOTBOL 和 REG_NOTEOL 为两个值之一或二者的或 (|) 运算，稍后会介绍。返回值：0：表示成功编译；非 0：表示编译失败，用 regerror 查看失败信息
*/
int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags);

/*
函数说明：用来释放 regcomp 编译好的内置变量。参数说明：Preg：由 regcomp 编译好的 regex_t 结构体指针。*/
void regfree(regex_t *preg);

/*
函数说明：Regcomp，regexec 出错时，会返回 error code 并且为非 0，此时就可以用 regerror 得到错误信息。参数说明：Errcode：Regcomp，regexec 出错时的返回值
    Preg：经过 Regcomp 编译的 regex_t 结构体指针。Errbuf：错误信息放置的位置。errbuf_size：错误信息 buff 的大小。*/
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>

int main (void)
{char ebuff[256];
    int ret;
    int cflags;
    regex_t reg;

    cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;

    char *test_str = "Hello World";
    char *reg_str = "H.*";

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "%s\n", ebuff);
        goto end;
    }   

    ret = regexec(&reg, test_str, 0, NULL, 0);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "%s\n", ebuff);
        goto end;
    }   
        
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "result is:\n%s\n", ebuff);
    
end:
    regfree(&reg);

    return 0;
}

编译，输出结果：

[root@zxy regex]# ./test 
result is:
Success

匹配成功。

如果我想保留匹配的结果怎么操作？那就得用到 regmatch_t 结构体了。重新改写上边代码，这时就不能用 REG_NOSUB 选项了，代码如下：

#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>

int main (void)
{

    int i;
    char ebuff[256];
    int ret;
    int cflags;
    regex_t reg;
    regmatch_t rm[5];
    char *part_str = NULL;

    cflags = REG_EXTENDED | REG_ICASE;

    char *test_str = "Hello World";
    char *reg_str = "e(.*)o";

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "%s\n", ebuff);
        goto end;
    }   

    ret = regexec(&reg, test_str, 5, rm, 0); 
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "%s\n", ebuff);
        goto end;
    }

    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "result is:\n%s\n\n", ebuff);

    for (i=0; i<5; i++)
    {if (rm[i].rm_so > -1)
        {part_str = strndup(test_str+rm[i].rm_so, rm[i].rm_eo-rm[i].rm_so);
            fprintf(stderr, "%s\n", part_str);
            free(part_str);
            part_str = NULL;
        }
    }

end:
    regfree(&reg);

    return 0;
}

编译，输出结果：

[root@zxy regex]# ./test 
result is:
Success

ello Wo
llo W

咦？？？？？？我明明只要一个匹配结果，为什么会打印两个出来呢？？？？？？？

原来 regmatch_t 数组的第一个元素是有特殊意义的：它是用来保存整个正则表达式能匹配的最大子串的起始和结束偏移量。所以我们在设置 regmatch_t 数组个数的时候一定要记住，它的个数是最大保留结果数 +1。

好了，基本的正则运用到此为止了，现在要开始讲讲 REG_NEWLINE、REG_NOTBOL 和REG_NOTEOL。很多人对这三个参数有所迷惑。我也是，昨天有人问问题，就把自己错误的理解告诉了别人，然后被大神一顿鄙视。我一直认为如果想用 ^ 和 $ 这两个匹配模式一定要用到 REG_NEWLINE 这个参数，其实不然。

首先看下 man page 对 REG_NEWLINE 的说明：

REG_NEWLINE
   Match-any-character operators don’t match a newline.

   A non-matching list ([^...])  not containing a newline does not match a newline.

   Match-beginning-of-line operator (^) matches the empty string immediately after a newline, regardless of whether eflags, the  execution  flags  of regexec(), contains REG_NOTBOL.

   Match-end-of-line operator ($) matches the empty string immediately before a newline, regardless of whether eflags contains REG_NOTEOL.

我英文不好，google 翻译之。。

REG_NEWLINE
  1. 匹配任何字符的运算符 (比如.) 不匹配换行('\n')；2. 非匹配列表（[^...]）不包含一个换行符不匹配一个换行符；3. 匹配开始运算符 (^) 遇到空字符串立即换行，不论在执行 regexec()时，eflags 是否设置了 REG_NOTBOL；4. 匹配结束运算符 ($) 遇到空字符串立即换行，不论在执行 regexec()时，eflags 是否设置了 REG_NOTEOL；

不明白说的是什么，程序测之。。

代码如下：

#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>

int main (void)
{

    int i;
    char ebuff[256];
    int ret;
    int cflags;

    regex_t reg;

    cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;

    char *test_str = "Hello World\n";
    char *reg_str = "Hello World.";

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "1. %s\n", ebuff);
        goto end;
    }   

    ret = regexec(&reg, test_str, 0, NULL, 0); 
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "2. %s\n", ebuff);

    cflags |= REG_NEWLINE;

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "3. %s\n", ebuff);
        goto end;
    }

    ret = regexec(&reg, test_str, 0, NULL, 0);
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "4. %s\n", ebuff);

end:
    regfree(&reg);

    return 0;
}

  编译，运行结果如下：[root@zxy regex]# ./test 
2. Success
4. No match

结果很明显：没有加入 REG_NEWLINE 的匹配成功，加入的匹配不成功。就是说不加入 REG_NEWLINE，任意匹配字符(.) 包含 ’n’，加入则不包含 ’n’。

代码如下：

#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>

int main (void)
{

    int i;
    char ebuff[256];
    int ret;
    int cflags;

    regex_t reg;

    cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;

    char *test_str = "Hello\nWorld";
    char *reg_str = "Hello[^]";

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "1. %s\n", ebuff);
        goto end;
    }   

    ret = regexec(&reg, test_str, 0, NULL, 0); 
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "2. %s\n", ebuff);

    cflags |= REG_NEWLINE;

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "3. %s\n", ebuff);
        goto end;
    }

    ret = regexec(&reg, test_str, 0, NULL, 0);
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "4. %s\n", ebuff);

end:
    regfree(&reg);

    return 0;
}

编译，运行结果如下：[root@zxy regex]# ./test 
2. Success
4. No match

结果说明：不加入REG_NEWLINE，在一个不包含 ’n’ 的非列表中，’n’ 是不被认作空白符，加入则 ’n’ 是被认作空白符。

代码如下：

#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>

int main (void)
{

    int i;
    char ebuff[256];
    int ret;
    int cflags;

    regex_t reg;

    cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;

    char *test_str = "\nHello World";
    char *reg_str = "^Hello";

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "1. %s\n", ebuff);
        goto end;
    }   

    ret = regexec(&reg, test_str, 0, NULL, 0); 
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "2. %s\n", ebuff);

    cflags |= REG_NEWLINE;

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "3. %s\n", ebuff);
        goto end;
    }

    ret = regexec(&reg, test_str, 0, NULL, 0);
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "4. %s\n", ebuff);

end:
    regfree(&reg);

    return 0;
}

编译，运行结果如下：

[root@zxy regex]# ./test 
2. No match
4. Success

结果说明：不加入REG_NEWLINE，’^’ 是不忽略 ’n’ 的，加入REG_NEWLINE，’^’ 是忽略 ’n’ 的。也就是说：不加入REG_NEWLINE，以 ’n’ 开头的字符串是不能用 ’^’ 匹配，加入REG_NEWLINE，以 ’n’ 开头的字符串是可以用 ’^’ 匹配。

代码如下：

#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>

int main (void)
{

    int i;
    char ebuff[256];
    int ret;
    int cflags;

    regex_t reg;

    cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;

    char *test_str = "Hello World\n";
    char *reg_str = "d$";

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "1. %s\n", ebuff);
        goto end;
    }   

    ret = regexec(&reg, test_str, 0, NULL, 0); 
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "2. %s\n", ebuff);

    cflags |= REG_NEWLINE;

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "3. %s\n", ebuff);
        goto end;
    }

    ret = regexec(&reg, test_str, 0, NULL, 0);
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "4. %s\n", ebuff);

end:
    regfree(&reg);

    return 0;
}

编译，运行结果如下：

[root@zxy regex]# ./test 
2. No match
4. Success

结果说明：不加入REG_NEWLINE，’&dollar;’ 是不忽略 ’n’ 的，加入REG_NEWLINE，’&dollar;’ 是忽略 ’n’ 的。也就是说：不加入REG_NEWLINE，以 ’n’ 结尾的字符串是不能用 ’&dollar;’ 匹配，加入REG_NEWLINE，以 ’n’ 开头的字符串是可以用 ’&dollar;’ 匹配。

好，REG_NEWLINE选项测试到此结束。总结下：

对于 REG_NEWLINE 选项，1. 使用任意匹配符 (.) 时，任意匹配符不会包含 ’n’；2. 对于一个不含有 ’n’ 的非列表，会把 ’n’ 认作空白符。3. 对于以 ’n’ 开头或结尾的字符串，会忽略 ’n’。使 ’^’ 和 ’$’ 可以使用。

现在开始说下 REG_NOTBOL 和REG_NOTEOL，首先看下 man page 对这两选项的说明：

REG_NOTBOL
  The  match-beginning-of-line  operator always fails to match (but see the compilation flag REG_NEWLINE above) This flag may be used when different portions of a string are passed to regexec() and the beginning of the string should not be interpreted as the beginning of the line.
REG_NOTEOL
  The match-end-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above)

继续 googling。

REG_NOTBOL
  匹配开始操作符 (^) 会经常匹配失败 (但是要考虑 REG_NEWLINE)，这个标志被用在当一个字符串的不同位置被传入到 regexec() 时，这个位置不应该被解释为该整个字符串的开始位置。REG_NOTEOL
  匹配结束操作符 ($) 会经常失败 (但是要考虑 REG_NEWLINE)。(这个标志被用在当一个字符串的不同位置被传入到 regexec() 时，即使满足匹配结束作符，也不应该被解释为以某字符 (串) 为结束的）。好吧，继续测试，第一个问题代码如下：

#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>

int main (void)
{

    int i;
    char ebuff[256];
    int ret;
    int cflags;

    regex_t reg;

    cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;

    char *test_str = "Hello World\n";
    char *reg_str = "^e";

    ret = regcomp(&reg, reg_str, cflags);
    if (ret)
    {regerror(ret, &reg, ebuff, 256);
        fprintf(stderr, "1. %s\n", ebuff);
        goto end;
    }   

    ret = regexec(&reg, test_str+1, 0, NULL, 0); 
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "2. %s\n", ebuff);

    ret = regexec(&reg, test_str+1, 0, NULL, REG_NOTBOL);
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "4. %s\n", ebuff);

end:
    regfree(&reg);

    return 0;
}

编译，运行结果如下：

[root@zxy regex]# ./test 
2. Success
4. No match

结果说明：不加入REG_NOTBOL，一个字符串的不同位置是可以用 ’^’ 进行匹配，加入REG_NOTBOL，则不能进行匹配。

第二个问题，我实在理解不了了，网上介绍的全是没有经过验证的。。。。。。

C语言中的正则表达式使用

C 语言中的正则表达式使用

函数原型

示例一

示例二

REG_NEWLINE、REG_NOTBOL 和 REG_NOTEOL

REG_NEWLINE

第一个问题

第二个问题

第三个问题

第四个问题

REG_NEWLINE 总结

REG_NOTBOL 和 REG_NOTEOL