共计 4821 个字符,预计需要花费 13 分钟才能阅读完成。
lilac-parser 是我用 ClojureScript 实现的一个库, 能够做一些正则的性能.
看名字, 这个库设计的时候更多是一个 parser 的思路,
从应用来说, 当做一个正则也是比拟顺的. 尽管不如正则简短明了.
正则的毛病次要是基于字符串状态编写, 须要本义, 规定长了就不好保护了.
而 lilac-parser 的形式, 就挺容易进行组合的, 我这边举一些例子
首先是 is+
这个规定, 进行准确匹配,
(parse-lilac "x" (is+ "x")) ; {:ok? true, :rest nil}
(parse-lilac "xyz" (is+ "xyz")) ; {:ok? true, :rest nil}
(parse-lilac "xy" (is+ "x")) ; {:ok? false}
(parse-lilac "xy" (is+ "x")) ; {:ok? true, :rest ("y")}
(parse-lilac "y" (is+ "x")) ; {:ok? false}
能够看到, 头部匹配上的表达式, 都返回了 true.
后边是否还有其余内容, 须要通过 :rest
字段再去独自判断了.
当然准确匹配比较简单, 而后是抉择匹配,
(parse-lilac "x" (one-of+ "xyz")) ; {:ok? true}
(parse-lilac "y" (one-of+ "xyz")) ; {:ok? true}
(parse-lilac "z" (one-of+ "xyz")) ; {:ok? true}
(parse-lilac "w" (one-of+ "xyz")) ; {:ok? false}
(parse-lilac "xy" (one-of+ "xyz")) ; {:ok? true, :rest ("y")}
反过来, 能够有排除的规定,
(parse-lilac "x" (other-than+ "abc")) ; {:ok? true, :rest nil}
(parse-lilac "xy" (other-than+ "abc")) ; {:ok? true, :rest ("y")}
(parse-lilac "a" (other-than+ "abc")) ; {:ok? false}
在此基础上, 减少一些逻辑, 示意判断的规定能够不存在,
当然容许不存在的话, 任何时候都能够退回到 true 的后果的,
(parse-lilac "x" (optional+ (is+ "x"))) ; {:ok? true, :rest nil}
(parse-lilac ""(optional+ (is+"x"))) ; {:ok? true, :rest nil}
(parse-lilac "x" (optional+ (is+ "y"))) ; {:ok? true, :rest("x")}
也能够设定规定, 判断多个, 也就是大于 1 个 (目前不能管制具体个数),
(parse-lilac "x" (many+ (is+ "x")))
(parse-lilac "xx" (many+ (is+ "x")))
(parse-lilac "xxx" (many+ (is+ "x")))
(parse-lilac "xxxy" (many+ (is+ "x")))
如果容许 0 个的状况, 就不是 many 了, 而是 some 的规定,
(parse-lilac ""(some+ (is+"x")))
(parse-lilac "x" (some+ (is+ "x")))
(parse-lilac "xx" (some+ (is+ "x")))
(parse-lilac "xxy" (some+ (is+ "x")))
(parse-lilac "y" (some+ (is+ "x")))
相应的, or 的规定能够写进去,
(parse-lilac "x" (or+ [(is+ "x") (is+ "y")]))
(parse-lilac "y" (or+ [(is+ "x") (is+ "y")]))
(parse-lilac "z" (or+ [(is+ "x") (is+ "y")]))
而 combine 是用来程序组合多个规定的,
(parse-lilac "xy" (combine+ [(is+ "x") (is+ "y")])) ; {:ok? true, :rest nil}
(parse-lilac "xyz" (combine+ [(is+ "x") (is+ "y")])) ; {:ok? true, :rest ("z")}
(parse-lilac "xy" (combine+ [(is+ "y") (is+ "x")])) ; {:ok? flase}
而 interleave 是示意两个规定, 而后互相距离反复,
这种场景很多都是逗号距离的表达式的解决当中用到,
(parse-lilac "xy" (interleave+ (is+ "x") (is+ "y")))
(parse-lilac "xyx" (interleave+ (is+ "x") (is+ "y")))
(parse-lilac "xyxy" (interleave+ (is+ "x") (is+ "y")))
(parse-lilac "yxy" (interleave+ (is+ "x") (is+ "y")))
另外以后的代码还提供了几个内置的规定, 用来判断字母, 数字, 中文的状况,
(parse-lilac "a" lilac-alphabet)
(parse-lilac "A" lilac-alphabet)
(parse-lilac "." lilac-alphabet) ; {:ok? false}
(parse-lilac "1" lilac-digit)
(parse-lilac "a" lilac-digit) ; {:ok? false}
(parse-lilac "汉" lilac-chinese-char)
(parse-lilac "E" lilac-chinese-char) ; {:ok? false}
(parse-lilac "," lilac-chinese-char) ; {:ok? false}
(parse-lilac "," lilac-chinese-char) ; {:ok? false}
具体某些非凡的字符的话, 临时只能通过 unicode 范畴来指定了.
(parse-lilac "a" (unicode-range+ 97 122))
(parse-lilac "z" (unicode-range+ 97 122))
(parse-lilac "A" (unicode-range+ 97 122))
有了这些规定, 就能够组合来模仿正则的性能了, 比方查找匹配项有多少,
(find-lilac "write cumulo and respo" (or+ [(is+ "cumulo") (is+ "respo")]))
; find 2
(find-lilac "write cumulo and phlox" (or+ [(is+ "cumulo") (is+ "respo")]))
; find 1
(find-lilac "write cumulo and phlox" (or+ [(is+ "cirru") (is+ "respo")]))
; find 0
或者间接进行字符串替换, 这就跟正则差不多了.
(replace-lilac "cumulo project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my"))
; "my project"
(replace-lilac "respo project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my"))
; "my project"
(replace-lilac "phlox project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my"))
; "phlox project"
能够看到, 这个写法就是组合进去的, 写起来比正则长, 然而能够定义变量, 做一些形象.
简略的例子可能看不出这样做有什么用, 可能就是感觉搞得反而更长了, 而且性能更差.
我的我的项目当中有个简略的 JSON 解析的例子, 这个用正则就搞不定了吧 …
间接搬运代码如下:
; 判断 true false 两种状况, 返回的是 boolean
(def boolean-parser
(label+ "boolean" (or+ [(is+ "true") (is+ "false")] (fn [x] (if (= x "true") true false)))))
(def space-parser (label+ "space" (some+ (is+ " ") (fn [x] nil))))
; 组合一个蕴含空白和逗号的解析器, label 只是正文, 能够疏忽
(def comma-parser
(label+ "comma" (combine+ [space-parser (is+ ",") space-parser] (fn [x] nil))))
(def digits-parser (many+ (one-of+ "0123456789") (fn [xs] (string/join "" xs))))
; 为了简略, null 和 undefined 间接返回 nil 了
(def nil-parser (label+ "nil" (or+ [(is+ "null") (is+ "undefined")] (fn [x] nil))))
; number 的状况, 须要思考后面可能有负号, 前面可能有小数点
; 这边偷懒没思考迷信记数法了...
(def number-parser
(label+
"number"
(combine+
; 负号.. 可选的
[(optional+ (is+ "-"))
digits-parser
; 组合进去小数局部, 这也是可选的
(optional+ (combine+ [(is+ ".") digits-parser] (fn [xs] (string/join "" xs))))]
(fn [xs] (js/Number (string/join "" xs))))))
(def string-parser
(label+
"string"
(combine+
; 字符串的解析, 引号结尾引号结尾
[(is+ "\"")
; 两头是非引号的字符串, 或者本义符号的状况
(some+ (or+ [(other-than+ "\"\\") (is+"\\\"") (is+"\\\\") (is+"\\n")]))
(is+ "\"")]
(fn [xs] (string/join "" (nth xs 1))))))
(defparser
value-parser+
()
identity
(or+
[number-parser string-parser nil-parser boolean-parser (array-parser+) (object-parser+)]))
(defparser
object-parser+
()
identity
(combine+
[(is+ "{")
(optional+
; 对象就比较复杂了, 次要看 interleave 局部吧, 外边只是花括号的解决
(interleave+
(combine+
[string-parser space-parser (is+ ":") space-parser (value-parser+)]
(fn [xs] [(nth xs 0) (nth xs 4)]))
comma-parser
(fn [xs] (take-nth 2 xs))))
(is+ "}")]
(fn [xs] (into {} (nth xs 1)))))
(defparser
array-parser+
()
(fn [x] (vec (first (nth x 1))))
(combine+
[(is+ "[")
; 数组, 同样是 interleave 的状况
(some+ (interleave+ (value-parser+) comma-parser (fn [xs] (take-nth 2 xs))))
(is+ "]")]))
能够看到, 通过 lilac-parser 结构规定的过后, 比拟容易就生成了一个 JSON Parser.
尽管反对的规定比较简单, 而且性能不大现实, 然而比起正则来说, 这个代码可读很多了.
置信能够作为一种思路, 用在很多文本处理的场景当中.
为了兴许能够提供简化一些的版本, 在 JavaScript 间接应用, 代替正则.