实现拼写检查器(spell check)

本文同时发在我的github博客上，欢迎star
在百度或者Google搜索的时候，有时会小手一抖，打错了个别字母，比如我们想搜索apple，错打成了appel，但神奇的是，即使我们敲下回车，搜索引擎也会自动搜索apple而不是appel，这是怎么实现的呢？本文就将从头实现一个JavaScript版的拼写检查器
基础理论
首先，我们要确定如何量化敲错单词的概率，我们将原本想打出的单词设为origin(O)，错打的单词设为error(E)
由贝叶斯定理我们可知：P(O|E)=P(O)*P(E|O)/P(E)
P(O|E)是我们需要的结果，也就是在打出错误单词E的情况下，原本想打的单词是O的概率
P(O)我们可以看作是O出现的概率，是先验概率，这个我们可以从大量的语料环境中获取
P(E|O)是原本想打单词O却打成了E的概率，这个可以用最短编辑距离模拟概率，比如原本想打的单词是apple，打成applee(最短编辑距离为1)的概率比appleee(最短编辑距离为2)自然要大
P(E)由于我们已知E，这个概念是固定的，而我们需要对比的是P(O1|E)、P(O2|E)…P(On|E)的概率，不需要精确的计算值，我们可以不用管它
具体实现
这部分的实现我参考了natural的代码，传送门
首先是构造函数：
function SpellCheck(priorList) {
//to do trie
this.priorList = priorList;
this.priorHash = {};
priorList.forEach(item => {
!this.priorHash[item] && (this.priorHash[item] = 0);
this.priorHash[item]++;
});
}
priorList是语料库，在构造函数中我们对priorList中的单词进行了出现次数的统计，这也就可以被我们看作是先验概率P(O)
接下来是check函数，用来检测这个单词是否在语料库中出现
SpellCheck.prototype.check = function(word) {
return this.priorList.indexOf(word) !== -1;
};
然后我们需要获取单词指定编辑距离内的所有可能性：
SpellCheck.prototype.getWordsByMaxDistance = function(wordList, maxDistance) {
if (maxDistance === 0) {
return wordList;
}
const listLength = wordList.length;
wordList[listLength] = [];
wordList[listLength – 1].forEach(item => {
wordList[listLength].push(…this.getWordsByOneDistance(item));
});
return this.getWordsByMaxDistance(wordList, maxDistance – 1);
};
SpellCheck.prototype.getWordsByOneDistance = function(word) {
const alphabet = “abcdefghijklmnopqrstuvwxyz”;
let result = [];
for (let i = 0; i < word.length + 1; i++) {
for (let j = 0; j < alphabet.length; j++) {
//插入
result.push(
word.slice(0, i) + alphabet[j] + word.slice(i, word.length)
);
//替换
if (i > 0) {
result.push(
word.slice(0, i – 1) +
alphabet[j] +
word.slice(i, word.length)
);
}
}
if (i > 0) {
//删除
result.push(word.slice(0, i – 1) + word.slice(i, word.length));
//前后替换
if (i < word.length) {
result.push(
word.slice(0, i – 1) +
word[i] +
word[i – 1] +
word.slice(i + 1, word.length)
);
}
}
}
return result.filter((item, index) => {
return index === result.indexOf(item);
});
};
wordList是一个数组，它的第一项是只有原始单词的数组，第二项是存放距离原始单词编辑距离为1的单词数组，以此类推，直到到达了指定的最大编辑距离maxDistance
以下四种情况被视为编辑距离为1:

插入一项，比如ab->abc

替换一项，比如ab->ac

删除一项，比如ab->a

前后替换，比如ab->ba

获取了所有在指定编辑距离的单词候选集，再比较它们的先验概率：
SpellCheck.prototype.getCorrections = function(word, maxDistance = 1) {
const candidate = this.getWordsByMaxDistance([[word]], maxDistance);
let result = [];
candidate
.map(candidateList => {
return candidateList
.filter(item => this.check(item))
.map(item => {
return [item, this.priorHash[item]];
})
.sort((item1, item2) => item2[1] – item1[1])
.map(item => item[0]);
})
.forEach(item => {
result.push(…item);
});
return result.filter((item, index) => {
return index === result.indexOf(item);
});
};
最后得到的就是修正后的单词
我们来测试一下：
const spellCheck = new SpellCheck([
“apple”,
“apples”,
“pear”,
“grape”,
“banana”
]);
spellCheck.getCorrectionsByCalcDistance(“appel”, 1); //[ ‘apple’ ]
spellCheck.getCorrectionsByCalcDistance(“appel”, 2); //[ ‘apple’, ‘apples’ ]
可以看到，在第一次测试的时候，我们指定了最大编辑距离为1，输入了错误的单词appel，最后返回修正项apple；而在第二次测试时，将最大编辑距离设为2，则返回了两个修正项
语料库较少的情况
上面的实现方法是先获取了单词所有指定编辑距离内的候选项，而在语料库单词较少的情况下，这种方法比较耗费时间，我们可以改成先获取语料库中符合指定最短编辑距离的单词
计算最短编辑距离是一种比较经典的动态规划(leetcode:72)，dp即可。这里的计算最短编辑距离与leetcode的情况略有不同，需要多考虑一层临近字母左右替换的情况
leetcode情况下的状态转换方程：

dp[i][j]=0 i===0,j===0

dp[i][j]=j i===0,j>0

dp[i][j]=i j===0,i>0

min(dp[i-1][j-1]+cost,dp[i-1][j]+1,dp[i][j-1]+1) i,j>0

其中当word1[i-1]===word2[j-1]时，cost为0，否则为1
考虑临近字母左右替换的情况，则需要在i>1,j>1且word1[i – 2] === word2[j – 1]&&word1[i – 1] === word2[j – 2]为true的条件下，再作min(dp[i-1][j-1]+cost,dp[i-1][j]+1,dp[i][j-1]+1,dp[i-2][j-2]+1)
拿到语料库中符合指定最短编辑距离的单词在对先验概率作比较，代码如下：
SpellCheck.prototype.getCorrectionsByCalcDistance = function(
word,
maxDistance = 1
) {
const candidate = [];
for (let key in this.priorHash) {
this.calcDistance(key, word) <= maxDistance && candidate.push(key);
}
return candidate
.map(item => {
return [item, this.priorHash[item]];
})
.sort((item1, item2) => item2[1] – item1[1])
.map(item => item[0]);
};
SpellCheck.prototype.calcDistance = function(word1, word2) {
const length1 = word1.length;
const length2 = word2.length;
let dp = [];
for (let i = 0; i <= length1; i++) {
dp[i] = [];
for (let j = 0; j <= length2; j++) {
if (i === 0) {
dp[i][j] = j;
continue;
}
if (j === 0) {
dp[i][j] = i;
continue;
}
const replaceCost =
dp[i – 1][j – 1] + (word1[i – 1] === word2[j – 1] ? 0 : 1);
let transposeCost = Infinity;
if (
i > 1 &&
j > 1 &&
word1[i – 2] === word2[j – 1] &&
word1[i – 1] === word2[j – 2]
) {
transposeCost = dp[i – 2][i – 2] + 1;
}
dp[i][j] = Math.min(
replaceCost,
transposeCost,
dp[i – 1][j] + 1,
dp[i][j – 1] + 1
);
}
}
return dp[length1][length2];
};
最后
这份代码还有很多可以优化的地方，比如check函数使用的是indexOf判断单词是否在语料库中出现，我们可以改用单词查找树(Trie)或者hash的方式加速查询

实现拼写检查器(spell check)

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

实现拼写检查器(spell check)

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复