Edit Distance计算两个字符串之间,由一个转成另一个所需要的最少编辑次数,次数越多,距离越大,也就越不相关。比如,“xiaoming”和“xiamin”,两者的转换需要两步:去除‘o’去除‘g’所以,次数/距离=2。!pip install distanceimport distancedef edit_distance(s1, s2): return distance.levenshtein(s1, s2)s1 = ‘xiaoming’s2 = ‘xiamin’print(‘距离:’+str(edit_distance(s1, s2)))杰卡德系数用于比较有限样本集之间的相似性与差异性。Jaccard 系数值越大,样本相似度越高,计算方式是:两个样本的交集除以并集。from sklearn.feature_extraction.text import CountVectorizerimport numpy as npdef jaccard_similarity(s1, s2): def add_space(s): return ’ ‘.join(list(s)) # 将字中间加入空格 s1, s2 = add_space(s1), add_space(s2) # 转化为TF矩阵 cv = CountVectorizer(tokenizer=lambda s: s.split()) corpus = [s1, s2] vectors = cv.fit_transform(corpus).toarray() # 求交集 numerator = np.sum(np.min(vectors, axis=0)) # 求并集 denominator = np.sum(np.max(vectors, axis=0)) # 计算杰卡德系数 return 1.0 * numerator / denominators1 = ‘你在干啥呢’s2 = ‘你在干什么呢’print(jaccard_similarity(s1, s2))TF 计算计算矩阵中两个向量的相似度,即:求解两个向量夹角的余弦值。计算公式:cos=a·b/|a|*|b|from sklearn.feature_extraction.text import CountVectorizerimport numpy as npfrom scipy.linalg import normdef tf_similarity(s1, s2): def add_space(s): return ’ ‘.join(list(s)) # 将字中间加入空格 s1, s2 = add_space(s1), add_space(s2) # 转化为TF矩阵 cv = CountVectorizer(tokenizer=lambda s: s.split()) corpus = [s1, s2] vectors = cv.fit_transform(corpus).toarray() # 计算TF系数 return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))s1 = ‘你在干啥呢’s2 = ‘你在干什么呢’print(tf_similarity(s1, s2))高阶模型BertBert的内部结构,请查看从word2vec到bert这篇文章,本篇文章我们只讲代码实现。我们可以下载Bert模型源码,或者使用TF-HUB的方式使用,本次我们使用下载源码的方式。首先,从Github下载源码,然后下载google预训练好的模型,我们选择Bert-base Chinese。预模型下载后解压,文件结构如图:vocab.txt是训练时中文文本采用的字典,bert_config.json是BERT在训练时,可选调整的一些参数。其它文件是模型结构,参数等文件。准备数据集修改 processorclass MoveProcessor(DataProcessor): “““Processor for the move data set .””” def get_train_examples(self, data_dir): “““See base class.””” return self._create_examples( self._read_tsv(os.path.join(data_dir, “train.tsv”)), “train”) def get_dev_examples(self, data_dir): “““See base class.””” return self._create_examples( self._read_tsv(os.path.join(data_dir, “dev.tsv”)), “dev”) def get_test_examples(self, data_dir): “““See base class.””” return self._create_examples( self._read_tsv(os.path.join(data_dir, “test.tsv”)), “test”) def get_labels(self): “““See base class.””” return [“0”, “1”] @classmethod def _read_tsv(cls, input_file, quotechar=None): “““Reads a tab separated value file.””” with tf.gfile.Open(input_file, “r”) as f: reader = csv.reader(f, delimiter="\t", quotechar=quotechar) lines = [] for line in reader: lines.append(line) return lines def create_examples(self, lines, set_type): “““Creates examples for the training and dev sets.””” examples = [] for (i, line) in enumerate(lines): guid = “%s-%s” % (set_type, i) if set_type == “test”: text_a = tokenization.convert_to_unicode(line[0]) label = “0” else: text_a = tokenization.convert_to_unicode(line[1]) label = tokenization.convert_to_unicode(line[0]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) return examples修改 processor字典def main(): tf.logging.set_verbosity(tf.logging.INFO) processors = { “cola”: ColaProcessor, “mnli”: MnliProcessor, “mrpc”: MrpcProcessor, “xnli”: XnliProcessor, ‘setest’:MoveProcessor }Bert模型训练export BERT_BASE_DIR=/Users/xiaomingtai/Downloads/chinese_L-12_H-768_A-12export MY_DATASET=/Users/xiaomingtai/Downloads/bert_modelpython run_classifier.py \ –data_dir=$MY_DATASET \ –task_name=setest \ –vocab_file=$BERT_BASE_DIR/vocab.txt \ –bert_config_file=$BERT_BASE_DIR/bert_config.json \ –output_dir=/Users/xiaomingtai/Downloads/ber_model_output/ \ –do_train=true \ –do_eval=true \ –do_predict=true\ –init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ –max_seq_length=128 \ –train_batch_size=16 \ –eval_batch_size=8\ –predict_batch_size=2\ –learning_rate=5e-5\ –num_train_epochs=3.0\Bert模型训练结果