关于推荐算法:推荐系统入门笔记九-基于内容的推荐算法

一. 简介

基于内容的举荐办法是以物品的内容形容信息为根据来做出的举荐，实质上是基于对物品和用户本身的特色或属性的间接剖析和计算。

例如，假如已知电影A是一部悲剧，而凑巧咱们得悉某个用户喜爱看喜剧电影，那么咱们基于这样的已知信息，就能够将电影A举荐给该用户。

二. 基于内容举荐的实现步骤

画像构建(画像就是刻画物品或用户的特色。实质上就是给用户或物品贴标签)
- 物品画像 : 给物品贴标签
- 用户画像 : 给用户贴标签
构建画像的办法:
依据PGC/UGC 内容构建物品画像,(PGC:物品自带的标签,UGC:用户提供的标签)
依据用户记录构建用户画像
依据用户画像从物品中寻找最匹配的TOP-N物品进行举荐

三. 基于内容的电影举荐: 物品画像

1. 构建步骤

利用tags.csv中每部电影的标签作为电影的候选关键词
利用Tf-IDF 或者word2vec 计算每个词的权重
选取权重top-n的标签作为电影画像

2. TF-IDF 算法

1. 算法原理

TF-IDF自然语言解决畛域中计算文档中词或短语的权值的办法，是词频（Term Frequency，TF）和逆转文档频率（Inverse Document Frequency，IDF）的乘积。TF指的是某一个给定的词语在该文件中呈现的次数。这个数字通常会被正规化，以避免它偏差长的文件（同一个词语在长文件里可能会比短文件有更高的词频，而不论该词语重要与否）。IDF是一个词语广泛重要性的度量，某一特定词语的IDF，能够由总文件数目除以蕴含该词语之文件的数目，再将失去的商取对数失去。

2. 算法举例

对于计算影评的TF-IDF，以电影“加勒比海盗：黑珍珠号的咒骂”为例，假如它总共有1000篇影评，其中一篇影评的总词语数为200，其中呈现最频繁的词语为“海盗”、“船长”、“自在”，别离是20、15、10次，并且这3个词在所有影评中被提及的次数别离为1000、500、100，就这3个词语作为关键词的程序计算如下。

将影评中呈现的停用词过滤掉，计算其余词语的词频。以呈现最多的三个词为例进行计算如下：
- “海盗”呈现的词频为20/200＝0.1
- “船长”呈现的词频为15/200=0.075
- “自在”呈现的词频为10/200=0.05；
计算词语的逆文档频率如下：
- “海盗”的IDF为：log(1000/1000)=0
- “船长”的IDF为：log(1000/500)=0.3
- “自在”的IDF为：log(1000/100)=1
由1和2计算的后果求出词语的TF-IDF后果，“海盗”为0，“船长”为0.0225，“自在”为0.05。

3. 算法实现

import pandas as pdimport numpy as npfrom gensim.corpora import Dictionaryfrom gensim.models import TfidfModeldef get_movie_dataset(tag_path, movie_path):    # 读取电影标签文件,去第2,第3列    _tags = pd.read_csv(tag_path, usecols=[1, 2]).dropna()    # 对标签数据进行汇合    tags = _tags.groupby('movieId').agg(list)    # 读取电影文件,将tags标签 和电影数据组合    movies = pd.read_csv(movie_path, index_col='movieId')    # 须要应用到电影的分类,所以对电影分类进行解决    movies['genres'] = movies['genres'].apply(lambda x: x.split('|'))    # 将标签数据和电影数据组合,取tags 和 movies 都存在的movieId    movie_index = set(movies.index) & set(tags.index)    # 取标签中的值    new_tags = tags.loc[movie_index]    # 组合数据    ret = movies.join(new_tags)    # 将数据转换成pd    movie_dataset = pd.DataFrame(        map(lambda x: (x[0], x[1], x[2], x[2] + x[3]) if x[3] is not np.nan else (x[0], x[1], x[2], []),            ret.itertuples()), columns=["movieId", "title", "genres", "tags"])    # 设置movie_dataset 的index为 movieId    movie_dataset.set_index('movieId', inplace=True)    return movie_datasetdef create_movie_profile(movie_dataset):    # 1. 对数据集进行迭代    # 2. 对每个电影的现有标签进行tf-idf计算权重    # 3. 对标签进行权重排序,    # 4. 取top-n 个tag 作为电影的标签    # 取出所有电影的标签    dataset = movie_dataset['tags'].values    # 应用gensim计算tf-idf ,将所有词放入一个词典    dct = Dictionary(dataset)    # 依据每条数据,计算对应的词索引和词频    corpus = [dct.doc2bow(line) for line in dataset]    model = TfidfModel(corpus)    # 给每个电影贴标签    _movie_profile = []    for i, data in enumerate(movie_dataset.itertuples()):        mid = data[0]        title = data[1]        genres = data[2]        # 依据每条数据返回标签,权重向量        vector = model[corpus[i]]        # 对标签进行权重排序并去前30个作为电影标签        movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]        # 前30个电影-权重        topN_tags_weights = dict(map(lambda x: (dct[x[0]], x[1]), movie_tags))        # 将类别词退出tags 设置权重为1        for g in genres:            topN_tags_weights[g] = 1.0        topN_tags = [i[0] for i in topN_tags_weights.items()]        _movie_profile.append((mid, title, topN_tags, topN_tags_weights))    movie_profile = pd.DataFrame(_movie_profile, columns=["movieId", "title", "profile", "weights"])    movie_profile.set_index("movieId", inplace=True)    return movie_profiledef create_inverted_table(movie_profile):    # 对电影画像做tag-movies倒排表:    # 每个关键词对应的电影,以及该词的权重    inverted_table = {}    # 对所有电影的标签权重循环    for mid, weights in movie_profile['weights'].iteritems():        # 取每个电影标签        for tag, weight in weights.items():            # 获取inverted_table 中 tag的值如果不存在,返回[]            _ = inverted_table.get(tag, [])            # 将 电影和权重增加到标签的列表中            _.append((mid, weight))            # 增加标签对应的电影和权重            inverted_table.setdefault(tag, _)    return inverted_tableif __name__ == '__main__':    tag_path = 'E:/ml/recommand/ml-latest-small/all-tags.csv'    movie_path = 'E:/ml/recommand/ml-latest-small/movies.csv'    watch_path = 'E:/ml/recommand/ml-latest-small/ratings.csv'    # 1. 获取数据,解决数据集    movie_dataset = get_movie_dataset(tag_path, movie_path)    # 2. 对电影构建画像    movie_profile = create_movie_profile(movie_dataset)    inverted_table = create_inverted_table(movie_profile)    print(inverted_table)

三. 基于内容的电影举荐: 用户画像

用户画像构建步骤：

依据用户的评分历史，将用户评分好的电影标签作为初始标签反打到用户身上

对用户的标签进行累计->排序->选取前N个作为用户标签

1. 代码实现

def create_user_profile(watch_path, movie_profile):  # 依据用户的评分历史，将用户评分好的电影标签作为初始标签反打到用户身上  # 对用户的标签进行累计->排序->选取前N个作为用户标签  watch_record = pd.read_csv(watch_path, usecols=range(3),                             dtype={"userId": np.int32, "movieId": np.int32, "rating": np.float32})  # 聚合用户评分数据  watch_record = watch_record.groupby('userId').agg(list)  user_profile = {}  for uid, mids, ratings in watch_record.itertuples():      # 为了取出大于用户平均值,先将数据转为numpy      _ = list()      _.append(mids)      _.append(ratings)      data_set = np.array(_)      rating = data_set[1:]      # 计算用户的平均分      user_mean = rating.mean()      # 取出评分大于用户评分平均值的所有movieId      data_set_index = np.where(rating > user_mean)      final_mids = data_set[data_set_index].astype(np.int)      # 通过电影id ,获取每个电影的tags,将tage组合,如果有雷同的tag,权重累计      # 对最终的tag按权重排序,取前N个作为用户标签      user_tag_weight = {}      for mid in final_mids:          # 电影对应的 tags和权重          movie_data_dict = movie_profile.loc[mid]['weights']          for tag, weight in movie_data_dict.items():              # 如果存在多个雷同标签,将标签权重相加              if tag in user_tag_weight.keys():                  user_tag_weight[tag] = user_tag_weight[tag] + weight              else:                  user_tag_weight[tag] = weight      # 对标签权重进行排序,      user_tags = sorted(user_tag_weight.items(), key=lambda x: x[1], reverse=True)[:50]      user_profile[uid] = user_tags

四. 依据物品画像和用户画像给用户举荐电影

1. 实现思路

遍历用户的标签
从物品画像倒排表中获取用户标签对应的电影,取出电影权重为,该标签的权重* 电影在标签中的权重
对电影权重排序
取出top-N个电影

2. 代码实现

def user_recommand_top_N(user_profile, inverted_table):    # 给用户举荐电影    # 1. 遍历用户的标签    # 2. 从物品画像倒排表中获取用户标签对应的电影, 取出电影权重为, 该标签的权重 * 电影在标签中的权重    # 3. 对电影权重排序    # 4. 取出top - N个电影    user_movie_profile = {}    # 遍历用户举荐记录,给用户举荐电影    for uid, tags in user_profile.items():        movie_weight_dict = {}        # 对用户的标签迭代,通过标签获取标签对应的电影        for tags_weight in tags:            tag = tags_weight[0]            t_weight = tags_weight[1]            # 从标签倒排表中获取标签对应的电影            movie_weight_list = inverted_table[tag]            # 对电影对应评分进行解决            for movie_weight in movie_weight_list:                mid = movie_weight[0]                m_weight = movie_weight[1]                # 如果是多个电影举荐,将权重相加                if mid in movie_weight_dict.keys():                    movie_weight_dict[mid] += (t_weight * m_weight)                else:                    movie_weight_dict[mid] = (t_weight * m_weight)            # 对电影权重进行排序            movie_weight_dict = sorted(movie_weight_dict.items(), key=lambda x: x[1], reverse=True)[:30]        user_movie_profile[uid] = movie_weight_dict    return user_movie_profile

五. 残缺代码

import pandas as pdimport numpy as npfrom gensim.corpora import Dictionaryfrom gensim.models import TfidfModeldef get_movie_dataset(tag_path, movie_path):    # 读取电影标签文件,去第2,第3列    _tags = pd.read_csv(tag_path, usecols=[1, 2]).dropna()    # 对标签数据进行汇合    tags = _tags.groupby('movieId').agg(list)    # 读取电影文件,将tags标签 和电影数据组合    movies = pd.read_csv(movie_path, index_col='movieId')    # 须要应用到电影的分类,所以对电影分类进行解决    movies['genres'] = movies['genres'].apply(lambda x: x.split('|'))    # 将标签数据和电影数据组合,取tags 和 movies 都存在的movieId    movie_index = set(movies.index) & set(tags.index)    # 取标签中的值    new_tags = tags.loc[movie_index]    # 组合数据    ret = movies.join(new_tags)    # 将数据转换成pd    movie_dataset = pd.DataFrame(        map(lambda x: (x[0], x[1], x[2], x[2] + x[3]) if x[3] is not np.nan else (x[0], x[1], x[2], []),            ret.itertuples()), columns=["movieId", "title", "genres", "tags"])    # 设置movie_dataset 的index为 movieId    movie_dataset.set_index('movieId', inplace=True)    return movie_datasetdef create_movie_profile(movie_dataset):    # 1. 对数据集进行迭代    # 2. 对每个电影的现有标签进行tf-idf计算权重    # 3. 对标签进行权重排序,    # 4. 取top-n 个tag 作为电影的标签    # 取出所有电影的标签    dataset = movie_dataset['tags'].values    # 应用gensim计算tf-idf ,将所有词放入一个词典    dct = Dictionary(dataset)    # 依据每条数据,计算对应的词索引和词频    corpus = [dct.doc2bow(line) for line in dataset]    model = TfidfModel(corpus)    # 给每个电影贴标签    _movie_profile = []    for i, data in enumerate(movie_dataset.itertuples()):        mid = data[0]        title = data[1]        genres = data[2]        # 依据每条数据返回标签,权重向量        vector = model[corpus[i]]        # 对标签进行权重排序并去前30个作为电影标签        movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]        # 前30个电影-权重        topN_tags_weights = dict(map(lambda x: (dct[x[0]], x[1]), movie_tags))        # 将类别词退出tags 设置权重为1        # for g in genres:        #     topN_tags_weights[g] = 1.0        topN_tags = [i[0] for i in topN_tags_weights.items()]        _movie_profile.append((mid, title, topN_tags, topN_tags_weights))    movie_profile = pd.DataFrame(_movie_profile, columns=["movieId", "title", "profile", "weights"])    movie_profile.set_index("movieId", inplace=True)    return movie_profiledef create_inverted_table(movie_profile):    # 对电影画像做tag-movies倒排表:    # 每个关键词对应的电影,以及该词的权重    inverted_table = {}    # 对所有电影的标签权重循环    for mid, weights in movie_profile['weights'].iteritems():        # 取每个电影标签        for tag, weight in weights.items():            # 获取inverted_table 中 tag的值如果不存在,返回[]            _ = inverted_table.get(tag, [])            # 将 电影和权重增加到标签的列表中            _.append((mid, weight))            # 增加标签对应的电影和权重            inverted_table.setdefault(tag, _)    return inverted_tabledef create_user_profile(watch_path, movie_profile):    # 依据用户的评分历史，将用户评分好的电影标签作为初始标签反打到用户身上    # 对用户的标签进行累计->排序->选取前N个作为用户标签    watch_record = pd.read_csv(watch_path, usecols=range(3),                               dtype={"userId": np.int32, "movieId": np.int32, "rating": np.float32})    # 聚合用户评分数据    watch_record = watch_record.groupby('userId').agg(list)    user_profile = {}    for uid, mids, ratings in watch_record.itertuples():        # 为了取出大于用户平均值,先将数据转为numpy        _ = list()        _.append(mids)        _.append(ratings)        data_set = np.array(_)        rating = data_set[1:]        # 计算用户的平均分        user_mean = rating.mean()        # 取出评分大于用户评分平均值的所有movieId        data_set_index = np.where(rating > user_mean)        final_mids = data_set[data_set_index].astype(np.int)        # 通过电影id ,获取每个电影的tags,将tage组合,如果有雷同的tag,权重累计        # 对最终的tag按权重排序,取前N个作为用户标签        user_tag_weight = {}        for mid in final_mids:            # 电影对应的 tags和权重            movie_data_dict = movie_profile.loc[mid]['weights']            for tag, weight in movie_data_dict.items():                # 如果存在多个雷同标签,将标签权重相加                if tag in user_tag_weight.keys():                    user_tag_weight[tag] = user_tag_weight[tag] + weight                else:                    user_tag_weight[tag] = weight        # 对标签权重进行排序,        user_tags = sorted(user_tag_weight.items(), key=lambda x: x[1], reverse=True)[:50]        user_profile[uid] = user_tags    return user_profiledef user_recommand_top_N(user_profile, inverted_table):    # 给用户举荐电影    # 1. 遍历用户的标签    # 2. 从物品画像倒排表中获取用户标签对应的电影, 取出电影权重为, 该标签的权重 * 电影在标签中的权重    # 3. 对电影权重排序    # 4. 取出top - N个电影    user_movie_profile = {}    # 遍历用户举荐记录,给用户举荐电影    for uid, tags in user_profile.items():        movie_weight_dict = {}        # 对用户的标签迭代,通过标签获取标签对应的电影        for tags_weight in tags:            tag = tags_weight[0]            t_weight = tags_weight[1]            # 从标签倒排表中获取标签对应的电影            movie_weight_list = inverted_table[tag]            # 对电影对应评分进行解决            for movie_weight in movie_weight_list:                mid = movie_weight[0]                m_weight = movie_weight[1]                # 如果是多个电影举荐,将权重相加                if mid in movie_weight_dict.keys():                    movie_weight_dict[mid] += (t_weight * m_weight)                else:                    movie_weight_dict[mid] = (t_weight * m_weight)            # 对电影权重进行排序            movie_weight_dict = sorted(movie_weight_dict.items(), key=lambda x: x[1], reverse=True)[:30]        user_movie_profile[uid] = movie_weight_dict    return user_movie_profileif __name__ == '__main__':    tag_path = 'E:/ml/recommand/ml-latest-small/all-tags.csv'    movie_path = 'E:/ml/recommand/ml-latest-small/movies.csv'    watch_path = 'E:/ml/recommand/ml-latest-small/ratings.csv'    # 1. 获取数据,解决数据集    movie_dataset = get_movie_dataset(tag_path, movie_path)    # 2. 对电影构建画像    movie_profile = create_movie_profile(movie_dataset)    # 3. 创立倒排表    inverted_table = create_inverted_table(movie_profile)    # 4, 构建用户画像    user_profile = create_user_profile(watch_path, movie_profile)    # 5. 对用户举荐电影    user_recommand_top_N(user_profile, inverted_table)