这里的格式并没有做过多的处理,可参考于OneNote笔记链接
由于OneNote取消了单页分享,如果需要请留下邮箱,我会邮件发送pdf版本,后续再解决这个问题
推荐算法库surprise安装
pip install surprise基本用法
• 自动交叉验证 # Load the movielens-100k dataset (download it if needed),data = Dataset.load_builtin('ml-100k') # We'll use the famous SVD algorithm.algo = SVD() # Run 5-fold cross-validation and print resultscross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True) load_builtin方法会自动下载“movielens-100k”数据集,放在.surprise_data目录下面 • 使用自定义的数据集 # path to dataset filefile_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data') # As we're loading a custom dataset, we need to define a reader. In the# movielens-100k dataset, each line has the following format:# 'user item rating timestamp', separated by '\t' characters.reader = Reader(line_format='user item rating timestamp', sep='\t') data = Dataset.load_from_file(file_path, reader=reader) # We can now use this dataset as we please, e.g. calling cross_validatecross_validate(BaselineOnly(), data, verbose=True)• 交叉验证 ○ cross_validate(算法,数据集,评估模块measures=[],交叉验证折数cv) ○ 通过test方法和KFold也可以对数据集进行更详细的操作,也可以使用LeaveOneOut或是ShuffleSplit from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise.model_selection import Kfold # Load the movielens-100k dataset data = Dataset.load_builtin('ml-100k') # define a cross-validation iterator kf = KFold(n_splits=3) algo = SVD() for trainset, testset in kf.split(data): # train and test algorithm. algo.fit(trainset) predictions = algo.test(testset) # Compute and print Root Mean Squared Error accuracy.rmse(predictions, verbose=True)• 使用GridSearchCV来调节算法参数 ○ 如果需要对算法参数来进行比较测试,GridSearchCV类可以提供解决方案 ○ 例如对SVD的参数尝试不同的值 from surprise import SVD from surprise import Dataset from surprise.model_selection import GridSearchCV # Use movielens-100K data = Dataset.load_builtin('ml-100k') param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005], 'reg_all': [0.4, 0.6]} gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3) gs.fit(data) # best RMSE score print(gs.best_score['rmse']) # combination of parameters that gave the best RMSE score print(gs.best_params['rmse']) # We can now use the algorithm that yields the best rmse: algo = gs.best_estimator['rmse'] algo.fit(data.build_full_trainset())• 使用预测算法 ○ 基线估算配置 § 在使用最小二乘法(ALS)时传入参数: 1) reg_i:项目正则化参数,默认值为10 2) reg_u:用户正则化参数,默认值为15 3) n_epochs:als过程中的迭代次数,默认值为10 print('Using ALS') bsl_options = {'method': 'als', 'n_epochs': 5, 'reg_u': 12, 'reg_i': 5 } algo = BaselineOnly(bsl_options=bsl_options) § 在使用随机梯度下降(SGD)时传入参数: 1) reg:优化成本函数的正则化参数,默认值为0.02 2) learning_rate:SGD的学习率,默认值为0.005 3) n_epochs:SGD过程中的迭代次数,默认值为20 print('Using SGD') bsl_options = {'method': 'sgd', 'learning_rate': .00005, } algo = BaselineOnly(bsl_options=bsl_options) § 在创建KNN算法时候来传递参数 bsl_options = {'method': 'als', 'n_epochs': 20, } sim_options = {'name': 'pearson_baseline'} algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options) ○ 相似度配置 § name:要使用的相似度名称,默认是MSD § user_based:是否时基于用户计算相似度,默认为True § min_support:最小的公共数目,当最小的公共用户或者公共项目小于min_support时候,相似度为0 § shrinkage:收缩参数,默认值为100 i. sim_options = {'name': 'cosine', 'user_based': False # compute similarities between items } algo = KNNBasic(sim_options=sim_options) ii. sim_options = {'name': 'pearson_baseline', 'shrinkage': 0 # no shrinkage } algo = KNNBasic(sim_options=sim_options)• 其他一些问题 ○ 如何获取top-N的推荐 from collections import defaultdict from surprise import SVD from surprise import Dataset def get_top_n(predictions, n=10): '''Return the top-N recommendation for each user from a set of predictions. Args: predictions(list of Prediction objects): The list of predictions, as returned by the test method of an algorithm. n(int): The number of recommendation to output for each user. Default is 10. Returns: A dict where keys are user (raw) ids and values are lists of tuples: [(raw item id, rating estimation), ...] of size n. ''' # First map the predictions to each user. top_n = defaultdict(list) for uid, iid, true_r, est, _ in predictions: top_n[uid].append((iid, est)) # Then sort the predictions for each user and retrieve the k highest ones. for uid, user_ratings in top_n.items(): user_ratings.sort(key=lambda x: x[1], reverse=True) top_n[uid] = user_ratings[:n] return top_n # First train an SVD algorithm on the movielens dataset. data = Dataset.load_builtin('ml-100k') trainset = data.build_full_trainset() algo = SVD() algo.fit(trainset) # Than predict ratings for all pairs (u, i) that are NOT in the training set. testset = trainset.build_anti_testset() predictions = algo.test(testset) top_n = get_top_n(predictions, n=10) # Print the recommended items for each user for uid, user_ratings in top_n.items(): print(uid, [iid for (iid, _) in user_ratings]) ○ 如何计算精度 from collections import defaultdict
...