▼554.488/688 Computing for Applied MathematicsSpring 2023 – Final Project AssignmentThe aim of this assignment is to give you a chance to exercise your skills at prediction usingPython. You have been sent an email with a link to data collected on a random sample from somepopulation of Wikipedia pages, to develop prediction models for three different web page attributes.Each student is provided with their own data drawn from a Wikipedia page population uniqueto that student, and this comes in the form of two files: 聢 A training set which is a pickled pandas data frame with 200,000 rows and 44 columns. Eachrow corresponds to a distinct Wikipedia page/url drawn at random from a certain populationof Wikipedia pages. The columns are 鈥?URLID in column 0, which gives a unique identifier for each url. You will not be able todetermine the url from the URLID or the rest of the data. (It would be a waste of timeto try so the only information you have about this url is provided in the dataset itself.) 鈥?40 feature/predictor variable columns in columns 1,…,40 each associated with a particularword (the word is in the header). For each url/Wikipedia page, the word column givesthe number of times each word appears in the asociated page. 鈥?Three response variables in columns 41, 42 and 43* length = the length of the page, defined as the total number of characters in thepage
- date = the last date when the page was edited* word present = a binary variable indicating whether at least one of 5 possible words(using a word list of 5 words specific to each student and not among the 40 featurewords) 1 appears in the page A test set which is also a pickled pandas data frame with 50,000 rows but with 41 columnssince the response variables (length, date, word present) are not available to you. The rowsof the test dataset also correspond to distinct url/pages drawn from the same Wikipediaurl/page population as the training dataset (with no pages in common with the training setpages). The response variables have been removed so that the columns that are available are 鈥?URLID in column 0
鈥?the same 40 feature/predictor variable columns corresponding to word counts for thesame 40 words as in the training setYour goal is to use the training data to predict the length variable for pages in the test dataset1What this list of 5 words is will not be revealed to you and you it would be a waste of time tring to figure outwhat it is.
predict the mean absolute error you expect to achieve in your predictions of length in the testdataset
predict word present for pages in the test dataset, attempting to make the false positive asclose as you can to .05 2, and make the true positive rates as high as you possibly can 3,
predict your true positive rate for word present in the test dataset predict edited 2023 for pages in the test dataset, attempting to make the false positive asclose as you can to .05 4, and make the true positive rates as high as you possibly can 5,
predict your true positive rate for edited 2023 in the test datasetSince I have the response variable values (length, word present, date) for the pages in your testdataset, I can determine the performance of your predictions. Since you do not have those variables,you will need to set aside some data in your training set or use cross-validation to estimate theperformance of your prediction models.There are 3 different parts of this assignment, each requiring a submission: Part 1 (30 points) – a Jupyter notebook containing 鈥?a description (in words, no code) of the steps you followed to arrive at your predictionsand your estimates of prediction quality – including a description of any separation ofyour training data into training and testing data, method you used for imputation,methods you tried to use for making predictions (e.g. regression, logistic regression, …)followed by
鈥?the code you used in your calculations Part 2 (60 points) – a cvs file with your predictions – this file should consist of exactly 4columns with 6
鈥?a header row with URLID, length, word present, edited 2023 鈥?50,000 additional rows
鈥?every URLID in your test dataset appearing in the URLID column – not altered in anyway!
鈥?no mssing values
鈥?data type for the length column should be integer or float 鈥?data type for the word present column should be either integer (0 or 1), float (0. or 1.)or Boolean (False/True)
2
false positive rate = proportion of pages for which word present is 0 but predicted to be 13
true positive rate = proportion of pages for which word present is 1 and predicted to be 14
false positive rate = proportion of pages for which edited 2023 is 0 but predicted to be 15
true positive rate = proportion of pages for which edited 2023 present is 1 and predicted to be 16
a notebook is provided to you for checking that your csv file is properly formatted 鈥?data type for the edited 2023 column should be either integer (0 or 1), float (0. or 1.)or Boolean (False/True)
Part 3 (30 points) – providing estimates of the following in a form: 鈥?what do you predict the mean absolute error of your length predictions to be? 鈥?what do you predict the true positive rate for your word present predictions to be? 鈥?what do you predict the true positive rate for your edited 2023 predictions to be?Your score in this assignment will be based on Part 1 (30 points)
鈥?evidence of how much effort you put into the assignment (how many different methodsdid you try?)
鈥?how well did you document what you did? 鈥?was your method for predicting the quality of your performance prone to over-fitting? Part 2 (60 points)
鈥?how good are your predictions of length, word present, edited 2003 – I will do predictionsusing your training data and I will compare your length mean absolute deviation to what I obtained in my predictions your true positive rate to what I obtained for the binary variables (assuming youmanaged to appropriately control the false positive rate) 鈥?how well did you meet specifications – did you get your false positive rate in predictionsof the binary variables close to .05 (again, compared to how well I was able to do this) Part 3 (30 points)
鈥?how good is your prediction of the length mean absolute deviation 鈥?how good is your prediction of the true positive rate for the word present variable 鈥?how good is your prediction of the true positive rate for the edited 2023 variableHow the datasets were producedThis is information that will not be of much help to you in completing the assignment, exceptmaybe to convince you that there would be no point in using one of the other students 鈥?data incompleting this assignment. 聢 I web crawled in WIkipedia to arrive at a random sample of around 2,000,000 pages. 聢 I made a list of 100 random words and extracted the length, the word counts, and the lastdate edited for each page.To create one of the student personal datasets, I repeated the following steps for each studentRepeat
Chose 10 random words w0,w1,…,w9 out of the 100 words in the list aboveDetemined the subsample of pages having w0 and w1 but not w2, w3 or w4.Used the words w5,w6,w7,w8 and w9 to create the word_present variableUntil
the subsample has at least 250,000 pagesRandomly sampled 40 of 90 unsampled words without replacementRandomly sampled without replacement 250,000 pages out of the subsampleRetained only the 250,000 pages andword counts for the 40 wordslength
word_present
last date edited
Randomly assigned missing values in the feature (word count) dataRandomly separated the 250,000 pages into200,000 training pages
50,000 test pages