共计 6733 个字符,预计需要花费 17 分钟才能阅读完成。
COMP34711 Natural Language Processing
Coursework 2, Nov 2021
You are provided with the product review corpus. Check the README file and observe the content,
format and structure of the corpus. You are asked to design and evaluate solutions for two NLP
tasks using this corpus. You are only free to use functions that are available in the NLTK
framework and machine learning libraries specified in the instruction, e.g., Weka, scikit-learn,
PyTorch, Keras, TensorFlow, to implement your design. Overall, this coursework is marked on the
basis of
• rigorous experimentation,
• knowledge displayed in report,
• independent problem-solving skill,
• self-learning ability,
• how informative your analysis is,
• language and ease of reading of the report,
• code quality based on correctness and readability (which includes comments).
You should solve all the tasks on your own. You are not permitted to collaborate with other
students on this coursework. In lab support sessions, you can ask TAs to explain knowledge
taught in the lecture or seek advice on how to use a natural language processing or machine
learning library. But you are not permitted to ask TAs to help with the solution design, or to check
the correctness of your solution.
Your submission should include both code and report. About your code, provide comments when
you see fit and your code will be marked based on both correctness and readability (which
includes comments). About your report, use Arial Font 11. Your main report should be no more
than 3 pages, including up to 2 pages for Task 1 while up to 1 page for Task 2. If needed, you can
include additionally up to 2 pages of screenshots (e.g., of your results) as an Appendix of your
report.
Task 1: Distributional Semantics (15 marks)
The following experiment is designed to evaluate the performance of a distributional semantic
approach.
• Step 1: Clean and pre-process all reviews in your text corpus as you see fit. Choose the top 50
most frequently occurred words (after removing the stop words) as the target words. You are
free to use functions that are available in the NLTK framework to help your text pre-processing.
• Step 2: For each of the 50 target words, uniformly sample half of its occurrences in the corpus
and substitute these with a made-up reverse words, e.g., half of the occurrences of “canon” will
be transformed into “nonac”. Refer to these 50 new words as pseudowords.
• Step 3: Construct a d-dimensional feature vector to characterise each of the 50 target words
and 50 pseudowords (N=50+50=100) using a distributional semantic approach (more detailed
requirements on this are provided later). Store your obtained feature vectors in a 100´d matrix
X.
• Step 4: Take the feature matrix X as the input, and apply a clustering algorithm to cluster the
100 words into 50 clusters. You are free to use eixsing clustering algorithm implementation as
2
you see fit. For instance, clustering modules (https://www.nltk.org/api/nltk…) from
NLTK, machine-learning framework for clustering from Weka
(http://www.cs.waikato.ac.nz/m…) and scikit-learn (https://scikitlearn.org/stabl…).
• Step 5: For each pair of the target word and its corresponding pseudoword, if these two are
grouped into the same cluster, it is defined as a correct pair. Among the 50 pairs, check the
percentage of the correct pairs, denoted by p.
• Step 6: Repeat this whole process multiple times, e.g., 5-10, and calculate the mean and
standard deviation of the obtained percentages p.
Applying what you have learned on lexical processing and distributional semantics, you should
come up with 2 different approaches for constructing the distributional semantic representations.
For instance, they can differ in ways of constructing the dictionary (e.g., stems vs. words) and of
extracting the context features, or differ in the approach principles. You should aim at achieving a
good clustering performance and understanding the reason behind.
Here are the requirements of the 2 approaches:
• They should differ significantly. For instance, the same context feature extraction approach
with different window sizes is considered as one approach.
• They should include one sparse approach and one dense approach.
• They should be evaluated and compared thoroughly, e.g., their performance, and effect of
their hyperparameter setting.
- Submission Instruction
Your implementation should be well-structured, defining a function for each step and executing the
functions in a main file.
You should submit the implementation and evaluation of your 2 approaches as 2 separate Jupyter
notebook files, named as“Task1_Approach1”,“Task1_Approach2”. The TA will run each file
separately during marking.
You should prepare a report (up to 2 pages) containing two sections:
• Methods: Explanation of your text cleaning and pre-processing steps, as well as the 2
approaches for constructing the distributional semantic representations.
• Result Analysis: Analyse and discuss the obtained clustering results for each approach.
You should discuss hyperparameter relevant issues if your approach requires any
hyperparameter setting, e.g., setting context window size, determining feature
dimensionality d for a dense approach, etc. - Mark Allocation
Marks are allocated as below:
• 1 mark for text cleaning, pre-processing, target words selection, pseudo words
construction.
• 10 marks for implementation, description, and result analysis of the 2 approaches, where 5
marks for each approach.
3
• 2 marks for clustering performance award, which means to achieve a satisfactory clustering
performance exceeding a percentage threshold by at least one approach and for explaining
the reason behind your success. This percentage threshold will only be released to you
after the marking.
• 2 marks for design novelty that can be either an improvement of what has been taught or a
new reasonable approach not taught in the“Distributional Semantics”Chapter. You need to
highlight in the report what the novelty is, if to gain these marks.
Task 2: Neural Network for Classifying Product Reviews (10 marks)
The product review corpus contains reviews scored as positive and negative opinions. Pre-process
your text, prepare the review examples for training and evaluation. Implement, train and evaluate a
neural network that can classify an input review to either a positive or a negative class. You are
free to choose any neural network/deep learning technique taught in the Chapter“Deep Learning
for NLP”, e.g., multi-layer perceptron, LSTM, bi-directional LSTM, etc. You should evaluate your
classifier’s classification accuracy using 5-fold cross validation (CV). You are free to use PyTorch,
Keras or TensorFlow library. - Submission Instruction
Your implementation should be well-structured with comments. You should submit the
implementation and its evaluation as a single file, named as“Task2”.
Prepare a report (up to 1 page) containing 2 short sections:
• Method: Explanation of your classification model design and training.
• Experiment and Result Analysis: Describe your experiment and evaluation approach. You
should discuss hyperparameter relevant issues if your approach requires any
hyperparameter setting. Report and analyse classification accuracy. Mark Allocation
Marks are allocated as below:
• 2 marks for text cleaning, pre-processing, and preparing the input data for the classifier.
• 7 marks for implementation, classification accuracy evaluation by 5-fold cross validation,
method description, and result analysis of the classifier.
• 1 mark for classification accuracy award, which is to achieve a satisfactory classification
accuracy exceeding an accuracy threshold. This threshold will only be released to you afterthe marking.
Submission Checklist
A .zip file named as“34711-Cwk-S-DeepLearning”containing
• Three code files: Task1_Approach1, Task1_Approach2, Task2.
• One .pdf file, combining reports for both Task 1 and Task 2.