CHEN E4580 Artificial Intelligence in Chemical Engineering Fall 2021
Final Project Due Date: Dec 15, 2021
The final project for the course carries 60% of your final grade. You can choose to do the
problem below, or an individual problem related to chemical engineering with prior
permission from the instructor or TA. In the event of you choosing a separate problem,
the problem has to be complex enough to justify the necessary grade. In either case, the
team/group that you are assigned to remains the same.
Your final grade will be based on the report submitted for the project. This report should
be in the form of a research article with a title, list of authors, an abstract, introduction,
methods, results and discussion, conclusions that highlight the major findings of your
work, and a list of references. There are no constraints in terms of the number of pages
or the word limit but try not to exceed 10 pages (excluding references). Excessive and
unnecessary verbosity will be penalized.
You need to submit your codes as separate .py or .ipynb files. A major part of the project
is to utilize the methods learnt during the course and demonstrate a strong knowledge of
at least a couple of methods (including visualization techniques) learnt during the course.
Problem Statement: Thermodynamic Property Prediction
Computational or data-driven property prediction is an important task that helps in
minimizing the experimentation effort required to measure and estimate molecular
properties. Most molecules are either difficult to procure or expensive to synthesize under
normal laboratory conditions. In addition, property measurement experiments are often
challenging. This necessitates the development of data-driven property estimation tools
that could predict such properties. These tools are data-driven in nature and utilize the
existing databases of chemical properties to learn underlying relationships between
molecular features and their physical properties.
One way to build such tools is to train (separate) regression models that relate the
molecular property descriptors or features to their property values. Such models would
take as input the molecular descriptors and give as output the predicted value of the
property of interest.
Molecular Representation
In order to predict molecular properties using data-driven algorithms, we need a way to
represent the molecules numerically so that they could be fed to a regression algorithm.
This could be done in several ways but two common approaches are group contribution
and mol2vec-based methods.
- Group Contribution: Molecules are often characterized by the presence of
several different groups and they could be used to capture the chemical and
structural properties that define their thermodynamic properties. The idea behind
2
group contribution features is to represent the frequency of occurrence (or
absence) of each functional group in a molecule as a vector. These vectors could
be used as features in regression models trained for the property prediction task.
A recent work that used these representations for the property prediction task
available here (paper, codebase). The data that has been given to you is
primarily based on this work and reading this paper is strongly
recommended. Molecules are represented using their SMILES string
representations (paper). -
Mol2vec: Mol2vec is an unsupervised machine learning approach to learn vector
representations of molecular substructures. The idea is primarily based on the
word2vec1 algorithm that is used for learning vector representations of English
words in natural language processing. The mol2vec algorithm learns vector
representations of molecules such that the molecular substructures that are
chemically related point in similar directions. The research article where mol2vec
was first published is posted on the course website and could also be accessed
here.
Since mol2vec requires a cheminformatics package called RDKit to be installed,
you need to first want install RDKit using the following command:pip install rdkit-pypi
In order to understand the implementation details of mol2vec, you can go through
the mol2vec documentation. The mol2vec package could be installed using the
following command:
pip install git+https://github.com/samoturk/m…
You’re given a pre-trained mol2vec model (model_300dim.pkl) that could be used
to generate features from SMILES strings. Refer to the following code snippet that
explains generating features using mol2vec.
import pandas as pd
import numpy as np
from rdkit import Chem
from gensim.models import word2vec
from mol2vec.features import mol2alt_sentence, mol2sentence,
MolSentence, DfVec, sentences2vec
from gensim.models import word2vec
mdf= pd.read_csv(‘example_dataset.csv’).iloc[:,:2] # read the
dataset as a pandas dataframe with only the first two columns
mdf.columns = [‘smiles’, ‘target’] # rename the columns to
‘smiles’ and ‘target’
mdf = mdf.astype(object) # change data type to object (required
to store molecule objects in the dataframe later) -
You can get a very intuitive understanding of the word2vec algorithm here.
3mdf.head() # look at the first few rows in the dataframe
mdf[‘mol’] = mdf[‘smiles’].apply(lambda x: Chem.MolFromSmiles(x))use rdkit to convert smiles strings to ‘molecule’ objects and
store in a new column ‘mol’
model = word2vec.Word2Vec.load(‘model_300dim.pkl’) # load the
pre-trained mol2vec model. This file is provided to you separately
mdf[‘sentence’] = None # create a new column ‘sentence’ to store
‘molecular sentences’
for i in range(mdf.shape[0]):
# the folloing ‘try except’ code block is require to skip
erroneous molecules that could not be processed by rdkit
try: # try the following lines of code
m = mdf’mol’
mdf.loc[i,’sentence’] = MolSentence(mol2alt_sentence(m, 1))
except: # do the following if there’s an exception (or error)
while executing the previous lines
mdf.loc[i,’sentence’] = None
print(‘skipped: {}’.format(mdf’smiles’))
mdf.dropna(inplace=True) # drop nans
mdf.head()generate vector representations of molecules using ‘molecular
sentences’ and pre-trained mol2vec model
mdf[‘mol2vec’] = [DfVec(x) for x in sentences2vec(mdf[‘sentence’],
model, unseen=’UNK’)]
X = np.array([x.vec for x in mdf[‘mol2vec’]]) # feature matrix
y = mdf[‘target’].values.astype(np.float32) # target values - Other representations: There are several other representations for molecules
such as the SMILES grammar-based representations (paper, codebase). In
addition, a molecular descriptors calculation software called Mordred (paper,
codebase) is freely available online that provides an exhaustive set of features that
could be used for property prediction tasks. You may go through these papers
and use them (even partially) in your work and contrast their
advantages/disadvantages with the mol2vec and group contribution
representations.
Visualization
4
Irrespective of the type of molecular representation that you decide to work with, visualize
a subset of molecules to see if chemically similar molecules are closer to each other.
Ideally, the molecular representations should capture the underlying similarities and
differences between molecules to the maximum possible extent. Since you have high
dimensional features, you may use t-SNE technique to project the features two 2-
dimensions and then visualize them using a scatter plot.
More information on this method could be found here. The research article that proposed
t-SNE has been added to the project folder.
Regression
The property prediction problem could be modeled as a regression task where you may
try different regression methods covered in class such as linear regression, polynomial
regression, support vector regression, decision trees, random forests, k-NN, and so on.
Do not forget to split the given data into train, test, and validation sets, perform appropriate
regularization, tune model hyperparameters, and transform the data using kernels, if
required.
Report the final model performance along with all the performance metrics for the
regression task that capture the performance of your model.
Dataset description
Each team is given two datasets containing data on two different properties that need to
be predicted. The first column contains the SMILES strings of molecules, the second
contains the true property values, and the rest of the 424 columns are the group
contribution-based features. These 424 features encode the presence of different
functional groups and their frequency of occurrence in a given molecule. You may use all
of these columns as features while training the group-contribution based regression
models.
In order to generate the features based on other representations (such as mol2vec), you
need to use the SMILES strings of molecules and pass them to the mol2vec model as
explained in the section on mol2vec above.
Note that you need to train at least two separate models – one using group contribution
representations and the other using any other choice of molecular representations for
each of the properties.
Provided files
Here’s an overview of the files are you are provided with:
5
S.No. Filename Description - teams.pdf Project teams-assignment
- property-assignment.pdf Property assignment for each team
- datasets Datasets for all groups. You just need to work with
the dataset assigned to your team - model_300dim.pkl Pre-trained mol2vec model
- papers Folder containing all the relevant research articles
The following commands would be useful while reading/writing the dataset files:
numpy.loadtxt() # for loading text files
pandas.read_csv() # for loading csv files as pandas dataframes
pandas.read_excel() # for loading excel files as pandas dataframes
pandas.to_csv() # write dataframe to csv file
pandas.to_excel() # write dataframe to excel file