乐趣区

关于后端:E4580化学工程

CHEN E4580 Artificial Intelligence in Chemical Engineering Fall 2021
Final Project Due Date: Dec 15, 2021
The final project for the course carries 60% of your final grade. You can choose to do the
problem below, or an individual problem related to chemical engineering with prior
permission from the instructor or TA. In the event of you choosing a separate problem,
the problem has to be complex enough to justify the necessary grade. In either case, the
team/group that you are assigned to remains the same.
Your final grade will be based on the report submitted for the project. This report should
be in the form of a research article with a title, list of authors, an abstract, introduction,
methods, results and discussion, conclusions that highlight the major findings of your
work, and a list of references. There are no constraints in terms of the number of pages
or the word limit but try not to exceed 10 pages (excluding references). Excessive and
unnecessary verbosity will be penalized.
You need to submit your codes as separate .py or .ipynb files. A major part of the project
is to utilize the methods learnt during the course and demonstrate a strong knowledge of
at least a couple of methods (including visualization techniques) learnt during the course.
Problem Statement: Thermodynamic Property Prediction
Computational or data-driven property prediction is an important task that helps in
minimizing the experimentation effort required to measure and estimate molecular
properties. Most molecules are either difficult to procure or expensive to synthesize under
normal laboratory conditions. In addition, property measurement experiments are often
challenging. This necessitates the development of data-driven property estimation tools
that could predict such properties. These tools are data-driven in nature and utilize the
existing databases of chemical properties to learn underlying relationships between
molecular features and their physical properties.
One way to build such tools is to train (separate) regression models that relate the
molecular property descriptors or features to their property values. Such models would
take as input the molecular descriptors and give as output the predicted value of the
property of interest.
Molecular Representation
In order to predict molecular properties using data-driven algorithms, we need a way to
represent the molecules numerically so that they could be fed to a regression algorithm.
This could be done in several ways but two common approaches are group contribution
and mol2vec-based methods.

  1. Group Contribution: Molecules are often characterized by the presence of
    several different groups and they could be used to capture the chemical and
    structural properties that define their thermodynamic properties. The idea behind
    2
    group contribution features is to represent the frequency of occurrence (or
    absence) of each functional group in a molecule as a vector. These vectors could
    be used as features in regression models trained for the property prediction task.
    A recent work that used these representations for the property prediction task
    available here (paper, codebase). The data that has been given to you is
    primarily based on this work and reading this paper is strongly
    recommended. Molecules are represented using their SMILES string
    representations (paper).
  2. Mol2vec: Mol2vec is an unsupervised machine learning approach to learn vector
    representations of molecular substructures. The idea is primarily based on the
    word2vec1 algorithm that is used for learning vector representations of English
    words in natural language processing. The mol2vec algorithm learns vector
    representations of molecules such that the molecular substructures that are
    chemically related point in similar directions. The research article where mol2vec
    was first published is posted on the course website and could also be accessed
    here.
    Since mol2vec requires a cheminformatics package called RDKit to be installed,
    you need to first want install RDKit using the following command:

    pip install rdkit-pypi
    In order to understand the implementation details of mol2vec, you can go through
    the mol2vec documentation. The mol2vec package could be installed using the
    following command:
    pip install git+https://github.com/samoturk/m…
    You’re given a pre-trained mol2vec model (model_300dim.pkl) that could be used
    to generate features from SMILES strings. Refer to the following code snippet that
    explains generating features using mol2vec.
    import pandas as pd
    import numpy as np
    from rdkit import Chem
    from gensim.models import word2vec
    from mol2vec.features import mol2alt_sentence, mol2sentence,
    MolSentence, DfVec, sentences2vec
    from gensim.models import word2vec
    mdf= pd.read_csv(‘example_dataset.csv’).iloc[:,:2] # read the
    dataset as a pandas dataframe with only the first two columns
    mdf.columns = [‘smiles’, ‘target’] # rename the columns to
    ‘smiles’ and ‘target’
    mdf = mdf.astype(object) # change data type to object (required
    to store molecule objects in the dataframe later)

  3. You can get a very intuitive understanding of the word2vec algorithm here.
    3

    mdf.head() # look at the first few rows in the dataframe
    mdf[‘mol’] = mdf[‘smiles’].apply(lambda x: Chem.MolFromSmiles(x))

    use rdkit to convert smiles strings to ‘molecule’ objects and

    store in a new column ‘mol’

    model = word2vec.Word2Vec.load(‘model_300dim.pkl’) # load the
    pre-trained mol2vec model. This file is provided to you separately
    mdf[‘sentence’] = None # create a new column ‘sentence’ to store
    ‘molecular sentences’
    for i in range(mdf.shape[0]):
    # the folloing ‘try except’ code block is require to skip
    erroneous molecules that could not be processed by rdkit
    try: # try the following lines of code
    m = mdf’mol’
    mdf.loc[i,’sentence’] = MolSentence(mol2alt_sentence(m, 1))
    except: # do the following if there’s an exception (or error)
    while executing the previous lines
    mdf.loc[i,’sentence’] = None
    print(‘skipped: {}’.format(mdf’smiles’))
    mdf.dropna(inplace=True) # drop nans
    mdf.head()

    generate vector representations of molecules using ‘molecular

    sentences’ and pre-trained mol2vec model

    mdf[‘mol2vec’] = [DfVec(x) for x in sentences2vec(mdf[‘sentence’],
    model, unseen=’UNK’)]
    X = np.array([x.vec for x in mdf[‘mol2vec’]]) # feature matrix
    y = mdf[‘target’].values.astype(np.float32) # target values

  4. Other representations: There are several other representations for molecules
    such as the SMILES grammar-based representations (paper, codebase). In
    addition, a molecular descriptors calculation software called Mordred (paper,
    codebase) is freely available online that provides an exhaustive set of features that
    could be used for property prediction tasks. You may go through these papers
    and use them (even partially) in your work and contrast their
    advantages/disadvantages with the mol2vec and group contribution
    representations.
    Visualization
    4
    Irrespective of the type of molecular representation that you decide to work with, visualize
    a subset of molecules to see if chemically similar molecules are closer to each other.
    Ideally, the molecular representations should capture the underlying similarities and
    differences between molecules to the maximum possible extent. Since you have high
    dimensional features, you may use t-SNE technique to project the features two 2-
    dimensions and then visualize them using a scatter plot.
    More information on this method could be found here. The research article that proposed
    t-SNE has been added to the project folder.
    Regression
    The property prediction problem could be modeled as a regression task where you may
    try different regression methods covered in class such as linear regression, polynomial
    regression, support vector regression, decision trees, random forests, k-NN, and so on.
    Do not forget to split the given data into train, test, and validation sets, perform appropriate
    regularization, tune model hyperparameters, and transform the data using kernels, if
    required.
    Report the final model performance along with all the performance metrics for the
    regression task that capture the performance of your model.
    Dataset description
    Each team is given two datasets containing data on two different properties that need to
    be predicted. The first column contains the SMILES strings of molecules, the second
    contains the true property values, and the rest of the 424 columns are the group
    contribution-based features. These 424 features encode the presence of different
    functional groups and their frequency of occurrence in a given molecule. You may use all
    of these columns as features while training the group-contribution based regression
    models.
    In order to generate the features based on other representations (such as mol2vec), you
    need to use the SMILES strings of molecules and pass them to the mol2vec model as
    explained in the section on mol2vec above.
    Note that you need to train at least two separate models – one using group contribution
    representations and the other using any other choice of molecular representations for
    each of the properties.
    Provided files
    Here’s an overview of the files are you are provided with:
    5
    S.No. Filename Description
  5. teams.pdf Project teams-assignment
  6. property-assignment.pdf Property assignment for each team
  7. datasets Datasets for all groups. You just need to work with
    the dataset assigned to your team
  8. model_300dim.pkl Pre-trained mol2vec model
  9. papers Folder containing all the relevant research articles
    The following commands would be useful while reading/writing the dataset files:
    numpy.loadtxt() # for loading text files
    pandas.read_csv() # for loading csv files as pandas dataframes
    pandas.read_excel() # for loading excel files as pandas dataframes
    pandas.to_csv() # write dataframe to csv file
    pandas.to_excel() # write dataframe to excel file
退出移动版