Example molecules to try out

#	Molecule	SMILES String	Description
1	Tylenol	CC(=O)NC1=CC=C(C=C1)O	a medication used to treat pain and fever
2	Chloronitropropan	CC(CCl)[N+](=O)[O-]	residues can be found in food, typically from pesticides
3	9-Bromoanthracene	C1=CC=C2C(=C1)C=C3C=CC=CC3=C2Br	halogenated polycyclic aromatic hydrocarbon found in urban air

The science under the hood

Molecular Fingerprinting

Fingerprinting is a way to represent molecular structure and properties as binary bit strings (0’s and 1's). This representation was initially developed and applied to searching databases for molecules with a specific substructure — but it can also be applied to machine learning.

A hash function is a random number generator and is applied to each feature of a molecule, such as the types of bonds and molecules present, which means that they act as seeds to the function.

Four types of fingerprints were generated: Morgan circular, Topological torsion, Daylight, Atom-pair

After testing the four most popular fingerprints - The final model used the Morgan Circular fingerprints which are generated by the Morgan Algorithm.

These fingerprints represent molecular structures and the presence of substructures by means of circular atom neighborhoods (bond radius). Another important feature is their ability to determine the absence or presence of molecular functionality, which can further help discriminate when classifying molecules.

The p53 gene encodes a protein of the same name and this protein is known as a tumor-suppressor protein. The p53 protein is expressed in cells when they undergo DNA damage — which can transform a normal cell into a cancerous one. To counteract the effects, p53 can cause growth arrest, repair DNA, or begin the process of cell death. Therefore, when DNA damage occurs, there is a significant increase in p53 expression. This increase in protein expression is a good indicator of irregular cell health.

This model was built using the Tox21 dataset which was generated by testing cell lines that produce a florescent reporter gene product under the control of p53 cellular machinery. By measuring levels of the reporter gene product against various compounds, researchers were able to determine whether a compound was an agonist (activator) of the p53 pathway or not.

Machine Learning

The Toxicology in the 21st Century program, or Tox21, is a unique collaboration between several federal agencies to develop new ways to rapidly test whether substances adversely affect human health. Substances assayed in Tox21 include a diverse range of products such as: commercial chemicals, pesticides, food additives/contaminants, and medical compounds.

The Tox21 data is labeled with active (1) and inactive (0) states and the data loaded, the column names formatted, and mol files were generated (these files are generally classified as data files that contain molecular information such as: atoms, bonds, coordinates, and connectivity information) for each molecule.

This project was built using a logistic regression algorithm, which classifies a categorical response (outcome) variable between 1 and 0 based on its relationship with predictive features. In contrast, linear regression outputs response variables that are continuous and can be any real number.

Most importantly, linear regression does not output probabilities and instead fits the best hyperplane. Therefore logistic regression is the natural choice for a binary classification problem such as this.

Three models were compared - KNN, logistic regression, and gradient boosting. When looking at validation accuracy, logistic regression consistently had the highest scores. Next, when evaluating AUC — arguably the most important metric — gradient boosting produced the highest score but logistic regression had the highest average scores.

The highest individual and average f1-scores belong to logistic regression, suggesting that it produces models with the greatest balance between precision and robustness.

The full write up can be found here

PubChem molecule builder

The PubChem molecule building tool is built and maintained by the National Center for Biotechnology Information. It can be used to build chemical queries and output their SMILES strings — which in turn can be used as an input for prediction. This tool can help understand which functional groups may illicit carcinogenic and non-carcinogenic classifications.