Predictive Modeling - pKa edition ¶
Today, we delve into the application of regression analysis in chemistry and data science. Our dataset comprises p$K_\text{a}$ values and corresponding molecular descriptors, offering a quantitative approach to understanding molecular properties.
pKa ¶
The p$K_\text{a}$ measures a substance's acidity or basicity, particularly in chemistry. It is the negative logarithm (base 10) of the acid dissociation constant ($K_\text{a}$) of a solution. The p$K_\text{a}$ value helps quantify the strength of an acid in a solution.
The expression for the acid dissociation constant ($K_\text{a}$), from which p$K_\text{a}$ is derived, is given by the following chemical equilibrium equation for a generic acid (HA) in water:
$$ \text{HA} \rightleftharpoons \text{H}^+ + \text{A}^- $$
The equilibrium constant ($K_\text{a}$) for this reaction is defined as the ratio of the concentrations of the dissociated ions ($\text{H}^+$ and $\text{A}^-$) to the undissociated acid ($\text{HA}$):
$$ K_a = \frac{[\text{H}^+][\text{A}^-]}{[\text{HA}]} $$
Taking the negative logarithm (base 10) of both sides of the equation gives the expression for pKa:
$$ \text{p}K_a = -\log_{10}(K_a) $$
So, in summary, the p$K_\text{a}$ is calculated by taking the negative logarithm of the acid dissociation constant ($K_\text{a}$) for a given acid. A lower p$K_\text{a}$ indicates a stronger acid.
In simpler terms:
- A lower p$K_\text{a}$ indicates a stronger acid because it means the acid is more likely to donate a proton (H+) in a chemical reaction.
- A higher p$K_\text{a}$ indicates a weaker acid as it is less likely to donate a proton.
The p$K_\text{a}$ is a crucial parameter in understanding the behavior of acids and bases in various chemical reactions. It is commonly used in fields such as medicinal chemistry, biochemistry, and environmental science to describe and predict the behavior of molecules in solution.
Exploring the dataset ¶
Now, let's shift our focus to a practical application of our theoretical knowledge.
I found this dataset that contains a bunch of high-quality experimental measurements of pKas. It contains a bunch of information and other aspects that makes regression a bit of a nightmare; thus, I did some cleaning of the data and computed some molecular features (i.e., descriptors) that we can use. Before delving into regression analysis, it is essential to conduct a systematic review of the dataset. This preliminary examination will provide us with the necessary foundation to understand the quantitative relationships between molecular features and acidity. Let's now proceed with a methodical investigation of the empirical data, setting the stage for our subsequent analytical endeavors.
Loading ¶
Using the Pandas library, read the CSV file into a DataFrame. Use the variable you defined in the previous step.
import numpy as np
import pandas as pd
# @title
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import py3Dmol
def show_mol(smi, style="stick"):
mol = Chem.MolFromSmiles(smi)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol)
AllChem.MMFFOptimizeMolecule(mol, maxIters=200)
mblock = Chem.MolToMolBlock(mol)
view = py3Dmol.view(width=500, height=500)
view.addModel(mblock, "mol")
view.setStyle({style: {}})
view.zoomTo()
view.show()
CSV_PATH = "https://github.com/oasci-courses/pitt-biosc1540-2024f/raw/refs/heads/main/content/lectures/21/smiles-pka-desc.csv"
df = pd.read_csv(CSV_PATH)