PDB#
This tutorial showcases how to use ProLIF to generate an interaction fingerprint for interactions between a protein and a ligand in a PDB file. There are several preparation steps which you need to follow before generating an interaction fingerprint with ProLIF.
Important
For convenience, the different files for this tutorial are included with the ProLIF installation, and we’ll access these files through plf.datafiles.datapath
which is a pathlib.Path
object pointing to the tutorials data directory. This makes it easier to manipulate paths to files, match filenames using wildcards…etc. in a Pythonic way, but you can also use plain strings, e.g. "/home/user/prolif/data/vina/rec.pdb"
instead of plf.datafiles.datapath / "vina" / "rec.pdb"
if you prefer.
Remember to replace any reference to plf.datafiles.datapath
with the actual paths to your inputs outside of this tutorial.
Tip
At the top of the page you can find links to either download this notebook or run it in Google Colab. You can install the dependencies for the tutorials with the command:
pip install prolif[tutorials]
Prerequisite#
The first step is to split your PDB file in two: one file for the protein and one file for the ligand.
There are different ways to do that, from the command line with grep
, or from a GUI such as PyMOL.
Important
Make sure that each file produced either has all bonds explicitely stated, or none of them.
Protein preparation#
In order to generate a valid molecule (with bond orders and charges) from a PDB file and correctly detect HBond interactions, ProLIF requires the protein file to contain explicit hydrogens.
Note
Having only polar hydrogens explicit may work if you use RDKit for reading the file, although they are some caveats around protonable residues such as histidines which RDKit may not recognize correctly due to naming conventions, we therefore encourage users to have all hydrogens explicit in the file and use MDAnalysis instead.
In order to protonate residues in your protein file, there are several strategies available, via command-line tools that you can install locally, or webservers, here are a few suggestions:
H++ webserver
PDB2PQR available through a webserver or as a CLI tool. Make sure to then convert the output PQR file to a PDB file.
Once this step is done, and you’ve verified that the prepared PDB file contains explicit hydrogens (and optionally explicit bonds), we can finally use that as input. We have 2 options to read the protein file: MDAnalysis (preferred) or RDKit (faster but risky, see below).
With MDAnalysis#
This is the recommended way of reading your prepared protein PDB file:
import MDAnalysis as mda
import prolif as plf
protein_file = str(plf.datafiles.datapath / "vina" / "rec.pdb")
u = mda.Universe(protein_file)
protein_mol = plf.Molecule.from_mda(u)
protein_mol.n_residues
302
We can have a quick look at each residue individually using the following command (we’ll only check residues 260 to 262 as there’s an histidine in there which could be problematic):
# check that residues 260 to 262 were processed correctly
# remove the `slice(260, 263)` part to show all residues
plf.display_residues(protein_mol, slice(260, 263))
Troubleshooting
Explicit valence for atom <...> is greater than permitted
If your file did not contain explicit bonds, MDAnalysis will automatically run its connectivity-perception algorithm. This could result in some atomic clashes that may be incorrectly classified as bonds and will prevent the conversion of the MDAnalysis molecule to RDKit through ProLIF. Since MDAnalysis uses van der Waals radii for bond detection, one can modify the default radii that are used:
u.atoms.guess_bonds(vdwradii={"H": 1.05, "O": 1.48})
Alternatively you could use some GUI tools like PyMOL to write the bonds (CONECT record) in your file and then reread it here.
With RDKit#
Important
While RDKit is going to be much faster at parsing the file, it can only recognize standard residue names, so some protonated residues may be incorrectly parsed.
For example, the protonated histidine residue HSE347 in our protein is not correcltly parsed which removes aromaticity on the ring, meaning that ProLIF will not be able to match this residue for pi-stacking interactions.
from rdkit import Chem
rdkit_prot = Chem.MolFromPDBFile(protein_file, removeHs=False)
rdkit_protein_mol = plf.Molecule(rdkit_prot)
# histidine HSE347 not recognized by RDKit
plf.display_residues(rdkit_protein_mol, slice(260, 263))
Ligand preparation#
As for the protein structure, we’ll need our prepared ligand file to contain explicit hydrogens. There are different ways to do this.
One could load the ligand file (or the whole structure) in PyMOL, select the ligand, add explicit hydrogens, and export the file as SDF, then use the following snippet:
# read SDF
ligand_file = str(plf.datafiles.datapath / "vina" / "vina_output.sdf")
ligand_mol = plf.sdf_supplier(ligand_file)[0]
# display ligand
plf.display_residues(ligand_mol, size=(400, 200))
If you only have a PDB file with hydrogens, you could use this code snippet instead:
# load PDB with explicit hydrogens
ligand_file = str(plf.datafiles.datapath / "vina" / "lig.pdb")
u = mda.Universe(ligand_file)
ligand_mol = plf.Molecule.from_mda(u)
# display ligand
plf.display_residues(ligand_mol, size=(400, 200))
Alternatively, PDB files from the Protein Data Bank usually provide the ligand as a SMILES string.
We can use the following snippet to combine the SMILES string and the ligand PDB file to prepare a valid molecule for ProLIF:
from rdkit import Chem
from rdkit.Chem.AllChem import AssignBondOrdersFromTemplate
pdb_ligand = Chem.MolFromPDBFile(str(plf.datafiles.datapath / "vina" / "lig.pdb"))
smiles_mol = Chem.MolFromSmiles("C[NH+]1CC(C(=O)NC2(C)OC3(O)C4CCCN4C(=O)C(Cc4ccccc4)N3C2=O)C=C2c3cccc4[nH]cc(c34)CC21")
mol = AssignBondOrdersFromTemplate(smiles_mol, pdb_ligand)
mol_h = Chem.AddHs(mol)
ligand_mol = plf.Molecule.from_rdkit(mol_h)
Fingerprint generation#
We can now generate a fingerprint. By default, ProLIF will calculate the following interactions: Hydrophobic, HBDonor, HBAcceptor, PiStacking, Anionic, Cationic, CationPi, PiCation, VdWContact. You can list all interactions that are available with the following command:
plf.Fingerprint.list_available()
['Anionic',
'CationPi',
'Cationic',
'EdgeToFace',
'FaceToFace',
'HBAcceptor',
'HBDonor',
'Hydrophobic',
'MetalAcceptor',
'MetalDonor',
'PiCation',
'PiStacking',
'VdWContact',
'XBAcceptor',
'XBDonor']
Tip
The default fingerprint will only keep track of the first group of atoms that satisfied the constraints per interaction type and residue pair.
If you want to keep track of all possible interactions to generate a count-fingerprint (e.g. when there are two atoms in the ligand that make an HBond-donor interaction with residue X), use plf.Fingerprint(count=True)
.
This is also quite useful for visualization purposes as you can then display the atom pair that has the shortest distance which will look more accurate.
This fingerprint type is however a bit slower to compute.
# use default interactions
fp = plf.Fingerprint()
# run on your poses
fp.run_from_iterable([ligand_mol], protein_mol)
<prolif.fingerprint.Fingerprint: 9 interactions: ['Hydrophobic', 'HBAcceptor', 'HBDonor', 'Cationic', 'Anionic', 'CationPi', 'PiCation', 'PiStacking', 'VdWContact'] at 0x7fcf0e934190>
Tip
The run_from_iterable
method will automatically select residues that are close to the ligand (6.0 Å) when computing the fingerprint. You can modify the 6.0 Å cutoff by specifying plf.Fingerprint(vicinity_cutoff=7.0)
, but this is only useful if you decide to change the distance parameters for an interaction class (see in the advanced section of the tutorials).
Alternatively, you can pass a list of residues like so:
fp.run_from_iterable(<other parameters>, residues=["TYR38.A", "ASP129.A"])
You can save the fingerprint object with fp.to_pickle
and reload it later with Fingerprint.from_pickle
:
fp.to_pickle("fingerprint.pkl")
fp = plf.Fingerprint.from_pickle("fingerprint.pkl")
Analysis#
Once the execution is done, you can access the results through fp.ifp
which is a nested dictionary:
ligand_residue = "LIG1.G"
protein_residue = "ASP129.A"
fp.ifp[0][(ligand_residue, protein_residue)]
{'HBDonor': ({'indices': {'ligand': (13, 52), 'protein': (8,)},
'parent_indices': {'ligand': (13, 52), 'protein': (1489,)},
'distance': 2.7534161093413183,
'DHA_angle': 158.09471780792316},),
'Cationic': ({'indices': {'ligand': (13,), 'protein': (8,)},
'parent_indices': {'ligand': (13,), 'protein': (1489,)},
'distance': 2.7534161093413183},),
'VdWContact': ({'indices': {'ligand': (8,), 'protein': (8,)},
'parent_indices': {'ligand': (8,), 'protein': (1489,)},
'distance': 3.1985935319784264},)}
While this contains all the details about the different interactions that were detected, it’s not the easiest thing to digest.
The best way to analyse our results is to export the interaction fingerprint to a Pandas DataFrame. You can read more about pandas in their user_guide.
df = fp.to_dataframe()
df.T
Frame | 0 | ||
---|---|---|---|
ligand | protein | interaction | |
LIG1.G | TYR109.A | Hydrophobic | True |
TRP125.A | Hydrophobic | True | |
LEU126.A | Hydrophobic | True | |
ASP129.A | HBDonor | True | |
Cationic | True | ||
VdWContact | True | ||
ILE130.A | Hydrophobic | True | |
CYS133.A | Hydrophobic | True | |
VAL200.A | VdWContact | True | |
VAL201.A | Hydrophobic | True | |
HBAcceptor | True | ||
VdWContact | True | ||
THR203.A | Hydrophobic | True | |
THR209.A | Hydrophobic | True | |
SER212.A | HBDonor | True | |
VdWContact | True | ||
THR213.A | Hydrophobic | True | |
VdWContact | True | ||
ALA216.A | Hydrophobic | True | |
VdWContact | True | ||
PHE330.B | Hydrophobic | True | |
VdWContact | True | ||
PHE331.B | Hydrophobic | True | |
PiStacking | True | ||
VdWContact | True | ||
MET337.B | Hydrophobic | True | |
PHE351.B | Hydrophobic | True | |
ASP352.B | VdWContact | True |
Troubleshooting
If the dataframe only shows VdWContact interactions and nothing else, you may have a skipped the protein preparation step for protein PDB file: either all bonds should be explicit, or none of them.
If you only have partial bonds in your PDB file, MDAnalysis won’t trigger the bond guessing algorithm so your protein is essentially a point cloud that can’t match with any of the atom selections specified by interactions other than VdW.
To fix this, simply add the following line after creating a Universe
from your protein file
with MDAnalysis:
u.atoms.guess_bonds()
and rerun the rest of the notebook.
Here are some common pandas snippets to extract useful information from the fingerprint table.
# hide an interaction type (Hydrophobic)
df.drop("Hydrophobic", level="interaction", axis=1).T
Frame | 0 | ||
---|---|---|---|
ligand | protein | interaction | |
LIG1.G | ASP129.A | HBDonor | True |
Cationic | True | ||
VdWContact | True | ||
VAL200.A | VdWContact | True | |
VAL201.A | HBAcceptor | True | |
VdWContact | True | ||
SER212.A | HBDonor | True | |
VdWContact | True | ||
THR213.A | VdWContact | True | |
ALA216.A | VdWContact | True | |
PHE330.B | VdWContact | True | |
PHE331.B | PiStacking | True | |
VdWContact | True | ||
ASP352.B | VdWContact | True |
# show only one protein residue (ASP129.A)
df.xs("ASP129.A", level="protein", axis=1).T
Frame | 0 | |
---|---|---|
ligand | interaction | |
LIG1.G | HBDonor | True |
Cationic | True | |
VdWContact | True |
# show only an interaction type (PiStacking)
df.xs("PiStacking", level="interaction", axis=1).T
Frame | 0 | |
---|---|---|
ligand | protein | |
LIG1.G | PHE331.B | True |
Visualisation#
There are a few different options builtin when it comes to visualisation.
You can display the interactions in a 2D interactive diagram:
fp.plot_lignetwork(ligand_mol, kind="frame", frame=0)
This diagram is interactive, you can:
zoom and pan,
move residues around,
click on the legend to display or hide types of residues or interactions,
hover an interaction line to display the distance.
Note
It is not possible to export it as an image, but you can always take a screenshot.
Up to now we’ve only been using the default fingerprint generator, but you can optionally enable the count
parameter to enumerate all occurences of an interaction (the default fingerprint generator will stop at the first occurence), and then display all of them by specifying display_all=True
:
fp_count = plf.Fingerprint(count=True)
fp_count.run_from_iterable([ligand_mol], protein_mol)
fp_count.plot_lignetwork(ligand_mol, kind="frame", frame=0, display_all=True)
You can also visualize this information in 3D:
view = fp_count.plot_3d(ligand_mol, protein_mol, frame=0, display_all=False)
view
3Dmol.js failed to load for some reason. Please check your browser console for error messages.
<py3Dmol.view at 0x7fcf47fe0a10>
As in the lignetwork plot, you can hover atoms and interactions to display more information.
The advantage of using a count fingerprint in that case is that it will automatically select the interaction occurence with the shortest distance for a more intuitive visualization.
Once you’re satisfied with the orientation, you can export the view as a PNG image with the following snippet:
from IPython.display import Javascript
Javascript(
"""
var png = viewer_%s.pngURI()
var a = document.createElement('a')
a.href = png
a.download = "prolif-3d.png"
a.click()
a.remove()
"""
% view.uniqueid
)