PDB#

This tutorial showcases how to use ProLIF to generate an interaction fingerprint for interactions between a protein and a ligand in a PDB file. There are several preparation steps which you need to follow before generating an interaction fingerprint with ProLIF.

Important

For convenience, the different files for this tutorial are included with the ProLIF installation, and we’ll access these files through plf.datafiles.datapath which is a pathlib.Path object pointing to the tutorials data directory. This makes it easier to manipulate paths to files, match filenames using wildcards…etc. in a Pythonic way, but you can also use plain strings, e.g. "/home/user/prolif/data/vina/rec.pdb" instead of plf.datafiles.datapath / "vina" / "rec.pdb" if you prefer.

Remember to replace any reference to plf.datafiles.datapath with the actual paths to your inputs outside of this tutorial.

Tip

At the top of the page you can find links to either download this notebook or run it in Google Colab. You can install the dependencies for the tutorials with the command:

pip install prolif[tutorials]

Prerequisite#

The first step is to split your PDB file in two: one file for the protein and one file for the ligand. There are different ways to do that, from the command line with grep, or from a GUI such as PyMOL.

Important

Make sure that each file produced either has all bonds explicitely stated, or none of them.

Protein preparation#

In order to generate a valid molecule (with bond orders and charges) from a PDB file and correctly detect HBond interactions, ProLIF requires the protein file to contain explicit hydrogens.

Note

Having only polar hydrogens explicit may work if you use RDKit for reading the file, although they are some caveats around protonable residues such as histidines which RDKit may not recognize correctly due to naming conventions, we therefore encourage users to have all hydrogens explicit in the file and use MDAnalysis instead.

In order to protonate residues in your protein file, there are several strategies available, via command-line tools that you can install locally, or webservers, here are a few suggestions:

PypKa available through a webserver or as a CLI tool,
H++ webserver
PDB2PQR available through a webserver or as a CLI tool. Make sure to then convert the output PQR file to a PDB file.

Once this step is done, and you’ve verified that the prepared PDB file contains explicit hydrogens (and optionally explicit bonds), we can finally use that as input. We have 2 options to read the protein file: MDAnalysis (preferred) or RDKit (faster but risky, see below).

With MDAnalysis#

This is the recommended way of reading your prepared protein PDB file:

import MDAnalysis as mda
import prolif as plf

protein_file = str(plf.datafiles.datapath / "vina" / "rec.pdb")

u = mda.Universe(protein_file)
protein_mol = plf.Molecule.from_mda(u)
protein_mol.n_residues

We can have a quick look at each residue individually using the following command (we’ll only check residues 260 to 262 as there’s an histidine in there which could be problematic):

# check that residues 260 to 262 were processed correctly
# remove the `slice(260, 263)` part to show all residues
plf.display_residues(protein_mol, slice(260, 263))

../_images/4ae9c235183875ed46d6a857e5dc0e17044df5fe0af9b7a86079c88d3a8461c4.svg

Troubleshooting

Explicit valence for atom <...> is greater than permitted

If your file did not contain explicit bonds, MDAnalysis will automatically run its connectivity-perception algorithm. This could result in some atomic clashes that may be incorrectly classified as bonds and will prevent the conversion of the MDAnalysis molecule to RDKit through ProLIF. Since MDAnalysis uses van der Waals radii for bond detection, one can modify the default radii that are used:

u.atoms.guess_bonds(vdwradii={"H": 1.05, "O": 1.48})

Alternatively you could use some GUI tools like PyMOL to write the bonds (CONECT record) in your file and then reread it here.

With RDKit#

Important

While RDKit is going to be much faster at parsing the file, it can only recognize standard residue names, so some protonated residues may be incorrectly parsed.

For example, the protonated histidine residue HSE347 in our protein is not correcltly parsed which removes aromaticity on the ring, meaning that ProLIF will not be able to match this residue for pi-stacking interactions.

from rdkit import Chem

rdkit_prot = Chem.MolFromPDBFile(protein_file, removeHs=False)
rdkit_protein_mol = plf.Molecule(rdkit_prot)
# histidine HSE347 not recognized by RDKit
plf.display_residues(rdkit_protein_mol, slice(260, 263))

../_images/491ec40bfd0b373c0a5398e820e29c4de2b01a1a13c57e855367e0ce545c4c5c.svg

Ligand preparation#

As for the protein structure, we’ll need our prepared ligand file to contain explicit hydrogens. There are different ways to do this.

One could load the ligand file (or the whole structure) in PyMOL, select the ligand, add explicit hydrogens, and export the file as SDF, then use the following snippet:

# read SDF
ligand_file = str(plf.datafiles.datapath / "vina" / "vina_output.sdf")
ligand_mol = plf.sdf_supplier(ligand_file)[0]
# display ligand
plf.display_residues(ligand_mol, size=(400, 200))

../_images/180b0b91d1856bebce404359ed402eff04298afa07400f642f2e3764b7f2b41e.svg

If you only have a PDB file with hydrogens, you could use this code snippet instead:

# load PDB with explicit hydrogens
ligand_file = str(plf.datafiles.datapath / "vina" / "lig.pdb")
u = mda.Universe(ligand_file)
ligand_mol = plf.Molecule.from_mda(u)
# display ligand
plf.display_residues(ligand_mol, size=(400, 200))

../_images/289fd4a5d23cf69a9bf2e3eebf1142caa5f7cce48e3512220791fdf315fe57e0.svg

Alternatively, PDB files from the Protein Data Bank usually provide the ligand as a SMILES string.

We can use the following snippet to combine the SMILES string and the ligand PDB file to prepare a valid molecule for ProLIF:

from rdkit import Chem
from rdkit.Chem.AllChem import AssignBondOrdersFromTemplate

pdb_ligand = Chem.MolFromPDBFile(str(plf.datafiles.datapath / "vina" / "lig.pdb"))
smiles_mol = Chem.MolFromSmiles("C[NH+]1CC(C(=O)NC2(C)OC3(O)C4CCCN4C(=O)C(Cc4ccccc4)N3C2=O)C=C2c3cccc4[nH]cc(c34)CC21")
mol = AssignBondOrdersFromTemplate(smiles_mol, pdb_ligand)
mol_h = Chem.AddHs(mol)
ligand_mol = plf.Molecule.from_rdkit(mol_h)

Fingerprint generation#

We can now generate a fingerprint. By default, ProLIF will calculate the following interactions: Hydrophobic, HBDonor, HBAcceptor, PiStacking, Anionic, Cationic, CationPi, PiCation, VdWContact. You can list all interactions that are available with the following command:

plf.Fingerprint.list_available()

['Anionic',
 'CationPi',
 'Cationic',
 'EdgeToFace',
 'FaceToFace',
 'HBAcceptor',
 'HBDonor',
 'Hydrophobic',
 'MetalAcceptor',
 'MetalDonor',
 'PiCation',
 'PiStacking',
 'VdWContact',
 'XBAcceptor',
 'XBDonor']

Tip

The default fingerprint will only keep track of the first group of atoms that satisfied the constraints per interaction type and residue pair.

If you want to keep track of all possible interactions to generate a count-fingerprint (e.g. when there are two atoms in the ligand that make an HBond-donor interaction with residue X), use plf.Fingerprint(count=True). This is also quite useful for visualization purposes as you can then display the atom pair that has the shortest distance which will look more accurate. This fingerprint type is however a bit slower to compute.

# use default interactions
fp = plf.Fingerprint()
# run on your poses
fp.run_from_iterable([ligand_mol], protein_mol)

<prolif.fingerprint.Fingerprint: 9 interactions: ['Hydrophobic', 'HBAcceptor', 'HBDonor', 'Cationic', 'Anionic', 'CationPi', 'PiCation', 'PiStacking', 'VdWContact'] at 0x7fcf0e934190>

Tip

The run_from_iterable method will automatically select residues that are close to the ligand (6.0 Å) when computing the fingerprint. You can modify the 6.0 Å cutoff by specifying plf.Fingerprint(vicinity_cutoff=7.0), but this is only useful if you decide to change the distance parameters for an interaction class (see in the advanced section of the tutorials).

Alternatively, you can pass a list of residues like so:

fp.run_from_iterable(<other parameters>, residues=["TYR38.A", "ASP129.A"])

You can save the fingerprint object with fp.to_pickle and reload it later with Fingerprint.from_pickle:

fp.to_pickle("fingerprint.pkl")
fp = plf.Fingerprint.from_pickle("fingerprint.pkl")

Analysis#

Once the execution is done, you can access the results through fp.ifp which is a nested dictionary:

ligand_residue = "LIG1.G"
protein_residue = "ASP129.A"

fp.ifp[0][(ligand_residue, protein_residue)]

{'HBDonor': ({'indices': {'ligand': (13, 52), 'protein': (8,)},
   'parent_indices': {'ligand': (13, 52), 'protein': (1489,)},
   'distance': 2.7534161093413183,
   'DHA_angle': 158.09471780792316},),
 'Cationic': ({'indices': {'ligand': (13,), 'protein': (8,)},
   'parent_indices': {'ligand': (13,), 'protein': (1489,)},
   'distance': 2.7534161093413183},),
 'VdWContact': ({'indices': {'ligand': (8,), 'protein': (8,)},
   'parent_indices': {'ligand': (8,), 'protein': (1489,)},
   'distance': 3.1985935319784264},)}

While this contains all the details about the different interactions that were detected, it’s not the easiest thing to digest.

The best way to analyse our results is to export the interaction fingerprint to a Pandas DataFrame. You can read more about pandas in their user_guide.

df = fp.to_dataframe()
df.T

		Frame	0
ligand	protein	interaction
LIG1.G	TYR109.A	Hydrophobic	True
	TRP125.A	Hydrophobic	True
	LEU126.A	Hydrophobic	True
	ASP129.A	HBDonor	True
		Cationic	True
		VdWContact	True
	ILE130.A	Hydrophobic	True
	CYS133.A	Hydrophobic	True
	VAL200.A	VdWContact	True
	VAL201.A	Hydrophobic	True
		HBAcceptor	True
		VdWContact	True
	THR203.A	Hydrophobic	True
	THR209.A	Hydrophobic	True
	SER212.A	HBDonor	True
	SER212.A	VdWContact	True
	THR213.A	Hydrophobic	True
	THR213.A	VdWContact	True
	ALA216.A	Hydrophobic	True
	ALA216.A	VdWContact	True
	PHE330.B	Hydrophobic	True
	PHE330.B	VdWContact	True
	PHE331.B	Hydrophobic	True
		PiStacking	True
		VdWContact	True
	MET337.B	Hydrophobic	True
	PHE351.B	Hydrophobic	True
	ASP352.B	VdWContact	True

Troubleshooting

If the dataframe only shows VdWContact interactions and nothing else, you may have a skipped the protein preparation step for protein PDB file: either all bonds should be explicit, or none of them.

If you only have partial bonds in your PDB file, MDAnalysis won’t trigger the bond guessing algorithm so your protein is essentially a point cloud that can’t match with any of the atom selections specified by interactions other than VdW.

To fix this, simply add the following line after creating a Universe from your protein file with MDAnalysis:

u.atoms.guess_bonds()

and rerun the rest of the notebook.

Here are some common pandas snippets to extract useful information from the fingerprint table.

# hide an interaction type (Hydrophobic)
df.drop("Hydrophobic", level="interaction", axis=1).T

		Frame	0
ligand	protein	interaction
LIG1.G	ASP129.A	HBDonor	True
		Cationic	True
		VdWContact	True
	VAL200.A	VdWContact	True
	VAL201.A	HBAcceptor	True
	VAL201.A	VdWContact	True
	SER212.A	HBDonor	True
	SER212.A	VdWContact	True
	THR213.A	VdWContact	True
	ALA216.A	VdWContact	True
	PHE330.B	VdWContact	True
	PHE331.B	PiStacking	True
	PHE331.B	VdWContact	True
	ASP352.B	VdWContact	True

# show only one protein residue (ASP129.A)
df.xs("ASP129.A", level="protein", axis=1).T

	Frame	0
ligand	interaction
LIG1.G	HBDonor	True
	Cationic	True
	VdWContact	True

# show only an interaction type (PiStacking)
df.xs("PiStacking", level="interaction", axis=1).T

	Frame	0
ligand	protein
LIG1.G	PHE331.B	True

Visualisation#

There are a few different options builtin when it comes to visualisation.

You can display the interactions in a 2D interactive diagram:

fp.plot_lignetwork(ligand_mol, kind="frame", frame=0)

This diagram is interactive, you can:

zoom and pan,
move residues around,
click on the legend to display or hide types of residues or interactions,
hover an interaction line to display the distance.

Note

It is not possible to export it as an image, but you can always take a screenshot.

Up to now we’ve only been using the default fingerprint generator, but you can optionally enable the count parameter to enumerate all occurences of an interaction (the default fingerprint generator will stop at the first occurence), and then display all of them by specifying display_all=True:

fp_count = plf.Fingerprint(count=True)
fp_count.run_from_iterable([ligand_mol], protein_mol)
fp_count.plot_lignetwork(ligand_mol, kind="frame", frame=0, display_all=True)

You can also visualize this information in 3D:

view = fp_count.plot_3d(ligand_mol, protein_mol, frame=0, display_all=False)
view

3Dmol.js failed to load for some reason. Please check your browser console for error messages.

<py3Dmol.view at 0x7fcf47fe0a10>

As in the lignetwork plot, you can hover atoms and interactions to display more information.

The advantage of using a count fingerprint in that case is that it will automatically select the interaction occurence with the shortest distance for a more intuitive visualization.

Once you’re satisfied with the orientation, you can export the view as a PNG image with the following snippet:

from IPython.display import Javascript

Javascript(
    """
    var png = viewer_%s.pngURI()
    var a = document.createElement('a')
    a.href = png
    a.download = "prolif-3d.png"
    a.click()
    a.remove()
    """
    % view.uniqueid
)

PDB

Contents

PDB#

Prerequisite#

Protein preparation#

With MDAnalysis#

With RDKit#

Ligand preparation#

Fingerprint generation#

Analysis#

Visualisation#