BioinformaticsPlus > 3D-SPECS > FAQ

3D-SPECS: Frequently Asked Questions

What can 3D-SPECS be used for?

3D-SPECS contains crystallisation predictions for all human proteins*. This information can be used for: *The longest protein product of each gene is used as the representative 3D-SPECS entry.

What is Target Selection?

Target selection is the process a lab or individual uses to identify proteins for structure determination studies.

Different laboratories have different reasons for choosing targets for structure determination. Some labs may be trying to solve the structure of individual proteins, while others may wish to target particular protein families. Many structural genomics laboratories have much more flexibility when it comes to selecting targets and generally this is where 3D-SPECS data will be most applicable. For example, selecting 100 XtalPred class 1 protein regions is statistically likely to result in more structures than selecting 100 XtalPred class 5 protein regions (How much more likely? see section What do XtalPred Crystallisation Classes mean?).

Target selection is something of a chicken-and-egg problem. How can you select the best targets without first selecting the best constructs/truncations? 3D-SPECS solves this problem by calculating an approximate 'best' construct by matching human proteins to PDB templates and uses XtalPred to calculate the likelihood of crystallisation success. By doing this for all human proteins, 3D-SPECS automatically identifies protein candidates with a high predicted crystallisation success. These regions can be optimised by following good construct design principles.

Whether you are trying to solve the structure of one protein or a hundred proteins, the prediction data in 3D-SPECS can be useful for designing constructs that maximise the chance of crystallisation success.

What is Construct Design?

This is the first step in the structure determination pipeline. Once a target has been selected, the target sequence is examined using bioinformatics tools and regions that may hinder crystallisation are generally excluded from the sequence that is cloned into the bacterial vector. Examples are long regions (>20 residues) of predicted disorder at the N or C terminus. Or where there is a clear distinction between two domains, the domains may be separated and attempted individually.

It is important to avoid cutting off any critical secondary structures that a responsible for protein stability or for important protein-protein interactions (e.g. in multimer formation). The tricky part of construct design is knowing what are 'critical' residues and which are not. Generally, the best guide for where to start/stop a protein truncation is to identify a protein in the Protein Data Bank (PDB) that has then same folding arrangement of secondary structures. Then to begin/end the construct near to the start/stop of the PDB template after accurately aligning the two proteins.

The impact of including / excluding certain residues can be estimated by assessing the 3D structure of the PDB template and asking questions like:

If these questions seem a bit daunting, don't worry, by selecting multiple start and stop sites (e.g. 2, 3 or 4 start, 2, 3 or 4 stop) you can experimentally test between 4 and 16 constructs (depending on your experimental pipeline bandwidth). Around 9 constructs (3x3) with a start/stop position separation of 5-10 residues, makes for a good sampling compromise. There is no magic 'optimum' number of constructs. The more constructs you make the more chance you have of finding the right region to express/purify/crystallise. At the end of the day, whether a protein region expresses and can be purified is more important than the theoretical construct design and/or how many constructs were required or what sampling strategy it took you to find the 'expressable' region.

3D-SPECS gives the regions with the best alignments to PDB templates on the 'SUMMARY' page in the 'Crystallisable Regions' table. The individual alignment(s) can be viewed under the 'TEMPLATES' tab. The combination of the PDB template information and the secondary structure detail gives the construct designer the opportunity to adjust the start/stop positions to accommodate the unique features of each query sequence.

The 'XTALPRED' tab shows the XtalPred scores for many 'in silico' truncations. This gives the construct designer an indication of whether removal of residues at each terminus will help or hinder crystallisation. However it is best to combine this information with the PDB template data rather than simply use the XtalPred scores as a guide on their own. As a general rule, 3D information/knowledge should always take priority over sequence-based predictions (of which XtalPred is an example).

What do XtalPred Crystallisation Classes mean?

The crystallisation class definition given by XtalPred is:
1 = optimal, 2 = suboptimal, 3 = average, 4 = difficult, 5 = very difficult.
(1 is the most likely to crystallise and 5 is the least likely to crystallise).

And in terms of real-world success (based on XtalPred benchmarking in Slabinski et al Protein Science, 2007) using structural genomics data from a database called TargetDB these classes roughly translate into percent crystallisation success:

These are the results from testing ~4,000 different proteins. Notice that XtalPred Class 5 has a considerably lower success rate than the other classes. The method that the XtalPred algorithm uses will put a protein in class 5 if it has any single bad property (not a combination of bad properties). This includes a long region of disorder (>40 residues), a single TM helix, or unusually long or short sequences (longer than 700 or shorter than 70).

For more detail, take a look at the local XtalPred page and the references therein or the remote XtalPred website help page

What are the 3D-SPECS identifiers?

The 3D-SPECS identifier is composed of three parts. All identifiers start with 'GDB' and end with 'A' (A signifies this is the longest protein isoform). The middle numbers are the NCBI Gene ID. For example GDB79716A is the longest protein isoform of NCBI gene entry 79716 (at the time of building 3D-SPECS). GDB79716A has an associated Genbank GI number for the protein, in this case it is 'gi|47155554'. All calculations/predictions for GDB79716A were carried out using the sequence defined by 'gi|47155554'.

(All links open a new browser tab)