The essential steps needed to map Protein Data Bank (PDB) entries into valid mmCIF
data sets are discussed. Examples of converting both routine and complex structures
using actual PDB entries with the program pdb2cif [BBB98] are given.
The Protein Data Bank format [PDB77, PDB95, PDB96] has been used for over 20 years
to archive macromolecular data, is produced by many refinement programs, and is used
as an input format by many applications. The pending adoption of the mmCIF dictionary
[FBB96] by the IUCr, in response to the need to explicitly represent a larger amount
of data which can be parsed by computer, necessary as the number of structures continues
to grow exponentially, has made translation from PDB format to mmCIF format a pressing issue.
In this talk we review the techniques needed to move from structures represented in
PDB format to mmCIF format. Some data items have direct mapping with minor syntactic
adjustment, such as for author names and journal references. Other data items, however, require us to recast our thinking along new lines. For example, the PDB format works
with chains and HET groups, while mmCIF uses entities (discrete chemical components).
Proper identification of entities in a PDB entry may require looking for sequence
homologies. As another example, consider beta sheets. The PDB format treats a bifurcated
sheet as two distinct sheets which happen to have certain strands in common, while
mmCIF allows all the strands involved to be represented as a single sheet. This requires strand matching and alignment to go from PDB format to mmCIF. What has currently
been automated in pdb2cif and what still requires human intervention will be discussed.
The Protein Data Bank [PDB96] uses a format with fixed fields and is order-dependent. Here is part of the list of atomic coordinates information from the PDB entry 4INS [DHH89] in the 1989 format and the 1996 format:
1989 format: ATOM 1 N GLY A 1 -8.863 16.944 14.289 1.00 21.88 1 4INS 235 ATOM 2 CA GLY A 1 -9.929 17.026 13.244 1.00 22.85 1 4INS 236 ATOM 3 C GLY A 1 -10.051 15.625 12.618 1.00 43.92 1 4INS 237 ATOM 4 O GLY A 1 -9.782 14.728 13.407 1.00 25.22 1 4INS 238 ATOM 5 N ILE A 2 -10.333 15.531 11.332 1.00 26.28 1 4INS 239 ATOM 6 CA ILE A 2 -10.488 14.266 10.600 1.00 20.84 1 4INS 240 ATOM 7 C ILE A 2 -9.367 13.302 10.658 1.00 11.81 1 4INS 241 ATOM 8 O ILE A 2 -9.580 12.092 10.969 1.00 20.31 1 4INS 242 ATOM 9 CB ILE A 2 -10.883 14.493 9.095 1.00 40.00 1 4INS 243 ATOM 10 CG1 ILE A 2 -11.579 13.146 8.697 1.00 36.74 1 4INS 244 1996 format: ATOM 1 N GLY A 1 -8.863 16.944 14.289 1.00 21.88 1 N ATOM 2 CA GLY A 1 -9.929 17.026 13.244 1.00 22.85 1 C ATOM 3 C GLY A 1 -10.051 15.625 12.618 1.00 43.92 1 C ATOM 4 O GLY A 1 -9.782 14.728 13.407 1.00 25.22 1 O ATOM 5 N ILE A 2 -10.333 15.531 11.332 1.00 26.28 1 N ATOM 6 CA ILE A 2 -10.488 14.266 10.600 1.00 20.84 1 C ATOM 7 C ILE A 2 -9.367 13.302 10.658 1.00 11.81 1 C ATOM 8 O ILE A 2 -9.580 12.092 10.969 1.00 20.31 1 O ATOM 9 CB ILE A 2 -10.883 14.493 9.095 1.00 40.00 1 C ATOM 10 CG1 ILE A 2 -11.579 13.146 8.697 1.00 36.74 1 C
The new mmCIF format is one of a family of STAR (Self-Defining Text Archive and Retrieval
File [HS94]) formats which uses a tag-value style of presentation and has very little
sensitivity to the ordering of the information. Here is an extract from an mmCIF
conversion of PDB entry 4INS:
loop_ _atom_site.label_seq_id _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.auth_seq_id _atom_site.label_alt_id _atom_site.cartn_x _atom_site.cartn_y _atom_site.cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.label_entity_id _atom_site.id 1 ATOM N N GLY A 1 . -8.863 16.944 14.289 1.00 21.88 1 1 1 1 ATOM C CA GLY A 1 . -9.929 17.026 13.244 1.00 22.85 1 1 2 1 ATOM C C GLY A 1 . -10.051 15.625 12.618 1.00 43.92 1 1 3 1 ATOM O O GLY A 1 . -9.782 14.728 13.407 1.00 25.22 1 1 4 2 ATOM N N ILE A 2 . -10.333 15.531 11.332 1.00 26.28 1 1 5 2 ATOM C CA ILE A 2 . -10.488 14.266 10.600 1.00 20.84 1 1 6 2 ATOM C C ILE A 2 . -9.367 13.302 10.658 1.00 11.81 1 1 7 2 ATOM O O ILE A 2 . -9.580 12.092 10.969 1.00 20.31 1 1 8 2 ATOM C CB ILE A 2 . -10.883 14.493 9.095 1.00 40.00 1 1 9 2 ATOM C CG1 ILE A 2 . -11.579 13.146 8.697 1.00 36.74 1 1 10
Because tags are always given, the same information can be presented in different
orderings. Note that the mmCIF format does not depend on the columns shown here,
just on a consistent ordering of tags versus data values.
The major differences in syntax are as follows:
PDB and mmCIf formats agree simply and directly for some data items, such as cell parameters, and admit a simple tabular mapping, as shown by this extract from the concordance [B96] which is available as part of pdb2cif [BBB98]:
PDB Field Content Type of Transformation and Related mmCIF field CRYST1[1-6] CRYST1 NA CRYST1[7-15] a equivalent to _cell.length_a CRYST1[16-24] b equivalent to _cell.length_b CRYST1[25-33] c equivalent to _cell.length_c CRYST1[34-40] alpha equivalent to _cell.angle_alpha CRYST1[41-47] beta equivalent to _cell.angle_beta CRYST1[48-54] gamma equivalent to _cell.angle_gamma CRYST1[56-66] sGroup equivalent to _symmetry.space_group_name_H-M CRYST1[67-70] z equivalent to _cell.Z_PDB
For example, in mmCIF, sheets are built up out of strands. All the strands in all sheets are listed in one STRUCT_SHEET_RANGE table. The relative ordering and orientation of all strands in all sheets are given in one STRUCT_SHEET_ORDER table. The hydrogen-bonding among all strands in all sheets is listed in one STRUCT_SHEET_HBOND table. The general characteristics of all sheets per se is given in one STRUCT_SHEET table. In PDB format, sheets are described by one set of sheet records per simple, non-bifurcated sheet. To convert from PDB format to mmCIF format, a list of all strands must be extracted from the SHEET records, sorted to remove duplicates, and the information placed in a STRUCT_SHEET_RANGE table. All strand to strand relationships are extracted and placed in a STRUCT_SHEET_ORDER table, etc. Here is a diagram of PDB entry 2ACE [HRSS96] showing strands forming sheets:
This is presented in the PDB entry as:
SHEET 1 A 3 LEU 6 THR 10 0 SHEET 2 A 3 GLY 13 MET 16 -1 N VAL 15 O VAL 8 SHEET 3 A 3 VAL 57 ALA 60 1 N TRP 58 O LYS 14 SHEET 1 B11 MET 16 PRO 21 0 SHEET 2 B11 HIS 26 PRO 34 -1 O ALA 29 N THR 18 SHEET 3 B11 TYR 96 PRO 102 -1 N ILE 99 O PHE 30 SHEET 4 B11 VAL 142 SER 147 -1 N LEU 143 O TRP 100 SHEET 5 B11 THR 109 TYR 116 1 N MET 112 O VAL 142 SHEET 6 B11 THR 193 GLU 199 1 O THR 195 N VAL 113 SHEET 7 B11 ARG 220 SER 226 1 N ILE 223 O ILE 196 SHEET 8 B11 GLN 318 ASN 324 1 N GLY 322 O LEU 224 SHEET 9 B11 GLY 417 PHE 423 1 N TYR 421 O LEU 321 SHEET 10 B11 PHE 502 LEU 505 1 N ILE 503 O LEU 420 SHEET 11 B11 MET 510 GLN 514 -1 N HIS 513 O PHE 502
Here is the same information converted to mmCIF format by pdb2cif:
loop_ _struct_sheet.id _struct_sheet.number_strands A 3 B 11 loop_ _struct_sheet_hbond.sheet_id _struct_sheet_hbond.range_id_1 _struct_sheet_hbond.range_id_2 _struct_sheet_hbond.range_1_beg_label_seq_id _struct_sheet_hbond.range_1_beg_label_atom_id _struct_sheet_hbond.range_2_beg_label_seq_id _struct_sheet_hbond.range_2_beg_label_atom_id _struct_sheet_hbond.range_1_end_label_seq_id _struct_sheet_hbond.range_1_end_label_atom_id _struct_sheet_hbond.range_2_end_label_seq_id _struct_sheet_hbond.range_2_end_label_atom_id A 1_A 2_A 8 O 15 N 8 O 15 N A 2_A 3_A 14 O 58 N 14 O 58 N B 1_B 2_B 18 N 29 O 18 N 29 O B 10_B 11_B 502 O 513 N 502 O 513 N B 2_B 3_B 30 O 99 N 30 O 99 N B 3_B 4_B 100 O 143 N 100 O 143 N B 4_B 5_B 142 O 112 N 142 O 112 N B 5_B 6_B 113 N 195 O 113 N 195 O B 6_B 7_B 196 O 223 N 196 O 223 N B 7_B 8_B 224 O 322 N 224 O 322 N B 8_B 9_B 321 O 421 N 321 O 421 N B 9_B 10_B 420 O 503 N 420 O 503 N loop_ _struct_sheet_order.sheet_id _struct_sheet_order.range_id_1 _struct_sheet_order.range_id_2 _struct_sheet_order.offset _struct_sheet_order.sense A 1_A 2_A +1 anti-parallel A 2_A 3_A +1 parallel B 1_B 2_B +1 anti-parallel B 10_B 11_B +1 anti-parallel B 2_B 3_B +1 anti-parallel B 3_B 4_B +1 anti-parallel B 4_B 5_B +1 parallel B 5_B 6_B +1 parallel B 6_B 7_B +1 parallel B 7_B 8_B +1 parallel B 8_B 9_B +1 parallel B 9_B 10_B +1 parallel loop_ _struct_sheet_range.sheet_id _struct_sheet_range.id _struct_sheet_range.beg_label_comp_id _struct_sheet_range.beg_label_asym_id _struct_sheet_range.beg_label_seq_id _struct_sheet_range.end_label_comp_id _struct_sheet_range.end_label_asym_id _struct_sheet_range.end_label_seq_id A 1_A LEU * 6 THR * 10 A 2_A GLY * 13 MET * 16 A 3_A VAL * 57 ALA * 60 B 1_B MET * 16 PRO * 21 B 10_B PHE * 502 LEU * 505 B 11_B MET * 510 GLN * 514 B 2_B HIS * 26 PRO * 34 B 3_B TYR * 96 PRO * 102 B 4_B VAL * 142 SER * 147 B 5_B THR * 109 TYR * 116 B 6_B THR * 193 GLU * 199 B 7_B ARG * 220 SER * 226 B 8_B GLN * 318 ASN * 324 B 9_B GLY * 417 PHE * 423
pdb2cif [BBB98] is a program which converts PDB entries into mmCIF datasets. Most,
but not all, common PDB record types are converted. The program cannot resolve some
of the ambiguitites involved in the conversion. The program has gone through extensive
changes since 1993 as both mmCIf and the PDB format have evolved. The program, which
was initially written as an awk script, is now available as an m4 macro document
which produces either perl or awk versions. The perl version is recommended.
The pdb2cif.m4 document contains approximately 6500 lines of text, which generates a similar sized awk script of over 10,000 lines of perl code (due to in-lining of certain critical functions). On modern processors with sufficient memory (32 to 64 MB available RAM), the conversion takes from several seconds to a few minutes (e.g. for large NMR entries) depending on the size of the PDB entry. The mmCIF data sets produced are approximately the same size as the original PDB entries. Here are the statistics for some conversions done on an SGI R8000 Indigo:
Size in Characters (* 1000) Conversion PDB Entry PDB mmCIF Time (secs.) 4INS 117 130 2.7 1CTJ 170 179 2.7 2ACE 393 433 7.3 4HIR 1,753 1,896 28.8
The time is approximately linear in the file size, dominated by the processing time for the atom list. The times given are real times and approximate the processor time on larger machines, but for large NMR entries processed on small machines, the real time can become very large due to extensive page swapping for the arrays used to hold the atom list.
The program produces summary warnings as comments at the end of each output CIF. Unconverted records are captured in the AUDIT category warnings and converted records should be examined carefully, especially for the following record types:
COMPND, SOURCE, TITLE and CAVEAT are merged into _struct.title without further parsing.
A great deal of information could be derived from the entries which follow the PDB
1995 format description when sufficient information for mapping of MOL_ID to entities is available.
One of the most challenging parts of the conversion done by pdb2cif is the identification
of chemical entities. pdb2cif does this by scanning SEQRES and ATOM list information
for sequence homologies. Doubtful cases are reported by warning comments in the mmCIF output.
pdb2cif is used to produce mmCIF output from a
Browser [APMS96] available on the PDB home page since August 1996 at:
http://www.pdb.bnl.gov
The program accepts all current PDB record types. Here is the DBREF information from the PDB entry 1CTJ [S95].
DBREF 1CTJ 1 89 SWS Q09099 CYC6_MONBR 1 89
loop_ _struct_ref.id _struct_ref.entity_id _struct_ref.biol_id _struct_ref.db_name _struct_ref.db_code _struct_ref.seq_align _struct_ref.seq_dif _struct_ref.details 1 1 * SWS 'Q09099 CYC6_MONBR' partial no . loop_ _struct_ref_seq.align_id _struct_ref_seq.ref_id _struct_ref_seq.seq_align_beg _struct_ref_seq.seq_align_end _struct_ref_seq.db_align_beg _struct_ref_seq.db_align_end _struct_ref_seq.details 1 1 '1' '89' '1' '89' .
1CTJ has anisotropic U's:
ATOM 1 N AGLU 1 4.127 26.179 -7.903 0.49 57.53 N ANISOU 1 N AGLU 1 9336 7394 4591 4 2737 2771 N ATOM 2 N BGLU 1 3.535 25.488 -12.889 0.51 54.52 N ANISOU 2 N BGLU 1 8406 5015 6783 -887 3093 161 N ATOM 3 CA AGLU 1 5.490 26.607 -8.207 0.49 52.50 C ANISOU 3 CA AGLU 1 9283 5563 4611 -256 2331 1241 C ATOM 4 CA BGLU 1 2.754 26.395 -12.051 0.51 51.27 C ANISOU 4 CA BGLU 1 7663 5124 6212 -653 2258 184 C ATOM 5 C AGLU 1 5.550 27.734 -9.233 0.49 47.55 C ANISOU 5 C AGLU 1 8593 4752 4275 -880 1820 625 C
loop_ _atom_site.label_seq_id _atom_site.auth_asym_id _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.auth_seq_id _atom_site.label_alt_id _atom_site.cartn_x _atom_site.cartn_y _atom_site.cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.label_entity_id _atom_site.id _atom_site.aniso_U[1][1] _atom_site.aniso_U[1][2] _atom_site.aniso_U[1][3] _atom_site.aniso_U[2][2] _atom_site.aniso_U[2][3] _atom_site.aniso_U[3][3] 1 ' ' ATOM N N GLU * 1 A 4.127 26.179 -7.903 0.49 57.53 . 1 1 0.9336 0.0004 0.2737 0.7394 0.2771 0.4591 1 ' ' ATOM N N GLU * 1 B 3.535 25.488 -12.889 0.51 54.52 . 1 2 0.8406 -0.0887 0.3093 0.5015 0.0161 0.6783 1 ' ' ATOM C CA GLU * 1 A 5.490 26.607 -8.207 0.49 52.50 . 1 3 0.9283 -0.0256 0.2331 0.5563 0.1241 0.4611 1 ' ' ATOM C CA GLU * 1 B 2.754 26.395 -12.051 0.51 51.27 . 1 4 0.7663 -0.0653 0.2258 0.5124 0.0184 0.6212 1 ' ' ATOM C C GLU * 1 A 5.550 27.734 -9.233 0.49 47.55 . 1 5 0.8593 -0.088 0.182 0.4752 0.0625 0.4275
As we gain more experience with the new PDB format and with mmCIF we hope to extend
the mapping of record types into the internal fields of COMPND and SOURCE and of
the newer, more structured remarks. Ultimately we hope to be able to do conversions
from PDB format to mmCIF in sufficient detail to extract all information for which mmCIF
tokens exist and for which information was provided in an entry, while preserving
the names and relationships which existed in the PDB entry, so that all records of
the original entry can be reconstructed from the new mmCIF data set.
There are many useful sites on the World Wide Web where information, tools and software related to CIF, mmCIF and the PDB can be found. The following are good starting points for exploration:
The International Union of Crystallography (IUCr) provides access to software, dictionaries, policy statements and documentation relating to CIF and mmCIF at:
with mirror sites at:The Nucleic Acid Database Project provides access to its entries, software and documentation, with an mmCIF page giving access to the dictionary and mmCIF software tools at:
with mirror sites at:The Protein Data Bank provides access to entries, software and documentation with a browser, and an on-line PDB format description at:
with mirror sites at many locations (see http://www.pdb.bnl.gov/pdb-docs/mirror_sites.html).
Tutorials on mmCIF and the relationship to PDB format can be found at: http://www.sdsc.edu/pb/cif/tutorials.html
Here are direct links to copies of the IUCr CIF home page, the NDB's mmCIF home page, pdb2cif, cif2pdb and CIFtbx (with Cyclops and cif2cif).
United States
| |||||
NDB, Rutgers, NJ | mmCIF | pdb2cif | cif2pdb |
CIFtbx... | |
SDSC, San Diego, CA | CIF | mmCIF | pdb2cif | cif2pdb |
CIFtbx... |
United Kingdom
| |||||
IUCr, Chester | CIF | pdb2cif | cif2pdb | CIFtbx... | |
EBI, Hinxton | mmCIF | pdb2cif | cif2pdb |
CIFtbx... | |
France
| |||||
U. P. et M. Curie, Paris | CIF | pdb2cif | cif2pdb | CIFtbx... | |
Sweden | |||||
U. of Stockholm | CIF | pdb2cif | cif2pdb | CIFtbx... | |
South Africa
| |||||
U. of the Witwatersrand | CIF | pdb2cif | cif2pdb | CIFtbx... | |
Japan
| |||||
NIBH, Ibaraki | mmCIF | pdb2cif | cif2pdb |
CIFtbx... | |
Australia
| |||||
UWA, Nedlands | STAR/CIF | pdb2cif | cif2pdb |
CIFtbx... | |
Herbert J. Bernstein (yaya@bernstein-plus-sons.com)