INTERNATIONAL TASK FORCE ON DEPOSITION, ARCHIVING,
AND CURATION OF THE PRIMARY INFORMATION
Airlie House Meeting
April 4-6 2001

Introduction

The structural genomics initiative will require the collection and curation of large amounts of experimental and structural data. Each project will collect information about all the experiments that lead to successful (and unsuccessful) structure determinations. Once a structure is determined, the results will be deposited into the Protein Data Bank.

Primary Objective

It is critical that the information that is collected by these projects is available and usable by humans and computers, and can be archived in the PDB in a high throughput mode. This implies that items of data common to all projects are named consistently and represented in a format that can be stored electronically without loss of information. The task of this committee then becomes one of considering what data the PDB needs to collect, and how to optimize data exchange among the various projects and the PDB.

The primary objective requires the endorsement and cooperation of all structural genomics projects. It will further require that each project implements its own well thought out LIMS (Laboratory Information Management System), and takes advantage of the experience gained by the major genome sequencing centers in managing ambitious high throughput projects.

Summary of Committee Activities

The committee considered four questions:

1. What additional data will be collected and archived for the structural genomics projects? These may include all aspects of the experiments including cloning, purification, crystallization, data collection, structure determination, structure refinement, analysis, and function.

2. In addition to what is already included in the current PDB, what additional information should be included in the PDB archive? For example, should there be more detail about data collection and structure determination?

3. Should there be some sort of tag that indicates that a structure has been determined as part of a structural genomics project?

4. Which software should be extended such that the output can be automatically archived? There is no final consensus yet about the answers to questions 1 and 3. Item 4 is still being considered. A consensus was view developed concerning 2, namely that the PDB should collect

more information about the experiment than it currently does. At present the Protein Data Bank collects the following information:
  • Atomic coordinates
  • Structure factors
  • Journal citations
  • Names of macromolecule and ligands
  • Sequence of macromolecule
  • Crystallization information
  • Unit cell and space group
  • Source information
  • Data collection information
  • Refinement information

    At the meeting in Yokohama, it was suggested that the PDB collect all information that appears in the materials and methods section of a journal, such as the Journal of Molecular Biology. This committee rigorously endorses this suggestion.

    To accomplish this objective, it was first necessary to define all the terms for the data that appear in those methods sections. The PDB staff reviewed several articles, extracted the data items and made the correspondences with the data dictionary that underlies the PDB. Most aspects of the pipeline from crystallization to refinement were found in the dictionary. The data items were sent out for review to the Task Force (Appendix 1). Several suggestions were made for additions and modifications. These will be implemented by September 2001 after full review by the Task Force with input from other members of the community.

    The set of data items that describes protein production is not found in the current data dictionary. A provisional list was developed and sent to the Task Force for review. In addition, a Web site was established for review of these items and for the submission of possible new items (http://www-pdb.rutgers.edu:5005/). These data items involving protein production include information about the following (Appendix 2):
  • General source information
  • Production of the target gene
  • Cloning
  • Expression
  • Purification

    Proposed Procedures and Activities

    There will be a worldwide effort to collect vast amounts of information about proteins and their structures. In order to prevent the loss of important information and to ensure the maximum potential collaboration among projects, it is essential at the early stage of this effort to be certain that information can be exchanged within and between projects. The only way that this can happen is to come to an agreement as to which data items are mandatory and to define each of these items carefully. A definitive list of all data items must be finally established. This will require thorough review by members of the Task Force as well as by other members of the community. Once the items are agreed upon, it will be necessary to specify clear definitions for inclusion in a data dictionary. The plan for accomplishing this is given here.

  • Most of the data items representing the experiment from crystallization to structure refinement (Appendix 1) are found in the dictionary that underlies the PDB. The task force will attempt to reach agreement for a set of mandatory items by September 2001. Once that occurs these data items will become a part of every structural genomics projects' submission to the PDB.

  • The data items representing protein production (Appendix 2) will require more careful discussion and thought. Wherever possible, publicly vetted nomenclatures and controlled vocabularies should be used.

  • A review process has been established. Members of the structural genomics community will actively participate in proposing and reviewing these items with the goal of having a mandatory list established in one year's time.

  • The PDB will take responsibility for the technical implementation of the dictionary.

  • Once the mandatory data items are established, all structural genomics projects will deposit mandatory data to the PDB in a consistent format.





    International Task Force on Deposition, Archiving, and Curation of the Primary Information



    Helen M. Berman, Chair
    Rutgers University

    Geoff Barton, Co-Chair
    EMBL-European Bioinformatics Institute

    Stephen Burley
    The Rockefeller University

    Aled Edwards
    University of Toronto

    Udo Heinemann
    Max-Delbruck Center for Molecular Medicine

    Haruki Nakamura
    Osaka University

    Osnat Herzberg
    University of Maryland Biotechnology Institute

    Andrzej Joachimiak
    Argonne National Lab

    Sung-Hou Kim
    University of California, Berkeley

    Guy (Gaetano) Montelione
    Rutgers University

    Dino Moras
    IGBMC

    John Rose
    University of Georgia

    Joel L. Sussman
    The Weizmann Institute of Science

    Thomas C Terwilliger
    Los Alamos National Lab

    Eldon Ulrich
    University of Wisconsin-Madison

    Bi-Cheng Wang
    University of Georgia

    Ian Wilson
    The Scripps Research Institute

    Shigeyuki Yokoyama
    RIKEN Genomic Sciences Centre





    DATA ITEMS REPRESENTED IN PAPERS DESCRIBING COMPLETED STRUCTURES

    MACROMOLECULE NAME:


    Molecule name_entity.pdbx_description holds the name corresponding to PDB compound name.
    Multiple systematic and common names can be supplied in mmCIF categories
    entity_name_sys and entity_name_com
    Fragment_entity_keywords.pdbx_fragment
    Mutations_entity_keywords.pdbx_mutation
    E.C. number_entity_keywords.pdbx_ec
    Notes: Macromolecule names are recorded in PDB COMPND records. All of the above items are included in the current format. The assignment of multiple common and systematic names is supported by mmCIF but not in the PDB format.

    CRYSTALLIZATION CONDITIONS AND UNIT CELL PARAMETERS:
    Data Item Dictionary Item Name
    Crystallization method_exptl_crystal_grow.method
    Apparatus_exptl_crystal_grow.apparatus
    Temperature_exptl_crystal_grow.temp
    _exptl_crystal_grow.temp_details
    pH_exptl_crystal_grow.pH
    _exptl_crystal_grow.pdbx_pH_range
    Crystallization solution compositionsTabulated in mmCIF category
    exptl_crystal_grow_comp
    Additional treatments (e.g. soaking)_exptl_crystal.preparation
    Cell constants_cell.length_a _cell.length_b
    _cell.length_c
    _cell.length_alpha
    _cell.length_beta
    _cell.length_gamma
    Space Group_symmetry.space_group_name_H-M
    Notes: Crystallization conditions are recorded as free text in PDB REMARK 280. Cell constants are recorded on the PDB CRYST1 records. mmCIF provides for description of multiple crystals and maintains the correspondences between each crystal and its associated diffraction data sets.

    SOURCE INFORMATION:
    Data Item Dictionary Item Name
    Organism common name_entity_src_gen.gene_src_common_name
    Organism scientific name_entity_src_gen.pdbx_gene_src_scientific_name
    Organ_entity_src_gen.pdbx_gene_src_organ
    Gene_entity_src_gen.pdbx_gene_src_gene
    Cellular location_entity_src_gen.pdbx_gene_src_cellular_location


    Expression system common name_entity_src_gen.host_org_common_name
    Expression system scientific name_entity_src_gen.pdbx_host_org_scientific_name
    Expression system cell line_entity_src_gen.pdbx_host_org_cell_line
    Expression system strain_entity_src_gen.pdbx_host_org_strain
    Expression system variant_entity_src_gen.pdbx_host_org_variant
    Expression vector_entity_src_gen.pdbx_host_org_vector
    Expression plasmid_entity_src_gen.plasmid_name
    Expression system cellular location_entity_src_gen.pdbx_host_org_cellular_location
    Expression system gene_entity_src_gen.pdbx_host_org_gene
    Notes: Source information is recorded in PDB SOURCE records. All of the above source items are represented in the current format.

    DATA COLLECTION:


    Data collection site_diffrn_source.pdbx_synchrotron_site
    Beamline_diffrn_source.pdbx_synchrotron_beamline
    Detector_diffrn_detector.detector
    _diffrn_detector.type
    Collection temperature_diffrn.ambient_temp
    _diffrn.ambient_temp_details
    Total unique reflections collected _reflns.number_all
    Observed reflections (> Sigma cutoff)_reflns.number_obs
    Criterion for "observed" reflections_reflns.observed_criterion
    Wavelength(s) used (simplified)_diffrn_radiation.pdbx_wavelength_list
    Wavelength(s) used (detailed)_diffrn_radiation_wavelength.wavelength
    Resolution range_reflns.d_resolution_high
    _reflns.d_resolution_low
    Completeness (observed)_reflns.percent_possible_obs
    Completeness of high resolution shell _reflns_shell.percent_possible_obs
    Redundancy overall_reflns.pdbx_redundancy
    Redundancy for high resolution shell_reflns_shell.pdbx_redundancy
    R-Merge (overall observed)_reflns.Rmerge_F_obs
    _reflns.pdbx_Rmerge_I_obs
    R-Merge (high resolution shell)_reflns_shell.Rmerge_F_obs
    _reflns_shell.Rmerge_I_obs
    R-Symm_reflns.pdbx_Rsym_value
    _reflns_shell.pdbx_Rsym_value
    <I> over <sigma I>_reflns.pdbx_netI_over_av_sigmaI
    _reflns_shell.meanI_over_sigI
    Data processing softwaremmCIF category "software" provides for complete program description.
    Notes: Data collection details are recorded in PDB REMARK 200. The description above provides a summary of the collected data with respect to the solved structure. If the data are originally encoded in imgCIF/CBF, then much greater detail is available describing the diffraction data sets that contributes to the final merged data set.

    STRUCTURE SOLUTION AND PHASING:


    For each MAD data set:
    Wavelength_phasing_MAD_set.wavelength
    Resolution range _phasing_MAD_set.d_res_high
    _phasing_MAD_set.d_res_low
    f'_phasing_MAD_set.f_prime
    f''_phasing_MAD_set.f_double_prime
    <FOM>_phasing_MAD_expt.mean_fom
    R-Cullis (acentric)
    R-Cullis (centric)
    R-Cullis (anomalous)
    Phasing power (acentric)
    Phasing power (centric)


    For each MIR data set:

    Resolution range
    _phasing_MIR_der.d_res_high
    _phasing_MIR_der.d_res_low
    Number of sites
    _phasing_MIR_der.number_of_sites
    Power acentric
    _phasing_MIR_der.power_acentric
    Power centric
    _phasing_MIR_der.power_centric
    R-Cullis (acentric)
    _phasing_MIR_der.R_cullis_acentric
    R-Cullis (centric)
    _phasing_MIR_der.R_cullis_centric
    R-Cullis (anomalous)
    _phasing_MIR_der.R_cullis_anomalous
    <FOM> (overall)
    _phasing_MIR.FOM
    <FOM> (high resolution shell)
    _phasing_MIR_der_shell.fom


    Structure solution softwaremmCIF category "software" provides for
    complete program description.
    Notes: The details are MAD and MIR experiments are not captured in the current PDB data file.

    REFINEMENT INFORMATION:
    Data Item Dictionary Item Name
    Resolution range_refine.ls_d_res_low
    _refine.ls_d_res_high
    Resolution range (highest res. shell)_refine_ls_shell.d_res_low
    _refine_ls_shell.d_res_high
    Number of reflections used in refinement_refine.ls_number_reflns_obs
    Number of reflections in R-Free set_refine.ls_number_reflns_R_free
    R-factor_refine.ls_R_factor_R_work
    _refine.ls_R_factor_R_free
    Number of atoms refined_refine_hist.number_atoms_total
    _refine_hist.number_atoms_solvent
    _refine_hist.pdbx_number_atoms_protein
    _refine_hist.pdbx_number_atoms_nucleic_acid
    _refine_hist.pdbx_number_atoms_ligand
    RMS Bond Distances _refine_ls_restr.type
    _refine_ls_restr.dev_ideal_target
    _refine_ls_restr.dev_ideal
    RMS Bond Angles
    RMS Chiral Volume
    RMS Planar Torsion Angles
    RMS Staggered Torsion Angles
    RMS Orthonormal Torsion Angles


    Isotropic temperature factor restraints_refine_b_iso.class
    _refine_b_iso.treatment
    _refine_b_iso.value
    Non-crystallographic symmetry restraintsNCS related domains are described in mmCIF categories struct_ncs_dom and struct_ncs_dom_lim.
    The ncs operations relating the domain ensembles are described in categories struct_ncs_ens, struct_ncs_ens_gen, and struct_ncs_oper. NCS restraints used in refinement are described in
    category refine_ls_restr_ncs.
    Solvent model used _refine.solvent_model_details
    _refine.solvent_model_param_bsol
    _refine.solvent_model_param_ksol
    Starting model _refine.pdbx_starting_model
    Overall Average Isotropic B Factor_refine.B_iso_mean
    Overall Anisotropic B Factor_refine.aniso_B[1][1]
    _refine.aniso_B[1][2]
    _refine.aniso_B[1][3]
    _refine.aniso_B[2][2]
    _refine.aniso_B[2][3]
    _refine.aniso_B[3][3]
    Overall Isotropic B Factor
    + main chain atoms
    + side chain atoms
    + ligand atoms
    + solvent
    Computed from _atom_site.B_iso_or_equiv
    Refinement softwaremmCIF category "software" provides for
    complete program description.
    Stereochemical quality/Ramachandran analysis
    + number of residues in favored regions
    + number of residues in additionally
    allowed regions
    + number of residues in generously allowed regions
    + number of residues in disallowed regions

    Notes: Refinement details are recorded in PDB REMARK 3. All of the above refinement parameters, except the Ramachandran analysis, are included in the current PDB format file. Matrices describing NCS operations are recorded in PDB MTRIX records. There are many more data items associated with refinement defined in the mmCIF dictionary that could be easily captured
    (e.g. refinement statistics for each resolution shell).








    Appendix 2: DATA ITEMS FOR PROTEIN PRODUCTION
    GENERAL SOURCE INFORMATION:
    Data Item Dictionary Item Name
    Organism common name_entity_src_gen.gene_src_common_name
    Organism scientific name_entity_src_gen.pdbx_gene_src_scientific_name
    Organ_entity_src_gen.pdbx_gene_src_organ
    Gene_entity_src_gen.pdbx_gene_src_gene
    Cellular location_entity_src_gen.pdbx_gene_src_cellular_location


    Expression system common name_entity_src_gen.host_org_common_name
    Expression system scientific name_entity_src_gen.pdbx_host_org_scientific_name
    Expression system cell line_entity_src_gen.pdbx_host_org_cell_line
    Expression system strain_entity_src_gen.pdbx_host_org_strain
    Expression system variant_entity_src_gen.pdbx_host_org_variant
    Expression vector_entity_src_gen.pdbx_host_org_vector
    Expression plasmid_entity_src_gen.plasmid_name
    Expression system cellular location_entity_src_gen.pdbx_host_org_cellular_location
    Expression system gene_entity_src_gen.pdbx_host_org_gene


    PRODUCTION OF THE TARGET GENE:
    Data Item Dictionary Item Name
    Source organism or original gene_entity_src_gen.gene_src_common_name

    _entity_src_gen.pdbx_gene_src_scientific_name


    PCR step number_entity_src_gen_prod_pcr.step_id
    PCR gene source _entity_src_gen_prod_pcr.gene_source
    Forward PCR primer sequence (5')_entity_src_gen_prod_pcr.forward_primer_sequence
    Reverse PCR primer sequence (3')_entity_src_gen_prod_pcr.reverse_primer_sequence
    PCR reaction conditions _entity_src_gen_prod_pcr.reaction_details
    PCR purification details _entity_src_gen_prod_pcr.purification_details
    Overall production step number_entity_src_gen_prod_pcr.prod_step_id


    Digestion step number_entity_src_gen_prod_digest.step_id
    First digestion restriction site_entity_src_gen_prod_digest.restriction_site_1
    Second digestion restriction site_entity_src_gen_prod_digest.restriction_site_2
    Purification of gene product_entity_src_gen_prod_digest.purification_details
    Overall production step number [1]_entity_src_gen_prod_digest.prod_step_id


    [1] Step number in the overall protein production process. This item is provided to
    allow the sequence of production operations to be recorded.

    BACTERIAL CLONING:
    Data Item Dictionary Item Name


    Cloning vector_entity_src_gen.pdbx_host_org_vector
    Plasmid name _entity_src_gen.plasmid_name
    Enzyme(s) used to prepare vector_entity_src_gen_bact_clone.cleavage_enzymes
    Vector purification details _entity_src_gen_bact_clone.purification_details
    Enzymes used for ligation_entity_src_gen_bact_clone.ligation_enzymes
    Ligation temperature_entity_src_gen_bact_clone.ligation_temperature
    Ligation time_entity_src_gen_bact_clone.ligation_time
    Transformation method_entity_src_gen_bact_clone.transformation_method
    Clone selection marker_entity_src_gen_bact_clone.clone_marker
    Clone selection criteria_entity_src_gen_bact_clone.clone_selection_criteria
    Overall production step number _entity_src_gen_bact_clone.prod_step_id


    BACTERIAL EXPRESSION:
    Data Item Dictionary Item Name


    Promoter type_entity_src_gen_bact_express.promoter_type
    Gene insertion length _entity_src_gen_bact_express.gene_insert_length
    Gene mutations_entity_src_gen_bact_express.gene_mutations
    N-terminal sequence tags _entity_src_gen_bact_express.N_terminal_seq_tag
    C-terminal sequence tags_entity_src_gen_bact_express.C_terminal_seq_tag
    Culture base media_entity_src_gen_bact_express.culture_base_media
    Culture additives_entity_src_gen_bact_express.culture_additives
    Culture volume_entity_src_gen_bact_express.culture_volume
    Culture time_entity_src_gen_bact_express.culture_time
    Induction procedure_entity_src_gen_bact_express.induction_details
    Induction timepoint_entity_src_gen_bact_express.induction_timepoint
    Growth time after induction_entity_src_gen_bact_express.induction_growth_time
    Protein location_entity_src_gen_bact_express.protein_location
    Harvesting protocol_entity_src_gen_bact_express.harvesting_details
    Storage conditions_entity_src_gen_bact_express.storage_details
    Overall production step number_entity_src_gen_bact_express.prod_step_id


    PURIFICATION:
    Data Item Dictionary Item Name
    Assay methods_entity_src_gen_pure.assay_method_details
    Purification preparation scale_entity_src_gen_pure.preparation_scale


    Lysis method_entity_src_gen_pure_lysis.method_details
    Lysis buffer composition_entity_src_gen_pure_lysis.buffer
    Lysis buffer volume_entity_src_gen_pure_lysis.buffer_volume
    Lysis temperature_entity_src_gen_pure_lysis.temperature
    Lysis separation details_entity_src_gen_pure_lysis.separation_details
    Overall production step number_entity_src_gen_pure_lysis.prod_step_id


    Fractionation step number_entity_src_gen_pure_fract.step_id
    Fractionation method_entity_src_gen_pure_fract.method_details
    Fractionation temperature_entity_src_gen_pure_fract.temperature
    Fractionation separation details_entity_src_gen_pure_fract.separation_details
    Protein location _entity_src_gen_pure_fract.protein_location
    Overall production step number_entity_src_gen_pure_fract.prod_step_id


    Chromatographic step number_entity_src_gen_pure_chrom.step_id
    Column type_entity_src_gen_pure_chrom.column_type
    Column volume_entity_src_gen_pure_chrom.column_volume
    Temperature_entity_src_gen_pure_chrom.column_temperature
    Equilibration buffer_entity_src_gen_pure_chrom.equilibration_buffer
    Elution protocol_entity_src_gen_pure_chrom.elution_protocol
    Sample preparation_entity_src_gen_pure_chrom.sample_prep_details
    Sample volume_entity_src_gen_pure_chrom.sample_volume
    Sample amount_entity_src_gen_pure_chrom.sample_amount
    Volume of pooled fractions_entity_src_gen_pure_chrom.volume_pooled_fractions
    Yield of pooled fractions_entity_src_gen_pure_chrom.yield_pooled_fractions
    Overall production step number_entity_src_gen_pure_chrom.prod_step_id


    Concentration procedure_entity_src_gen_pure.concentration_details
    Concentration device_entity_src_gen_pure.concentration_device
    Final storage buffer _entity_src_gen_pure.storage_buffer
    Final storage temperature_entity_src_gen_pure.storage_temperature
    Final protein concentration_entity_src_gen_pure.protein_concentration
    Protein conc. measurement method_entity_src_gen_pure.protein_measurement_details