Draft Report of the Task Force on Target Tracking
Our committee recommends that Structural Genomics Laboratories adopt a policy of open exchange of target information. This will allow each laboratory to consider work in progress elsewhere as a criterion in target selection, and help to avoid unnecessary duplication of effort.
As the simplest possible mechanism for data exchange, our committee recommends that each Structure Genomics Laboratory maintain on its web site a list of protein targets currently under investigation. Targets should be described by their sequence and status-of work, using a standard format finalized at the Airlie House meeting, as described below.
Our committee recognizes that a central repository of target information may be a more efficient means of data exchange. We recommend that Structure Genomics Laboratories explore this possibility using a trial repository, perhaps that to be developed for the NIH-funded centers.
Background and Summary of Task Force Discussions:
The remit of this task force was to examine mechanisms for tracking targets for protein structure determination within the international structural genomics community. The motivation is to make fully public all information on what protein structures are being attempted and which are complete, so as to assist the experimental centers world-wide in target selection, and to provide up-to-date information to those analyzing the results. Note that this task force activity is separate from the NIH work on establishing tracking facilities for the NIH-funded P50 structural genomics centers.
We considered three possible scenarios:
1. One or more databases and associated web sites, providing information on targets that are being attempted in the context of more general information, such as the families to which the targets belong, the current knowledge of structure, the models that are available. Several sites, for example Presage and genome3D, have already begun work in this area.
Because of the many groups who would be interested in provided such a service, this option is very unattractive. How would the privileged group be chosen? How would it be funded? What would happen if it were not effective? These are all serious concerns. We feel that sharing amongst multiple sites is also a desirable undertaking as that makes changes in format extremely difficult and introduces complexity without compensatory benefit.
2. A single 'bare-bones' repository, collecting raw information from all registered structural genomics laboratories and making it available in a unified way. This center would not provide additional information or services. The site manager would ensure compliance to the needed format, information-content, and update frequency standards. The attraction of this scenario is that no advantage would be given to the site owners. The site would also be cheap to run. It must, however, be extremely robust, provide archival quality data, and support those making submissions as well as for those who wish to make use of the data, especially "value added" sites.
The task force considers this an attractive option, assuming that a suitable host for the site can be found.
3. A distributed information system in which registered sites make target information available in a standard form, accessible to anyone via FTP. The advantage of this system is that it is a cheap, peer to peer arrangement, involving no centralization. The disadvantage is that it would be difficult to ensure compliance to the needed format, information content and update frequency standards. In practice, what might happen is that added value sites would effectively take over the job ensuring standards are met.
The task force considers this scenario less desirable than number 2, in principle, but it can be undertaken immediately with little effort and may well prove an acceptable arrangement.
Data Exchange Recommendation:
At the Airlie meeting, option 3, an exchange between web sites, was agreed. A sample target entry in the agreed form is given below, together with a formal XML specification of the format. Aside from the format for an individual target entry, it was agreed that the following associated procedures shall be followed:
1) The targets file is a concatenation of target entries.
2) Targets files are to be updated weekly.
3) Each target entry represents a single protein, not a family.
4) Target entries are not deleted. Work stopped is a status code.
5) Each lab will communicate their targets FTP address to Steve Bryant.
6) Labs will prepare targets files and FTP sites by 1 June 2001.
The contents of the "sequence" item require further explanation. The intention is that the sequence identify the target with precision sufficient to allow other labs to avoid unintentional duplication. For example, when one domain of a large protein is under study, the sequence should correspond to that domain, not the whole protein. Minor sequence variations introduced in the course of sample preparation need not be indicated. For example, if a domain boundary is modified slightly or HIS tag added, the sequence need not be updated to reflect these changes.
Here is an example targets file with one target entry:
<!DOCTYPE target PUBLIC "-//Structure//target/EN" "target.dtd">
<status> Work Stopped</status>
<name>A target example</name>
Here is the formal XML specification for target entries:
<!ELEMENT target (id, lab, date, status+, sequence, name?, url*, remark*)>
<!-- required data items -->
<!-- any lab specified id -->
<!ELEMENT id (#PCDATA)>
<!-- any lab specified id -->
<!ELEMENT lab (#PCDATA)>
<!-- most recent update. format: YYYY-MM-DD -->
<!ELEMENT date (#PCDATA)>
<!-- status. One or more or the following descriptive terms:
<!ELEMENT status (#PCDATA)>
<!-- protein sequence in IUPAC 1-letter codes -->
<!ELEMENT sequence (#PCDATA)>
<!-- optional data items -->
<!-- any lab-specified name for the protein -->
<!ELEMENT name (#PCDATA)>
<!-- any url of interest regarding the protein -->
<!ELEMENT url (#PCDATA)>
<!-- remarks in free text -->
<!ELEMENT remark (#PCDATA)>