The germplasm accession metadata from many of the genebanks is combined to form a regional data warehouse database with an overview of the conserved germplasm in the specific region. EURISCO combines the metadata on germplasm accessions conserved by genebanks in Europe. SINGER combines the data from the international CGIAR research centers. GRIN combines the data from the American USDA germplasm conservation institutes. AMERISCO is under development for the Americas. The WIEWS collect and indexes summary metadata on the global germplasm genebank collection. No complete global index of accession unit level metadata on all the stored germplasm accessions worldwide exists. The FAO germplasm index is under further development. The GBIF catalogue of primary biodiversity specimen unit level data is under development. The methods to collect, combine and index germplasm accession level metadata are rapidly developing and the establishment of online web service data providers (BioCASE, DiGIR) published from the individual genebanks makes the automatic gathering/harvesting of accession level metadata much simpler. A global index of all germplasm accessions conserved by genebanks worldwide is an obvious element of an Accession Clearing House! If sufficient metadata of sufficient quality and update frequency is produced from regional or global portals, this data could be used by the Accession Clearing House rather than exposing the original data provider for repeated data harvesting from different portals and purposes. Supplementary data harvest would however most likely be needed. For example to add important descriptors not indexed by GBIF or FAO. Or to add germplasm not part of the geographic scope of regional portals like the EURISCO, SINGER, USDA-GRIN or AMERISCO. To reduce the risk of adding the same accession metadata more than once to the global index from both the regional or global index will become a big challenge when the very same metadata can be included from different sources (original data provider or various data portals).
Interoperability of metadata from different sources
The potential confusion and duplication of germplasm accession level metadata could be avoided if the portal would harvest and index data from the original data provider only. Technical considerations may make this impossible or inconvenient. Bandwidth is still a limited resource for many of the relevant institutes. Data provider database computer performance can become a limitation if many portals will visit to harvest all the provided data too often. The implementation of data harvest directly from the original data provider could be politically sensitive as some countries could demand to filter the provided data through a national data inventory index. And the uptime of the original data provider service can be a problem. Furthermore, additional useful accession level metadata, statements and suggested corrections can be provided from other sources than the original germplasm metadata provider. One example is the statement on Òmost original sampleÓ being implemented at some of the European Central Crop Portals (ECCDBs). Other examples can be if germplasm accessions from different genebanks take part of a characterization study and the research institute publishes the results; or if a study on duplication of germplasm accessions across genebank collections are performed and published. IPGRI holds collecting event data (original collecting forms) on many accessions stored at genebanks worldwide, as also other institutes can hold more detailed passport data on accessions stored at different genebanks. To combine and bring together germplasm metadata on the very same accession from different sources around the world is a challenge not yet sufficiently solved.
Germplasm IDs and the global unique ID (GUID)
If all germplasm accession units would be assigned a global unique ID (GUID), then the combining of metadata from different sources as mentioned above would be significantly easier. The various genebanks worldwide have however different models how to build a local unique ID for the stored accessions. For some institutes the accession number alone is not unique within the institute database, but only unique within the relevant crop or thematic sub collection (e.g. VIR). The accession numbers are often given as integer numbers or as a character followed by an integer number (34, 00034, PI56, CIho454, HOR565, H9879, K6787, K-5676 etc). The accession number prefixes are not standardized and more than one institute can often use the same prefix. This complicates the interoperability of accession level metadata from different sources. It is probably too optimistic to think that all holders of germplasm accession level metadata will modify their local database systems to include a GUID. One possible model is to build a central GUID resolving system to link a metadata record to the GUID as well as linking different metadata records to the physical stored germplasm (seed) sample. The original holders of the metadata would of course be encouraged to implement the GUID in their local databases as well, but not required to do so for the interoperability to work. To maintain a stable resolving mechanism some parts of the metadata must be stable and never change after defined. For germplasm accessions the holding genebank, genus or species name and accession number would mostly provide this unique stable key. If any of these values changes we would talk of a different germplasm accession! Other data classes with metadata units like the collecting events, taxonomic names and concepts, commercial cultivars, landraces, evaluation and characterization observations, institutes, personsÉetc could also be assigned a GUID. For some GUID technologies there can be a number of parallel services to resolve the GUID. These GUID resolving services would be registered and discoverable from a central DNS (e.g. the LSID technology). Some of the GUID technologies have developed a web browser plug-in so that the GUID in a web page would be clickable. By clicking the GUID the user is directed to a DNS routing you to a resolving service to display the GUID metadata (e.g. LSID). Relationships between GUIDs could be described with e.g. RDF technology. An RDF statement could be that one specific germplasm accession (GUID#1) originates from a specific germplasm origin unit (GUID#2.
Other relevant germplasm names and IDs
A global index to additional germplasm IDs besides the accession number would also be of value. The most important IDs are the collecting number assigned by the collector during the gathering/collecting event; and the breederÕs accession designation assigned during the breeding/cultivation event. The collecting number is recorded and documented by most genebank information systems. The accession name is often used to describe the germplasm name used by the breeder. Some genebanks also record the commercial cultivar name or the landrace name where appropriate. Most genebanks have documentation on the donor or immediate source of the germplasm. That is whom did the genebank receive the germplasm sample from. The donor institute and person as well as the donor accession designation are documented.
Data harvest issues
To harvest data efficiently from the network of data providers there are a few modifications to the local database that would be recommended. First the local system of the data provider should be strongly encouraged to support and be able to provide GUIDs and metadata unit last updated for the shared unit level data points. Further the last update for the complete dataset would be very useful. If the dataset last update is provided, then the harvesting engine can compare this value to the last harvesting event for this specific dataset. If the dataset has not been updated since last harvesting and indexing, then no new harvesting is needed. Further if the metadata unit last update is provided the harvesting engine can ask for only units updated since last harvest event. To capture if a unit has been removed (if this should be allowed) the complete list of UIDs (local unique ID) could be retrieved and compared to the UIDs on this dataset in the global index. If the GUIDs would be provided from the data provider the lookup of the appropriate GUID in the central GUID resolving system would not be needed.
Text: Dag Endresen, NGB/IPGRI.
Please feel free to reuse or modify the text above.