BioSamples Metadata Model

The BioSamples repository stores and displays metadata about samples to enable their discovery and re-analysis. Each sample receives a unique sample accession, which can be referenced in other archives such as ENA, EGA, and PRIDE, increasing findability and interoperability through cross-referencing.

Samples for BioSamples only have a few mandatory fields.

sample name
release date (publication date for the sample)
organism (must be in NCBI Taxonomy)

Partners should submit rich metadata where possible as this will enable discovery and reuse of registered samples. Submitters may add as many custom metadata attributes as desired, which will be indexed and searchable in BioSamples.

Sample Checklists

To increase standardisation and ensure that each sample is registered with at least a minimum amount of metadata, ENA provides Genomics Standards Consortium (GSC) Sample Checklists. These each provide a minimum set of mandatory attributes which are required for a particular environment for an ENA submission. There are also recommended and optional attributes available. It is possible to update your samples with the appropriate metadata later. If you cannot provide a value for a mandatory field, please see Reporting Missing Values for the appropriate values.

Note

Registering a BioSample with an ENA checklist is a requirement for submitting data related to this sample to ENA.

These checklists are developed in collaboration with different research communities to ensure that they are relevant and realistic for their context. When registering a sample, it is important to choose the most relevant sample checklist available and provide the most metadata possible.

Checklists are maintained in collaboration with the ENA team and are available in the JSON Schema Store. Submissions are automatically validated against their selected checklist via bioValidator at the time of submission or curation. This ensures that key fields are present and consistent.

Sample Relationships in BioSamples

Sample relationships describe the relationship between two BioSamples. The relationships can be submission, technical, or biological relationships. It links different samples together and supports relationship-based graph searches. The sample relationship is submitted to BioSamples by providing the source, type, and target. Below is an example of sample relationships in BioSamples.

Please note that the direction of relationships should always start from the source to the target. For example, if adding a sample relationship to a sample with accession SAME123456, the ‘source’ should always be SAME123456.

"relationships" : [ {
    "source" : "SAMEA1111111",
    "type" : "derived from",
    "target" : "SAMEA2222222"
  }, {
    "source" : "SAMEG00000",
    "type" : "has member",
    "target" : "SAMEA1111111"
} ]

When the submitter provides relationship information in one sample, the reverse relationships in corresponding samples will be generated automatically. BioSamples does not validate the type, direction, or the logic of the relationships. BioSamples currently supports four types of sample relationships

Sample Relationships
Relationship types	Reverse relationships	Description
`derived from`	`derived from (reverse)`	Sample A is derived from Sample B. E.g. Tissue samples derived from donor samples Cell line samples derived from tissue samples Microbial samples derived from environmental samples
`same as`	`same as`	Sample A is the same as Sample B. This can be used to link duplicated samples.
`child of`	`child of (reverse)`	Sample A is the child of Sample B. E.g. Patient A is the child of Patient B
`negative control of`	`negative control of (reverse)`	Sample A is the negative control of Sample B. e.g.

Sample Dates

BioSamples keeps records of different dates related to the sample lifecycle. The dates can be generated either by the data repositories or by the data submitters for data exchange or experiment purposes.

Sample date fields
Date type	Description
`Submitted on`	The earliest date at which valid metadata has been provided by the submitter. This attribute is generated by BioSamples and other INSDC partners.
`Released on`	The user-supplied date at which the sample metadata is made publicly available for the first time.
`Last reviewed`	The date at which a new curation object has been created or automatic curation pipelines have been run on the sample metadata. This field is only present if at least one curation object has been added by the curation pipelines. The “last reviewed” date is updated when the curation objects are reviewed—even if they are found still valid and unmodified—and indicates that the sample is compliant with the latest BioSamples curation rules. See Submit curation object. This attribute is generated by BioSamples.
`INSDC first public and INSDC last update`	You might see additional dates or timestamps in the sample’s `attributes` section, such as INSDC first public and INSDC last update. These are generated by other data repositories and appear due to data exchange with other archives participating in the International Nucleotide Sequence Database Collaboration (INSDC).

Reporting Missing Values

The International Nucleotide Database Collaboration (INSDC) has a standardised missing/null value reporting language to be used where a value of an expected format for sample metadata reporting can not be provided.

The controlled vocabulary takes into account different types of constraints. Submitters are strongly encouraged to always provide true values. However, if a missing/null value reporting is required, submitters are asked to use a term with the finest granularity for their situation. See the table below for accepted missing value reporting terms.

Recommended terms for reporting missing values
Value	Definition
`not collected`	Information was not given because it has not been collected, and will always be missing.
`not provided`	Information may have been collected but was not provided with the submission. It may be added later.
`restricted access`	Information exists but cannot be released openly because of privacy or confidentiality concerns.

Important: Any other placeholder values (such as n/a, na, n.a, none, unknown, --, ., null, missing, not reported, not requested, not applicable, not specified, and not known) should not be used and must be removed from submissions. If included, these will be eliminated during automatic curation.