The cellranger-arc
pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode and valid UMI and were assigned with high confidence to a gene. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, features, and barcode lists used for the analysis.
(root)
├─ barcode_idx
├─ barcode_info [HDF5 group]
│ ├─ genomes
│ └─ pass_filter
├─ barcodes
├─ count
├─ feature_idx
├─ features [HDF5 group]
│ ├─ _all_tag_keys
│ ├─ feature_type
│ ├─ genome
│ ├─ id
│ └─ name
├─ gem_group
├─ library_idx
├─ library_info
├─ metrics_json
├─ umi
└─ umi_type
The following HDF5 datasets in the molecule information file correspond to columns of a table. Each row of that table corresponds to a unique molecule specified by (UMI, cell-barcode, feature) tuple. This tuple indicates the feature best supported by the reads (including PCR duplicates) assigned to that unique pairing of UMI and 10x Barcode.
Column | Type | Description |
---|---|---|
barcode_idx | uint64 | A zero-based index into the barcodes dataset (see next section), indicating the 10x Barcode sequence assigned to this putative molecule. |
count | uint32 | Number of reads associated with this putative molecule that were confidently mapped to the assigned feature. |
feature_idx | uint32 | A zero-based index into the features HDF5 group (see next section), indicating the feature to which this putative molecule was assigned. |
gem_group | uint16 | Integer label that distinguishes data derived from distinct 10x Genomics GEM reactions (such as different chip or chip channels). |
library_idx | uint16 | A zero-based index into the library_info array (see next section) that distinguishes data coming from distinct 10x Genomics libraries. For the Chromium Single Cell Multiome ATAC + Gene Expression assay only one library can be associated with a single GEM well. |
umi | uint32 | 2-bit encoded (see note below) processed (i.e. corrected) UMI sequence. |
umi_type | uint32 | A boolean array specifying whether the molecule aligned to an exonic (1) or intronic (0) region of the associated feature. |
The barcodes
and library_info
datasets provide information about the experiments contained in this analysis.
Dataset | Type | Description |
---|---|---|
barcodes | string | A list of all 10x Barcodes associated with this experiment (including those that were not observed). The barcode_idx column described in the previous section contains indices into this list of barcodes. |
library_info | string | A JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_id , library_type , and gem_group |
The HDF5 group barcode_info
provides information regarding the barcodes that were called as cells during the analysis. This HDF5 group contains two columns.
Dataset | Type | Description |
---|---|---|
genomes | string | A list of all genome references used in this analysis. In most cases, this will be a single genome. |
pass_filter | uint64 | A matrix with three columns that contains one row per cell-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx) , where genome_idx is an index into the genomes dataset. |
The HDF5 group features
contains information regarding the feature reference used for the analysis. The datasets within the features
group represent columns of a table containing one row per feature. Values in the feature_idx
column described in the previous section provide indices into the rows of this table.
In addition to the columns described below, _all_tag_keys
contains a list of built-in tags (genome
).
Column | Type | Description |
---|---|---|
feature_type | string | The type of feature reference to which this feature belongs (Gene Expression). |
genome | string | The genome reference for a given feature (e.g., "GRCh38" or "mm10"). |
id | string | The The Ensembl gene ID corresponding to this feature. |
name | string | The common gene symbol associated with each of the above ids . |
The UMI sequences are 2-bit encoded as follows:
- Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
- The least significant byte (LSB) contains the 3'-most nucleotides.
Note that the cell-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info/barcodes
HDF5 dataset.
The metrics_json
dataset contains pipeline metrics in JSON format that are used internally by Cell Ranger. Users should view metrics using the Cell Ranger ARC metrics outputs.