Support homeCell Ranger ARCAnalysis
Molecule Info HDF5 (H5)

Molecule Info HDF5 (H5)

The cellranger-arc pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode and valid UMI and were assigned with high confidence to a gene. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, features, and barcode lists used for the analysis.

(root) ├─ barcode_idx ├─ barcode_info [HDF5 group] │ ├─ genomes │ └─ pass_filter ├─ barcodes ├─ count ├─ feature_idx ├─ features [HDF5 group] │ ├─ _all_tag_keys │ ├─ feature_type │ ├─ genome │ ├─ id │ └─ name ├─ gem_group ├─ library_idx ├─ library_info ├─ metrics_json ├─ umi └─ umi_type

The following HDF5 datasets in the molecule information file correspond to columns of a table. Each row of that table corresponds to a unique molecule specified by (UMI, cell-barcode, feature) tuple. This tuple indicates the feature best supported by the reads (including PCR duplicates) assigned to that unique pairing of UMI and 10x Barcode.

ColumnTypeDescription
barcode_idxuint64A zero-based index into the barcodes dataset (see next section), indicating the 10x Barcode sequence assigned to this putative molecule.
countuint32Number of reads associated with this putative molecule that were confidently mapped to the assigned feature.
feature_idxuint32A zero-based index into the features HDF5 group (see next section), indicating the feature to which this putative molecule was assigned.
gem_groupuint16Integer label that distinguishes data derived from distinct 10x Genomics GEM reactions (such as different chip or chip channels).
library_idxuint16A zero-based index into the library_info array (see next section) that distinguishes data coming from distinct 10x Genomics libraries. For the Chromium Single Cell Multiome ATAC + Gene Expression assay only one library can be associated with a single GEM well.
umiuint322-bit encoded (see note below) processed (i.e. corrected) UMI sequence.
umi_typeuint32A boolean array specifying whether the molecule aligned to an exonic (1) or intronic (0) region of the associated feature.

The barcodes and library_info datasets provide information about the experiments contained in this analysis.

DatasetTypeDescription
barcodesstringA list of all 10x Barcodes associated with this experiment (including those that were not observed). The barcode_idx column described in the previous section contains indices into this list of barcodes.
library_infostringA JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_idlibrary_type, and gem_group

The HDF5 group barcode_info provides information regarding the barcodes that were called as cells during the analysis. This HDF5 group contains two columns.

DatasetTypeDescription
genomesstringA list of all genome references used in this analysis. In most cases, this will be a single genome.
pass_filteruint64A matrix with three columns that contains one row per cell-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx), where genome_idx is an index into the genomes dataset.

The HDF5 group features contains information regarding the feature reference used for the analysis. The datasets within the features group represent columns of a table containing one row per feature. Values in the feature_idx column described in the previous section provide indices into the rows of this table.

In addition to the columns described below, _all_tag_keys contains a list of built-in tags (genome).

ColumnTypeDescription
feature_typestringThe type of feature reference to which this feature belongs (Gene Expression).
genomestringThe genome reference for a given feature (e.g., "GRCh38" or "mm10").
idstringThe The Ensembl gene ID corresponding to this feature.
namestringThe common gene symbol associated with each of the above ids.

The UMI sequences are 2-bit encoded as follows:

  • Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
  • The least significant byte (LSB) contains the 3'-most nucleotides.

Note that the cell-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info/barcodes HDF5 dataset.

The metrics_json dataset contains pipeline metrics in JSON format that are used internally by Cell Ranger. Users should view metrics using the Cell Ranger ARC metrics outputs.