The cellranger-arc
pipeline outputs a position-sorted and indexed BAM file of read alignments to the genome and transcriptome. Reads aligned to the transcriptome across exon junctions in the genome tend to have a large gap in their CIGAR string i.e. 35M225N64M. Each read in this BAM file has a 10x Chromium cellular (associated with a 10x Genomics gel bead) barcode and molecular barcode information attached. Cell Ranger ARC modifies MAPQ values; see the MM tag below. The following assumes basic familiarity with the BAM format. More details on the the SAM/BAM standard are available online.
10x Barcode (associated with a 10x Genomics gel bead) and molecular barcode information for each read is stored as TAG
fields:
Tag | Type | Description |
---|---|---|
CB | Z | Chromium cellular barcode sequence that is error-corrected and confirmed against a list of known-good barcode sequences. |
CR | Z | Chromium cellular barcode sequence as reported by the sequencer. |
CY | Z | Chromium cellular barcode read quality. Phred scores as reported by sequencer. |
UB | Z | Chromium molecular barcode sequence that is error-corrected among other molecular barcodes with the same cellular barcode and gene alignment. |
UR | Z | Chromium molecular barcode sequence as reported by the sequencer. |
UY | Z | Chromium molecular barcode read quality. Phred scores as reported by sequencer. |
The cell barcode CB
tag includes a suffix with a dash separator followed by a number:
AGAATGGTCTGCAT-1
This number denotes what we call a GEM well, and is used to serialize barcodes in order to achieve a higher effective barcode diversity when combining samples generated from separate GEM chip channel runs. Note that Cell Ranger ARC does not support multi-GEM-well analysis at this time. Normally, this number will be "1" across all barcodes when analyzing a sample generated from a single GEM chip channel. It can either be left in place and treated as part of a unique barcode identifier, or explicitly parsed out to leave only the barcode sequence itself.
The following tags will also be present on reads that mapped to the genome and overlapped either an exon or an intron by at least one base pair (default mode). When cellranger-arc count
is run with the --gex-exclude-introns
argument, alignment tags are restricted to reads overlapping exons. A read may align to multiple transcripts and genes, but it is only considered confidently mapped to the transcriptome if it maps to a single gene.
Tag | Type | Description |
---|---|---|
RE | A | Single character indicating the region type of this alignment (E = exonic, N = intronic, I = intergenic). |
TX | Z | Semicolon-separated list present in reads aligned to the same strand as the transcripts (or genes) compatible with this alignment. In case of a transcriptomic alignment overlapping an exonic region the format of each entry is [transcript_id],[strand][pos],[cigar] ; where transcript_id is specifed by the reference GTF, strand is either + or - , pos is the alignment offset in transcript coordinates, and cigar is the CIGAR string in transcript coordinates. In case of a genomic alignment overlapping an intronic region the format of each entry is [gene_id],[strand] ; where gene_id is specifed by the reference GTF and strand is either + or - . |
AN | Z | Same as the TX tag, but for reads that are aligned to the antisense strand of annotated transcripts (or genes). |
GX | Z | Semicolon-separated list of gene IDs that are compatible with this alignment. Gene IDs are specified with the gene_id key in the reference GTF attribute column. |
GN | Z | Semicolon-separated list of gene names that are compatible with this alignment. Gene names are specified with gene_name key in the reference GTF attribute column. |
MM | i | Set to 1 if the genome-aligner (STAR) originally gave a MAPQ < 255 (it multi-mapped to the genome) and Cell Ranger changed it to 255 because the read overlapped exactly one gene. |
pa | i | The number of poly-A nucleotides trimmed from the 3' end of read 2. Up to 10% mismatches are permitted. |
ts | i | The number of template switch oligo (TSO) nucleotides trimmed from the 5' end of read 2. Up to 3 mismatches are permitted. The 30-bp TSO sequence is AAGCAGTGGTATCAACGCAGAGTACATGGG . |
xf | i | Extra alignment flags. The bits of this tag are interpreted as follows: |
The following tag represents the feature sequence extracted from the read, and the feature reference it was matched to (if any). The BAM read sequence will contain all the bases outside of the cell barcode and UMI regions.
Tag | Type | Description |
---|---|---|
fx | Z | Feature identifier matched to this feature read. Specified in the id column of the feature reference. |