Cell Ranger vdj Annotations

The structure of a typical V(D)J transcipt:

UTR: Untranslated region; FWR: Framework region; CDR: Complementarity determining region

The cellranger vdj pipeline provides amino acid and nucleotide sequences for framework and complementarity determining regions (CDRs). The V(D)J annotations on the assembled contigs and on the clonotype consensus sequences are produced in multiple formats.

Learn more about productive contigs on the Annotation Algorithm page.

CSV: High-level annotations with one contig, consensus, or clonotype per row.
JSON: Detailed annotations, including alignment coordinates and amino acid translations.
BED: Germline V(D)J segments as features for use with tools like IGV.
TSV: Used for the AIRR rearrangement format of V(D)J contigs and consensus sequences.

clonotypes.csv: High-level descriptions of each clonotype.
consensus_annotations.csv: High-level and detailed annotations of each clonotype consensus sequence.
filtered_contig_annotations.csv: High-level annotations of each high-confidence contigs from cell-associated barcodes. This is a subset of all_contig_annotations.csv.
all_contig_annotations.csv: High-level and detailed annotations of all contigs (from cell and background barcodes) in CSV format.
all_contig_annotations.bed: High-level and detailed annotations of all contigs (from cell and background barcodes) in BED format.
all_contig_annotations.json: High-level and detailed annotations of all contigs (from cell and background barcodes) in JSON format.
airr_rearrangement.tsv: Annotated contigs and consensus sequences of V(D)J rearrangements in the AIRR format.

The clonotypes.csv file provides high-level descriptions of each clonotype.

Column	Description
`clonotype_id`	The ID of the clonotype to which this consensus sequence was assigned.
`frequency`	The observed number of cell barcodes with this clonotype.
`proportion`	The observed fraction of cell barcodes with this clonotype.
`cdr3s_aa`	A semicolon-delimited list of chain:sequence pairs, where chain is TRA, TRB, TRG, TRD, IGK, IGL, or IGH and sequence is the CDR3 amino acid sequence for that chain.
`cdr3s_nt`	A semicolon-delimited list of chain:sequence pairs, where chain is TRA, TRB, TRG, TRD, IGK, IGL, or IGH and sequence is the CDR3 nucleotide sequence for that chain.
`inkt_evidence`	For T cells, this column indicates whether the clonotype is a group of iNKT cells. The evidence is semicolon-delimited list of `chain:matches`, where chain is one of TRA or TRB and matches is one of `genes`, `junction` or `genes+junction`. See iNKT/MAIT for more information.
`mait_evidence`	For T cells, this column indicates whether the clonotype is a group of MAIT cells. The evidence is semicolon-delimited list of `chain:matches`, where chain is one of TRA or TRB and matches is one of `genes`, `junction` or `genes+junction`. See iNKT/MAIT for more information.

The consensus_annotations.csv file provides high-level and detailed annotations of each clonotype consensus sequence.

Column	Description
`clonotype_id`	The ID of the clonotype to which this consensus sequence was assigned.
`consensus_id`	The ID of this consensus sequence.
`v_start`	0-based index of the V region start position on the consensus sequence.
`v_end`	0-based index of the V region end position on the consensus sequence.
`v_end_ref`	0-based index of the V gene end position on the reference.
`j_start`	0-based index of the J region start position on the consensus sequence.
`j_start_ref`	0-based index of the J gene start position on the reference.
`j_end`	0-based index of the J region end position on the consensus sequence.
`cdr3_start`	0-based index of the CDR3 region start position on the consensus sequence.
`cdr3_end`	0-based index of the CDR3 region end position on the consensus sequence.

The all_contig_annotations.json file provides high-level and detailed annotations of all contigs (from cell and background barcodes) in JSON format. This file can be used to learn more about each assembled contig, and investigate why some contigs were filtered out. The all_contig_annotations.json file is the input required to run enclone.

Column	Description
`barcode`	Cell barcode for this contig.
`is_cell`	True or False value indicating whether the barcode was called as a cell.
`contig_id`	Unique identifier for this contig.
`high_confidence`	True or False value indicating whether the contig was called as high-confidence (unlikely to be a chimeric sequence or other artifact).
`length`	The contig sequence length in nucleotides.
`chain`	The chain associated with this contig: TRA, TRB, IGK, IGL, or IGH.
`v_gene`	The highest-scoring V segment, e.g., TRAV1-1.
`d_gene`	The highest-scoring D segment, e.g., TRBD1.
`j_gene`	The highest-scoring J segment, e.g., TRAJ1-1.
`full_length`	True or False value indicating if the contig was declared as full-length.
`productive`	True or False value indicating if the contig was declared as productive.
`productive`	True or False value indicating if the contig was declared as productive.
`fwr1`	The predicted FWR1 amino acid sequence.
`fwr1_nt`	The predicted FWR1 nucleotide sequence.
`cdr1`	The predicted CDR1 amino acid sequence.
`cdr1_nt`	The predicted CDR1 nucleotide sequence.
`fwr2`	The predicted FWR2 amino acid sequence.
`fwr2_nt`	The predicted FWR2 nucleotide sequence.
`cdr2`	The predicted CDR2 amino acid sequence.
`cdr2_nt`	The predicted CDR2 nucleotide sequence.
`fwr3`	The predicted FWR3 amino acid sequence.
`fwr3_nt`	The predicted FWR3 nucleotide sequence.
`cdr3`	The predicted CDR3 amino acid sequence.
`cdr3_nt`	The predicted CDR3 nucleotide sequence.
`fwr4`	The predicted FWR4 amino acid sequence.
`fwr4_nt`	The predicted FWR4 nucleotide sequence.
`reads`	The number of reads aligned to this contig.
`umis`	The number of distinct UMIs aligned to this contig.
`raw_clonotype_id`	The ID of the clonotype to which this cell barcode was assigned.
`raw_consensus_id`	The ID of the consensus sequence to which this contig was assigned.
`exact_subclonotype_id`	The ID of the exact subclontype to which this cell barcode was assigned.

Details on how the Cell Ranger algorithm delimits CDRs (Complementarity Determining Regions) and FWRs (Frame Work Regions) are provided on the enclone features page.

The all_contig_annotations.bed file provides high-level and detailed annotations of all contigs (from cell and background barcodes) in BED format. The columns are not named but correspond to:

Contig name
Nucleotide position at which the contig annotation starts
Nucleotide position at which the contig annotation ends
Annotation

The all_contig_annotations.bed provides information about the structure of each assembled contig and allows further investigation into why some contigs were filtered out. An example all_contig_annotations.bed is shown here:


AAACCTGAGACAGGCT-1_contig_1	0	36	IGKV3-11_5'UTR
AAACCTGAGACAGGCT-1_contig_1	36	381	IGKV3-11_L-REGION+V-REGION
AAACCTGAGACAGGCT-1_contig_1	376	415	IGKJ2_J-REGION
AAACCTGAGACAGGCT-1_contig_1	415	551	IGKC_C-REGION

Field	Description
`barcode`	Barcode sequence
`contig_name`	Name of the contig
`sequence`	Nucleotide sequence of the contig
`quals`	Contig quality score
`fraction_of_reads_for_this_barcode_provided_as_input_to_assembly`	Fraction of reads for this barcode that were provided as input to the assembly algorith
`read_count`	Number of reads assigned to this contig
`umi_count`	Number of UMIs assigned to this contig
`start_codon_pos`	Starting nucleotide base position of the start codon on the contig
`stop_codon_pos`	Last nucleotide base position of stop codon on the contig
`aa_sequence`	Amino acid sequence of the contig
`frame`	Unused field. Ignored by the algorithm.
`cdr3`	Amino acid sequence of the contig's CDR3
`cdr3_seq`	Nucleotide sequence of the contig's CDR3
`cdr3_start`	Starting base of the contig's CDR3
`cdr3_stop`	Last base of the contig's CDR3
`fwr1`-`fwr4`	Optional; Start and stop positions of the contig's FWR1-FWR4 regions
`cdr1`-`cdr2`	Optional; Start and stop positions of the contig's CDR1-CDR2 regions
`annotations`	The annotations for the contig from the reference file
`clonotype`	Null; filled in after clonotyping
`high_confidence`	TRUE or FALSE statement of whether the contig has high confidence
`validated_umis`	A list of UMIs that have been validated
`non_validated_umis`	A list of UMIs that have not been validated
`invalidated_umis`	A list of invalidated UMIs
`is_cell`	TRUE or FALSE statement about whether the barcode was declared a cell
`productive`	TRUE or FALSE statement about whether the contig was productive based on five criteria. NULL=not full length.
`filtered`	Always TRUE
`is_gex_cell`	TRUE or FALSE statement about whether the barcode was declared a cell by Gene expression data. Null=Data not available
`is_asm_cell`	TRUE or FALSE statement about whether the barcode was declared a cell by the VDJ assembler. Null=Data not available
`full_length`	TRUE or FALSE statement about whether the contig is full length.
`junction_support`	A map of `{Reads: x, UMIs: y}` supporting the junction region of a contig. This information is generated by the `cellranger vdj` assembler for productive contigs in reference-assisted assembly (or valid contigs in de novo assembly) and used for confidence determination and cell filtering.

The airr_rearrangement.tsv file provides the annotated contigs and consensus sequences of V(D)J rearrangements in the AIRR format.

Column	Description
`cell_id`	Cell barcode defining the cell for the query sequence.
`clone_id`	Clonotype ID/clonotype assignment.
`sequence_id`	The name of the contig associated with the rearrangement.
`sequence`	The nucleotide sequence of the rearrangement.
`sequence_aa`	The amino acid sequence of the rearrangement.
`productive`	Whether or not the rearrangement is productive.
`rev_comp`	Set to `false` by default (10x Genomics V(D)J sequences are not reverse complemented).
`v_call`	The name of the aligned V gene for the rearrangement.
`v_cigar`	The CIGAR string of the V gene alignment.
`d_call`	The name of the aligned D gene for the rearrangement.
`d_cigar`	The CIGAR string of the D gene alignment.
`j_call`	The name of the aligned J gene for the rearrangement.
`j_cigar`	The CIGAR string of the J gene alignment.
`c_call`	The name of the aligned C gene for the rearrangement.
`c_cigar`	The CIGAR string of the C gene alignment.
`sequence_alignment`	The aligned sequence of the VDJ rearrangement.
`germline_alignment`	The assembled, aligned, full-length inferred germline sequence of the aligned sequence.
`junction`	The nucleotide sequence of the rearrangement's junction (CDR3).
`junction_aa`	The amino acid sequence of the rearrangement's junction (CDR3).
`junction_length`	The length of the rearrangement's junction nucleotide sequence.
`junction_aa_length`	The length of the rearrangement's junction amino acid sequence.
`v_sequence_start`	1-based index on the contig of the V region start position.
`v_sequence_end`	1-based index on the contig of the V region end position.
`d_sequence_start`	1-based index on the contig of the D region start position.
`d_sequence_end`	1-based index on the contig of the D region end position.
`j_sequence_start`	1-based index on the contig of the J region start position.
`j_sequence_end`	1-based index on the contig of the J region end position.
`c_sequence_start`	1-based index on the contig of the C region start position.
`c_sequence_end`	1-based index on the contig of the C region end position.
`consensus_count`	The number of reads associated with this rearrangement.
`duplicate_count`	The number of unique molecular identifiers associated with this rearrangement.
`is_cell`	Is this rearrangement cell-associated?

The AIRR rearrangement file includes all mandatory AIRR fields and several optional variables to enhance reproducibility and guide analyses.

Structure of V(D)J transcript

File format overview

Annotation files overview

Clonotype CSV file

Consensus Annotation CSV Files

Contig annotation JSON files

Contig annotation BED files

Contig annotation JSON files

AIRR rearrangements TSV file