Input

Clonotype tables

The processing stage of RepSeq analysis starts with mapping of Variable, Diversity and Joining segments. Mapped reads are then assembled into clonotypes and stored as a clonotype abundance tables.

Clonotype

VDJtools clonotype specification includes the following fields:

  • Variable (V) segment name.
  • Diversity (D) segment name for some of the receptor chains (TRB, TRD and IGH). Set to . if not aplicable or D segment was not identified.
  • Joining (J) segment name.
  • Complementarity determining region 3 nucleotide sequence (CDR3nt). CDR3 starts with Variable region reference point (conserved Cys residue) and ends with Joining segment reference point (conserved PheTry).
  • Translated CDR3 sequence (CDR3aa).
  • Somatic hypermutations (SHMs) in the variable segment (antibody only, planned).

Important

For ambiguous segment assignments encoded by a comma separated list of segment names only the first one is selected.

Hint

In case of non-coding CDR3 sequences, the convention is to translate in both directions: upstream from V segment reference point and downstream from J segment reference point. The resulting sequence (e.g. CASSLA_TNEKFF) is linked by a _ symbol that marks the incomplete codon.

Clonotype abundance data is represented by count and frequency fields:

  • Count: number of reads or cDNA/DNA molecules in case UMIs are used.
  • Frequency: the share of clonotype in the sample. While seemingly redundant, this property is left for compatibility with cases when the sample represents a subset of another one, e.g. clonotypes from PBMCs filtered by intersection with lymph node clonotypes.

The following fields are optional, but are used for computing various statistics and visualization:

  • Vend, Dstart, Dend and Jstart - marking V, D and J segment boundaries within CDR3 nucleotide sequence (inclusive)

Tip

VDJtools accepts gzip-compressed files, such files should have an .gz suffix. Input data should be provided in a form of tab-delimited table.

VDJtools format

This is a core tabular format for VDJtools. All datasets should be converted to this format using the Convert routine prior to analysis. Columns 8-10 are optional.

column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11
count frequency CDR3nt CDR3aa V D J Vend Dstart Dend Jstart
1176 9.90E-02 TGTGCCAGC...AAGCTTTCTTT CAST...EAFF TRBV12-4 TRBD1 TRBJ1-1 11 14 16 23

All additional columns after column 10 will be considered as clonotype annotations and carried over unmodified during most stages of VDJtools analysis. This is especially useful when processing results of Annotation and ScanDatabase (Available only up to v1.0.5, use VDJdb) routines.

Formats supported for conversion

MiTCR

Output from MiTCR software (executable jar, documentation) in full mode can be used without any pre-processing. Corresponding table should start with two header lines (default MiTCR output stores processing options and version in the first line), followed by a clonotype list.

Run Convert routine with -S mitcr argument to prepare datasets in this format for VDJtools analysis.

MiGEC

MiGEC is a software for V/D/J mapping and CDR3 extraction that relies on BLAST algorithm for running alignments. MIGEC software additionally implements processing of unique molecular identifier (UMI)-tagged libraries for error correction and dataset normalization. Default output of MIGEC software can be directly used with VDJtools.

Run Convert routine with -S migec argument to prepare datasets in this format for VDJtools analysis.

IgBlast (MIGMAP)

As IgBlast doesn’t compute a canonical clonotype abundance table, VDJtools supports output of MIGMAP, a versatile IgBlast wrapper. Note that currently no somatic hypermutation (SHM) information is imported by VDJtools, neither there are any dedicated VDJtools routines to analyze SHM profiles, but you check out post-analysis provided by MIGMAP.

Run Convert routine with -S migmap argument to prepare datasets in this format for VDJtools analysis.

ImmunoSEQ

One of the most commonly used RepSeq data format, more than 90% of recently published studies were performed using immunoSEQ assay. We have implemented a parser for clonotype tables as provided by Adaptive Biotechnologies.

  • The resulting datasets for most studies that use ImmunoSEQ technology can be accessed and exported using the ImmunoSEQ Analyzer.
  • Example datasets in this format could be found in the Supplementary Data section of Spreafico R et al. Ann Rheum Dis. 2014.
  • Column header information was taken from page 24 of the immunoSEQ Analyzer manual
  • VDJtools will use V/J segment information only at the family level, as many of the clonotypes miss segment (-X) and allele (-X*0Y) information. The clonotype table is then collapsed to handle unique V/J/CDR3 entries.
  • Raw clonotype tables in this format do not contain CDR3 nucleotide sequence. Instead, an entire sequencing read (first column) is provided. Therefore, we have implemented additional algorithms for CDR3 extraction and “virtual” translation to tell out-of-frame clonotypes from partially read ones.

Attention

Some of the clonotype entries will dropped during conversion as they contain an incomplete CDR3 sequence (lacking J segment), which is due to short reads used in immunoSEQ assay, see this blog post for details.

Run Convert routine with -S immunoseq argument to prepare datasets in this format for VDJtools analysis. Note that there are currently two possible ImmunoSEQ output formats that have different column naming:

  • This option should be used in case you have selected
Export samples option in the ImmunoSEQ analyzer.
  • In case you have used the Export samples v2 option you should pass the -S immunoseqv2 argument to VDJtools Convert routine.

IMGT/HighV-QUEST

Another commonly used RepSeq processing tool is the IMGT/HighV-QUEST web server.

Please refer to the official documentation to see the description of output files and their formats.

Tip

The output for each submission consists of several files and only

3_Nt-sequences_${chain}_${sx}_${date}.txt

should be used as an input for VDJtools Convert routine.

Run Convert routine with -S imgthighvquest argument to prepare datasets in this format for VDJtools analysis.

VDJdb

VDJtools has native support for the analysis of clonotype tables annotated with VDJdb software. Note that as those tables can list the same clonotype several times with different annotation, they should not be used directly in most VDJtools routines (e.g. diversity statistics), check out VDJdb README for corresponding guidelines and workarounds.

Vidjil

VDJtools supports parsing output Json files produced by the Vidjil software. VDJtools will only use top clonotypes which have V/D/J detalization in the output.

RTCR

VDJtools supports parsing the results.tsv table with clonotype list generated by the RTCR software.

Run Convert routine with -S rtcr argument to prepare datasets in this format for VDJtools analysis.

MiXCR

Output from MiXCR software export routine in full (default) mode can be used without any pre-processing.

Run Convert routine with -S mixcr argument to prepare datasets in this format for VDJtools analysis.

IMSEQ

Output from IMSEQ software can be used if results are collapsed to nucleotide-level clonotypes using -on argument with IMSEQ.

Run Convert routine with -S imseq argument to prepare datasets in this format for VDJtools analysis.

Metadata

Most VDJtools routines will accept multiple sample files as command line arguments for batch processing. This should be always preferred over multiple calls to VDJTools with a single sample due to the initialisation time of VDJTools.

An alternative way to specify a sample batch is to pass the sample metadata file with -m option. The file should contain sample file paths, sample names. It can be also supplemented with optional metadata columns that will be appended to analysis results and can be used for plottings.

Additionally, for each step that involves modification of samples (e.g. converting or filtering non-functional rearrangements) a new metadata file will be created in the folder containing the processed sample batch.

Note

  • VDJtools will append metadata fields to its output tables to facilitate the exploration of analysis results.
  • Metadata entries are used as a factor in some analysis routines and most plotting routines.
  • When performing tasks that involve modifying clonotype abundance tables themselves, such as down-sampling, VDJtools will also provide a copy of metadata file pointing to newly generated samples.
  • Newly generated metadata file would contain an additional ..filter.. column, which has a comma-separated list of filters that were applied. For example the DownSample routine run with -n 50000 will append ds:50000 to the ..filter.. column. Note that this column name is reserved and should not be modified.
  • Some routines for working with metadata files can be found in Utilities section.

Below are the basic guidelines for creating a metadata file.

  • Metadata file should be a tab-delimited table, e.g.

    #file.name sample.id col.name ...
    sample_1.txt sample_1 A ...
    sample_2.txt sample_2 A ...
    sample_3.txt sample_3 B ...
    sample_4.txt sample_4 C ...
    ... ... ... ...
  • Header is mandatory, first two columns should be named file_name and sample_id. Names of the remaining columns will be later used to specify metadata variable name

  • First two columns should contain the file name and sample id respectively.

    • The file name should be either an absolute path (e.g. /Users/username/somedir/file.txt) or a path relative to the parent directory of metadata file (e.g. ../file.txt)
    • Sample IDs should be unique
  • Columns after sample.id are treated as metadata entries. There are also several cases when info from metadata is used during execution:

    • VDJtools plotting routines could be directed to use metadata fields for naming samples and creating intuitive legends. If column name contains spaces it should be quoted, e.g. -f "patient id"
    • Metadata fields are categorized as factor (contain only strings), numeric (contain only numbers) and semi-numeric (numbers and strings). Numeric and semi-numeric fields could be used for gradient coloring by plotting routines.