Input¶
Clonotype tables¶
The processing stage of RepSeq analysis starts with mapping of Variable, Diversity and Joining segments. Mapped reads are then assembled into clonotypes and stored as a clonotype abundance tables.
Clonotype¶
VDJtools clonotype specification includes the following fields:
- Variable (V) segment name.
- Diversity (D) segment name for some of the receptor chains (TRB, TRD and IGH). Set to . if not aplicable or D segment was not identified.
- Joining (J) segment name.
- Complementarity determining region 3 nucleotide sequence (CDR3nt). CDR3 starts with Variable region reference point (conserved Cys residue) and ends with Joining segment reference point (conserved PheTrp).
- Translated CDR3 sequence (CDR3aa).
- Somatic hypermutations (SHMs) in the variable segment (antibody only, planned).
Important
For ambiguous segment assignments encoded by a comma separated list of segment names only the first one is selected.
Hint
In case of non-coding CDR3 sequences, the convention is to
translate in both directions: upstream from V segment
reference point and downstream from J segment reference point.
The resulting sequence (e.g. CASSLA_TNEKFF
)
is linked by a _
symbol that marks the incomplete codon.
Clonotype abundance data is represented by count and frequency fields:
- Count: number of reads or cDNA/DNA molecules in case UMIs are used.
- Frequency: the share of clonotype in the sample. While seemingly redundant, this property is left for compatibility with cases when the sample represents a subset of another one, e.g. clonotypes from PBMCs filtered by intersection with lymph node clonotypes.
The following fields are optional, but are used for computing various statistics and visualization:
- Vend, Dstart, Dend and Jstart - marking V, D and J segment boundaries within CDR3 nucleotide sequence (inclusive)
Tip
VDJtools accepts gzip-compressed
files, such files should have an .gz
suffix. Input data
should be provided in a form of tab-delimited table.
VDJtools format¶
This is a core tabular format for VDJtools. All datasets should be converted to this format using the Convert routine prior to analysis. Columns 8-10 are optional.
column1 | column2 | column3 | column4 | column5 | column6 | column7 | column8 | column9 | column10 | column11 |
---|---|---|---|---|---|---|---|---|---|---|
count | frequency | CDR3nt | CDR3aa | V | D | J | Vend | Dstart | Dend | Jstart |
1176 | 9.90E-02 | TGTGCCAGC…AAGCTTTCTTT | CAST…EAFF | TRBV12-4 | TRBD1 | TRBJ1-1 | 11 | 14 | 16 | 23 |
All additional columns after column 10 will be considered as clonotype annotations and carried over unmodified during most stages of VDJtools analysis. This is especially useful when processing results of Annotation and ScanDatabase (DEPRECATED since v1.0.5, use VDJmatch) routines.
Formats supported for conversion¶
MiTCR¶
Output from MiTCR software (executable jar,
documentation) in
full
mode can be used without any pre-processing. Corresponding
table should start with two header lines (default MiTCR output
stores processing options and version in the first line), followed by a clonotype
list.
Run Convert routine with -S mitcr
argument to prepare datasets
in this format for VDJtools analysis.
MiGEC¶
MiGEC is a software for V/D/J mapping and CDR3 extraction that relies on BLAST algorithm for running alignments. MIGEC software additionally implements processing of unique molecular identifier (UMI)-tagged libraries for error correction and dataset normalization. Default output of MIGEC software can be directly used with VDJtools.
Run Convert routine with -S migec
argument to prepare datasets
in this format for VDJtools analysis.
IgBlast (MIGMAP)¶
As IgBlast doesn’t compute a canonical clonotype abundance table, VDJtools supports output of MIGMAP, a versatile IgBlast wrapper. Note that currently no somatic hypermutation (SHM) information is imported by VDJtools, neither there are any dedicated VDJtools routines to analyze SHM profiles, but you check out post-analysis provided by MIGMAP.
Run Convert routine with -S migmap
argument to prepare datasets
in this format for VDJtools analysis.
ImmunoSEQ¶
One of the most commonly used RepSeq data format, more than 90% of recently published studies were performed using immunoSEQ assay. We have implemented a parser for clonotype tables as provided by Adaptive Biotechnologies.
- The resulting datasets for most studies that use ImmunoSEQ technology can be accessed and exported using the ImmunoSEQ Analyzer.
- Example datasets in this format could be found in the Supplementary Data section of Spreafico R et al. Ann Rheum Dis. 2014.
- Column header information was taken from page 24 of the immunoSEQ Analyzer manual
- VDJtools will use V/J segment information only at the family level, as many of the clonotypes miss segment (-X) and allele (-X*0Y) information. The clonotype table is then collapsed to handle unique V/J/CDR3 entries.
- Raw clonotype tables in this format do not contain CDR3 nucleotide sequence. Instead, an entire sequencing read (first column) is provided. Therefore, we have implemented additional algorithms for CDR3 extraction and “virtual” translation to tell out-of-frame clonotypes from partially read ones.
Attention
Some of the clonotype entries will dropped during conversion as they contain an incomplete CDR3 sequence (lacking J segment), which is due to short reads used in immunoSEQ assay, see this blog post for details.
Run Convert routine with -S immunoseq
argument to prepare datasets
in this format for VDJtools analysis. Note that there are currently two possible ImmunoSEQ
output formats that have different column naming:
- This option should be used in case you have selected
Export samples
option in the ImmunoSEQ analyzer.
- In case you have used the
Export samples v2
option you should pass the-S immunoseqv2
argument to VDJtools Convert routine.
IMGT/HighV-QUEST¶
Another commonly used RepSeq processing tool is the IMGT/HighV-QUEST web server.
Please refer to the official documentation to see the description of output files and their formats.
Tip
The output for each submission consists of several files and only
3_Nt-sequences_${chain}_${sx}_${date}.txt
should be used as an input for VDJtools Convert routine.
Run Convert routine with -S imgthighvquest
argument to prepare datasets
in this format for VDJtools analysis.
VDJdb¶
VDJtools has native support for the analysis of clonotype tables annotated with VDJdb software. Note that as those tables can list the same clonotype several times with different annotation, they should not be used directly in most VDJtools routines (e.g. diversity statistics), check out VDJdb README for corresponding guidelines and workarounds.
Vidjil¶
VDJtools supports parsing output Json files produced by the Vidjil software. VDJtools will only use top clonotypes which have V/D/J detalization in the output.
RTCR¶
VDJtools supports parsing the results.tsv
table with clonotype list
generated by the RTCR software.
Run Convert routine with -S rtcr
argument to prepare datasets
in this format for VDJtools analysis.
Metadata¶
Most VDJtools routines will accept multiple sample files as command line arguments for batch processing. This should be always preferred over multiple calls to VDJTools with a single sample due to the initialisation time of VDJTools.
An alternative way to specify a sample batch is to pass the sample metadata
file with -m
option. The file should contain sample file paths,
sample names. It can be also supplemented with optional metadata columns
that will be appended to analysis results and can be used for plottings.
Additionally, for each step that involves modification of samples (e.g. converting or filtering non-functional rearrangements) a new metadata file will be created in the folder containing the processed sample batch.
Note
- VDJtools will append metadata fields to its output tables to facilitate the exploration of analysis results.
- Metadata entries are used as a factor in some analysis routines and most plotting routines.
- When performing tasks that involve modifying clonotype abundance tables themselves, such as down-sampling, VDJtools will also provide a copy of metadata file pointing to newly generated samples.
- Newly generated metadata file would contain an additional
..filter..
column, which has a comma-separated list of filters that were applied. For example the DownSample routine run with-n 50000
will appendds:50000
to the..filter..
column. Note that this column name is reserved and should not be modified. - Some routines for working with metadata files can be found in Utilities section.
Below are the basic guidelines for creating a metadata file.
Metadata file should be a tab-delimited table, e.g.
#file.name sample.id col.name … sample_1.txt sample_1 A … sample_2.txt sample_2 A … sample_3.txt sample_3 B … sample_4.txt sample_4 C … … … … … Header is mandatory, first two columns should be named file_name and sample_id. Names of the remaining columns will be later used to specify metadata variable name
First two columns should contain the file name and sample id respectively.
- The file name should be either an absolute path
(e.g.
/Users/username/somedir/file.txt
) or a path relative to the parent directory of metadata file (e.g.../file.txt
) - Sample IDs should be unique
- The file name should be either an absolute path
(e.g.
Columns after sample.id are treated as metadata entries. There are also several cases when info from metadata is used during execution:
- VDJtools plotting routines could be directed to use metadata fields
for naming samples and creating intuitive legends. If column name
contains spaces it should be quoted, e.g.
-f "patient id"
- Metadata fields are categorized as factor (contain only strings), numeric (contain only numbers) and semi-numeric (numbers and strings). Numeric and semi-numeric fields could be used for gradient coloring by plotting routines.
- VDJtools plotting routines could be directed to use metadata fields
for naming samples and creating intuitive legends. If column name
contains spaces it should be quoted, e.g.