Repertoire overlap analysis

OverlapPair

Performs a comprehensive analysis of clonotype sharing for a pair of samples.

Command line usage

$VDJTOOLS OverlapPair [options] sample1.txt sample2.txt output_prefix

Parameters:

Shorthand Long name Argument Description
-i --intersect-type string Sample intersection rule. Defaults to strict. See Common parameters
-t --top int Number of top clonotypes to visualize explicitly on stack are plot and provide in the collapsed joint table. Should not exceed 100, default is 20
-p --plot   Turns on plotting. See Common parameters
  --plot-area-v2   Alternative plotting mode, clonotype CDR3 sequences are shown at plot sides and connected to corresponding areas with lines.
-h --help   Display help message

Tabular output

Two joint clonotype abundance tables with paired.[intersection type shorthand].table.txt and paired.[intersection type shorthand].table.collapsed.txt suffices are generated. Tables are written in VDJtools format. Collapsed table contains rows corresponding to top N clonotypes and summary abundances for non-overlapping and hidden clonotypes.

See Joint clonotype abundance table structure for a detailed description of table columns.

A summary table (paired.[intersection type shorthand].summary.txt suffix) containing information on sample overlap size, etc, is also provided. See tabular output in CalcPairwiseDistances section below for details.

Graphical output

A composite scatterplot plot having paired.[intersection type shorthand].scatter.pdf suffix is generated. The second plot file with .paired.[intersection type shorthand].table.collapsed.pdf suffix contains a clonotype stack area plot.

_images/intersect-pair-scatter.png

Clonotype scatterplot. Main frame contains a scatterplot of clonotype abundances (overlapping clonotypes only) and a linear regression. Point size is scaled to the geometric mean of clonotype frequency in both samples. Scatterplot axes represent log10 clonotype frequencies in each sample. Two marginal histograms show the overlapping (red) and total clonotype (grey) abundance distributions in corresponding sample. Histograms are weighted by clonotype abundance, i.e. they display read distribution by clonotype size.

_images/intersect-pair-stack.png

Shared clonotype abundance plot. Plot shows details for top 20 clonotypes shared between samples, as well as collapsed (“NotShown”) and non-overlapping (“NonOverlapping”) clonotypes. Clonotype CDR3 amino acid sequence is plotted against the sample where the clonotype reaches maximum abundance.


CalcPairwiseDistances

Performs an all-versus-all pairwise overlap for a list of samples and computes a set of repertoire similarity measures. At least 3 samples should be provided. Note that this is one of most the memory-demanding routines, as it will load all samples into memory at once (unless used with --low-mem option).

Repertoire similarity measures include

  • Pearson correlation of clonotype frequencies, restricted only to the overlapping clonotypes

    \[R_{ij} = \frac{\sum^N_{k=1} \left(\phi _{ik} - \bar{\phi _{i}} \right ) \left(\phi _{jk} - \bar{\phi _{j}} \right )}{\sqrt{\sum^N_{k=1} \left(\phi _{ik} - \bar{\phi _{i}} \right )^2 \sum^N_{k=1} \left(\phi _{jk} - \bar{\phi _{j}} \right )^2}}\]

    where \(k=1..N\) are the indices of overlapping clonotypes, \(\phi_{ik}\) is the frequency of clonotype \(k\) in sample \(i\) and \(\bar{\phi_{i}}\) is the average frequency of overlapping clonotypes in sample \(i\).

  • Relative overlap diversity, computed with the following normalization

    \[D_{ij} = \frac{d_{ij}}{d_{i}d_{j}}\]

    where \(d_{ij}\) is the number of clonotypes present in both samples and \(d_{i}\) is the diversity of sample \(i\). See this paper for the rationale behind normalization.

  • Geometric mean of relative overlap frequencies

    \[F_{ij} = \sqrt{f_{ij}f_{ji}}\]

    where \(f_{ij}=\sum^N_{k=1}\phi_{ik}\) is the total frequency of clonotypes that overlap between samples \(i\) and \(j\) in sample \(i\).

  • Сlonotype-wise sum of geometric mean frequencies

    \[F2_{ij} = \sum^N_{k=1}\sqrt{\phi_{ik}\phi_{jk}}\]

    Note that this measure performs similar to \(F\) and provides slightly more robust results in case cross-sample contamination is present.

  • Jensen-Shannon divergence between Variable segment usage profiles (will be moved to CalcSegmentUsage in near future).

  • Jaccard index.

  • Morisita-Horm index.

ClusterSamples routine can be additionally run for CalcPairwiseDistances results.

Command line usage

$VDJTOOLS CalcPairwiseDistances \
[options] [sample1.txt sample2.txt sample3.txt ... if -m is not specified] output_prefix

Parameters:

Shorthand Long name Argument Description
-m --metadata path Path to metadata file. See Common parameters
-i --intersect-type string Sample intersection rule. Defaults to aa. See Common parameters
  --low-mem   Low memory mode, will keep only a pair of samples in memory during execution, but run much slower.
-p --plot   Turns on plotting. See Common parameters
-h --help   Display help message

Tabular output

A table suffixed intersect.batch.[intersection type shorthand].summary.txt with a comprehensive information on sample pair intersections is generated. This table is non-redundant: it contains N * (N - 1) / 2 rows corresponding to upper diagonal of matrix of possible pairs (i,j). Table layout is given below in three parts.

General info

Column Description
1_sample_id First sample unique identifier
2_sample_id Second sample unique identifier
div1 Total number of clonotypes in the first sample after identical clonotypes are collapsed based on intersection type -i
div2 Same as above, second sample
div12 Number of overlapping clonotypes
div21 Same as above
count1 Total number of reads in the first sample
count2
count12 For clonotypes overlapping between two samples: total number of reads they have in the first sample
count21
freq1 Total clonotype relative abundance for the first sample (should be 1.0 if sample is unaltered)
freq2
freq12 For clonotypes overlapping between two samples: their sum of relative abundances in the first sample
freq21

Overlap metrics

Column Description
R Pearson correlation
D Relative overlap diversity
F Geometric mean of relative overlap frequencies
F2 Sum of geometric means of overlapping clonotype frequencies.
vJSD Jensen-Shannon divergence of Variable segment usage distributions
vjJSD <experimental>
vj2JSD <experimental>
sJSD <experimental>
Jaccard Jaccard index
MorisitaHorn Morisita-Horn index

Sample metadata

Column Description
1_… First sample metadata columns. See Metadata section
2_… Second sample metadata columns

Graphical output

Circos plots showing pairwise overlap are stored in a file suffixed intersect.batch.[intersection type shorthand].summary.pdf.

_images/intersect-batch-circos.png

Pairwise overlap circos plot. Count, frequency and diversity panels correspond to the read count, frequency (both non-symmetric) and the total number of clonotypes that are shared between samples. Pairwise overlaps are stacked, i.e. segment arc length is not equal to sample size.


ClusterSamples

This routine provides additional cluster analysis (hierarchical clustering), multi-dimensional scaling (MDS) and plotting for CalcPairwiseDistances output.

Note that this routine requires the following parameter setting:

Command line usage

$VDJTOOLS ClusterSamples \
[options] input_prefix [output_prefix]

Parameters:

Shorthand Long name Argument Description
-e --measure string Sample overlap metric, see Overlap metrics section of CalcPairwiseDistances tabular output for allowed values. Defaults to F
-i --intersect-type string Intersection type, defaults to aa. See Common parameters
-f --factor string Specifies metadata column with plotting factor (is used to color for sample labels and figure legend). See Common parameters
-n --numeric   Specifies if plotting factor is continuous. See Common parameters
-l --label string Specifies metadata column with sample labelslabel . See Common parameters
-h --help   Display help message
-p --plot   Turns on plotting. See Common parameters

Tabular output

Two output files are generated:

  • Table suffixed mds.[value of -i argument].[value of -e argument].txt that contains coordinates of samples computed using multi-dimensional scaling (MDS), i.e. the coordinates of samples projected to a 2D plane in a manner that pairwise sample distances are preserved.
  • A file in Newick format suffixed hc.[value of -i argument].[value of -e argument].newick is generated that contains sample dendrogram produced by hierarchical clustering.

Note

Hierarchical clustering and MDS are performed using hclust() and isoMDS() (MASS package) R functions. Default parameters are used for those algorithms.

Distances are scaled as -log10(.) and (1-.)/2 for relative overlap and correlation metrics respectively; in case of Jensen-Shannon divergence, Jaccard and Morisita-Horn indices no scaling is performed.

Graphical output

Hierarchical clustering plot is stored in a file suffixed hc.[value of -i argument].[value of -e argument].pdf. MDS plot is stored in a file with mds.[value of -i argument].[value of -e argument].pdf suffix.

_images/intersect-batch-dendro.png

Hierarchical clustering. Dendrogram of samples, branch length shows the distance between repertoires. Node colors correspond to factor value, continuous scale is used in present case (-n -f age argument).

_images/intersect-batch-mds.png

MDS plot. A scatterplot of samples. Euclidean distance between points reflects the distance between repertoires. Points are colored by factor value.


TestClusters

This routine allows to test whether a given factor influences repertoire clustering. It assesses compactness of samples that have the same factor level and separation between samples with distinct factor levels for the factor specified in ClusterSamples.

Performs post-hoc permutation testing based on MDS coordinates generated by ClusterSamples routine. Can only be performed if a discrete factor (-f) was specified in ClusterSamples.

Note that this routine requires the following parameter setting:

  • Input file prefix (input_prefix) is set to the same value as the output prefix of ClusterSamples
  • The -i and -e argument setting is the same as in ClusterSamples

Command line usage

$VDJTOOLS TestClusters \
[options] input_prefix [output_prefix]

Parameters:

Shorthand Long name Argument Description
-e --measure string Sample overlap metric, see Overlap metrics section of CalcPairwiseDistances tabular output for allowed values. Defaults to F
-i --intersect-type string Intersection type, defaults to aa. See Common parameters

Tabular output

none

Graphical output

Permutation summary plot is generated having the perms.[value of -i argument].[value of -e argument].pdf suffix.

_images/test-clusters.png

Testing compactness and separation of sample clustering for a given factor. Average repertoire similarity values for sample pairs in which both samples have the same (within panel) and different (between panel) factor levels. Each row correspond to a specific factor level. Red lines show observed values, histograms correspond to values generated by randomly permuting factor levels. Numbers near red lines indicate P-values for n=10000 permutations.


TrackClonotypes

This routine performs an all-vs-all intersection between an ordered list of samples for clonotype tracking purposes. User can specify sample which clonotypes will be traced, e.g. the pre-therapy sample.

Command line usage

$VDJTOOLS TrackClonotypes \
[options] [sample1.txt sample2.txt sample3.txt ... if -m is not specified] output_prefix

Parameters:

Shorthand Long name Argument Description
-m --metadata path Path to metadata file. See See Common parameters
-i --intersect-type string Sample intersection rule. Defaults to strict. See Common parameters
-f --factor string Specifies factor that should be treated as time variable. Factor values should be numeric. If such column not set, time points are taken either from values provided with -s argument or sample order. See Common parameters
-x --track-sample integer A zero-based index of time point to track. If not provided, will consider all clonotypes that were detected in 2+ samples
-s --sequence [t1,t2,...] Time point sequence. Unused if -m is specified. If not specified, either time column values from metadata, or sample indexes (as in command line) are used.
-t --top integer Number of top clonotypes to visualize explicitly on stack are plot and provide in the collapsed joint table. Should not exceed 100, default is 200
-p --plot   Turns on plotting. See Common parameters
-c --compress   Compressed output for clonotype table. See Common parameters
-h --help   Display help message

Tabular output

Summary table suffixed sequential.[value of -i argument].summary.txt is created with the following columns.

Column Description
1_sample_id First sample unique identifier
2_sample_id Second sample unique identifier
value Value of the intersection metric
metric Metric type: diversity, frequency or count. Metrics correspond to the number of unique clonotypes, total frequency and total read count for clonotypes overlapping between first and second sample. In case tracking is on (-x), only clonotypes present in tracked sample are counted.
1_time Time value for the first sample
2_time Time value for the second sample
1_… First sample metadata columns. See Metadata section
2_… Second sample metadata columns

Two joint clonotype abundance tables with sequential.[intersection type shorthand].table.txt and sequential.[intersection type shorthand].table.collapsed.txt suffices are generated. The latter contains top -t clonotypes, with two additional rows containing summary count and frequency for non-overlapping and collapsed clonotypes.

See Joint clonotype abundance table structure for a detailed description of table columns.

Graphical output

Summary table is visualized in a plot file suffixed sequential.[value of -i argument].summary.pdf. A plot file with .sequential.[value of -i argument].stackplot.pdf suffix contains a clonotype abundance stack area plot. The same is also visualized using a heatmap in a file with .sequential.[value of -i argument].heatplot.pdf).

_images/intersect-seq-summary.png

Clonotype tracking summary. Count, frequency and diversity panels correspond to the read count, frequency (both non-symmetric) and the total number of clonotypes that are shared between samples. Rows and columns of each matrix are sorted according to time point sequence.

_images/intersect-seq-stackplot.png

Clonotype tracking stackplot. Contains detailed profiles for top -t clonotypes, as well as collapsed (“NotShown”) and non-overlapping (“NonOverlapping”) clonotypes. Clonotype CDR3 amino acid sequence is plotted against the sample where the clonotype reaches maximum abundance. Clonotypes are colored by the peak position of their abundance profile.

_images/intersect-seq-heatplot.png

Clonotype tracking heatmap. Shows a heatmap for top -t joint clonotype abundances.