Annotation

CalcDegreeStats

Performs a TCR neighborhood enrichment test (TCRNET), testing each sample for clonotypes that have more neighbours (higher degree in a graph), i.e. clonotypes with similar CDR3 amino acid sequences, than would be expected by chance according to some control dataset. User can specify the actual search scope (i.e. number of allowed CDR3 mismatches), whether to only compare clonotypes with same V/J, and the control sample. If control sample is not provided, a pooling (see PoolSamples) of all provided samples is used. Note that this test, if supplied with real samples and a control pooled using -i strict option will account for the number of neighbours with the same CDR3 amino acid sequence, but distinct nucleotide sequences. If this is not desired, all input samples and control should be pre-pooled with -i aa or -i aaVJ to collapse variants coding for the amino acid CDR3 sequence.

Note

Running this routine will not return the actual clonotype graph for you, just annotate input samples. To build the graph, one should refer to VDJmatch software and its Cluster routine. Make sure the search scope option is the same as -o used for CalcDegreeStats and that all scoring/filtering is turned off. Next, one should retain only the edges that connect pairs of enriched clonotypes and enriched clonotypes with their neighbours.

Command line usage

$VDJTOOLS CalcDegreeStats \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix

Parameters:

Shorthand Long name Argument Description
-m --metadata path Path to metadata file. See Common parameters
-b --background path Path to the background (control) sample, used to compute expected statistics/P-values. If not provided, will pool input samples and uses them as control.
-o --search-scope s,i,d Search scope: number of substitutions (s), indels (id) and total number of mismatches (t) allowed. Default is 1,0,1
-g --grouping string Primary grouping type, limits set of clonotype comparisons: ‘dummy’ (no grouping, default), ‘vj’ (same V and J) or ‘vjl’ (same V, J and CDR3 length).
-g2 --grouping2 string Secondary grouping, used for computing statistics, accepts same values as -g. By default will select ‘vjl’ if no indels allowed and ‘vj’ otherwise.
-h --help   Display help message

Note

There are two possible schemes for running the algorithm. Firstly, one can select, say a search scope of 1,0,1 allowing no indels, and -g vjl to only allow comparisons between clonotypes that match in V, J and CDR3 length. Then, one should only consider p.value.g in the output and disregard all columns with g2/group2. On the other hand, if one wants to allow comparison of clonotypes with different V/J, and/or comparisons with indels, the option -g dummy should be used. If one thinks there might be certain biases in V/J frequencies between control/background sample and input samples, and one wants to control for them, he should select -g2 vj, then observed degree values will be provided as is (i.e. not limiting clonotype comparisons to a fixed V/J), but the expected degree will be corrected to account for V/J usage difference between input sample and control. One should only consider p.value.g2 in this case. See below for more explaination on output columns.

Tabular output

Processed samples will have additional annotation columns appended to VDJtools clonotype table columns. These columns are the following:

Column Description
degree.s Degree (number of neighbours) of a given clonotype in sample. The degree is the number of unique clonotypes (incl. nucleotide variants) that match a given clonotype under specified search scope.
group.count.s Number of unique clonotypes that match the group, defined by primary grouping (-g), of a given clonotype in sample, say have the same V and J.
group2.count.s Same as above, but the group is defined by secondary grouping -g2.
degree.c Degree (number of neighbours) of a given clonotype in the control sample.
group.count.c Number of unique clonotypes in the control sample that match the group of given clonotype as defined by primary grouping (-g).
group2.count.c Same as above, but the group is defined by secondary grouping -g2.
p.value.g P-value for the neighbour (degree) enrichment of a given clonotype according to primary grouping. The P-value is computed as Pbinom(n=degree.s|p=degree.c/group.count.c, N=group.count.s).
p.value.g2 P-value for the neighbour (degree) enrichment of a given clonotype according to secondary grouping. The P-value is computed as Ppoisson(n=degree.s|lambda=group.count.s*degree.c/group.count.c).

A metadata file will be created for resulting samples with degstat appended to the ..filter.. metadata column.

Graphical output

none


CalcCdrAAProfile

Generates amino acid physical properties profile of CDR3. Amino acids are first grouped to corresponding CDR3 sub-regions and then binned by position within the sub-region. Amino acids in a given bin is scored according to its physical properties, sums of those scores and total number of amino acids is reported for each sample/sub-region/bin/property combination.

For example under the polarity property amino acids are marked as polar (1) and non-polar (0) and the sum of these values is returned. When divided by the total number of amino acids one will get the fraction of polar amino acids in a given sample/sub-region. For volume the same operation will return the average volume of amino acids.

Command line usage

$VDJTOOLS CalcCdrAAProfile \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix

Parameters:

Shorthand Long name Argument Description
-m --metadata path Path to metadata file. See Common parameters
-w --weighted   If set, will weight amino acid property values by clonotype frequency.
-n --normalize   If set, will normalize amino acid property values by dividing them by corresponding CDR3 sub-region size.
-r --region-list region1,… List of CDR3 sub-regions to count statistics for, default is "CDR3-full,VJ-junc,V-germ,J-germ
-o --property-list property1,… List of amino acid physicochemical properties to use, see below for allowed value. Uses all amino acid properties from list below by default.
-h --help   Display help message

Supported CDR3 sub-regions:

Name Description
CDR3-full Complete CDR3 region
CDR3-center Central 5 amino acids of CDR3
V-germ Germline part of CDR3 region corresponding to Variable segment
D-germ Germline part of CDR3 region corresponding to Diversity segment
J-germ Germline part of CDR3 region corresponding to Joining segment
VD-junc Variable-Diversity segment junction, applicable when D segment is mapped
DJ-junc Diversity-Joining segment junction, applicable when D segment is mapped
VJ-junc Variable-Joining segment junction, including D segment if it is mapped

Supported amino acid physical properties (see full table for raw values):

Name Description Reference
alpha Preference to appear in alpha helices Stryer L et al. Biochemistry, 5th edition. ISBN 978-0716746843
beta Preference to appear in beta sheets Stryer L et al. Biochemistry, 5th edition. ISBN 978-0716746843
turn Preference to appear in turns Stryer L et al. Biochemistry, 5th edition. ISBN 978-0716746843
surface Residues that have unchanged accessibility area when PPI partner is present PMID:22559010
rim Residues that have changed accessibility area, but no atoms with zero accessibility in PPI interfaces PMID:22559010
core Residues that have changed accessibility area and at least one atom with zero accessibility in PPI interfaces PMID:22559010
disorder Intrinsic structural disorder-promoting, order-promoting and neutral amino acids PMID:11381529
charge Charged/non-charged amino acids Wikipedia
pH Amino acid pH level Wikipedia
polarity Polar/non-polar amino acids Wikipedia
hydropathy Amino acid hydropathy Wikipedia
volume Amino acid volume Wikipedia
strength Strongly-interacting amino acids / amino acids depleted by purifying selection in thymus PMID:18946038
mjenergy Mean value of MJ statistical potential for each amino acid, used to derive ‘strength’ PMID:8604144
kf1..``kf10`` Values of 10 Kidera factors summarizing physicochemical properties of amino acids unpublished

Tabular output

A summary table with averaged amino acid property values is generated, suffixed cdr3aa.profile.[wt or unwt based on -u].txt. The table contains the following columns:

Column Description
sample_id Sample unique identifier
Sample metadata columns. See Metadata section
region Current CDR3 sub-region, see above
property Amino acid physical property name, see above
mean Mean property value

Graphical output

none


Annotate

This routine will compute a set of properties for each clonotype’s CDR3 sequence and append them to resulting clonotype table. For example, number of added N-nucleotides and the sum of polar amino acids in CDR3. The main difference from CalcCdrAAProfile is that the former computes sample-level average while this routine performs calculation on clonotype level.

Command line usage

$VDJTOOLS Annotate \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix

Parameters:

Shorthand Long name Argument Description
-m --metadata path Path to metadata file. See Common parameters
-b --base param1,param2,… Comma-separated list of basic clonotype features to calculate and append to resulting clonotype tables. See below for allowed values. Default: cdr3Length,ndnSize,insertSize
-a --aaprop property1,… Comma-separated list of amino acid properties. Amino acid property value sum will be calculated for CDR3 sequence (blank annotations will be generated for non-coding clonotypes). See below for allowed values. Default: hydropathy,charge,polarity,strength,contact
-h --help   Display help message

List of basic annotation properties:

Name Description
cdr3Length Length of CDR3 region
NDNSize Number of nucleotides between last base of V germline and first base of J germline parts of CDR3
insertSize Number of added N-nucleotides
VDIns Number of added N-nucleotides in V-D junction or -1 if D segment is undefined
DJIns Number of added N-nucleotides in D-J junction or -1 if D segment is undefined

See CalcCdrAAProfile for the list of amino acid properties available for annotation. Sum of specified amino acid property values across all amino acids of CDR3 will be computed. It can be divided by cdr3Length / 3 basic property value to get the average.

Tabular output

Processed samples will have additional annotation columns appended to VDJtools clonotype table columns. Those columns will be prefixed with base. for basic CDR3 properties and aaprop. for CDR3 amino acid composition properties.

A metadata file will be created for resulting samples with annot:[-b value]:[-a value] appended to the ..filter.. metadata column.

Graphical output

none


ScanDatabase (DEPRECATED since v1.0.5, use VDJmatch)

Annotates a set of samples using immune receptor database based on V-(D)-J junction matching. By default uses VDJdb, which contains CDR3 sequences, Variable and Joining segments of known specificity obtained using literature mining. This routine supports user-provided databases and allows flexible filtering of results based on database fields. The output of ScanDatabase includes both detailed (clonotype-wise) annotation of samples and summary statistics. Only amino-acid CDR3 sequences are used in database querying.

Command line usage

$VDJTOOLS ScanDatabase \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix

Parameters:

Shorthand Long name Argument Description
-m --metadata path Path to metadata file. See Common parameters
-D --database path Path to an external database file. Will use built-in VDJdb if not specified.
-d --details   Will provide a detailed output for each sample with annotated clonotype matches
-f --fuzzy   Will query database allowing at most 2 substitutions, 1 deletion and 1 insertion but no more than 2 mismatches simultaneously. If not set, only exact matches will be reported
  --filter expression Logical pre-filter on database columns. See below
  --v-match   V segment must to match
  --j-match   J segment must to match
-h --help   Display help message

Note

Database filter is a logical expression that contains reference to input table columns. Database column name references should be surrounded with double underscores (__). Syntax supports Regex and standard Java/Groovy functions such as .contains(), .startsWith(), etc. Here are some examples:

__origin__=~/EBV/
!(__origin__=~/CMV/)

Note that the expression should be quoted: --filter "__origin__=~/HSV/"

Tabular output

A summary table suffixed annot.[database name].summary.txt is generated. First header line marked with ##FILTER contains filtering expression that was used. The table contains the following columns:

Column Description
sample_id Sample unique identifier
Sample metadata columns. See Metadata section
diversity Number of clonotypes in sample
match_size Number of matches between sample and database. In case --fuzzy mode is on, all matches will be counted. E.g. if clonotype a in the sample matches clonotypes A and B in the database and clonotype b in the sample matches clonotype B the value in this column will be 3.
sample_diversity_in_matches Number of unique clonotypes in the sample that matched clonotypes from the database
db_diversity_in_matches Number of unique clonotypes in the database that matched clonotypes from the sample
sample_freq_in_matches Overall frequency of unique clonotypes in the sample that matched clonotypes from the database
mean_matched_clone_size Geometric mean of frequency of unique clonotypes in the sample that matched clonotypes from the database

Detailed database query results will be also reported for each sample if -d is specified. Those tables are suffixed annot.[database name].[sample id].txt and contain the following columns.

Column Description
score CDR3 sequence alignment score
query_cdr3aa Query CDR3 amino acid sequence
query_v Query Variable segment
query_j Query Joining segment
subject_cdr3aa Subject CDR3 amino acid sequence
subject_v Subject Variable segment
subject_j Subject Joining segment
v_match true if Variable segments of query and subject clonotypes match
j_match true if Joining segments of query and subject clonotypes match
mismatches Comma-separated list of query->subject mismatches
Database fields corresponding to subject clonotype

Graphical output

none