Annotation¶
CalcDegreeStats¶
Performs a TCR neighborhood enrichment test (TCRNET), testing each sample for clonotypes
that have more neighbours (higher degree in a graph), i.e. clonotypes with similar CDR3 amino acid sequences, than would be expected
by chance according to some control dataset. User can specify the actual search scope (i.e.
number of allowed CDR3 mismatches), whether to only compare clonotypes with same V/J, and the
control sample. If control sample is not provided, a pooling (see PoolSamples) of all provided samples is used.
Note that this test, if supplied with real samples and a control pooled using -i strict
option
will account for the number of neighbours with the same CDR3 amino acid sequence, but distinct nucleotide
sequences. If this is not desired, all input samples and control should be pre-pooled with -i aa
or
-i aaVJ
to collapse variants coding for the amino acid CDR3 sequence.
Note
Running this routine will not return the actual clonotype graph for you, just annotate input samples.
To build the graph, one should refer to VDJmatch software
and its Cluster
routine. Make sure the search scope option is the same as -o
used for CalcDegreeStats
and that all scoring/filtering is turned off. Next, one should retain only the edges that connect pairs of
enriched clonotypes and enriched clonotypes with their neighbours.
Command line usage¶
$VDJTOOLS CalcDegreeStats \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix
Parameters:
Shorthand | Long name | Argument | Description |
---|---|---|---|
-m |
--metadata |
path | Path to metadata file. See Common parameters |
-b |
--background |
path | Path to the background (control) sample, used to compute expected statistics/P-values. If not provided, will pool input samples and uses them as control. |
-o |
--search-scope |
s,i,d | Search scope: number of substitutions (s), indels (id) and total number of mismatches (t) allowed. Default is 1,0,1 |
-g |
--grouping |
string | Primary grouping type, limits set of clonotype comparisons: ‘dummy’ (no grouping, default), ‘vj’ (same V and J) or ‘vjl’ (same V, J and CDR3 length). |
-g2 |
--grouping2 |
string | Secondary grouping, used for computing statistics, accepts same values as -g . By default will select ‘vjl’ if no indels allowed and ‘vj’ otherwise. |
-h |
--help |
Display help message |
Note
There are two possible schemes for running the algorithm. Firstly, one can select,
say a search scope of 1,0,1
allowing no indels, and -g vjl
to only allow comparisons
between clonotypes that match in V, J and CDR3 length. Then, one should
only consider p.value.g
in the output and disregard all columns with g2/group2
.
On the other hand, if one wants to allow comparison of clonotypes with different V/J,
and/or comparisons with indels, the option -g dummy
should be used. If one thinks there
might be certain biases in V/J frequencies between control/background sample and input samples,
and one wants to control for them, he should select -g2 vj
, then observed degree values
will be provided as is (i.e. not limiting clonotype comparisons to a fixed V/J),
but the expected degree will be corrected to account for V/J usage difference
between input sample and control. One should only consider p.value.g2
in this case. See below for more explaination on output columns.
Tabular output¶
Processed samples will have additional annotation columns appended to VDJtools clonotype table columns. These columns are the following:
Column | Description |
---|---|
degree.s | Degree (number of neighbours) of a given clonotype in sample. The degree is the number of unique clonotypes (incl. nucleotide variants) that match a given clonotype under specified search scope. |
group.count.s | Number of unique clonotypes that match the group, defined by primary grouping (-g ), of a given clonotype in sample, say have the same V and J. |
group2.count.s | Same as above, but the group is defined by secondary grouping -g2 . |
degree.c | Degree (number of neighbours) of a given clonotype in the control sample. |
group.count.c | Number of unique clonotypes in the control sample that match the group of given clonotype as defined by primary grouping (-g ). |
group2.count.c | Same as above, but the group is defined by secondary grouping -g2 . |
p.value.g | P-value for the neighbour (degree) enrichment of a given clonotype according to primary grouping. The P-value is computed as Pbinom(n=degree.s|p=degree.c/group.count.c, N=group.count.s) . |
p.value.g2 | P-value for the neighbour (degree) enrichment of a given clonotype according to secondary grouping. The P-value is computed as Ppoisson(n=degree.s|lambda=group.count.s*degree.c/group.count.c) . |
A metadata file will be created for resulting samples with degstat
appended to the ..filter..
metadata column.
CalcCdrAAProfile¶
Generates amino acid physical properties profile of CDR3. Amino acids are first grouped to corresponding CDR3 sub-regions and then binned by position within the sub-region. Amino acids in a given bin is scored according to its physical properties, sums of those scores and total number of amino acids is reported for each sample/sub-region/bin/property combination.
For example under the polarity property amino acids are marked as polar (1
)
and non-polar (0
) and the sum of these values is returned. When divided by
the total number of amino acids one will get the fraction of polar amino acids
in a given sample/sub-region. For volume the same operation will return the
average volume of amino acids.
Command line usage¶
$VDJTOOLS CalcCdrAAProfile \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix
Parameters:
Shorthand | Long name | Argument | Description |
---|---|---|---|
-m |
--metadata |
path | Path to metadata file. See Common parameters |
-w |
--weighted |
If set, will weight amino acid property values by clonotype frequency. | |
-n |
--normalize |
If set, will normalize amino acid property values by dividing them by corresponding CDR3 sub-region size. | |
-r |
--region-list |
region1,… | List of CDR3 sub-regions to count statistics for, default is "CDR3-full,VJ-junc,V-germ,J-germ |
-o |
--property-list |
property1,… | List of amino acid physicochemical properties to use, see below for allowed value. Uses all amino acid properties from list below by default. |
-h |
--help |
Display help message |
Supported CDR3 sub-regions:
Name | Description |
---|---|
CDR3-full |
Complete CDR3 region |
CDR3-center |
Central 5 amino acids of CDR3 |
V-germ |
Germline part of CDR3 region corresponding to Variable segment |
D-germ |
Germline part of CDR3 region corresponding to Diversity segment |
J-germ |
Germline part of CDR3 region corresponding to Joining segment |
VD-junc |
Variable-Diversity segment junction, applicable when D segment is mapped |
DJ-junc |
Diversity-Joining segment junction, applicable when D segment is mapped |
VJ-junc |
Variable-Joining segment junction, including D segment if it is mapped |
Supported amino acid physical properties (see full table for raw values):
Name | Description | Reference |
---|---|---|
alpha |
Preference to appear in alpha helices | Stryer L et al. Biochemistry, 5th edition. ISBN 978-0716746843 |
beta |
Preference to appear in beta sheets | Stryer L et al. Biochemistry, 5th edition. ISBN 978-0716746843 |
turn |
Preference to appear in turns | Stryer L et al. Biochemistry, 5th edition. ISBN 978-0716746843 |
surface |
Residues that have unchanged accessibility area when PPI partner is present | PMID:22559010 |
rim |
Residues that have changed accessibility area, but no atoms with zero accessibility in PPI interfaces | PMID:22559010 |
core |
Residues that have changed accessibility area and at least one atom with zero accessibility in PPI interfaces | PMID:22559010 |
disorder |
Intrinsic structural disorder-promoting, order-promoting and neutral amino acids | PMID:11381529 |
charge |
Charged/non-charged amino acids | Wikipedia |
pH |
Amino acid pH level | Wikipedia |
polarity |
Polar/non-polar amino acids | Wikipedia |
hydropathy |
Amino acid hydropathy | Wikipedia |
volume |
Amino acid volume | Wikipedia |
strength |
Strongly-interacting amino acids / amino acids depleted by purifying selection in thymus | PMID:18946038 |
mjenergy |
Mean value of MJ statistical potential for each amino acid, used to derive ‘strength’ | PMID:8604144 |
kf1 ..``kf10`` |
Values of 10 Kidera factors summarizing physicochemical properties of amino acids | unpublished |
Tabular output¶
A summary table with averaged amino acid property values is generated,
suffixed cdr3aa.profile.[wt or unwt based on -u].txt
. The table contains
the following columns:
Column | Description |
---|---|
sample_id | Sample unique identifier |
… | Sample metadata columns. See Metadata section |
region | Current CDR3 sub-region, see above |
property | Amino acid physical property name, see above |
mean | Mean property value |
Annotate¶
This routine will compute a set of properties for each clonotype’s CDR3 sequence and append them to resulting clonotype table. For example, number of added N-nucleotides and the sum of polar amino acids in CDR3. The main difference from CalcCdrAAProfile is that the former computes sample-level average while this routine performs calculation on clonotype level.
Command line usage¶
$VDJTOOLS Annotate \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix
Parameters:
Shorthand | Long name | Argument | Description |
---|---|---|---|
-m |
--metadata |
path | Path to metadata file. See Common parameters |
-b |
--base |
param1,param2,… | Comma-separated list of basic clonotype features to calculate and append to resulting clonotype tables. See below for allowed values. Default: cdr3Length,ndnSize,insertSize |
-a |
--aaprop |
property1,… | Comma-separated list of amino acid properties. Amino acid property value sum will be calculated for CDR3 sequence (blank annotations will be generated for non-coding clonotypes). See below for allowed values. Default: hydropathy,charge,polarity,strength,contact |
-h |
--help |
Display help message |
List of basic annotation properties:
Name | Description |
---|---|
cdr3Length |
Length of CDR3 region |
NDNSize |
Number of nucleotides between last base of V germline and first base of J germline parts of CDR3 |
insertSize |
Number of added N-nucleotides |
VDIns |
Number of added N-nucleotides in V-D junction or -1 if D segment is undefined |
DJIns |
Number of added N-nucleotides in D-J junction or -1 if D segment is undefined |
See CalcCdrAAProfile for the list of amino acid properties available for annotation.
Sum of specified amino acid property values across all amino acids of CDR3 will be computed.
It can be divided by cdr3Length / 3
basic property value to get the average.
Tabular output¶
Processed samples will have additional annotation columns appended to VDJtools clonotype
table columns. Those columns will be prefixed with base.
for basic CDR3 properties
and aaprop.
for CDR3 amino acid composition properties.
A metadata file will be created for resulting samples with annot:[-b value]:[-a value]
appended to the ..filter..
metadata column.
ScanDatabase (DEPRECATED since v1.0.5, use VDJmatch)¶
Annotates a set of samples using immune receptor database based on V-(D)-J junction matching. By default uses VDJdb, which contains CDR3 sequences, Variable and Joining segments of known specificity obtained using literature mining. This routine supports user-provided databases and allows flexible filtering of results based on database fields. The output of ScanDatabase includes both detailed (clonotype-wise) annotation of samples and summary statistics. Only amino-acid CDR3 sequences are used in database querying.
Command line usage¶
$VDJTOOLS ScanDatabase \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix
Parameters:
Shorthand | Long name | Argument | Description |
---|---|---|---|
-m |
--metadata |
path | Path to metadata file. See Common parameters |
-D |
--database |
path | Path to an external database file. Will use built-in VDJdb if not specified. |
-d |
--details |
Will provide a detailed output for each sample with annotated clonotype matches | |
-f |
--fuzzy |
Will query database allowing at most 2 substitutions, 1 deletion and 1 insertion but no more than 2 mismatches simultaneously. If not set, only exact matches will be reported | |
--filter |
expression |
Logical pre-filter on database columns. See below | |
--v-match |
V segment must to match | ||
--j-match |
J segment must to match | ||
-h |
--help |
Display help message |
Note
Database filter is a logical expression that contains
reference to input table columns. Database column name references should
be surrounded with double underscores (__
). Syntax supports Regex and
standard Java/Groovy functions such as .contains()
, .startsWith()
,
etc. Here are some examples:
__origin__=~/EBV/
!(__origin__=~/CMV/)
Note that the expression should be quoted: --filter "__origin__=~/HSV/"
Tabular output¶
A summary table suffixed annot.[database name].summary.txt
is
generated. First header line marked with ##FILTER
contains filtering
expression that was used. The table contains the following columns:
Column | Description |
---|---|
sample_id | Sample unique identifier |
… | Sample metadata columns. See Metadata section |
diversity | Number of clonotypes in sample |
match_size | Number of matches between sample and database. In case --fuzzy mode is on, all matches will be counted. E.g. if clonotype a in the sample matches clonotypes A and B in the database and clonotype b in the sample matches clonotype B the value in this column will be 3. |
sample_diversity_in_matches | Number of unique clonotypes in the sample that matched clonotypes from the database |
db_diversity_in_matches | Number of unique clonotypes in the database that matched clonotypes from the sample |
sample_freq_in_matches | Overall frequency of unique clonotypes in the sample that matched clonotypes from the database |
mean_matched_clone_size | Geometric mean of frequency of unique clonotypes in the sample that matched clonotypes from the database |
Detailed database query results will be also reported for each sample if
-d
is specified. Those tables are suffixed
annot.[database name].[sample id].txt
and contain the following
columns.
Column | Description |
---|---|
score | CDR3 sequence alignment score |
query_cdr3aa | Query CDR3 amino acid sequence |
query_v | Query Variable segment |
query_j | Query Joining segment |
subject_cdr3aa | Subject CDR3 amino acid sequence |
subject_v | Subject Variable segment |
subject_j | Subject Joining segment |
v_match | true if Variable segments of query and subject clonotypes match |
j_match | true if Joining segments of query and subject clonotypes match |
mismatches | Comma-separated list of query->subject mismatches |
… | Database fields corresponding to subject clonotype |
Graphical output¶
none