# Diversity estimation¶

Note

Application of routines from this section is described in the following tutorial.

## PlotQuantileStats¶

Plots a three-layer donut chart to visualize the repertoire clonality.

- First layer (“set”) includes the frequency of singleton (“1”, met once), doubleton (“2”, met twice) and high-order (“3+”, met three or more times) clonotypes. Singleton and doubleton frequency is an important factor in estimating the total repertoire diversity, e.g. Chao1 diversity estimator (see Colwell et al). We have also recently shown that in whole blood samples, singletons have very nice correlation with the number of naive T-cells, which are the backbone of immune repertoire diversity.
- The second layer (“quantile”), displays the abundance of top 20% (“Q1”), next 20% (“Q2”), … (up to “Q5”) clonotypes for clonotypes from “3+” set. In our experience this quantile plot is a simple and efficient way to display repertoire clonality.
- The last layer (“top”) displays the individual abundances of top N clonotypes.

### Command line usage¶

```
java -Xmx4G -jar vdjtools.jar PlotQuantileStats [options] sample.txt output_prefix
```

Parameters:

Shorthand | Long name | Argument | Description |
---|---|---|---|

`-t` |
`--top` |
int | Number of top clonotypes to visualize. Should not exceed 10, default is 5 |

`-h` |
`--help` |
Display help message |

### Tabular output¶

Following table with `.qstat.txt`

prefix is generated,

Column | Description |
---|---|

Type | Detalization level: `set` , `quantile` or `top` |

Name | Variable name: “1”, “Q1”, “CASSLAPGATNEKLFF”, etc |

Value | Corresponding relative abundance |

### Graphical output¶

Following plot with `.qstat.pdf`

prefix is generated,

**Sample clonality plot**. See above for the description of plot structure.

## RarefactionPlot¶

Plots rarefaction curves for specified list of samples, that is, the dependencies between sample diversity and sample size. Those curves are interpolated from 0 to the current sample size and then extrapolated up to the size of the largest of samples, allowing comparison of diversity estimates. Interpolation and extrapolation are based on multinomial models, see Colwell et al for details.

### Command line usage¶

```
java -Xmx4G -jar vdjtools.jar RarefactionPlot \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix
```

### Parameters¶

Shorthand | Long name | Argument | Description |
---|---|---|---|

`-m` |
`--metadata` |
path | Path to metadata file. See Common parameters |

`-i` |
`--intersect-type` |
string | Set the intersection type used to collapse clonotypes before computing diversity. Defaults to `strict` (don’t collapse at all). See Common parameters |

`-s` |
`--steps` |
integer | Set the total number of points in the rarefaction curve, default is `101` |

`-f` |
`--factor` |
string | Specifies plotting factor. See Common parameters |

`-n` |
`--numeric` |
Specifies if plotting factor is numeric. See Common parameters | |

`-l` |
`--label` |
string | Specifies label used for plotting. See Common parameters |

`--wide-plot` |
Set wide plotting area | ||

`--label-exact` |
If set to true, will position sample labels exactly at observed samle size, will use the extrapolated sample size otherwise | ||

`-h` |
`--help` |
Display help message |

### Tabular output¶

The following table with
`rarefaction.[intersection type shorthand].txt`

is generated:

Column | Definition |
---|---|

sample_id | Sample unique identifier |

… | Sample metadata columns, see Metadata section |

x | Subsample size, reads |

mean | Mean diversity at given size |

ciL | Lower bound of 95% confidence interval |

ciU | Upper bound of 95% confidence interval |

type | Data point type: `0=interpolation` , `1=exact` , `2=extrapolation` |

### Graphical output¶

A figure with the same suffix as output table and `.pdf`

extension is
provided.

**Rarefaction plot**. Solid and dashed lines mark interpolated and extrapolated
regions of rarefaction curves respectively, points mark exact sample
size and diversity. Shaded areas mark 95% confidence intervals.

## CalcDiversityStats¶

Computes a set of diversity statistics, including

- Observed diversity, the total number of clonotypes in a sample
- Lower bound total diversity (LBTD) estimates
- Chao estimate
(denoted
*chao1*) - Efron-Thisted estimate

- Chao estimate
(denoted
- Diversity indices
- Shannon-Wiener index. The exponent of clonotype frequency distribution entropy is returned.
- Normalized Shannon-Wiener index. Normalized (divided by
`log[number of clonotypes]`

) entropy of clonotype frequency distribution. Note that plain entropy is returned, not its exponent. - Inverse Simpson index

- Extrapolated Chao diversity estimate,
denoted
*chaoE*here. - The d50 index, a recently developed immune diversity estimate

Diversity estimates are computed in two modes: using original data and via several re-sampling steps (usually down-sampling to the size of smallest dataset).

- The estimates computed on original data could be biased by uneven
sampling depth (sample size), of those only
`chaoE`

is properly normalized to be compared between samples. While not good for between-sample comparison, the LBTD estimates provided for original data are most useful for studying the fundamental properties of repertoires under study, i.e. to answer the question how large the repertoire diversity of an entire organism could be. - Estimates computed using re-sampling are useful for between-sample comparison, e.g. we have successfully used the re-sampled (normalized) observed diversity to measure the repertoire aging trends (see this paper).

Hint

In our recent experience the observed diversity and LBTD estimates computed on re-sampled data provide best results for between-sample comparisons.

### Command line usage¶

```
java -Xmx4G -jar vdjtools.jar CalcDiversityStats \
[options] [sample1.txt sample2.txt ... if -m is not specified] output_prefix
```

Parameters:

Shorthand | Long name | Argument | Description |
---|---|---|---|

`-m` |
`--metadata` |
path | Path to metadata file. See Common parameters |

`-i` |
`--intersect-type` |
string | Set the intersection type used to collapse clonotypes before computing diversity. Defaults to `strict` (don’t collapse at all). See Common parameters |

`-x` |
`--downsample-to` |
integer | Set the sample size to interpolate the diversity estimates via resampling. Default = size of smallest sample. Applies to diversity estimates stored in `.resampled.txt` table |

`-X` |
`--extrapolate-to` |
integer | Set the sample size to extrapolate the diversity estimates. Default = size of largest sample. Currently, only applies to `chaoE` diversity estimate. |

`--resample-trials` |
integer | Number of resamples for corresponding estimator. Default = 3 | |

`-h` |
`--help` |
Display help message |

### Tabular output¶

Two tables with `diversity.[intersection type shorthand].txt`

and
`diversity.[intersection type shorthand].resampled.txt`

are generated,
containing diversity estimates computed on original and down-sampled
datasets respectively.

Note that `chaoE`

estimate is only present in the table generated for
original samples. Both tables contain means and standard deviations of
diversity estimates. Also note that standard deviation and mean values
for down-sampled datasets are computed based on N=3 re-samples.

Here is an example column layout, similar between both output tables:

Column | Definition |
---|---|

sample_id | Sample unique identifier |

… | Sample metadata columns, see Metadata section |

reads | Number of reads in the sample |

diversity | Diversity of the original sample (after collapsing to unique clonotypes according to `-i` parameter) |

extrapolate_reads / resample_reads | The reads used to extrapolate or re-sample in order to compute present diversity estiamtes |

<name>_mean |
Mean value of the diversity estimate <name> |

<name>_std |
Standard deviation of the diversity estimate <name> |

… |

### Graphical output¶

none