README.Rmd

---
title: "tidybulk - part of tidyTranscriptomics"
output: github_document
---

<!-- badges: start -->
[![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) [![R build status](https://github.com/stemangiola/tidybulk/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/stemangiola/tidybulk/actions)
<!-- badges: end -->

```{r echo=FALSE}
knitr::opts_chunk$set( fig.path = "man/figures/")
```

**Brings transcriptomics to the tidyverse!**

The code is released under the version 3 of the GNU General Public License.

```{r, echo=FALSE, out.height = "139px", out.width = "120px"}
knitr::include_graphics("man/figures/logo.png")
```
  
website: [stemangiola.github.io/tidybulk/](http://stemangiola.github.io/tidybulk/)
[Third party tutorials](https://rstudio-pubs-static.s3.amazonaws.com/792462_f948e766b15d4ee5be5c860493bda0b3.html)
Please have a look also to 

- [tidySummarizedExperiment](https://github.com/stemangiola/tidySummarizedExperiment) for bulk data tidy representation
- [tidySingleCellExperiment](https://github.com/stemangiola/tidySingleCellExperiment) for single-cell data tidy representation
- [tidyseurat](https://github.com/stemangiola/tidyseurat) for single-cell data tidy representation
- [tidyHeatmap](https://github.com/stemangiola/tidyHeatmap) for heatmaps produced with tidy principles
analysis and manipulation 
- [tidygate](https://github.com/stemangiola/tidygate) for adding custom
gate information to your tibble 


<!---

[![Build Status](https://travis-ci.org/stemangiola/tidybulk.svg?branch=master)](https://travis-ci.org/stemangiola/tidybulk) [![Coverage Status](https://coveralls.io/repos/github/stemangiola/tidybulk/badge.svg?branch=master)](https://coveralls.io/github/stemangiola/tidybulk?branch=master)

-->

```{r, echo=FALSE, out.width = "800px"}
knitr::include_graphics("man/figures/new_SE_usage-01.png")
```


## Functions/utilities available

Function | Description
------------ | -------------
`aggregate_duplicates` | Aggregate abundance and annotation of duplicated transcripts in a robust way
`identify_abundant` `keep_abundant` | identify or keep the abundant genes
`keep_variable` | Filter for top variable features
`scale_abundance` | Scale (normalise) abundance for RNA sequencing depth
`reduce_dimensions` | Perform dimensionality reduction (PCA, MDS, tSNE, UMAP)
`cluster_elements` | Labels elements with cluster identity (kmeans, SNN)
`remove_redundancy` | Filter out elements with highly correlated features
`adjust_abundance` | Remove known unwanted variation (Combat)
`test_differential_abundance` | Differential transcript abundance testing (DESeq2, edgeR, voom) 
`deconvolve_cellularity` | Estimated tissue composition (Cibersort, llsr, epic, xCell, mcp_counter, quantiseq
`test_differential_cellularity` | Differential cell-type abundance testing
`test_stratification_cellularity` | Estimate Kaplan-Meier survival differences
`test_gene_enrichment` | Gene enrichment analyses (EGSEA)
`test_gene_overrepresentation` | Gene enrichment on list of transcript names (no rank)
`test_gene_rank` | Gene enrichment on list of transcript (GSEA)
`impute_missing_abundance` | Impute abundance for missing data points using sample groupings


Utilities | Description
------------ | -------------
`get_bibliography` | Get the bibliography of your workflow
`tidybulk` | add tidybulk attributes to a tibble object
`tidybulk_SAM_BAM` | Convert SAM BAM files into tidybulk tibble
`pivot_sample` | Select sample-wise columns/information
`pivot_transcript` | Select transcript-wise columns/information
`rotate_dimensions` | Rotate two dimensions of a degree
`ensembl_to_symbol` | Add gene symbol from ensembl IDs
`symbol_to_entrez` | Add entrez ID from gene symbol
`describe_transcript` | Add gene description from gene symbol

All functions are directly compatibble with `SummarizedExperiment` object.


```{r, echo=FALSE, include=FALSE, }
library(dplyr)
library(tidyr)
library(tibble)
library(magrittr)
library(ggplot2)
library(ggrepel)
library(tidybulk)
library(tidySummarizedExperiment)
library(here)

my_theme = 	
	theme_bw() +
	theme(
		panel.border = element_blank(),
		axis.line = element_line(),
		panel.grid.major = element_line(size = 0.2),
		panel.grid.minor = element_line(size = 0.1),
		text = element_text(size=12),
		legend.position="bottom",
		aspect.ratio=1,
		strip.background = element_blank(),
		axis.title.x  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
		axis.title.y  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10))
	)

utils::download.file("https://zenodo.org/records/11201167/files/counts_SE.rda?download=1", destfile = "counts_SE.rda") 
here("counts_SE.rda") |> load()
tibble_counts = counts_SE |> as_tibble()

```

## Installation

From Bioconductor
```{r eval=FALSE}
BiocManager::install("tidybulk")
```

From Github
```{r, eval=FALSE}
devtools::install_github("stemangiola/tidybulk")
```

# Data

We will use a `SummarizedExperiment` object

```{r}
counts_SE
```

Loading `tidySummarizedExperiment` will automatically abstract this object as `tibble`, so we can display it and manipulate it with tidy tools. Although it looks different, and more tools (tidyverse) are available to us, this object is in fact a `SummarizedExperiment` object.

```{r}
class(counts_SE)
```

## Get the bibliography of your workflow 
First of all, you can cite all articles utilised within your workflow automatically from any tidybulk tibble

```{r eval=FALSE}
counts_SE |>	get_bibliography()
```

## Aggregate duplicated `transcripts`

tidybulk provide the `aggregate_duplicates` function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. `aggregate_duplicates` takes a tibble and column names (as symbols; for `sample`, `transcript` and `count`) as arguments and returns a tibble with transcripts with the same name aggregated. All the rest of the columns are appended, and factors and boolean are appended as characters.

<div class="column-left">
TidyTranscriptomics
```{r aggregate, message=FALSE, warning=FALSE, results='hide', class.source='yellow'}
rowData(counts_SE)$gene_name = rownames(counts_SE)
counts_SE.aggr = counts_SE |> aggregate_duplicates(.transcript = gene_name)
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r aggregate long, eval=FALSE}
temp = data.frame(
	symbol = dge_list$genes$symbol,
	dge_list$counts
)
dge_list.nr <- by(temp,	temp$symbol,
	function(df)
		if(length(df[1,1])>0)
			matrixStats:::colSums(as.matrix(df[,-1]))
)
dge_list.nr <- do.call("rbind", dge_list.nr)
colnames(dge_list.nr) <- colnames(dge_list)
```
</div>
<div style="clear:both;"></div>

## Scale `counts`

We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). `scale_abundance` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method as arguments and returns a tibble with additional columns with scaled data as `<NAME OF COUNT COLUMN>_scaled`.

<div class="column-left">
TidyTranscriptomics
```{r normalise, cache=TRUE}
counts_SE.norm = counts_SE.aggr |> identify_abundant(factor_of_interest = condition) |> scale_abundance()
```

</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r normalise long, eval=FALSE}
library(edgeR)

dgList <- DGEList(count_m=x,group=group)
keep <- filterByExpr(dgList)
dgList <- dgList[keep,,keep.lib.sizes=FALSE]
[...]
dgList <- calcNormFactors(dgList, method="TMM")
norm_counts.table <- cpm(dgList)
```
</div>
<div style="clear:both;"></div>

```{r, include=FALSE}
counts_SE.norm |> select(`count`, count_scaled, .abundant, everything())
```

We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type.


```{r plot_normalise, cache=TRUE}
counts_SE.norm |>
	ggplot(aes(count_scaled + 1, group=.sample, color=`Cell.type`)) +
	geom_density() +
	scale_x_log10() +
	my_theme
```

## Filter `variable transcripts`

We may want to identify and filter variable transcripts.

<div class="column-left">
TidyTranscriptomics
```{r filter variable, cache=TRUE}
counts_SE.norm.variable = counts_SE.norm |> keep_variable()
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r filter variable long, eval=FALSE}
library(edgeR)

x = norm_counts.table

s <- rowMeans((x-rowMeans(x))^2)
o <- order(s,decreasing=TRUE)
x <- x[o[1L:top],,drop=FALSE]

norm_counts.table = norm_counts.table[rownames(x)]

norm_counts.table$cell_type = tibble_counts[
	match(
		tibble_counts$sample,
		rownames(norm_counts.table)
	),
	"Cell.type"
]
```

</div>
<div style="clear:both;"></div>


## Reduce `dimensions`

We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. `reduce_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions.

**MDS** (Robinson et al., 10.1093/bioinformatics/btp616)

<div class="column-left">
TidyTranscriptomics
```{r mds, cache=TRUE}
counts_SE.norm.MDS =
  counts_SE.norm |>
  reduce_dimensions(method="MDS", .dims = 6)

```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval = FALSE}
library(limma)

count_m_log = log(count_m + 1)
cmds = limma::plotMDS(ndim = .dims, plot = FALSE)

cmds = cmds %$%	
	cmdscale.out |>
	setNames(sprintf("Dim%s", 1:6))

cmds$cell_type = tibble_counts[
	match(tibble_counts$sample, rownames(cmds)),
	"Cell.type"
]
```
</div>
<div style="clear:both;"></div>

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

```{r plot_mds, cache=TRUE}
counts_SE.norm.MDS |> pivot_sample()  |> select(contains("Dim"), everything())

counts_SE.norm.MDS |>
	pivot_sample() |>
  GGally::ggpairs(columns = 6:(6+5), ggplot2::aes(colour=`Cell.type`))


```

**PCA**

<div class="column-left">
TidyTranscriptomics
```{r pca, cache=TRUE, message=FALSE, warning=FALSE, results='hide'}
counts_SE.norm.PCA =
  counts_SE.norm |>
  reduce_dimensions(method="PCA", .dims = 6)
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r,eval=FALSE}
count_m_log = log(count_m + 1)
pc = count_m_log |> prcomp(scale = TRUE)
variance = pc$sdev^2
variance = (variance / sum(variance))[1:6]
pc$cell_type = counts[
	match(counts$sample, rownames(pc)),
	"Cell.type"
]
```
</div>
<div style="clear:both;"></div>

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

```{r plot_pca, cache=TRUE}

counts_SE.norm.PCA |> pivot_sample() |> select(contains("PC"), everything())

counts_SE.norm.PCA |>
	 pivot_sample() |>
  GGally::ggpairs(columns = 11:13, ggplot2::aes(colour=`Cell.type`))
```

**tSNE**
<div class="column-left">
TidyTranscriptomics
```{r tsne, cache=TRUE, message=FALSE, warning=FALSE, results='hide'}
counts_SE.norm.tSNE =
	breast_tcga_mini_SE |>
	identify_abundant() |>
	reduce_dimensions(
		method = "tSNE",
		perplexity=10,
		pca_scale =TRUE
	)
```


</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}
count_m_log = log(count_m + 1)

tsne = Rtsne::Rtsne(
	t(count_m_log),
	perplexity=10,
		pca_scale =TRUE
)$Y
tsne$cell_type = tibble_counts[
	match(tibble_counts$sample, rownames(tsne)),
	"Cell.type"
]
```
</div>
<div style="clear:both;"></div>

Plot

```{r}
counts_SE.norm.tSNE |>
	pivot_sample() |>
	select(contains("tSNE"), everything()) 

counts_SE.norm.tSNE |>
	pivot_sample() |>
	ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + my_theme
```

## Rotate `dimensions`

We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. `rotate_dimensions` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as `<NAME OF DIMENSION> rotated <ANGLE>` by default, or as specified in the input arguments.
<div class="column-left">
TidyTranscriptomics
```{r rotate, cache=TRUE}
counts_SE.norm.MDS.rotated =
  counts_SE.norm.MDS |>
	rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, action="get")
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}
rotation = function(m, d) {
	r = d * pi / 180
	((bind_rows(
		c(`1` = cos(r), `2` = -sin(r)),
		c(`1` = sin(r), `2` = cos(r))
	) |> as_matrix()) %*% m)
}
mds_r = pca |> rotation(rotation_degrees)
mds_r$cell_type = counts[
	match(counts$sample, rownames(mds_r)),
	"Cell.type"
]
```
</div>
<div style="clear:both;"></div>

**Original**
On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type.

```{r plot_rotate_1, cache=TRUE}
counts_SE.norm.MDS.rotated |>
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell.type` )) +
  geom_point() +
  my_theme
```

**Rotated**
On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type.

```{r plot_rotate_2, cache=TRUE}
counts_SE.norm.MDS.rotated |>
	pivot_sample() |>
	ggplot(aes(x=`Dim1_rotated_45`, y=`Dim2_rotated_45`, color=`Cell.type` )) +
  geom_point() +
  my_theme
```

## Test `differential abundance`

We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). `test_differential_abundance` takes a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g.,  log fold change, p-value and false discovery rate).
<div class="column-left">
TidyTranscriptomics
```{r de, cache=TRUE, message=FALSE, warning=FALSE, results='hide'}
counts_SE.de =
	counts_SE |>
	test_differential_abundance( ~ condition, action="get")
counts_SE.de
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}
library(edgeR)

dgList <- DGEList(counts=counts_m,group=group)
keep <- filterByExpr(dgList)
dgList <- dgList[keep,,keep.lib.sizes=FALSE]
dgList <- calcNormFactors(dgList)
design <- model.matrix(~group)
dgList <- estimateDisp(dgList,design)
fit <- glmQLFit(dgList,design)
qlf <- glmQLFTest(fit,coef=2)
topTags(qlf, n=Inf)
```
</div>
<div style="clear:both;"></div>

The functon `test_differential_abundance` operated with contrasts too. The constrasts hve the name of the design matrix (generally <NAME_COLUMN_COVARIATE><VALUES_OF_COVARIATE>)
```{r de contrast, cache=TRUE, message=FALSE, warning=FALSE, results='hide', eval=FALSE}
counts_SE.de =
	counts_SE |>
	identify_abundant(factor_of_interest = condition) |>
	test_differential_abundance(
		~ 0 + condition,                  
		.contrasts = c( "conditionTRUE - conditionFALSE"),
		action="get"
	)
```

## Adjust `counts`

We may want to adjust `counts` for (known) unwanted variation. `adjust_abundance` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as `<COUNT COLUMN>_adjusted`. At the moment just an unwanted covariated is allowed at a time.

<div class="column-left">
TidyTranscriptomics
```{r adjust, cache=TRUE, message=FALSE, warning=FALSE, results='hide'}
counts_SE.norm.adj =
	counts_SE.norm |> adjust_abundance(	.factor_unwanted = batch, .factor_of_interest = factor_of_interest)

```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}
library(sva)

count_m_log = log(count_m + 1)

design =
		model.matrix(
			object = ~ factor_of_interest + batch,
			data = annotation
		)

count_m_log.sva =
	ComBat(
			batch =	design[,2],
			mod = design,
			...
		)

count_m_log.sva = ceiling(exp(count_m_log.sva) -1)
count_m_log.sva$cell_type = counts[
	match(counts$sample, rownames(count_m_log.sva)),
	"Cell.type"
]

```
</div>
<div style="clear:both;"></div>

## Deconvolve `Cell type composition`

We may want to infer the cell type composition of our samples (with the algorithm Cibersort; Newman et al., 10.1038/nmeth.3337). `deconvolve_cellularity` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the adjusted  cell type proportions.


<div class="column-left">
TidyTranscriptomics
```{r cibersort, cache=TRUE}
counts_SE.cibersort =
	counts_SE |>
	deconvolve_cellularity(action="get", cores=1, prefix = "cibersort__") 

```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}

source(‘CIBERSORT.R’)
count_m |> write.table("mixture_file.txt")
results <- CIBERSORT(
	"sig_matrix_file.txt",
	"mixture_file.txt",
	perm=100, QN=TRUE
)
results$cell_type = tibble_counts[
	match(tibble_counts$sample, rownames(results)),
	"Cell.type"
]

```
</div>
<div style="clear:both;"></div>

With the new annotated data frame, we can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation. On the x axis we have the cell types inferred by Cibersort, on the y axis we have the inferred proportions. The data is facetted and coloured by nominal cell types (annotation given by the researcher after FACS sorting).

```{r plot_cibersort, cache=TRUE}
counts_SE.cibersort |>
	pivot_longer(
		names_to= "Cell_type_inferred", 
		values_to = "proportion", 
		names_prefix ="cibersort__", 
		cols=contains("cibersort__")
	) |>
  ggplot(aes(x=`Cell_type_inferred`, y=proportion, fill=`Cell.type`)) +
  geom_boxplot() +
  facet_wrap(~`Cell.type`) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), aspect.ratio=1/5)
```

## Test differential cell-type abundance

We can also perform a statistical test on the differential cell-type abundance across conditions

```{r DC, cache=TRUE}

	counts_SE |>
	test_differential_cellularity(. ~ condition )

```

We can also perform regression analysis with censored data (coxph).

```{r DC_censored}
	# Add survival data

counts_SE_survival = 
	counts_SE |>
	nest(data = -sample) |>
		mutate(
			days = sample(1:1000, size = n()),
			dead = sample(c(0,1), size = n(), replace = TRUE)
		) |>
	unnest(data) 

# Test
counts_SE_survival |>
	test_differential_cellularity(survival::Surv(days, dead) ~ .)

```

We can also perform test of Kaplan-Meier curves.

```{r DC_censored_stratification}

counts_stratified = 
	counts_SE_survival |>

	# Test
	test_stratification_cellularity(
		survival::Surv(days, dead) ~ .,
		sample, transcript, count
	)

counts_stratified

```

Plot Kaplan-Meier curves

```{r}
counts_stratified$plot[[1]]
```

## Cluster `samples`

We may want to cluster our data (e.g., using k-means sample-wise). `cluster_elements` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with additional columns for the cluster annotation. At the moment only k-means clustering is supported, the plan is to introduce more clustering methods.

**k-means**

<div class="column-left">
TidyTranscriptomics
```{r cluster, cache=TRUE}
counts_SE.norm.cluster = counts_SE.norm.MDS |>
  cluster_elements(method="kmeans",	centers = 2, action="get" )
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}
count_m_log = log(count_m + 1)

k = kmeans(count_m_log, iter.max = 1000, ...)
cluster = k$cluster

cluster$cell_type = tibble_counts[
	match(tibble_counts$sample, rownames(cluster)),
	c("Cell.type", "Dim1", "Dim2")
]

```
</div>
<div style="clear:both;"></div>

We can add cluster annotation to the MDS dimension reduced data set and plot.

```{r plot_cluster, cache=TRUE}
 counts_SE.norm.cluster |>
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=`cluster_kmeans`)) +
  geom_point() +
  my_theme
```

**SNN**

Matrix package (v1.3-3) causes an error with Seurat::FindNeighbors used in this method. We are trying to solve this issue. At the moment this option in unaviable.


<div class="column-left">
TidyTranscriptomics
```{r SNN, eval=FALSE, cache=TRUE, message=FALSE, warning=FALSE, results='hide'}
counts_SE.norm.SNN =
	counts_SE.norm.tSNE |>
	cluster_elements(method = "SNN")
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}
library(Seurat)

snn = CreateSeuratObject(count_m)
snn = ScaleData(
	snn, display.progress = TRUE,
	num.cores=4, do.par = TRUE
)
snn = FindVariableFeatures(snn, selection.method = "vst")
snn = FindVariableFeatures(snn, selection.method = "vst")
snn = RunPCA(snn, npcs = 30)
snn = FindNeighbors(snn)
snn = FindClusters(snn, method = "igraph", ...)
snn = snn[["seurat_clusters"]]

snn$cell_type = tibble_counts[
	match(tibble_counts$sample, rownames(snn)),
	c("Cell.type", "Dim1", "Dim2")
]

```
</div>
<div style="clear:both;"></div>

```{r SNN_plot, eval=FALSE, cache=TRUE}
counts_SE.norm.SNN |>
	pivot_sample() |>
	select(contains("tSNE"), everything()) 

counts_SE.norm.SNN |>
	pivot_sample() |>
	gather(source, Call, c("cluster_SNN", "Call")) |>
	distinct() |>
	ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + facet_grid(~source) + my_theme


# Do differential transcription between clusters
counts_SE.norm.SNN |>
	mutate(factor_of_interest = `cluster_SNN` == 3) |>
	test_differential_abundance(
    ~ factor_of_interest,
    action="get"
   )
```

## Drop `redundant` transcripts

We may want to remove redundant elements from the original data set (e.g., samples or transcripts), for example if we want to define cell-type specific signatures with low sample redundancy. `remove_redundancy` takes as arguments a tibble, column names (as symbols; for `sample`, `transcript` and `count`) and returns a tibble with redundant elements removed (e.g., samples). Two redundancy estimation approaches are supported:

+ removal of highly correlated clusters of elements (keeping a representative) with method="correlation"
+ removal of most proximal element pairs in a reduced dimensional space.

**Approach 1**

<div class="column-left">
TidyTranscriptomics
```{r drop, cache=TRUE}
counts_SE.norm.non_redundant =
	counts_SE.norm.MDS |>
  remove_redundancy(	method = "correlation" )
```
</div>
<div class="column-right">
Standard procedure (comparative purpose)
```{r, eval=FALSE}
library(widyr)

.data.correlated =
	pairwise_cor(
		counts,
		sample,
		transcript,
		rc,
		sort = TRUE,
		diag = FALSE,
		upper = FALSE
	) |>
	filter(correlation > correlation_threshold) |>
	distinct(item1) |>
	rename(!!.element := item1)

# Return non redudant data frame
counts |> anti_join(.data.correlated) |>
	spread(sample, rc, - transcript) |>
	left_join(annotation)


```
</div>
<div style="clear:both;"></div>

We can visualise how the reduced redundancy with the reduced dimentions look like

```{r plot_drop, cache=TRUE}
counts_SE.norm.non_redundant |>
	pivot_sample() |>
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell.type`)) +
  geom_point() +
  my_theme

```

**Approach 2**

```{r drop2, cache=TRUE}
counts_SE.norm.non_redundant =
	counts_SE.norm.MDS |>
  remove_redundancy(
  	method = "reduced_dimensions",
  	Dim_a_column = `Dim1`,
  	Dim_b_column = `Dim2`
  )
```

We can visualise MDS reduced dimensions of the samples with the closest pair removed.

```{r plot_drop2, cache=TRUE}
counts_SE.norm.non_redundant |>
	pivot_sample() |>
	ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell.type`)) +
  geom_point() +
  my_theme

```

## Other useful wrappers

The above wrapper streamline the most common processing of bulk RNA sequencing data. Other useful wrappers are listed above.

## From BAM/SAM to tibble of gene counts

We can calculate gene counts (using FeatureCounts; Liao Y et al., 10.1093/nar/gkz114) from a list of BAM/SAM files and format them into a tidy structure (similar to counts).

```{r eval=FALSE}
counts = tidybulk_SAM_BAM(
	file_names,
	genome = "hg38",
	isPairedEnd = TRUE,
	requireBothEndsMapped = TRUE,
	checkFragLength = FALSE,
	useMetaFeatures = TRUE
)
```

## From ensembl IDs to gene symbol IDs

We can add gene symbols from ensembl identifiers. This is useful since different resources use ensembl IDs while others use gene symbol IDs. This currently works for human and mouse.

```{r ensembl, cache=TRUE}
counts_ensembl |> ensembl_to_symbol(ens)
```

## From gene symbol to gene description (gene name in full)

We can add gene full name (and in future description) from symbol identifiers. This currently works for human and mouse.

```{r description}
counts_SE |> 
	describe_transcript() |> 
	select(feature, description, everything())
```