LBFextract tutorial
-------------------

In this tutorial, we show how to use the CLI and python interfaces provided by LBFextract. In LBFextract, we provide 
fragmentomics-based and coverage-based feature extraction methods. In this tutorial we will focus on transcription 
factors (TFs).
This will be done for CTCF and AR, but can be done for any TF for which a BED file is provided.  A TF is 
represented by a BED file containing regions corresponding to its TFBSs as provided by the GTRD database. 
When a TF binds to the chromatin it may cause nucleosome displacement and a chromatin conformation change.
This results in a TF footprint on the chromatin, which can be observed from liquid biopsy data. In liquid biopsy, we have cfDNA,
which is the DNA shed into body fluids by cells undergoing apoptosis, necrosis, necroptosis or NETosis. DNA molecules bound to
nucleosomes have higher protection against degradation. In regulatory regions, TF binding  
displaces nucleosomes in a TFBS  and causes a phasing effect of nucleosomes within regions bound by the same TF within 
a cell and between cells.
This causes a signal to emerge when these regions are analyzed from coverage prospective or fragmentation perspective.
Here we will see how these information can be extracted from liquid biopsy data using LBFextract.

1. Installation
---------------
To run this tutorial, it is suggested to create a folder in which the code and the results will be stored and set this as
the current working directory. 

Open a terminal and run the following command:

.. code-block:: bash
    
    mkdir -p tutorial_lbf && cd tutorial_lbf

To run this tutorial, conda is required. If you do not have conda installed, the following can be used to install it:

.. code-block:: bash
    
    curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
    rm Miniconda3-latest-Linux-x86_64.sh

After the download is complete and conda is installed, the prompt should show a `(base)` prefix. Running `conda init`
may be necessary to initialize conda.
A new conda environment for this tutorial can be created and activated as follow:

.. code-block:: bash

    conda create -n LBF_tutorial python=3.10 -y
    conda activate LBF_tutorial

Inside this environment we will install the necessary packages required
for the tutorial (jupyter, ipykernel):

.. code-block:: bash

    python -m pip install jupyter
    python -m pip install ipykernel
    python -m ipykernel install --name LBF_tutorial --user

We will now clone the LBF repository and enter the folder containing the code (LBF folder):

.. code-block:: bash

    git clone https://github.com/Isy89/LBF.git && cd LBF

To install the LBFextract package the following command can be run:

.. code-block:: bash

    python -m pip install .

To be sure that the installation succeeded, we can run the following command:

.. code-block:: bash

    lbfextract --help

This should show the help message of the CLI.
To install lbfextract specific conda environment to have a separate running samtools with the correct version, run:

.. code-block:: bash

    lbfextract setup create-conda-env

LBF extract can be used in 3 ways:

1. Through the command line interface (CLI) 
2. Through the terminal user interface (TUI) 
3. Through the python interface

We should be now inside the LBF folder and the conda environment LBF_tutorial should be active. This can be seen by the 
prompt being preceded by (LBF_tutorial).

2. CLI inerface 
---------------

2.1. Help

After running the command `lbfextract --help`, in the help message we can see the available subcommands:

1. feature_extraction_commands
2. post_extraction_analysis_commands
3. setup
4. start-tui

to have a look at the available feature extraction methods one can run:

.. code-block:: bash

    lbfextract feature_extraction_commands --help

This will show the help message of the feature_extraction_commands subcommand. The feature_extraction_commands subcommand 
are reported with a truncated help message. To see for example the full explanation of the extract-coverage subcommand
one can run the following command:

.. code-block:: bash

    lbfextract feature_extraction_commands extract-coverage --help

2.2. Extract coverage

We will now extract coverage from the sample provided in the test dataset folder.
To do this, we need to provide the following arguments to the `lbfextract feature_extraction_commands extract-coverage` 
command:

1. --path_to_bam: the path to the BAM file containing the reads
2. --path_to_bed: the path to the BED file containing the regions of interest
3. --output_path: the path to the output file

We will use the datasets provided in the `tests/test_datasets` folder. We will use the CTCF and AR BED files and the 
fextract_anonymized_test.bam file. 

To extract coverage for CTCF, the following command can be run:

.. code-block:: bash

    lbfextract feature_extraction_commands extract-coverage \
    --path_to_bam "tests/test_dataset/bam/fextract_anonymized_test.bam" \
    --path_to_bed "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/CTCF.sorted.gtrd_version_21_12.1000_sites.hg38.bed" \
    --output_path "tests/test_out/test_cli" \
    --summarization_method mean \
    --cores 4

The result will be a pickle (.pkl ) file containing the Signal object, which contains:

- the coverage signal in the array attribute, 
- the metadata in the metadata attribute
- the tags in the tags attribute and a pdf containing the plot

The coverage signal is a 1D numpy vector containing the coverage signal for the regions of interest. In case the value of 
the summarization method was set to skipp the array will be a 2D matrix.
We can extract also the same signal from more than one BED file using the feature extraction methods containing an 
``in-batch` suffix.

To extract coverage for CTCF and AR, run the following command:

.. code-block:: bash

    lbfextract feature_extraction_commands extract-coverage-in-batch \
    --path_to_bam "tests/test_dataset/bam/fextract_anonymized_test.bam" \
    --path_to_bed "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/" \
    --output_path "tests/test_out/test_cli" \
    --summarization_method mean \
    --cores 4

The output will contain:

1. a plot for each BED file
2. a batch_signals plot containing the top and bottom 5 signals one next to the other 
3. a csv file containing one signal per BED file in each row 
4. a pkl file containing the Signal object
5. a correlation matrix 
6. the summary containing:
    - the amplitude for the top and bottom 5 signals
    - the correlation for the top and bottom 5 signals
    - the heatmap for the top and bottom 5 signals
    - the PCA plot for all BED files


2.3. Extract fragmentomics-based features

The fragment length distribution per position can be extracted in the following way:

.. code-block:: bash

    lbfextract feature_extraction_commands extract-fragment-length-distribution \
    --path_to_bam "tests/test_dataset/bam/fextract_anonymized_test.bam" \
    --path_to_bed "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/CTCF.sorted.gtrd_version_21_12.1000_sites.hg38.bed" \
    --output_path "tests/test_out/test_cli" \
    --cores 4

The results is a plot showing a heatmap of the distribution per position. Each column corresponds to a position relative
to the center of the binding site. The rows correspond to the fragment length.
At the bottom the averaged fragment length distribution is shown for chunks of position over the +- n positions considered 
(+-2000 by default). The same can be done using the in-batch version of this feature extraction method

.. code-block:: bash

    lbfextract feature_extraction_commands extract-fragment-length-distribution-in-batch \
    --path_to_bam "tests/test_dataset/bam/fextract_anonymized_test.bam" \
    --path_to_bed "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/" \
    --output_path "tests/test_out/test_cli" \
    --cores 4

The output in this case will be a plot of the fragment length distribution per position for each BED file and a .npz file 
containing each array in a different key.

3. Python interface
-------------------
To better enjoy the python interface, we suggest to use a jupyter notebook. To start a jupyter notebook or lab, run the following:

.. code-block:: bash

    jupyter notebook

or:

.. code-block:: bash

    jupyter lab

This will open a new tab in the browser. In the new tab, navigate to the folder where the tutorial is located and create a
new notebook. In the notebook, one can copy paste the following commands.
    
The python interface can be used to extract the same features as the CLI.

To use the python interface the `FeatureExtractor` object needs to be imported from the `lbfextract.feature_extractor` 
module.

.. code-block:: python

    import lbfextract
    from lbfextract.feature_extractor import FeatureExtractor
    import pathlib
    
    fe = FeatureExtractor()

we can now extract the coverage signal for CTCF using the following code:

.. code-block:: python

    path_to_bam = "tests/test_dataset/bam/fextract_anonymized_test.bam"
    path_to_bed = "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/CTCF.sorted.gtrd_version_21_12.1000_sites.hg38.bed"
    output_path = "tests/test_out/test_python"
    summarization_method = "mean"
    
    fe.extract(
        "extract_coverage",
         path_to_bam = pathlib.Path(path_to_bam),
         path_to_bed = pathlib.Path(path_to_bed),
         output_path = pathlib.Path(output_path),
         summarization_method = summarization_method,
         cores = 4
    )

The same can be done for the in-batch version of the feature extraction method:

.. code-block:: python

    path_to_bam = "tests/test_dataset/bam/fextract_anonymized_test.bam"
    path_to_bed = "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/"
    output_path = "tests/test_out/test_python"
    summarization_method = "mean"
    
    fe.extract(
        "extract_coverage_in_batch",
         path_to_bam = pathlib.Path(path_to_bam),
         path_to_bed = pathlib.Path(path_to_bed),
         output_path = pathlib.Path(output_path),
         summarization_method = summarization_method,
         cores = 4
    )

The fragment length distribution can be extracted using the following code:

.. code-block:: python

    path_to_bam = "tests/test_dataset/bam/fextract_anonymized_test.bam"
    path_to_bed = "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/CTCF.sorted.gtrd_version_21_12.1000_sites.hg38.bed"
    output_path = "tests/test_out/test_python"
    
    fe.extract(
        "extract_fragment_length_distribution",
         path_to_bam = pathlib.Path(path_to_bam),
         path_to_bed = pathlib.Path(path_to_bed),
         output_path = pathlib.Path(output_path),
         cores = 4
    )


The same can be done for the in-batch version of the feature extraction method:

.. code-block:: python

    path_to_bam = "tests/test_dataset/bam/fextract_anonymized_test.bam"
    path_to_bed = "tests/test_dataset/multi_bed_ar_ctcf_for_dyads/"
    output_path = "tests/test_out/test_python"
    
    fe.extract(
        "extract_fragment_length_distribution_in_batch",
         path_to_bam = pathlib.Path(path_to_bam),
         path_to_bed = pathlib.Path(path_to_bed),
         output_path = pathlib.Path(output_path),
         cores = 4
    )


4. Differentially active Transcription Factors (TFs)
----------------------------------------------------

In this section we will show how to use the CLI and python interface to extract differentially active TFs. We will use
a generated dataset containing 2 groups of samples having 20% of differentially active TFs.

4.1. Generate the dataset

We can generate the dataset using the `DatasetGenerator` object. This can be done in the following way:

.. code-block:: python

    from lbfextract.data.dummy_dataset_generator import DummyDatasetGenerator
    
    dg = DummyDatasetGenerator(n_tf=100, groups=2, dimensions=2000, n_samples=20, n_diff_active=0.2, noise_level=1,sign=-1)
    dataset = dg.create_dataset()
    dg.save_dataset_by_sample(dataset, output_path="tests/test_dataset/generated_dataset")
    
    
This creates a datasets having two groups having 20 samples each having 100 TF with signals having 4000 dimensions 
with 20% of differentially active TFs. Each sample is saved as a csv file in the generated_dataset folder.

4.2. Generation of the sample sheet

To run the rest of the analysis we can go back to the previous terminal. Be sure to be in the `LBF` folder and the
`LBF_tutorial` conda environment is active.
LBFextract offers a way to automatically generate a sample sheet containing a column with the paths to each generated
.csv file and a empty column (group), which will contain the group membership of each sample.
This is done in the following way: 

.. code-block:: bash

    lbfextract post_extraction_analysis_commands \
    --path_to_res_summary "tests/test_dataset/generated_dataset" \
    --output_path "tests/test_out/" \
    generate-sample-sheet

Once the sample sheet is generated the information about the group membership can be filled in and the information about the 
samples name can be corrected. When sample specific datasets are generated through a normal lbfextract feature extraction method
they will be imported with the correct name. This may need to be changed when the structure of the results of the feature 
extraction method dose not follow the standard structure.

We can attach the group information to the sample sheet using the following command:

.. code-block:: python

    import pandas as pd
    import pathlib
    
    df = pd.read_csv("tests/test_out/fextract_diff_signal_results/sample_sheet.csv", index_col=0)
    df["sample_name"] = df["paths_to_sample_result"].apply(lambda x: pathlib.Path(x).stem)
    df["group"] = df["sample_name"].apply(lambda x: x.split("_")[0])
    df.to_csv("tests/test_out/fextract_diff_signal_results/sample_sheet_filled.csv")

4.3. Extract differentially active TFs

Once the groups are given and the sample name corrected we can start the differentially active analysis in the following way:

.. code-block:: bash

    lbfextract post_extraction_analysis_commands \
    --save_individual_plots \
    --commit_hash "TEST" \
    --center_signal_indices "1985,2015" \
    --rm_outliers \
    --flanking_signal_indices "1000,3000" \
    --correction_method "fdr_bh" \
    --max_iter 1 \
    --path_to_res_summary "tests/test_dataset/generated_dataset" \
    --alpha 0.05 \
    --output_path "tests/diff_active_tfs" \
    --path_to_sample_sheet "tests/test_out/fextract_diff_signal_results/sample_sheet_filled.csv" \
    --outer_group_column "group" \
    get-differentially-active-genomic-intervals

The output will be located under `tests/diff_active_tfs/`. The output will contain:

- a folder containing the heatmap of the differentially active transcription factors per group
- a barplot of the number of differentially active TFs per group
- a subfolder for each pair of groups containing the plots for each differentially active TFs, the table of the enrichment
  analysis
- the table of the differentially active transcription factors.
- a .pkl file containing the results as python object
- a metadata.json file containing the metadata of the analysis