AMATERASU: Automated Mapping And Transcriptome Expression Research Analysis Suite

Published:

Developer: HikariMusic
Overview: AMATERASU (Automated Mapping And Transcriptome Expression Research Analysis Suite) is a command-line toolkit that streamlines RNA sequencing data analysis from raw data processing to advanced statistical analysis. The suite combines raw data alignment using STAR, expression profiling, clustering analysis, differential expression analysis, gene set enrichment analysis, and survival analysis into a unified workflow with straightforward commands. AMATERASU simplifies complex bioinformatics processes while generating publication-ready visualizations, making RNA-seq analysis more accessible to researchers working with large-scale transcriptomics data. The project is open-source and can be found on GitHub.

License DOI Stars

Setup

Clone the repository:

git clone https://github.com/hikarimusic/AMATERASU.git

Make a virtual environment:

python3 -m venv .amaterasu
source .amaterasu/bin/activate

Install python packages:

cd AMATERASU/
pip install -r requirements.txt

Download and index genome files with STAR:

chmod +x setup_STAR.sh process_STAR.sh
./setup_STAR.sh

Raw Data Processing

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Align and profile RNA-seq data with STAR:

./process_STAR.sh <input_dir/> <output_dir/>

The <input_dir/> should contain sequencing raw data. Both pair-end and single-end data are acceptable. Example of <input_dir/>:

my_sequences/
├── sample1_1.fastq # paired-end data
├── sample1_2.fastq
├── sample2.fastq # single-end data
└── sample3/
    ├── sample3_1.fastq # Files can be in subfolders
    └── sample3_2.fastq

The <output_dir/> will contain expression profiles of the samples. Example of <output_dir/>:

my_profiles/
├── sample1.tsv
├── sample2.tsv
└── sample3.tsv

Cohort Summary

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Summarize the profiles based on your cohort:

python3 summarize.py <cohort.csv> <profile_dir/> <value_type>

The first column of the cohort should be the sample ID. Example of <cohort.csv> (or <cohort.tsv>):

sample_idgenderracetumor_stage
TCGA_DD_AAVPmaleasianStage I
TCGA_WX_AA46malewhiteStage II
TCGA_BD_A3EPfemaleblackStage I

The <profile_dir/> should contain profiles of those samples:

my_profiles/
├── TCGA_DD_AAVP.tsv
├── TCGA_WX_AA46.tsv
└── TCGA_BD_A3EP.tsv

<value_type> specifies the expression value to be used. Example of a profile .tsv file (in this case we can use tpm_unstranded as the value type):

# gene-model: GENCODE v36
gene_id	gene_name	gene_type	unstranded	stranded_first	stranded_second	tpm_unstranded	fpkm_unstranded	fpkm_uq_unstranded
N_unmapped			8120268	8120268	8120268			
N_multimapping			4402480	4402480	4402480			
N_noFeature			1446120	24573336	24781939			
N_ambiguous			5551010	1476463	1462076			
ENSG00000000003.15	TSPAN6	protein_coding	3231	1634	1597	44.6685	16.8557	23.8956
ENSG00000000005.6	TNMD	protein_coding	0	0	0	0.0000	0.0000	0.0000
ENSG00000000419.13	DPM1	protein_coding	1083	556	528	56.2676	21.2327	30.1006

After running the command, a folder <cohort/> will be created. The original cohort with gene expression values will be generated as <cohort/summary.csv>, which can be used in the following analysis. Example of <cohort/summary.csv>:

sample_idTSPAN6TNMD
TCGA_DD_AAVP136.40.00
TCGA_WX_AA4645.30.19
TCGA_BD_A3EP60.10.04

Clustering Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform clustering analysis with columns in interest:

python3 cluster.py <cohort/summary.csv> <group_column1> <group_column2> ... 

Each of the <group_column?> will be labeled in the clustering results.

PCA plots corresponding to each column will be generated as <cohort/cluster_PCA_....png>. Example:

Heatmaps with hierarchy clustering will be generated as <cohort/cluster_heatmap_....png>. Example:

If [-n] is specified in the command, samples will be labeled into n clusters and cohort/summary_cluster.csv will be generated. You can use it instead of cohort/summary.csv in the following analysis to access the Cluster column. Example of cohort/summary_cluster.csv:

sample_idgenderCluster
TCGA_DD_AAVPmaleCluster1
TCGA_WX_AA46maleCluster2
TCGA_BD_A3EPfemaleCluster3

All the configuration such as figure size and format can be set in the head of cluster.py.

Differential Expression Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform differential expression analysis between two groups:

python3 DEA.py <cohort/summary.csv> <group_column> <group_a1> <group_a2> ... -- <group_b1> <group_b2> ... 

The first group will contain samples with <group_column> equal to <group_a?>, and the second group will contain samples with <group_column> equal to <group_b?>.

Volcano plots and heatmaps comparing the two groups will be generated as <cohort/DEA_volcano_....png> and <cohort/DEA_heatmap_....png>. Example:

Strip plots of the differential expression genes will be generated as <cohort/DEA_strip_....png>. Example:

The differential expression genes will be summarized as cohort/DEA_genes_....csv. The up-regulated genes of <group_b?> compared to <group_a?> will be summarized as cohort/DEA_genes_up_....csv. The down-regulated genes of <group_b?> compared to <group_a?> will be summarized as cohort/DEA_genes_down_....csv. Example:

genelog2_fold_changep_valueadjusted_pvalue
SAR1B-1.0989112062849027.924706080956533e-441.2597112786288506e-39
ACSM2A-2.255934217384251.080155659882675e-408.585077184747501e-37
ACSM2B-2.0235531558386232.481105941588788e-401.3146553349165125e-36

All the configuration such as figure size and format can be set in the head of DEA.py.

Gene Set Enrichment Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform gene set enrichment analysis between two groups based on the predefined gene set:

python3 GSEA.py <cohort/summary.csv> <group_column> <group_a1> <group_a2> ... -- <group_b1> <group_b2> ... <geneset.gmt>

The first group will contain samples with <group_column> equal to <group_a?>, and the second group will contain samples with <group_column> equal to <group_b?>. Predefined gene sets and their genes should be specified in <geneset.gmt>. Example:

HALLMARK_ADIPOGENESIS    <url>    ABCA1    ABCB8    ACAA2    ...
HALLMARK_ALLOGRAFT_REJECTION    <url>    AARS1    ABCE1    ABI1    ...
HALLMARK_ANDROGEN_RESPONSE    <url>    ABCC4    ABHD2    ACSL3    ...

GSEA plots of significant gene sets will be generated in <cohort/GSEA_gsea_.../> as <up_....png> or <down_....png>. Example:

Bar plots showing leading genes of significant gene sets will be generated in <cohort/GSEA_bar_.../> as <up_....png> or <down_....png>. Example:

The gene sets will be summarized as cohort/GSEA_genesets_....csv. The up-regulated gene sets of <group_b?> compared to <group_a?> will be summarized as cohort/GSEA_genesets_up_....csv. The down-regulated gene sets of <group_b?> compared to <group_a?> will be summarized as cohort/GSEA_genesets_down_....csv. Example:

gene_setenrichment_scorepositionp_valueadjusted_pvalue
HALLMARK_XENOBIOTIC_METABOLISM-0.3500.6392.31e-211.15e-19
HALLMARK_BILE_ACID_METABOLISM-0.4570.6571.27e-203.19e-19
HALLMARK_MYC_TARGETS_V10.3360.2402.92e-204.86e-19

All the configuration such as figure size and format can be set in the head of GSEA.py.

Single Sample Gene Set Enrichment Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform single sample gene set enrichment analysis on all samples based on the predefined gene set:

python3 ssGSEA.py <cohort/summary.csv> <geneset.gmt>

Enrichment scores between each pair of sample and gene set will be calculated and summarized as cohort/summary_ssGSEA.csv, which can be used in other analysis. Example of cohort/summary_ssGSEA.csv:

sample_idADIPOGENESISALLOGRAFT_REJECTION
TCGA_DD_AAVP28463.124099.9
TCGA_DD_A4NP28476.422982.5
TCGA_BC_407328279.427140.9

All the configuration such as removing prefix and combine mode can be set in the head of ssGSEA.py.

Survival Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform survival analysis to compare among different groups:

python3 survival.py <cohort/summary.csv> <time> <event> <group_column1> <group_column2> ...

<time> should record the event time, and <event> should record the event status (0 for alive, 1 for dead). <group_column?> can be group columns or gene names to be compared.

Kaplan–Meier plots corresponding to each column will be generated as <cohort/survival_KM_....png>. Example:

The survival differences will be summarized as <cohort/survival_variables.csv>. Example:

variabletypethresholddfp_value
Tumorcategorical 32.71e-08
PCLAFnumerical2.2918.82e-06

If [-all] is specified, all genes will be analyzed and summarized as <cohort/survival_genes.csv> (instead of finding optimal thresholds, the medians will be used). Example:

genethresholddfp_valueadjusted_pvalue
LYPLA299.912.89e-060.00288
MRTO430.617.68e-060.00378
YBX1184.211.14e-050.00378

All the configuration such as figure size and format can be set in the head of survival.py.

Others

  • Contact: hikarimusic.tm@gmail.com
  • Citation: Tung, Yeu-Guang (2024). AMATERASU: Automated Mapping And Transcriptome Expression Research Analysis Suite. Zenodo. https://doi.org/10.5281/zenodo.14337458