AMATERASU: Automated Mapping And Transcriptome Expression Research Analysis Suite

Published: December 10, 2024

Developer: HikariMusic
Overview: AMATERASU (Automated Mapping And Transcriptome Expression Research Analysis Suite) is a command-line toolkit that streamlines RNA sequencing data analysis from raw data processing to advanced statistical analysis. The suite combines raw data alignment using STAR, expression profiling, clustering analysis, differential expression analysis, gene set enrichment analysis, and survival analysis into a unified workflow with straightforward commands. AMATERASU simplifies complex bioinformatics processes while generating publication-ready visualizations, making RNA-seq analysis more accessible to researchers working with large-scale transcriptomics data. The project is open-source and can be found on GitHub.

Setup

Clone the repository:

git clone https://github.com/hikarimusic/AMATERASU.git

Make a virtual environment:

python3 -m venv .amaterasu
source .amaterasu/bin/activate

Install python packages:

cd AMATERASU/
pip install -r requirements.txt

Download and index genome files with STAR:

chmod +x setup_STAR.sh process_STAR.sh
./setup_STAR.sh

Raw Data Processing

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Align and profile RNA-seq data with STAR:

./process_STAR.sh <input_dir/> <output_dir/>

The <input_dir/> should contain sequencing raw data. Both pair-end and single-end data are acceptable. Example of <input_dir/>:

my_sequences/
├── sample1_1.fastq # paired-end data
├── sample1_2.fastq
├── sample2.fastq # single-end data
└── sample3/
    ├── sample3_1.fastq # Files can be in subfolders
    └── sample3_2.fastq

The <output_dir/> will contain expression profiles of the samples. Example of <output_dir/>:

my_profiles/
├── sample1.tsv
├── sample2.tsv
└── sample3.tsv

Cohort Summary

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Summarize the profiles based on your cohort:

python3 summarize.py <cohort.csv> <profile_dir/> <value_type>

The first column of the cohort should be the sample ID. Example of <cohort.csv> (or <cohort.tsv>):

sample_id	gender	race	tumor_stage
TCGA_DD_AAVP	male	asian	Stage I
TCGA_WX_AA46	male	white	Stage II
TCGA_BD_A3EP	female	black	Stage I

The <profile_dir/> should contain profiles of those samples:

my_profiles/
├── TCGA_DD_AAVP.tsv
├── TCGA_WX_AA46.tsv
└── TCGA_BD_A3EP.tsv

<value_type> specifies the expression value to be used. Example of a profile .tsv file (in this case we can use tpm_unstranded as the value type):

# gene-model: GENCODE v36
gene_id	gene_name	gene_type	unstranded	stranded_first	stranded_second	tpm_unstranded	fpkm_unstranded	fpkm_uq_unstranded
N_unmapped			8120268	8120268	8120268			
N_multimapping			4402480	4402480	4402480			
N_noFeature			1446120	24573336	24781939			
N_ambiguous			5551010	1476463	1462076			
ENSG00000000003.15	TSPAN6	protein_coding	3231	1634	1597	44.6685	16.8557	23.8956
ENSG00000000005.6	TNMD	protein_coding	0	0	0	0.0000	0.0000	0.0000
ENSG00000000419.13	DPM1	protein_coding	1083	556	528	56.2676	21.2327	30.1006

After running the command, a folder <cohort/> will be created. The original cohort with gene expression values will be generated as <cohort/summary.csv>, which can be used in the following analysis. Example of <cohort/summary.csv>:

sample_id	…	TSPAN6	TNMD	…
TCGA_DD_AAVP	…	136.4	0.00	…
TCGA_WX_AA46	…	45.3	0.19	…
TCGA_BD_A3EP	…	60.1	0.04	…

Clustering Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform clustering analysis with columns in interest:

python3 cluster.py <cohort/summary.csv> <group_column1> <group_column2> ...

Each of the <group_column?> will be labeled in the clustering results.

PCA plots corresponding to each column will be generated as <cohort/cluster_PCA_....png>. Example:

Heatmaps with hierarchy clustering will be generated as <cohort/cluster_heatmap_....png>. Example:

If [-n] is specified in the command, samples will be labeled into n clusters and cohort/summary_cluster.csv will be generated. You can use it instead of cohort/summary.csv in the following analysis to access the Cluster column. Example of cohort/summary_cluster.csv:

sample_id	gender	…	Cluster
TCGA_DD_AAVP	male	…	Cluster1
TCGA_WX_AA46	male	…	Cluster2
TCGA_BD_A3EP	female	…	Cluster3

All the configuration such as figure size and format can be set in the head of cluster.py.

Differential Expression Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform differential expression analysis between two groups:

python3 DEA.py <cohort/summary.csv> <group_column> <group_a1> <group_a2> ... -- <group_b1> <group_b2> ... 

The first group will contain samples with <group_column> equal to <group_a?>, and the second group will contain samples with <group_column> equal to <group_b?>.

Volcano plots and heatmaps comparing the two groups will be generated as <cohort/DEA_volcano_....png> and <cohort/DEA_heatmap_....png>. Example:

Strip plots of the differential expression genes will be generated as <cohort/DEA_strip_....png>. Example:

The differential expression genes will be summarized as cohort/DEA_genes_....csv. The up-regulated genes of <group_b?> compared to <group_a?> will be summarized as cohort/DEA_genes_up_....csv. The down-regulated genes of <group_b?> compared to <group_a?> will be summarized as cohort/DEA_genes_down_....csv. Example:

gene	log2_fold_change	p_value	adjusted_pvalue
SAR1B	-1.098911206284902	7.924706080956533e-44	1.2597112786288506e-39
ACSM2A	-2.25593421738425	1.080155659882675e-40	8.585077184747501e-37
ACSM2B	-2.023553155838623	2.481105941588788e-40	1.3146553349165125e-36

All the configuration such as figure size and format can be set in the head of DEA.py.

Gene Set Enrichment Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform gene set enrichment analysis between two groups based on the predefined gene set:

python3 GSEA.py <cohort/summary.csv> <group_column> <group_a1> <group_a2> ... -- <group_b1> <group_b2> ... <geneset.gmt>

The first group will contain samples with <group_column> equal to <group_a?>, and the second group will contain samples with <group_column> equal to <group_b?>. Predefined gene sets and their genes should be specified in <geneset.gmt>. Example:

HALLMARK_ADIPOGENESIS    <url>    ABCA1    ABCB8    ACAA2    ...
HALLMARK_ALLOGRAFT_REJECTION    <url>    AARS1    ABCE1    ABI1    ...
HALLMARK_ANDROGEN_RESPONSE    <url>    ABCC4    ABHD2    ACSL3    ...

GSEA plots of significant gene sets will be generated in <cohort/GSEA_gsea_.../> as <up_....png> or <down_....png>. Example:

Bar plots showing leading genes of significant gene sets will be generated in <cohort/GSEA_bar_.../> as <up_....png> or <down_....png>. Example:

The gene sets will be summarized as cohort/GSEA_genesets_....csv. The up-regulated gene sets of <group_b?> compared to <group_a?> will be summarized as cohort/GSEA_genesets_up_....csv. The down-regulated gene sets of <group_b?> compared to <group_a?> will be summarized as cohort/GSEA_genesets_down_....csv. Example:

gene_set	enrichment_score	position	p_value	adjusted_pvalue
HALLMARK_XENOBIOTIC_METABOLISM	-0.350	0.639	2.31e-21	1.15e-19
HALLMARK_BILE_ACID_METABOLISM	-0.457	0.657	1.27e-20	3.19e-19
HALLMARK_MYC_TARGETS_V1	0.336	0.240	2.92e-20	4.86e-19

All the configuration such as figure size and format can be set in the head of GSEA.py.

Single Sample Gene Set Enrichment Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform single sample gene set enrichment analysis on all samples based on the predefined gene set:

python3 ssGSEA.py <cohort/summary.csv> <geneset.gmt>

Enrichment scores between each pair of sample and gene set will be calculated and summarized as cohort/summary_ssGSEA.csv, which can be used in other analysis. Example of cohort/summary_ssGSEA.csv:

sample_id	…	ADIPOGENESIS	ALLOGRAFT_REJECTION	…
TCGA_DD_AAVP	…	28463.1	24099.9	…
TCGA_DD_A4NP	…	28476.4	22982.5	…
TCGA_BC_4073	…	28279.4	27140.9	…

All the configuration such as removing prefix and combine mode can be set in the head of ssGSEA.py.

Survival Analysis

Start AMATERASU:

source .amaterasu/bin/activate
cd AMATERASU/

Perform survival analysis to compare among different groups:

python3 survival.py <cohort/summary.csv> <time> <event> <group_column1> <group_column2> ...

<time> should record the event time, and <event> should record the event status (0 for alive, 1 for dead). <group_column?> can be group columns or gene names to be compared.

Kaplan–Meier plots corresponding to each column will be generated as <cohort/survival_KM_....png>. Example:

The survival differences will be summarized as <cohort/survival_variables.csv>. Example:

variable	type	threshold	df	p_value
Tumor	categorical		3	2.71e-08
PCLAF	numerical	2.29	1	8.82e-06

If [-all] is specified, all genes will be analyzed and summarized as <cohort/survival_genes.csv> (instead of finding optimal thresholds, the medians will be used). Example:

gene	threshold	df	p_value	adjusted_pvalue
LYPLA2	99.9	1	2.89e-06	0.00288
MRTO4	30.6	1	7.68e-06	0.00378
YBX1	184.2	1	1.14e-05	0.00378

All the configuration such as figure size and format can be set in the head of survival.py.

Others

Contact: hikarimusic.tm@gmail.com
Citation: Tung, Yeu-Guang (2024). AMATERASU: Automated Mapping And Transcriptome Expression Research Analysis Suite. Zenodo. https://doi.org/10.5281/zenodo.14337458

Share on

Twitter Facebook LinkedIn