In this course, the following next-generation sequencing (NGS) datasets/protocols are described in detail:
Analysis of the major project dataset requires quality control (quality and sequencing adapter trimming), genome alignment, feature counts, differential expression and downstream visualisation. The data also aim to address a particular scientific question and investigate a hypothesis. For the major project, you will be assigned a dataset (from a published project) and complete all the analysis tasks (from raw data to final results) and write up a report. This report must be structured like a small journal article, with abstract (summarising the project), introduction (background on the study and identification of the research hypothesis), methods (analysis steps and programs used), results (what you found) and discussion (how the results relate to the research question) sections. Marks will also be awarded to the bash/R or RMarkdown scripts that you use.
Section | Mark |
---|---|
Abstract | 5% |
Introduction + hypothesis | 10% |
Methods | 20% |
Results & Discussion | 30% |
References | 5% |
Analysis scripts | 30% |
Your major project RNA-Seq data is from a gene expression study. The study (Kumar et al., 2021) was trying to identify candidate genes that contribute to a long circadian period in the model organism Drosophila melanogaster by performing transcriptional profiling of whole fly heads from two genotypes: DGRP_892 (DGRP892), a line with a long 31-hour circadian period and Canton-S B (CantonSB), a line with a normal 24-hour circadian period. If you want to learn more about the study, you can read their published paper 1. This RNA-Seq data was sequenced on Illumina HiSeq4000 (PE76).
For your major project, you will compare a pair of transcriptome data between male or female long circadian line DGRP892 and normal circadian line CantonSB at a specific circadian time (every 2 hours of a 24 hours circadian time period, from CT02 to CT24). See details in the Materials and Methods section of the reference paper 1 if you want to get more information about the experimental design. Your research goal will be to investigate the transcriptional difference between your designated pair of Drasophila samples and interpret your results.
Each student will be provided with a unique pair of transcriptome dataset (12 zipped fastq
files), including 6 files representing the RNA-Seq of three biological replicates from the long circadian line DGRP892 and 6 files representing the RNA-seq of corresponding normal circadian line CantonSB. For example, the following is a typical list of files for one student:
Files (paired end) | File format | Line (genotype) | Tissue | Circadian time | Sex | Description of files |
---|---|---|---|---|---|---|
DGRP892_Head_CT02_male_rep1_R1.fastq.gz, DGRP892_Head_CT02_male_rep1_R2.fastq.gz | fastq | DGRP_892 | Head | CT02 | male | Biological replicate 1 (rep1) of long circadian line DGRP_892 at 2 hours circadian time (CT02) |
DGRP892_Head_CT02_male_rep2_R1.fastq.gz, DGRP892_Head_CT02_male_rep2_R2.fastq.gz | fastq | DGRP_892 | Head | CT02 | male | Biological replicate 2 (rep2) of long circadian line DGRP_892 at 2 hours circadian time (CT02) |
DGRP892_Head_CT02_male_rep3_R1.fastq.gz, DGRP892_Head_CT02_male_rep3_R2.fastq.gz | fastq | DGRP_892 | Head | CT02 | male | Biological replicate 3 (rep3) of long circadian line DGRP_892 at 2 hours circadian time (CT02) |
CantonSB_Head_CT02_male_rep1_R1.fastq.gz, CantonSB_Head_CT02_male_rep1_R2.fastq.gz | fastq | Canton-S B | Head | CT02 | male | Biological replicate 1 (rep1) of normal circadian line Canton-S B at 2 hours circadian time (CT02) |
CantonSB_Head_CT02_male_rep2_R1.fastq.gz, CantonSB_Head_CT02_male_rep2_R2.fastq.gz | fastq | Canton-S B | Head | CT02 | male | Biological replicate 2 (rep2) of normal circadian line Canton-S B at 2 hours circadian time (CT02) |
CantonSB_Head_CT02_male_rep3_R1.fastq.gz, CantonSB_Head_CT02_male_rep3_R2.fastq.gz | fastq | Canton-S B | Head | CT02 | male | Biological replicate 3 (rep3) of normal circadian line Canton-S B at 2 hours circadian time (CT02) |
Your unique transcriptome dataset can be accessed through VM path /shared/data/major_project_2024/01_raw_data/axxxxxxx
, in which axxxxxxx
is your university a
number. You have two options to access/use your unique transcriptome dataset:
For example, if your major project top level directory is ~/major_project
, and you want to access your raw fastq data in directory ~/major_project/01_raw_data
in your analysis, you can run the following commands (replace axxxxxxx
with your a
number) to create soft links of your unique transcriptome raw data files:
mkdir -p ~/major_project/01_raw_data
cd ~/major_project/01_raw_data
ln -s /shared/data/major_project_2024/01_raw_data/axxxxxxx/*.fastq.gz ./
You can hard code your unique transcriptome dataset path (/shared/data/major_project_2024/01_raw_data/axxxxxxx
, replace axxxxxxx
with your a
number) in your bash script.
[Important] Please DO NOT copy the transcriptome raw data files into your home directory If everyone did this, we would run out of storage.
The major project is mainly based on the practicals of Week 6, Alignment/NGS
and Week 11, Transciptomics/Gene Expression
, and also requires skills that you have learned from other practicals (R, bash).
Set up your project folder structure in an organised way.
/shared/data/major_project_2024/00_DB
. There are three files in the folder:
Drosophila_melanogaster.BDGP6.46.dna.primary_assembly.no_scaffolds.fa
, this is the reference genome sequences file in fasta
format.Drosophila_melanogaster.BDGP6.46.112.chr.gtf
, this is the reference gene annotation file in gtf
format.Drosophila_melanogaster.ensDB.112.gene_annotation.txt
, this is a text file including additional information about reference genes.You may want to include figures/tables in your report, however, please don’t try to include every figure/table from your analysis in your report. The general rule for figures is no more than 6-8 figures (you can have multiple sub-panels in one figure) in a scientific article.
[1] Kumar S, Tunc I, Tansey TR, Pirooznia M, Harbison ST. Identification of Genes Contributing to a Long Circadian Period in Drosophila Melanogaster. J Biol Rhythms. 2021 Jun;36(3):239-253. doi: 10.1177/0748730420975946. Epub 2020 Dec 4. PMID: 33274675; PMCID: PMC8390771.