-
Notifications
You must be signed in to change notification settings - Fork 15
Input Files
Lorraine Ayad edited this page Jul 1, 2022
·
10 revisions
Certain input files are required to run CNEFinder. These include:
Mandatory
- Reference genome (FASTA File)
- Query genome (FASTA File)
Repetitive regions must be annotated in the input FASTA files with either 'N' or lowercase letters.
- Exon coordinates for reference genome (see example file below)
- Exon coordinates for query genome (see example file below)
The following files are only required if CNEs are to be identified using gene coordinates instead of coordinates input by the user
- Gene coordinates for reference genome (see example file below)
- Gene coordinates for query genome (see example file below)
- Click here to download hg38 chromosomes.
- Click here to download galGal4 chromosomes.
- Create exon file hg38_exons for hg38. Exonic coordinates for hg38 can be retrieved using R/bioconductor:
require(TxDb.Hsapiens.UCSC.hg38.knownGene)
exonsRanges <- exons(TxDb.Hsapiens.UCSC.hg38.knownGene)
- Please note that exon files must be in TSV format. Create exon file galGal4_exons for galGal4. Exonic coordinates for galGal4 can be retrieved using biomaRt, Bioconductor R package:
require(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="dec2015.archive.ensembl.org")
ensembl <- useDataset("ggallus_gene_ensembl",mart=ensembl)
attributes <- listAttributes(ensembl)
exons <- getBM(attributes=c("chromosome_name", "exon_chrom_start",
"exon_chrom_end", "strand"),
mart=ensembl)
exonsRanges <- GRanges(seqnames=exons$chromosome_name,
ranges=IRanges(start=exons$exon_chrom_start,
end=exons$exon_chrom_end),
strand=ifelse(exons$strand==1L, "+", "-")
)
seqlevelsStyle(exonsRanges) <- "UCSC"
-
Create gene file hg38_genes for hg38. Follow instructions in biomaRt, Bioconductor R package to retrieve genic coordinates for GRCh38(hg38). Column names should comply with the following format:
gene_name chromosome_name start_position end_position status WI2-2998D17.2 1 223741977 223742862 KNOWN AC020910.2 19 35166998 35168486 KNOWN SNRPEP10 1 223831811 223832085 KNOWN CTC-523E23.1 19 35302738 35305249 NOVEL PHBP11 1 224044281 224045089 KNOWN
NOTE: The gene name must be in the first column of the gene files.
- Create gene file galGal4_genes for galGal4. Same as above but select ggallus_gene_ensembl as an ensembl dataset.