Skip to content

Input Files

Lorraine Ayad edited this page Jul 1, 2022 · 10 revisions

Certain input files are required to run CNEFinder. These include:

Mandatory

  1. Reference genome (FASTA File)
  2. Query genome (FASTA File)

Repetitive regions must be annotated in the input FASTA files with either 'N' or lowercase letters.

  1. Exon coordinates for reference genome (see example file below)
  2. Exon coordinates for query genome (see example file below)

The following files are only required if CNEs are to be identified using gene coordinates instead of coordinates input by the user

  1. Gene coordinates for reference genome (see example file below)
  2. Gene coordinates for query genome (see example file below)

Example

  1. Click here to download hg38 chromosomes.
  2. Click here to download galGal4 chromosomes.
  3. Create exon file hg38_exons for hg38. Exonic coordinates for hg38 can be retrieved using R/bioconductor:
require(TxDb.Hsapiens.UCSC.hg38.knownGene)
exonsRanges <- exons(TxDb.Hsapiens.UCSC.hg38.knownGene)
  1. Please note that exon files must be in TSV format. Create exon file galGal4_exons for galGal4. Exonic coordinates for galGal4 can be retrieved using biomaRt, Bioconductor R package:
require(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="dec2015.archive.ensembl.org")
ensembl <-  useDataset("ggallus_gene_ensembl",mart=ensembl)
attributes <- listAttributes(ensembl)
exons <- getBM(attributes=c("chromosome_name", "exon_chrom_start",
"exon_chrom_end", "strand"),
mart=ensembl)
exonsRanges <- GRanges(seqnames=exons$chromosome_name,
ranges=IRanges(start=exons$exon_chrom_start,
end=exons$exon_chrom_end),
strand=ifelse(exons$strand==1L, "+", "-")
)
seqlevelsStyle(exonsRanges) <- "UCSC"
  1. Create gene file hg38_genes for hg38. Follow instructions in biomaRt, Bioconductor R package to retrieve genic coordinates for GRCh38(hg38). Column names should comply with the following format:

    gene_name	chromosome_name	start_position	end_position	status
    WI2-2998D17.2	1	223741977	223742862	KNOWN
    AC020910.2	19	35166998	35168486	KNOWN
    SNRPEP10	1	223831811	223832085	KNOWN
    CTC-523E23.1	19	35302738	35305249	NOVEL
    PHBP11	1	224044281	224045089	KNOWN
    
    

NOTE: The gene name must be in the first column of the gene files.

  1. Create gene file galGal4_genes for galGal4. Same as above but select ggallus_gene_ensembl as an ensembl dataset.

Clone this wiki locally