-
Notifications
You must be signed in to change notification settings - Fork 1
01 Developer Setup
- Conda (miniconda preferred)
-
Create a new conda environment named
guidescanwith Python version 3.10 (andpipinstalled in the environment to avoid any surprises later).conda create --name guidescan python=3.10 pip -
Activate the environment.
conda activate guidescanThe command prompt will change to indicate the new conda environment by prepending
(guidescan). -
Clone the repository and enter it:
git clone https://github.com/pritykinlab/guidescanpy.git cd guidescanpy -
Install the package in editable mode and any optional dependencies:
pip install -e ".[dev]" -
Install
guidescanThe core guidescan program is needed for indexing genomes and creating new databases, and it is sufficient that the binary be accessible in the activated
condaenvironment. Unless you want to download and compile guidescan yourself, the easiest option is to install it frombioconda. The command that is likely to work for most platforms is:conda install -c conda-forge -c bioconda guidescanVerify
guidescanversion by runningguidescan --versionon the command line. Read the guidescan documentation on how to use the utility. -
Run tests
This step is crucial to see if
guidescanpyandguidescanare working correctly. Run:cd docker/snakemake snakemake -F guidescan_pytest --cores 1 --use-conda --config max_kmers=1000 enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"This will run a workflow that generates a small amount of test data (1000 kmers) for the
sacCer3organism and thecas9enzyme, and run the unit tests found in thetestsfolder.Mac users on Apple Silicon (M1/M2/M3 CPUs): One of the steps in the workflow adds cutting-efficiency values to the generated databases, and uses Python 2.7 code supplied from a different Research Lab. You will want to set the environment variable
CONDA_SUBDIRtoosx-64to allowcondato use Rosetta 2 emulation for these steps. In other words, the command you will want to run is:CONDA_SUBDIR=osx-64 snakemake -F guidescan_pytest --cores 1 --use-conda --config max_kmers=1000 enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"
To start working on guidescanpy, we will likely need some real data.
"Data" in guidescanpy comprises of:
- A relational database to store chromosome and gene information for organisms. By default this is a local
sqlitedatabase (guidescan.db). - BAM files that store on-target and off-target information for an organism + enzyme combination. For example, the
sacCer3organism + thecas9enzyme combination will make up a single.bamfile. -
Index files that allow
guidescanto quickly search an organism's genomic sequence. For example, thesacCer3organism's sequence will have a single index (each index is made up of 3 files, as we'll see shortly).
To generate sample data for sacCer3/cas9, repeat the step we ran in (6) above, but with minor variations:
cd docker/snakemake
snakemake --cores 1 --use-conda --config enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"
Mac users on Apple Silicon (M1/M2/M3 CPUs): One of the steps in the workflow adds cutting-efficiency values to the generated databases, and uses Python 2.7 code supplied from a different Research Lab. You will want to set the environment variable CONDA_SUBDIR to osx-64 to allow conda to use Rosetta 2 emulation for these steps. In other words, the command you will want to run is:
CONDA_SUBDIR=osx-64 snakemake --cores 1 --use-conda --config enzymes="[\"cas9\"]" organisms="[\"sacCer3\"]"
This step will likely take a couple of hours. For other organisms, including hg38, it will take substantially more time. If you're impatient, you can download pre-generated BAM and index files from our website. See this link to see how.
However you choose to generate the data, you will need to set two environment variables, which tell guidescanpy the location of the BAM files and index files. These are GUIDESCAN_BAM_PATH and GUIDESCAN_INDEX_PATH respectively.
In the following example, we have downloaded the sacCer3+cas9 BAM file in databases/cas9, and the sacCer3 index files in indices.
$ pwd
/home/joe/guidescan/data
$ tree
.
├── databases
│ └── cas9
│ └── sacCer3.bam.sorted
└── indices
├── sacCer3.index.forward
├── sacCer3.index.gs
└── sacCer3.index.reverse
Note the folder structure - the BAM file is stored in a sub-folder <enzyme> (cas9 or cpf1) inside databases, and the index files are <organism>.index.<extension> inside indices. So we can set the 2 required environment variables as:
export GUIDESCAN_BAM_PATH=/home/joe/guidescan/data/databases
export GUIDESCAN_INDEX_PATH=/home/joe/guidescan/data/indices
The guidescan.com website is made up of two parts - a Flask component which is the main web application, and a Celery task management component which handles long-running requests on the website. To start both of these, open up two terminal windows, and run the following commands. Both terminals need to have access to the environment variables we set above, so you may want to set those environment variables in your user profile.
--- Terminal 1 ---
conda activate guidescan
guidescan worker
--- Terminal 2 ---
conda activate guidescan
guidescan web
Note the link in the terminal when you run guidescan web (typically http://127.0.0.1:5001). This is the link you will use to open up the browser.
If you see a "Not Found" error (404) in the browser, append a
/pyto the address bar.
Keep both terminals active while you're interacting with the web application.
If you generated/downloaded data only for
sacCer3, you will obviously only be able to run queries for that organism.
- Start a new branch
cd <path_to_guidescanpy>
git checkout -b <your_awesome_branch_name>
- Install the pre-commit hook. This will allow you to identify style/formatting/coding issues every time you commit your code. Pre-commit automatically formats the files in your repository according to certain standards, and/or warns you if certain best practices are not followed.
pre-commit install
-
Tweak/modify the code, make
guidescanpybetter! Send a PR towards themainbranch.
Our CI will automatically run the pre-commit and pytest steps for PRs towards the protected branches, so running these steps on your local installation will prevent surprises for you later.
When you are done with the development, deactivate the guidescan environment and return to (base) by the following command:
conda deactivate