diff --git a/README.md b/README.md
index b3c8553..3d924e0 100644
--- a/README.md
+++ b/README.md
@@ -1,17 +1,19 @@
# Tree Nine
-Put diff files on an existing phylogenetic tree using [UShER](https://www.nature.com/articles/s41588-021-00862-7)'s `usher sampled` task with a bit of help from [SRANWRP](https://www.github.com/aofarrel/SRANWRP), followed by conversion of that tree to Taxonium, Newick, and Nextstrain formats. Samples' SNP distance is calculated and output as a distance matrix, and samples will be placed into clusters based on the distance.
+Master repo for the following:
+* The eponymous Tree Nine workflow: A WDL workflow for placing tuberculosis samples on an existing phylogenetic tree with UShER, converting that tree into multiple output formats, then clustering said samples based on genetic distance **<--- if you are a public health department, this is probably what you're looking for**
+* WDLizations of several common matUtils commands
+ * matUtils annotate
+ * matUtils extract (for subtrees and for file conversion)
+ * matUtils mask
+ * matUtils summarize
+* WDLization of [UShER](https://www.nature.com/articles/s41588-021-00862-7)
-Verified on Terra-Cromwell and miniwdl. Make sure to add `--copy-input-files` for miniwdl. Default inputs assume you're working with _Mycobacterium tuberculosis_, be sure to change them if you aren't working with that bacterium.
+For organizational sake, WDLizations of matUtils and/or UShER may be split off into different repo in the future.
-This repo also contains the following subworkflows:
-* [Annotate](./annotate.md)
-* [Convert to Nextstrain](./convert_to_nextstrain.md) (for viewing in Auspice, non-clade sample annotations, etc)
-* [Extract](./extract.md)
-* [Mask tree](./mask_tree.wdl)
-* [Mask subtree](./mask_subtree.wdl)
-* [Summarize](./summarize.md)
+## Getting Tree Nine running
+Tree Nine is verifed on Terra-Cromwell and miniwdl. Make sure to add `--copy-input-files` for miniwdl. **The clustering portion of the workflow may not work on ARM machines (including Apple Silicon) due to a known Rosetta incompatibility (everything else should be fine though).**
-## features
+## Tree Nine features
* Highly scalable, even on lower-end computes
* Can input a single pre-combined diff file
* Includes a sample input tree created from SRA data if no input tree is specified
@@ -22,15 +24,14 @@ This repo also contains the following subworkflows:
* Clustering is also performed after backmasking
* (optional) Create per-cluster Nextstrain subtrees
* (optional) Reroot the tree to a specified node
-* (optional) Backmask newly-added samples against each other to hide positions where any newly-added sample lacks data, then create a new set of trees based on the backmasked diff files
+* (optional) "Backmask" newly-added samples against each other to hide positions where any newly-added sample lacks data, then create a new set of trees based on the backmasked diff files
* Designed for highly clonal samples which have a plausible direct epidemiological relationship
* Backmasking can only be performed on samples which have a sample-level diff files
* (optional) Summarize input, reroot, and output trees with matutils
* (optional) Filter out positions by coverage at that position and/or entire samples by overall coverage
* (optional) Specify your own reference genome if you don't want to work with H37Rv
* (optional) Annotate clades via matutils with a specified annotation TSV
+* (optional) Upload to clusters to Microreact for visualization
## benchmarking
-Formal benchmarks have not been established, but a full run of placing 60 new TB samples on an existing 7000+ TB sample tree, conversion to taxonium and newick formats, distance matrixing, clustering finding, and creating cluster-specific Nextstrain trees executes in about five minutes on a 2019 Macbook Pro.
-
-Backmasking is the least scalable part of the pipeline. The comparison itself theoretically scales n2 and once the comparison is completed, n backmasked disk files must be written to the disk. We have observed that memory problems tend to arise during the file-writing part when n≥55 on a local machine. Runtime attributes are adjustable as task-level variables to aid with scaling on cloud backends, although we have seen the default handle 60 samples at a time without much issue.
\ No newline at end of file
+Formal benchmarks have not been established, but a full run of placing 60 new TB samples on an existing 7000+ TB sample tree, conversion to taxonium and newick formats, distance matrixing, clustering, and creating cluster-specific Nextstrain trees executes in about five minutes on a 2019 Macbook Pro.
\ No newline at end of file
diff --git a/attic/Test_Distance_Matrix_Methods.wdl b/attic/Test_Distance_Matrix_Methods.wdl
new file mode 100644
index 0000000..2d3b7b1
--- /dev/null
+++ b/attic/Test_Distance_Matrix_Methods.wdl
@@ -0,0 +1,267 @@
+version 1.0
+
+workflow Test_Distance_Matrix_Methods {
+ input {
+ File tree_nwk
+ File tree_pb
+ File samples
+ Int memory_gb = 32
+ Int cpu = 12
+ }
+
+ call create_distance_matrices {
+ input:
+ tree_nwk = tree_nwk,
+ tree_pb = tree_pb,
+ samples = samples,
+ memory = memory_gb,
+ cpu = cpu
+ }
+
+ meta {
+ description: "Let's see if PhyloDM really is as fast as they say"
+ }
+}
+
+task create_distance_matrices {
+ input {
+ File tree_nwk
+ File tree_pb
+ File samples
+ Int memory
+ Int cpu
+ }
+
+ command <<<
+ set -eux pipefail
+ pip install phylodm
+
+ python3 << CODE
+ TREE_AS_NWK = "~{tree_nwk}"
+ TREE_AS_PB = "~{tree_pb}"
+ MATUTILS_SUMMARY_SAMPLES = "~{samples}"
+ UNCLUSTERED_SAMPLES = set()
+
+ import sys
+ import csv
+ import time
+ import logging
+ import numpy as np
+ import bte
+ from phylodm import PhyloDM
+ import dendropy
+
+ np.set_printoptions(threshold=sys.maxsize)
+ logging.basicConfig(
+ format='[%(asctime)s] %(levelname)s %(message)s',
+ level=logging.INFO,
+ datefmt='%Y-%m-%d %H:%M:%S')
+ UINT8_MAX = np.iinfo(np.uint8).max # UNSIGNED!
+ UINT16_MAX = np.iinfo(np.uint16).max # UNSIGNED!
+ UINT32_MAX = np.iinfo(np.uint32).max # UNSIGNED!
+ UINT64_MAX = np.iinfo(np.uint64).max # UNSIGNED!
+ MATRIX_INTEGER_MAX = UINT64_MAX
+
+
+ with open(MATUTILS_SUMMARY_SAMPLES) as f: next(f); SAMPLES = sorted(line.split("\t")[0] for line in f)
+
+ def main():
+ # PhyloDM's native load_from_newick_path() seems to hate nwks from UShER, but
+ # for some reason it has no issue with them if passed through dendropy first?
+ dendro_tree = dendropy.Tree.get_from_path(TREE_AS_NWK, schema='newick')
+ phylodm_tree = PhyloDM.load_from_dendropy(dendro_tree)
+ matrix_start_time, matrix_name = time.time(), "PHYLODM"
+ phylodm_matrix = phylodm_tree.dm(norm=False)
+ logging.warning("[%s] Finished calculating matrix samples in %.2f sec", matrix_name, time.time() - matrix_start_time)
+
+ bte_tree = bte.MATree(TREE_AS_PB)
+ bte_matrix_simple = dist_matrix_and_get_subclusters(bte_tree, UINT32_MAX, SAMPLES, get_subclusters=False, matrix_name="SIMPLE BTE")
+ bte_matrix_checked = dist_matrix_and_get_subclusters(bte_tree, UINT32_MAX, SAMPLES, get_subclusters=True, matrix_name="DOUBLECHECK BTE")
+
+ assert bte_matrix_simple.shape == bte_matrix_checked.shape
+ assert bte_matrix_simple.shape == phylodm_matrix.shape
+ print(f"Matrixes all have the same shape: {phylodm_matrix.shape}")
+
+ compare_matrices(bte_matrix_simple, bte_matrix_checked, "bte_matrix_simple", "bte_matrix_checked", SAMPLES)
+ compare_matrices(bte_matrix_checked, bte_matrix_simple, "bte_matrix_checked", "bte_matrix_simple", SAMPLES)
+
+ # Check labels - note that phyloDM inserts spaces where the tree (and BTE) has underscores!
+ phylodm_labels = phylodm_tree.taxa()
+ bte_labels = SAMPLES
+ assert len(phylodm_labels) == len(bte_labels)
+ print(f"Matrixes all have the same number of labels: {len(phylodm_labels)}")
+ fixed_labels = list()
+ for label in phylodm_labels:
+ fixed_labels.append(label.replace(" ", "_"))
+ phylodm_labels = fixed_labels
+
+ if sorted(phylodm_labels) != bte_labels:
+ print(f"Total number of labels: {len(phylodm_labels)}")
+ print(f"Number of intersecting labels: {len(set(phylodm_labels).intersection(set(bte_labels)))}")
+ print(f"Symmetric difference: {set(phylodm_labels).symmetric_difference(set(bte_labels))}")
+ print("PhyloDM - bte")
+ print(set(phylodm_labels) - set(bte_labels))
+ print("bte - PhyloDM")
+ print(set(bte_labels) - set(phylodm_labels))
+
+ # 1. Create a mapping from labels to indices for the second matrix
+ label_to_idx_b = {label: i for i, label in enumerate(bte_labels)}
+
+ # 2. Get the new order of indices based on phylodm_labels
+ # (Assumes phylodm_labels and bte_labels contain the same set of strings)
+ new_indices = [label_to_idx_b[label] for label in phylodm_labels]
+
+ # 3. Reorder Matrix B to match Matrix A
+ # np.ix_ creates a mesh so we reorder rows AND columns
+ bte_matrix_simple_aligned = bte_matrix_simple[np.ix_(new_indices, new_indices)]
+
+ compare_matrices(phylodm_matrix, bte_matrix_simple_aligned, "phylodm_matrix", "bte_matrix_simple_aligned", SAMPLES)
+
+
+ def compare_matrices(matrix_a, matrix_b, a_name, b_name, labels):
+ diff_norm = np.linalg.norm(matrix_a - matrix_b)
+ print(f"[{a_name}, {b_name}] Frobenius Norm: {diff_norm}")
+ if diff_norm == 0.0:
+ print(f"[{a_name}, {b_name}] appear to be functionally indentical")
+ return
+ print(labels)
+ print(matrix_a)
+ print(labels)
+ print(matrix_b)
+ sample_differences = np.abs(matrix_a - matrix_b).mean(axis=1)
+ silliest_samples = sorted(zip(labels, sample_differences), key=lambda x: x[1], reverse=True)
+ print(f"[{a_name}, {b_name}] Silliest (most different) samples: {silliest_samples}")
+
+
+ def dist_matrix_and_get_subclusters(tree_to_matrix, subcluster_distance, samples: list, get_subclusters, matrix_name="SOME_MATRIX"):
+ matrix_start_time = time.time()
+ i_samples = samples # this was sorted() earlier so it should be sorted in matrix
+ j_ghost_index = 0
+ neighbors = []
+
+ # Currently using a 32-bit unsigned int matrix in hopes of less aggressive RAM usage
+ if MATRIX_INTEGER_MAX == UINT8_MAX:
+ matrix = np.full((len(samples),len(samples)), 0, dtype=np.uint8) # UNSIGNED!
+ elif MATRIX_INTEGER_MAX == UINT16_MAX:
+ matrix = np.full((len(samples),len(samples)), 0, dtype=np.uint16) # UNSIGNED!
+ else:
+ matrix = np.full((len(samples),len(samples)), 0, dtype=np.uint32) # UNSIGNED!
+
+ for i, this_samp in enumerate(i_samples):
+ definitely_in_a_cluster = False
+
+ # The pool of samples we allow for j shrinks by one with every iteration of i,
+ # in order to prevent calculating distances twice. (We can do this only because
+ # our matrix is square and we're starting with two equivalent sorted lists.)
+ #
+ # We keep track of how many shrinks we have done using "j_ghost_index".
+ #
+ # on first iteration:
+ # j_ghost = 1
+ # i = [A*, B, C, D]
+ # j = [B, C, D]
+ # ---> A:B, A:C, and A:D
+ # next:
+ # j_ghost = 2
+ # i = [A, B*, C, D]
+ # j = [C, D]
+ # ---> B:C and B:D (we already had B:A from previous iteration)
+ # next:
+ # j_ghost = 3
+ # i = [A, B, C*, D]
+ # j = [D]
+ # ---> C:D (already have C:A and C:B)
+ # next:
+ # j_ghost = 4
+ # i = [A, B, C, D*]
+ # j = []
+ # ---> finished
+ #
+ # We need to keep track of how many shrinks we have done with "j_ghost_index"
+ # in order to place these calculated values in the correct place on the matrix.
+ # j_ghost_index + enumerate(j_samples) = correct index for the j bit of the matrix
+ j_ghost_index += 1
+ j_samples = samples[j_ghost_index:]
+
+ for j, that_samp in enumerate(j_samples):
+
+ j_matrix = j + j_ghost_index
+ logging.debug("j %s, j_ghost_index %s, j_matrix %s, that_samp %s", j, j_ghost_index, j_matrix, that_samp)
+
+ this_node, that_node = this_samp, that_samp
+ LCA = tree_to_matrix.LCA([this_samp, that_samp]) # type str
+ logging.debug("%s:%s LCA is %s", this_samp, that_samp, LCA)
+ total_distance = sum_paths_to_LCA_plus_overflow_check(tree_to_matrix, this_node, that_node, LCA)
+ matrix[i][j_matrix], matrix[j_matrix][i] = total_distance, total_distance
+ if get_subclusters and total_distance <= subcluster_distance:
+ logging.debug(" %s and %s seem to be within a %sSNP-cluster (%s)", this_samp, that_samp, subcluster_distance, total_distance)
+ neighbors.append(tuple((this_samp, that_samp)))
+ definitely_in_a_cluster = True
+
+ # Consider samples A, B, C, D, and E. When i = A, j=B, so we calculate their distance, then assign the result to matrix[A][B]
+ # and matrix[B][A]. Then j=C, so we get the distance, assign matrix[A][C] and matrix[C][A], etc...
+ # Because the j array is shrinking per iteration of i, that can prevent definitely_in_a_cluster from being triggered if it
+ # ought to, which is why we need this bit below.
+ if get_subclusters and not definitely_in_a_cluster:
+ second_smallest_distance = np.partition(matrix[i], 1)[1] # second smallest, because smallest is self-self at 0
+ if second_smallest_distance <= subcluster_distance:
+ #logging.debug(" Oops, %s was already clustered! (closest sample is %s SNPs away)", this_samp, second_smallest_distance)
+ pass
+ else:
+ #logging.debug(" %s appears to be truly unclustered (closest sample is %s SNPs away)", this_samp, second_smallest_distance)
+ if subcluster_distance in (UINT32_MAX, 20): # pylint: disable=else-if-used # only add to global unclustered if it's not in a 20 SNP cluster
+ if this_samp in INITIAL_SAMPS:
+ UNCLUSTERED_SAMPLES.add(this_samp) # attempt to fix https://github.com/aofarrel/tree_nine/issues/41
+
+ # finished iterating, let's see what our clusters look like
+ #logging.info("Here is our matrix")
+ #logging.info(self.matrix)
+ # This doesn't print len(self.samples) because that was printed earlier already
+ logging.warning("[%s] Finished calculating matrix samples in %.2f sec", matrix_name, time.time() - matrix_start_time)
+ return matrix
+ #subclusters = self.get_true_clusters(neighbors, self.get_subclusters, subcluster_distance) # None if !get_subclusters
+ #return subclusters
+
+ def sum_paths_to_LCA_plus_overflow_check(tree_to_matrix, this_node, that_node, LCA):
+ this_path, that_path = 0,0
+ while tree_to_matrix.get_node(this_node).id != LCA:
+ this_node = tree_to_matrix.get_node(this_node) # type MATnode
+ this_path += this_node.branch_length # type float
+ logging.debug(f"{this_node.id}x{this_node.parent.id} branch length: {this_path}")
+ this_node = this_node.parent.id # type str
+ while tree_to_matrix.get_node(that_node).id != LCA:
+ that_node = tree_to_matrix.get_node(that_node) # type MATnode
+ that_path += that_node.branch_length # type float
+ logging.debug(f"{that_node.id}x{that_node.parent.id} branch length: {that_path}")
+ that_node = that_node.parent.id # type str
+ total_distance_i64 = this_path + that_path
+ logging.debug(f"Total distance is {total_distance_i64}")
+ if total_distance_i64 > MATRIX_INTEGER_MAX:
+ # this is a debug instead of a warning because it happens so often in the uint8 case
+ logging.debug("Total distance between %s and %s is %s, greater than integer maximum; will store as %s", this_node, that_node, total_distance_i64, MATRIX_INTEGER_MAX)
+ return convert_64int_to_whatever(MATRIX_INTEGER_MAX)
+ else:
+ return convert_64int_to_whatever(total_distance_i64)
+
+ def convert_64int_to_whatever(python_int64):
+ if MATRIX_INTEGER_MAX == UINT8_MAX:
+ return np.uint8(python_int64) # UNSIGNED!
+ elif MATRIX_INTEGER_MAX == UINT16_MAX:
+ return np.uint16(python_int64) # UNSIGNED!
+ else:
+ return np.uint32(python_int64) # UNSIGNED!
+
+ if __name__=="__main__":
+ main()
+
+ CODE
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: cpu
+ disks: "local-disk " + 150 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4ash_2"
+ memory: memory + " GB"
+ }
+}
\ No newline at end of file
diff --git a/attic/benchmark_usher_gcp.wdl b/attic/benchmark_usher_gcp.wdl
new file mode 100644
index 0000000..1465edd
--- /dev/null
+++ b/attic/benchmark_usher_gcp.wdl
@@ -0,0 +1,161 @@
+version 1.0
+
+# !! This calls the Google Cloud metadata server and WILL NOT WORK on other backends, including local !!
+
+import "https://raw.githubusercontent.com/aofarrel/SRANWRP/v1.1.29/tasks/processing_tasks.wdl" as processing
+
+workflow Tree_Nine {
+ input {
+ Array[File] diffs
+ File? input_tree
+ File? existing_diffs
+ File? existing_samples
+
+ # matUtils/UShER options
+ Boolean detailed_clades = false
+ Float? max_low_coverage_sites
+
+ # output file names and prefixes, extension not included
+ String? comment
+ Array[String]? rename_samples
+ String out_prefix = "bigtree"
+ String out_diffs = "_combined"
+
+ # testing functions
+ Boolean concat_files_then_exit = false
+ }
+
+ call processing.cat_files as cat_diff_files {
+ input:
+ new_files_to_concat = diffs,
+ out_concat_file = out_prefix + out_diffs,
+ keep_only_unique_lines = false,
+ keep_only_unique_files = true, # STRICTLY NECESSARY UNLESS YOUR DATA *AND* SAMPLE IDS ARE DEDUPLICATED
+ quality_report_removal_threshold = max_low_coverage_sites,
+ out_sample_names = "samples_added",
+ new_files_override_sample_names = rename_samples,
+ king_file = existing_diffs,
+ king_file_sample_names = existing_samples,
+ and_then_exit_1 = concat_files_then_exit,
+ datestamp_main_files = true, # does not datestamp diffs
+ out_concat_extension = ".diff"
+ }
+ String optional_datestamp = cat_diff_files.today
+ String outfile_usher_tree_raw = out_prefix + optional_datestamp + "_raw.pb"
+
+ call usher_sampled_diff as usher_sampled_diff_one {
+ input:
+ detailed_clades = detailed_clades,
+ diff = cat_diff_files.outfile,
+ input_mat = input_tree,
+ output_mat = outfile_usher_tree_raw
+ }
+
+ call usher_sampled_diff as usher_sampled_diff_two {
+ input:
+ detailed_clades = detailed_clades,
+ diff = cat_diff_files.outfile,
+ input_mat = input_tree,
+ output_mat = outfile_usher_tree_raw
+ }
+
+ output {
+ String? out_comment = comment
+ File raw_tree_one = usher_sampled_diff_one.usher_tree
+ File raw_tree_two = usher_sampled_diff_two.usher_tree
+ }
+}
+
+
+task usher_sampled_diff {
+ input {
+ # main files -- for TB, do not include ref_genome, it's already baked in!
+ File diff
+ File? input_mat
+ File? ref_genome
+
+ # usher options
+ Int batch_size_per_process = 5
+ Boolean detailed_clades
+ Int optimization_radius = 0
+ Int max_parsimony_per_sample = 1000000
+ Int max_uncertainty_per_sample = 1000000
+ String output_mat = basename(select_first([input_mat, "debugtree"]), ".pb") + "_new.pb"
+
+ # WDL specific -- note that cpu does not directly set usher's
+ # threads argument, but it does affect the number of cores
+ # available for use (by default usher uses all available)
+ Int addldisk = 10
+ Int cpu = 40 # needed for CPDH but overkill for small numbers of samples -- 8 (yes, eight!) would do fine
+ Int memory = 32 # needed for CDPH but overkill for small numbers of samples -- 16 would do fine
+ Int preempt = 1
+ }
+
+ Int disk_size = ceil(size(diff, "GB")) + ceil(size(ref_genome, "GB")) + ceil(size(input_mat, "GB")) + addldisk
+ String D = if !(detailed_clades) then "" else "-D "
+
+ command <<<
+ if [[ "~{input_mat}" = "" ]]
+ then
+ i="/HOME/usher/example_tree/for_debugging_only__tb_7K_noQC_diffs_mask2ref.L.fixed.pb"
+ else
+ i="~{input_mat}"
+ fi
+
+ if [[ "~{ref_genome}" = "" ]]
+ then
+ ref="/HOME/usher/ref/Ref.H37Rv/ref.fa"
+ else
+ ref="~{ref_genome}"
+ fi
+
+ echo "~{input_mat}"
+ echo $i
+ echo "~{ref_genome}"
+ echo $ref
+ echo "----- CURRENT WORKDIR -----"
+ ls -lha
+ echo "---------- USHER ----------"
+
+ usher-sampled ~{D} --optimization_radius=~{optimization_radius} \
+ -e ~{max_uncertainty_per_sample} \
+ -E ~{max_parsimony_per_sample} \
+ --batch_size_per_process ~{batch_size_per_process} \
+ --diff "~{diff}" \
+ -i "$i" \
+ --ref "$ref" \
+ -o "~{output_mat}" >/dev/null 2>&1
+
+ echo "--- CLOUD VM DIAGNOSTICS ---"
+ apt-get update
+ apt-get install -y curl
+
+ echo -n "CPU Platform: "
+ curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/cpu-platform || echo "Unknown"
+
+ echo -n ""
+ echo -n "Machine Type: "
+ curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/machine-type | awk -F'/' '{print $NF}'
+
+ lscpu | grep -E "Model name|flags" | head -n 2
+
+ echo "------- DISK I/O TEST -------"
+ # write 1 GB of 0s as a test of disk speed
+ dd if=/dev/zero of=test_speed_file bs=1M count=1000 oflag=dsync 2>&1 | grep -E "copied|MB/s"
+ rm test_speed_file
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File usher_tree = output_mat
+ File? clades = "clades.txt" # only if detailed_clades = true
+ }
+
+}
\ No newline at end of file
diff --git a/sanity_check_diffs.wdl b/attic/sanity_check_diffs.wdl
similarity index 100%
rename from sanity_check_diffs.wdl
rename to attic/sanity_check_diffs.wdl
diff --git a/attic/tree_nein.wdl b/attic/tree_nein.wdl
new file mode 100644
index 0000000..d74cd50
--- /dev/null
+++ b/attic/tree_nein.wdl
@@ -0,0 +1,1499 @@
+version 1.0
+
+# "Batteries included" version of Tree Nine used for Firecloud API (FISS), which appears to not support Dockstore workflows
+
+import "https://raw.githubusercontent.com/aofarrel/SRANWRP/v1.1.29/tasks/processing_tasks.wdl" as processing
+import "https://raw.githubusercontent.com/aofarrel/dropkick/1.0.0/dropkick.wdl" as dropkick
+import "https://raw.githubusercontent.com/aofarrel/microreact_WDLs/refs/heads/main/share_projects_with_team_via_file.wdl"
+
+# User notes:
+# * If user doesn't define input_tree, a rudimentary 7K sample tree will serve as the base tree. This base tree
+# doesn't represent genetic diversity of MTBC well and should not be used for anything besides quick testing.
+# * Should be run with --copy-input-files on miniwdl (required if clustering, may work w/o it if not clustering)
+
+# Dev notes:
+# * Anything marked !ForwardReference is using a bogus fallback value with select_first() to coerce to not-optional
+
+workflow Tree_Nine {
+ input {
+ Array[File] diffs
+ Array[String]? entity_ids # will be necessary for lining up metadata in later versions
+ File? input_tree
+ File? existing_diffs
+ File? existing_samples
+ String? listener_bucket
+
+ # matUtils/UShER options
+ Boolean detailed_clades = false
+ Float? max_low_coverage_sites
+ File? matutils_clade_annotations
+ Boolean optimize = true
+ String? reroot_to_this_node
+ Boolean summarize_tree_before_placing_samples = false
+ Boolean summarize_tree_after_placing_samples = false
+
+ # related to clustering/distance matrix
+ Boolean identify_clusters = false
+ Boolean cluster_entire_tree = false
+ File? special_samples
+ File? persistent_denylist
+
+ # if you are running with persistent clusters, both of these must be filled in
+ # if you are identifying clusters ad-hoc, both of these must be undefined
+ File? persistent_cluster_meta
+ File? persistent_cluster_ids
+
+ # related to putting clusters on Microreact
+ Boolean upload_clusters_to_microreact = false
+ File? microreact_blank_template_json
+ File? microreact_decimated_template_json
+ File? microreact_key
+ File? microreact_update_template_json
+
+ # non-exclusive ways of sharing your microreact projects
+ String? microreact_share_email
+ String? microreact_share_team
+
+ # rarely used files (see parameter_meta)
+ Array[File]? coverage_reports
+ File? ref_genome # do not define this if you're using H37Rv!
+
+ # output file names and prefixes, extension not included
+ String? comment
+ Array[String]? rename_samples
+ Boolean datestamp_outs = true
+ String out_prefix = "bigtree"
+ String out_diffs = "_combined"
+
+ # testing functions
+ Boolean concat_files_then_exit = false
+ }
+
+ parameter_meta {
+ comment: "String that gets copied directly to output (useful for Terra data tables)"
+
+ diffs: "Array of single-sample MAPLE-formatted diff files from myco"
+
+ input_tree: "The base MAT tree (.pb) that samples will be placed upon; will fall back to a test tree on ~7K TB samples from SRA if not defined (test tree should ONLY be used for quick debugging; it is not representative of MTBC diversity nor proper sample QC)"
+
+ existing_diffs: "A pre-concatenated multi-sample .diff file which will be concatenated with the single-sample diffs input (requires existing_samples)"
+
+ existing_samples: "A file listing all samples within the multiple-sample existing_diffs file (requires existing_diffs)"
+
+ matutils_clade_annotations: "Two column TSV for clade annotation via matUtils"
+
+ cluster_entire_tree: "If true, matrix and cluster all samples on tree; if false, only matrix and cluster special_samples (if defined) or newly added samples."
+
+ cluster_max_distance: "Soft-maximum SNP distance between two samples for them to be in the same cluster. NOTE if cluster_max_distance=10, A:B=10, B:C=5, and A:C=15, then all three will still be in a cluster even though A:C is above cluster_max_distance, since both are within 10 of another sample in that cluster."
+
+ coverage_reports: "NOT USUALLY NEEDED - Single line text files generated by Lily's vcf to diff script, used to filter samples with low overall coverage. By default, myco's vcf_to_diff.py filters sites per site coverage, so this isn't typically needed here in Tree Nine."
+
+ detailed_clades: "usher_sampled_diff -D"
+
+ max_low_coverage_sites: "Maximum percentage of low coverage sites a sample can have before throwing it out (requires coverage_reports, does not apply to backmasked diffs)"
+
+ special_samples: "Provide an override file containing names of the only samples to consider for matrix and clustering. If this isn't defined, matrixing and clustering is done on either entire tree (if cluster_entire_tree) or all samples with a diff file (if not cluster_entire_tree)."
+
+ ref_genome: "Reference genome, equivalent to UShER's ref argument, default is H37Rv (M tuberculosis)"
+
+ rename_samples: "For file at index i in diffs[i], rename it to the corresponding string at rename_samples[i]."
+
+ reroot_to_this_node: "matUtils extract -y (Reroot the output tree relative to this node, leave blank to not reroot)"
+
+ out_prefix: "Prefix for all output files"
+
+ upload_clusters_to_microreact: "If you know, you know"
+ }
+
+ call processing.cat_files as cat_diff_files {
+ input:
+ new_files_to_concat = diffs,
+ out_concat_file = out_prefix + out_diffs,
+ keep_only_unique_lines = false,
+ keep_only_unique_files = true, # STRICTLY NECESSARY UNLESS YOUR DATA *AND* SAMPLE IDS ARE DEDUPLICATED
+ new_files_quality_reports = coverage_reports,
+ quality_report_removal_threshold = max_low_coverage_sites,
+ out_sample_names = "samples_added",
+ new_files_override_sample_names = rename_samples,
+ king_file = existing_diffs,
+ king_file_sample_names = existing_samples,
+ and_then_exit_1 = concat_files_then_exit,
+ datestamp_main_files = true, # does not datestamp diffs
+ out_concat_extension = ".diff"
+ }
+
+ File samples_considered_for_clustering = select_first([special_samples, cat_diff_files.first_lines, usher_sampled_diff.usher_tree]) #!ForwardReference
+
+ # Tree Nine attempts to use a clear naming scheme to make its large number of output files unambigious, but you might have a better
+ # system than I do, so I'm going to define all remaining major outfile-controlling variables here so you can edit it easily.
+
+ String empty_string = ""
+ if(!(datestamp_outs)) { String no_datestamp = "" }
+ String optional_datestamp = select_first([no_datestamp, cat_diff_files.today])
+ String presumed_input_mat_basename = basename(select_first([input_tree, "default_debug-only_basetree"]))
+
+ String outfile_annotated_input_tree = "input_" + presumed_input_mat_basename + optional_datestamp + ".pb"
+ String outfile_usher_tree_raw = out_prefix + optional_datestamp + "_raw.pb"
+ String outfile_usher_tree_optimized = basename(outfile_usher_tree_raw, "_raw.pb") + optional_datestamp + "_optimized.pb"
+ String outfile_usher_tree_annotated = basename(outfile_usher_tree_raw, "_raw.pb") + optional_datestamp + "_annotated.pb"
+ String outfile_usher_tree_rerooted = basename(outfile_usher_tree_raw, "_raw.pb") + optional_datestamp + "_reroot_to_" + select_first([reroot_to_this_node, empty_string]) + ".pb"
+ String outfile_taxonium_tree = basename(outfile_usher_tree_raw, "_raw.pb") + optional_datestamp + "_taxonium.jsonl.gz"
+ String outfile_nextstrain_tree = basename(outfile_usher_tree_raw, "_raw.pb") + optional_datestamp + ".json"
+ String outfile_nwk_matutils_tree = basename(outfile_usher_tree_raw, "_raw.pb") + optional_datestamp + ".nwk"
+ # There is also a nwk tree generated by the clustering script, that one is always called a000000.nwk and should be identical to the matutils one
+
+ String outfile_input_tree_summaries = "input_" + presumed_input_mat_basename + "_"
+ String outfile_usher_tree_summaries_before_reroot = out_prefix + optional_datestamp + "_before_reroot_"
+ String outfile_usher_tree_summaries_final = out_prefix + optional_datestamp + "_final_"
+
+
+ if(summarize_tree_before_placing_samples) {
+ if (defined(input_tree)) {
+
+ # iff there is a metadata tsv, annotate input tree with it before summarizing
+ if (defined(matutils_clade_annotations)) {
+
+ call annotate as annotate_input_tree {
+ input:
+ input_mat = select_first([input_tree, usher_sampled_diff.usher_tree]), #!ForwardReference
+ metadata_tsv = select_first([matutils_clade_annotations, usher_sampled_diff.usher_tree]), #!ForwardReference
+ outfile_mat = outfile_annotated_input_tree
+ }
+ }
+
+ File possibly_annotated_input_tree = select_first([annotate_input_tree.annotated_tree, input_tree])
+
+ call summarize as summarize_input_tree {
+ input:
+ input_mat = possibly_annotated_input_tree,
+ prefix_outs = outfile_input_tree_summaries
+ }
+ }
+ }
+
+ call usher_sampled_diff as usher_sampled_diff {
+ input:
+ detailed_clades = detailed_clades,
+ diff = cat_diff_files.outfile,
+ input_mat = input_tree,
+ output_mat = outfile_usher_tree_raw,
+ ref_genome = ref_genome
+ }
+
+ if (optimize) {
+ call matOptimize as matOptimize_usher {
+ input:
+ input_mat = usher_sampled_diff.usher_tree,
+ output_mat = outfile_usher_tree_optimized
+ }
+ }
+
+ File optimized_or_raw_tree = select_first([matOptimize_usher.optimized_tree, usher_sampled_diff.usher_tree])
+
+
+ if (defined(matutils_clade_annotations)) {
+ call annotate as annotate_usher {
+ input:
+ input_mat = optimized_or_raw_tree,
+ metadata_tsv = select_first([matutils_clade_annotations, usher_sampled_diff.usher_tree]), # bogus fallback
+ outfile_mat = outfile_usher_tree_annotated
+ }
+ }
+
+ File possibly_annotated_maximal_output_tree = select_first([annotate_usher.annotated_tree, optimized_or_raw_tree])
+
+ if(defined(reroot_to_this_node)) {
+
+ if(summarize_tree_after_placing_samples) {
+ call summarize as summarize_before_reroot {
+ input:
+ input_mat = possibly_annotated_maximal_output_tree,
+ prefix_outs = outfile_usher_tree_summaries_before_reroot
+ }
+ }
+
+ call reroot as reroot_usher {
+ input:
+ input_mat = possibly_annotated_maximal_output_tree,
+ reroot_to_this_node = select_first([reroot_to_this_node, ""]),
+ output_mat = outfile_usher_tree_rerooted
+ }
+ }
+
+ File final_maximal_output_tree = select_first([reroot_usher.rerooted_tree, possibly_annotated_maximal_output_tree])
+
+ # defined(matutils_clade_annotations) defined(reroot_to_this_node) final_maximal_output_tree
+ # ----------------------------------------------------------------------------------------------------------------
+ # true true annotated and rerooted
+ # true false annotated
+ # false true rerooted
+ # false false neither, just the output matOptimize_usher.optimized_tree
+
+ call convert_to_taxonium as to_taxonium {
+ input:
+ input_mat = final_maximal_output_tree,
+ outfile_taxonium = outfile_taxonium_tree
+ }
+
+ if (identify_clusters) {
+ call cluster_CDPH_method as cluster {
+ input:
+ shareemail = microreact_share_email,
+ input_mat_with_new_samples = final_maximal_output_tree,
+ special_samples = samples_considered_for_clustering,
+ combined_diff_file = cat_diff_files.outfile,
+ only_matrix_special_samples = !(cluster_entire_tree),
+ persistent_ids = persistent_cluster_ids,
+ persistent_cluster_meta = persistent_cluster_meta,
+ microreact_key = microreact_key,
+ microreact_update_template_json = microreact_update_template_json,
+ microreact_blank_template_json = microreact_blank_template_json,
+ microreact_decimated_template_json = microreact_decimated_template_json,
+ persistent_denylist = persistent_denylist,
+ upload_clusters_to_microreact = upload_clusters_to_microreact,
+ datestamp = cat_diff_files.today,
+ #metadata_fields = metadata_fields,
+ #metadata_values = metadata_values
+ }
+
+ # This is some trickery to prevent Cromwell from complaining about us putting an "optional" output
+ # into a non-optional task input. We can do this because the "optional" output actually gets created
+ # in non-error cases (at least, this is the case with how we call the task here in Tree Nine)
+ # so it will never actually fall back on optimized_or_raw_tree -- and if it does error, the entire
+ # pipeline crashes so what happens here is moot.
+ File coerced_cluster_json = select_first([cluster.final_cluster_information_json, optimized_or_raw_tree])
+ File coerced_unclustered_txt = select_first([cluster.unclustered_samples, optimized_or_raw_tree])
+
+ if (defined(listener_bucket)) {
+ # More coercion workarounds here, this one is even sillier because we're in a defined() block, alas
+ # this is required.
+ String coerced_destination_bucket = select_first([listener_bucket, "nonsense fallback value"])
+ call dropkick.Dropkick_Curl as upload_cluster_json {
+ input:
+ destination_bucket = coerced_destination_bucket,
+ files_to_upload = [coerced_cluster_json]
+ }
+
+ call dropkick.Dropkick_Curl as upload_unclustered_txt {
+ input:
+ destination_bucket = coerced_destination_bucket,
+ files_to_upload = [coerced_unclustered_txt]
+ }
+ }
+
+ if (defined(microreact_share_team)) {
+ if (defined(microreact_key)) {
+ if (defined(cluster.updated_mr_URIs_file)) { # must explictly check as it could be undefined if nothing got updated
+ File coerced_microeract_key = select_first([microreact_key, optimized_or_raw_tree])
+ String coerced_microreact_share_team = select_first([microreact_share_team, "nonsense fallback value"])
+ File coerced_updated_mr_URIs_file = select_first([cluster.updated_mr_URIs_file, optimized_or_raw_tree])
+ call share_projects_with_team_via_file.Microreact_Share_Projects_With_Team {
+ input:
+ token = coerced_microeract_key,
+ team_uri = coerced_microreact_share_team,
+ project_uris = coerced_updated_mr_URIs_file
+ }
+ }
+ }
+ }
+
+ call convert_to_nextstrain_single_terra_compatiable as to_nextstrain_cluster {
+ input:
+ input_mat = final_maximal_output_tree,
+ outfile_nextstrain = outfile_nextstrain_tree,
+ one_metadata_file = cluster.samp_cluster_ten
+ }
+ }
+
+ if (!(identify_clusters)) {
+ call convert_to_nextstrain_single_terra_compatiable as to_nextstrain {
+ input:
+ input_mat = final_maximal_output_tree,
+ outfile_nextstrain = outfile_nextstrain_tree
+ }
+ }
+
+
+ if(summarize_tree_after_placing_samples) {
+ call summarize as summarize_final {
+ input:
+ input_mat = final_maximal_output_tree,
+ prefix_outs = outfile_usher_tree_summaries_final
+ }
+ }
+
+ output {
+ String? out_comment = comment
+
+ # This has a ton of outputs and we want them to be easily viewable in Terra's UI, so they have a consistent naming scheme:
+ #
+ # [BM/NB/IN]_[BIG/CLS/UNC]_[THING]_[FILETYPE]
+ # where:
+ # * BM = backmask, NB = not backmasked, IN = "raw" input
+ # * BIG = big tree, CLS = cluster, UNC = unclustered
+
+ # big trees - protobuff
+ #
+ # note that tree_usher_rerooted is annotated if defined(matutils_clade_annotations), but tree_usher_annotated is NOT rerooted
+ # even if defined(reroot_to_this_node) -- this was done on purpose so people can get two annotated trees if they
+ # want to easily compare the tree before and after rerooting
+ #
+ File BIG_tree_usher = optimized_or_raw_tree
+ File BIG_tree_usher_raw_dont_use = usher_sampled_diff.usher_tree
+ File? BIG_tree_reroot = reroot_usher.rerooted_tree
+ File? BIG_tree_ushanno = annotate_usher.annotated_tree
+
+ # big trees - other formats
+ #
+ # iff defined(reroot_to_this_node), these are based on usher_tree_rerooted
+ # else, these are based on usher_tree_raw (and usher_tree_rerooted doesn't exist)
+ #
+ File? BIG_tree_nwk_raw = cluster.bigtree_raw
+ File? BIG_tree_nwk_gen = cluster.bigtree_gen
+ File BIG_tree_taxonium = to_taxonium.taxonium_tree
+ File? BIG_tree_json_noanno = to_nextstrain.nextstrain_singular_tree
+ File? BIG_tree_json_clusteranno = to_nextstrain_cluster.nextstrain_singular_tree
+
+ # cluster subtrees
+ # ultimately derived from nb_big_tree_nwk/bm_big_tree_nwk
+ #
+ #Array[File]? CLUSTER_trees_json = cluster.cluster_trees_json
+ #Array[File]? CLUSTER_trees_nwk = cluster.acluster_trees
+ #Array[File]? BM_CLUSTER_trees_json = cluster.cluster_trees_json
+ Array[File]? BM_CLUSTER_trees_nwk = cluster.bcluster_trees
+
+ # distance matrices
+ File? BIG_matrix_nb = cluster.bigtree_matrix
+ #Array[File]? CLUSTER_dmatrices = cluster.acluster_matrices
+ Array[File]? BM_CLUSTER_dmatrices = cluster.bcluster_matrices
+
+ # other cluster information
+ File updated_diff_file = cat_diff_files.outfile
+ File updated_diff_contents = samples_considered_for_clustering
+ File? updated_persistent_ids = cluster.new_persistent_ids
+ File? updated_persistent_meta = cluster.new_persistent_meta
+ File? updated_cluster_information_json = cluster.final_cluster_information_json
+ Int? n_new_samps_input = cat_diff_files.files_input
+ Int? n_new_samps_skipped = cat_diff_files.files_removed
+ Int? n_20SNP_clusters = cluster.n_big_clusters
+ Int? n_samps_unclustered = cluster.n_unclustered
+ Int? n_samps_clustered = cluster.n_samples_in_clusters
+ Int? n_samps_processed = cluster.n_samples_processed
+
+ # unclustered stuff
+ File? unclustered_neighbors = cluster.all_nearest_relatives
+ File? unclusted_samples = cluster.unclustered_samples
+ #File nb_unc_tree_nwk = cluster.unclustered_tree_nwk
+ #Array[File]? unclustered_subtrees = cluster.unclustered_subtrees
+
+ # summaries
+ File? info_new_samples = cluster.new_samples
+ File? in_summary = summarize_input_tree.summary
+ File? nb_summary_preroot = summarize_before_reroot.summary
+ File? nb_summary_final = summarize_final.summary
+
+ # sample information
+ File? in_list_samples = summarize_input_tree.samples
+ File? nb_list_samples_preroot = summarize_before_reroot.samples # iff defined(reroot_to_this_node)
+ File? nb_list_samples_final = summarize_final.samples
+
+ #Array[String] samples_processed = read_lines(samples_considered_for_clustering) # non-array version also exists
+ Array[String] samples_dropped = cat_diff_files.removed_files
+
+ }
+}
+
+task extract {
+ input {
+ File input_mat
+ File samples
+ Int? nearest_k
+
+ Int addldisk = 10
+ Int cpu = 8
+ Int memory = 16
+ Int preempt = 1
+ }
+ Int disk_size = ceil(size(input_mat, "GB")) + addldisk
+ String output_mat = basename(input_mat, ".pb") + ".subtree" + ".pb"
+
+ command <<<
+ if [[ "~{nearest_k}" = "" ]]
+ then
+ matUtils extract -i ~{input_mat} -s ~{samples} -o ~{output_mat}
+ else
+ matUtils extract -i ~{input_mat} -K ~{samples}:~{nearest_k} -o ~{output_mat}
+ fi
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File subtree = output_mat
+ }
+}
+
+task mask {
+ input {
+ File input_mat
+ File mask_tsv
+
+ Int addldisk = 10
+ Int cpu = 8
+ Int memory = 16
+ Int preempt = 1
+ }
+ Int disk_size = ceil(size(input_mat, "GB")) + addldisk
+ String output_mat = basename(input_mat, ".pb") + ".masked" + ".pb"
+
+ command <<<
+ matUtils mask -i ~{input_mat} --mask-mutations ~{mask_tsv} -o ~{output_mat}
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File masked_tree = output_mat
+ }
+}
+
+task reroot {
+ input {
+ File input_mat
+ String reroot_to_this_node
+ String output_mat = basename(input_mat, ".pb") + ".reroot_to_~{reroot_to_this_node}" + ".pb"
+
+ Int addldisk = 10
+ Int cpu = 8
+ Int memory = 16
+ Int preempt = 1
+ }
+ Int disk_size = ceil(size(input_mat, "GB")) + addldisk
+
+ command <<<
+ if [[ "~{reroot_to_this_node}" = "" ]]
+ then
+ echo "You need to specify the node to reroot upon"
+ exit 1
+ fi
+ matUtils extract -i "~{input_mat}" -y "~{reroot_to_this_node}" -o "~{output_mat}"
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File rerooted_tree = output_mat
+ }
+}
+
+task summarize {
+ # Generates most of the possible outputs of matUtils summarize:
+ #
+ # --samples (-s): Write a two-column tsv listing all samples in the tree and their parsimony score (terminal branch length). Auspice-compatible.
+ # --clades (-c): Write a tsv listing all clades and the count of associated samples in the tree.
+ # --mutations (-m): Write a tsv listing all mutations in the tree and their occurrence count.
+ # --aberrant (-a): Write a tsv listing potentially problematic nodes, including duplicates and internal nodes with no mutations and/or branch length 0.
+ # --haplotype (-H): Write a tsv listing haplotypes represented by comma-delimited lists of mutations and their count across the tree.
+ # --sample-clades (-C): Write a tsv listing all samples and their clades.
+ # --calculate-roho (-R): Write a tsv listing, for each mutation occurrence that is valid, the number of offspring and other numbers for RoHo calculation.
+ #
+ # Two outputs are not generated:
+ # * expanded_roho: this slows things down too much
+ # * translate: this would require taking in a gtf and ref genome
+
+ input {
+ File? input_mat
+ String? prefix_outs
+
+ Int addldisk = 10
+ Int cpu = 8
+ Int memory = 16
+ Int preempt = 1
+ }
+ Int disk_size = if defined(input_mat) then ceil(size(input_mat, "GB")) + addldisk else addldisk
+ String prefix = select_first([prefix_outs, ""])
+
+ command <<<
+ if [[ "~{input_mat}" = "" ]]
+ then
+ i="/HOME/usher/example_tree/for_debugging_only__tb_7K_noQC_diffs_mask2ref.L.fixed.pb"
+ else
+ i="~{input_mat}"
+ fi
+
+ matUtils summary -i "$i" > "~{prefix}.summary.txt"
+ matUtils summary -i "$i" -A # samples, clades, mutations, aberrant
+ matUtils summary -i "$i" -H haplotypes.tsv
+ matUtils summary -i "$i" -C sample_clades.tsv
+ matUtils summary -i "$i" -R roho.tsv
+ for file in *.tsv
+ do
+ mv -- "$file" "~{prefix}.${file}"
+ done
+
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File summary = prefix + ".summary.txt"
+ File samples = prefix + ".samples.tsv"
+ File clades = prefix + ".clades.tsv"
+ File mutations = prefix + ".mutations.tsv"
+ File aberrant = prefix + ".aberrant.tsv"
+ File haplotype = prefix + ".haplotypes.tsv"
+ File sample_clades = prefix + ".sample_clades.tsv"
+ File calculate_roho = prefix + ".roho.tsv"
+ }
+}
+
+task annotate {
+ input {
+ File? input_mat
+ File metadata_tsv # only can annotate one column at a time
+ String outfile_mat
+
+ Int addldisk = 10
+ Int cpu = 8
+ Int memory = 16
+ Int preempt = 1
+ }
+ Int disk_size = ceil(size(input_mat, "GB")) + ceil(size(metadata_tsv, "GB")) + addldisk
+
+ command <<<
+ matUtils annotate -i "~{input_mat}" -P "~{metadata_tsv}" -o "~{outfile_mat}"
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File annotated_tree = outfile_mat
+ }
+}
+
+task convert_to_newick_subtrees_by_cluster {
+ input {
+ File input_mat
+ File metadata_tsv
+ File grouped_clusters
+ Int context_samples
+ Int memory = 32
+ Boolean debug = true
+ String prefix = ""
+ }
+
+ command <<<
+ i=2
+ number_of_clusters=$(wc -l ~{grouped_clusters} | awk '{ print $1 }')
+ if [ ~{debug} = "true" ]; then printf "while %s < %s\n" "$i" "$number_of_clusters"; fi
+ # shellcheck disable=SC2086
+ # We are going to copy groups.tsv within this loop, but that shouldn't cause issues.
+ # Older versions of this task mv'd groups.txt and were fine, so this should be extra-safe.
+ while [ $i -le $number_of_clusters ]
+ do
+ head -$i "~{grouped_clusters}" | tail -n 1 > groups.tsv
+ if [ ~{debug} = "true" ]
+ then
+ printf "line %s grouped clusters now in groups.tsv as:\n" "$i"
+ cat groups.tsv
+ printf "\n"
+ fi
+ # shellcheck disable=SC2094
+ while IFS=" " read -r cluster samples
+ do
+ # shellcheck disable=SC2086
+ echo $samples > this_cluster_samples.txt
+ sed -i 's/,/\n/g' this_cluster_samples.txt
+ number_of_samples_in_cluster=$(wc -l this_cluster_samples.txt | awk '{ print $1 }')
+ minimum_tree_size=$((number_of_samples_in_cluster+~{context_samples}))
+ if [ ~{debug} = "true" ]
+ then
+ printf "%s in cluster + ~{context_samples} context: expecting %s samples in output" "$number_of_samples_in_cluster" "$minimum_tree_size"
+ printf "passing this_cluster_samples.txt:\n"
+ cat this_cluster_samples.txt
+ printf "matutils extract -i ~{input_mat} -t %s -s this_cluster_samples.txt -N %s \n" "$cluster" "$minimum_tree_size"
+ fi
+ matUtils extract -i "~{input_mat}" -t "~{prefix}$cluster" -s this_cluster_samples.txt -N $minimum_tree_size -M "~{metadata_tsv}"
+ if [ ~{debug} = "true" ]
+ # for some reason, subtrees seem to end up with .nw as their extension
+ for tree in *.nw; do
+ mv -- "$tree" "${tree%.nw}.nwk"
+ done
+ mv subtree-assignments.tsv "$cluster-subtree-assignments.tsv"
+ cp groups.tsv "$cluster-groups.tsv"
+ then
+ printf "Finished %s and moving files.\n" "$cluster"
+ ls -lha
+ fi
+ i=$((i+1))
+ done < groups.tsv
+ rm groups.tsv
+ done
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: 12
+ disks: "local-disk " + 150 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: 1
+ }
+
+ output {
+ Array[File] newick_subtrees = glob("*.nwk")
+ Array[File] subtree_assignments = glob("*-subtree-assignments.tsv")
+ Array[File] groups = glob("*groups.tsv")
+ Array[File] metadata_tsvs = glob("*.tsv") # for auspice.us, which supports nwk
+ }
+}
+
+task convert_to_nextstrain_subtrees_by_cluster {
+ input {
+ File input_mat
+ File metadata_tsv
+ File grouped_clusters
+ Int context_samples
+ Int memory = 32
+ Boolean debug = true
+ String prefix = ""
+ }
+
+ command <<<
+ i=2
+ number_of_clusters=$(wc -l ~{grouped_clusters} | awk '{ print $1 }')
+ if [ ~{debug} = "true" ]; then printf "while %s < %s\n" "$i" "$number_of_clusters"; fi
+ # shellcheck disable=SC2086
+ # We are going to copy groups.tsv within this loop, but that shouldn't cause issues.
+ # Older versions of this task mv'd groups.txt and were fine, so this should be extra-safe.
+ while [ $i -le $number_of_clusters ]
+ do
+ head -$i "~{grouped_clusters}" | tail -n 1 > groups.tsv
+ if [ ~{debug} = "true" ]
+ then
+ printf "line %s grouped clusters now in groups.tsv as:\n" "$i"
+ cat groups.tsv
+ printf "\n"
+ fi
+ # shellcheck disable=SC2094
+ while IFS=" " read -r cluster samples
+ do
+ # shellcheck disable=SC2086
+ echo $samples > this_cluster_samples.txt
+ sed -i 's/,/\n/g' this_cluster_samples.txt
+ number_of_samples_in_cluster=$(wc -l this_cluster_samples.txt | awk '{ print $1 }')
+ minimum_tree_size=$((number_of_samples_in_cluster+~{context_samples}))
+ if [ ~{debug} = "true" ]
+ then
+ printf "%s in cluster + ~{context_samples} context: expecting %s samples in output" "$number_of_samples_in_cluster" "$minimum_tree_size"
+ printf "passing this_cluster_samples.txt:\n"
+ cat this_cluster_samples.txt
+ printf "matutils extract -i ~{input_mat} -j %s -s this_cluster_samples.txt -N %s -M ~{metadata_tsv}\n" "$cluster" "$minimum_tree_size"
+ fi
+ matUtils extract -i "~{input_mat}" -j "~{prefix}$cluster" -s this_cluster_samples.txt -N $minimum_tree_size -M "~{metadata_tsv}"
+ mv subtree-assignments.tsv "$cluster-subtree-assignments.tsv"
+ cp groups.tsv "$cluster-groups.tsv"
+ i=$((i+1))
+ done < groups.tsv
+ rm groups.tsv
+ done
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: 12
+ disks: "local-disk " + 150 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: 1
+ }
+
+ output {
+ Array[File] nextstrain_subtrees = glob("*.json")
+ Array[File] subtree_assignments = glob("*-subtree-assignments.tsv")
+ Array[File] groups = glob("*groups.tsv")
+ }
+}
+
+
+task convert_to_nextstrain_subtrees {
+ # based loosely on Marc Perry's version
+ input {
+ File input_mat # aka tree_pb
+
+ Int memory = 32
+ Array[File?] metadata_files
+ Int? nearest_k
+ String outfile_nextstrain = "nextstrain"
+ File? selected_samples
+ Int treesize = 0
+ }
+
+ String metadata = if length(metadata_files) > 0 then "-M" else ""
+
+ command <<<
+ METAFILES_OR_EMPTY="~{sep=',' metadata_files}"
+ if [[ "~{selected_samples}" == "" ]]
+ then
+ matUtils extract -i ~{input_mat} -S sample_paths.txt
+ cut -f1 sample_paths.txt | tail -n +2 > sample.ids
+ if [[ "~{nearest_k}" == "" ]]
+ then
+ matUtils extract -i ~{input_mat} -j ~{outfile_nextstrain} -s sample.ids -N ~{treesize} ~{metadata} $METAFILES_OR_EMPTY
+ else
+ matUtils extract -i ~{input_mat} -j ~{outfile_nextstrain} -K sample.ids:~{nearest_k} -N ~{treesize} ~{metadata} $METAFILES_OR_EMPTY
+ fi
+ else
+ if [[ "~{nearest_k}" == "" ]]
+ then
+ matUtils extract -i ~{input_mat} -j ~{outfile_nextstrain} -s ~{selected_samples} -N ~{treesize} ~{metadata} $METAFILES_OR_EMPTY
+ else
+ matUtils extract -i ~{input_mat} -j ~{outfile_nextstrain} -K ~{selected_samples}:~{nearest_k} -N ~{treesize} ~{metadata} $METAFILES_OR_EMPTY
+ fi
+ fi
+ ls -lha
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: 12
+ disks: "local-disk " + 150 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: 1
+ }
+
+ output {
+ Array[File] nextstrain_subtrees = glob("*.json")
+ }
+}
+
+task convert_to_nextstrain_single {
+ input {
+ File input_mat # aka tree_pb
+ Int memory = 32
+ String outfile_nextstrain
+ Array[File?] metadata_files
+ }
+
+ String metadata = if length(metadata_files) > 0 then "-M" else ""
+
+ command <<<
+ METAFILES_OR_EMPTY="~{sep=',' metadata_files}"
+ matUtils extract -i ~{input_mat} ~{metadata} $METAFILES_OR_EMPTY -j ~{outfile_nextstrain}
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: 12
+ disks: "local-disk " + 150 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: 1
+ }
+
+ output {
+ File nextstrain_singular_tree = outfile_nextstrain
+ }
+}
+
+task convert_to_nextstrain_single_terra_compatiable {
+ input {
+ File input_mat # aka tree_pb
+ Int memory = 32
+ String outfile_nextstrain
+ File? one_metadata_file
+ }
+
+ command <<<
+ if [ "~{one_metadata_file}" != "" ]
+ then
+ matUtils extract -i ~{input_mat} -M ~{one_metadata_file} -j ~{outfile_nextstrain}
+ else
+ matUtils extract -i ~{input_mat} -j ~{outfile_nextstrain}
+ fi
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: 12
+ disks: "local-disk " + 150 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: 1
+ }
+
+ output {
+ File nextstrain_singular_tree = outfile_nextstrain
+ }
+}
+
+task convert_to_newick {
+ input {
+ File input_mat
+ String outfile_nwk
+ }
+
+ command <<<
+ matUtils extract -i ~{input_mat} -t ~{outfile_nwk}
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: 8
+ disks: "local-disk " + 100 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: 8 + " GB"
+ preemptible: 1
+ }
+
+ output {
+ File newick_tree = outfile_nwk
+ }
+}
+
+task usher_sampled_diff {
+ input {
+ # main files -- for TB, do not include ref_genome, it's already baked in!
+ File diff
+ File? input_mat
+ File? ref_genome
+
+ # usher options
+ Int batch_size_per_process = 5
+ Boolean detailed_clades
+ Int optimization_radius = 0
+ Int max_parsimony_per_sample = 1000000
+ Int max_uncertainty_per_sample = 1000000
+ String output_mat = basename(select_first([input_mat, "debugtree"]), ".pb") + "_new.pb"
+
+ # WDL specific -- note that cpu does not directly set usher's
+ # threads argument, but it does affect the number of cores
+ # available for use (by default usher uses all available)
+ Int addldisk = 10
+ Int cpu = 40 # needed for CPDH but overkill for small numbers of samples -- 8 (yes, eight!) would do fine
+ Int memory = 32 # needed for CDPH but overkill for small numbers of samples -- 16 would do fine
+ Int preempt = 1
+ }
+
+ Int disk_size = ceil(size(diff, "GB")) + ceil(size(ref_genome, "GB")) + ceil(size(input_mat, "GB")) + addldisk
+ String D = if !(detailed_clades) then "" else "-D "
+
+ command <<<
+ if [[ "~{input_mat}" = "" ]]
+ then
+ i="/HOME/usher/example_tree/for_debugging_only__tb_7K_noQC_diffs_mask2ref.L.fixed.pb"
+ else
+ i="~{input_mat}"
+ fi
+
+ if [[ "~{ref_genome}" = "" ]]
+ then
+ ref="/HOME/usher/ref/Ref.H37Rv/ref.fa"
+ else
+ ref="~{ref_genome}"
+ fi
+
+ echo "~{input_mat}"
+ echo $i
+ echo "~{ref_genome}"
+ echo $ref
+ echo "------------------"
+ ls -lha
+ echo "------------------"
+
+ usher-sampled ~{D} --optimization_radius=~{optimization_radius} \
+ -e ~{max_uncertainty_per_sample} \
+ -E ~{max_parsimony_per_sample} \
+ --batch_size_per_process ~{batch_size_per_process} \
+ --diff "~{diff}" \
+ -i "$i" \
+ --ref "$ref" \
+ -o "~{output_mat}" >/dev/null 2>&1
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4_4"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File usher_tree = output_mat
+ File? clades = "clades.txt" # only if detailed_clades = true
+ }
+
+}
+
+task matOptimize {
+ input {
+ File input_mat
+ Int max_hours = 1
+ Int? max_iterations
+ Float min_improvement = 0.00000001
+ String output_mat = basename(input_mat, ".pb") + "_optimized.pb"
+
+ Int addldisk = 100
+ Int cpu = 24
+ Int memory = 32
+ Int preempt = 1
+ }
+
+ Int disk_size = ceil(size(input_mat, "GB")) + addldisk
+
+ command <<<
+ if [[ ! "~{max_iterations}" = "" ]]
+ then
+ MAX_ITERATIONS="--max-iterations ~{max_iterations}"
+ else
+ MAX_ITERATIONS=""
+ fi
+ # shellcheck disable=SC2086
+ matOptimize -i "~{input_mat}" --max-hours ~{max_hours} --min-improvement ~{min_improvement} $MAX_ITERATIONS -o "~{output_mat}"
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4ash_1"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File optimized_tree = output_mat
+ }
+}
+
+task convert_to_taxonium {
+ input {
+ File input_mat
+ String outfile_taxonium
+
+ Int addldisk = 100
+ Int cpu = 12
+ Int memory = 16
+ Int preempt = 1
+ }
+
+ Int disk_size = ceil(size(input_mat, "GB")) + addldisk
+
+ command <<<
+ echo "booted into Docker successfully"
+ echo "input file: ~{input_mat}"
+ ls -lha ~{input_mat}
+ echo "running usher_to_taxonium..."
+ usher_to_taxonium -i "~{input_mat}" -o "~{outfile_taxonium}"
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + disk_size + " SSD"
+ docker: "ashedpotatoes/sranwrp:1.1.6"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ File taxonium_tree = outfile_taxonium
+ }
+}
+
+task matrix {
+ input {
+ File input_nwk
+ Boolean only_matrix_special_samples
+ File? special_samples
+ String? outfile_matrix
+ }
+
+ command <<<
+ wget https://raw.githubusercontent.com/aofarrel/parsevcf/1.3.1/distancematrix_nwk.py
+ if [[ "~{only_matrix_special_samples}" = "true" ]]
+ then
+ samples=$(< "~{special_samples}" tr -s '\n' ',' | head -c -1)
+ echo "Samples that will be in the distance matrix: $samples"
+ python3 distancematrix_nwk.py "~{input_nwk}" --samples "$samples" -v
+ else
+ python3 distancematrix_nwk.py "~{input_nwk}" -v
+ fi
+ if [[ "~{outfile_matrix}" != "" ]]
+ then
+ icky_filename=$(find . -maxdepth 1 -name "*.tsv")
+ mv "$icky_filename" "~{outfile_matrix}"
+ fi
+ >>>
+
+ runtime {
+ cpu: 8
+ disks: "local-disk " + 100 + " SSD"
+ docker: "ashedpotatoes/sranwrp:1.1.15"
+ memory: 8 + " GB"
+ preemptible: 1
+ }
+
+ output {
+ File out_matrix = glob("*.tsv")[0]
+ }
+
+}
+
+task matrix_and_find_clusters {
+ input {
+ File input_nwk
+ File? persistent_cluster_tsv
+ Boolean only_matrix_special_samples
+ File? special_samples
+ Int distance
+ Int cpu = 8
+ Int memory = 8
+ }
+
+ command <<<
+ # temporarily overwriting DM script in the docker image until it's more stable
+ rm /scripts/distancematrix_nwk.py
+ wget https://raw.githubusercontent.com/aofarrel/parsevcf/refs/heads/slight-refactor/distancematrix_nwk.py
+ mv distancematrix_nwk.py /scripts/distancematrix_nwk.py
+
+ # TODO: persistent cluster IDs will kind of break on the "lonely" cluster. modify marc's script so
+ # that if a sample in previous run was "lonely" and in most recent run is not "lonely", give it a
+ # new name (that hasn't been used before) and mark it as a previously-run-newly-clustered sample.
+ if [[ "~{only_matrix_special_samples}" = "true" ]]
+ then
+ samples=$(< "~{special_samples}" tr -s '\n' ',' | head -c -1)
+ echo "Samples that will be in the distance matrix: $samples"
+ python3 /scripts/distancematrix_nwk.py "~{input_nwk}" --samples "$samples" -d ~{distance}
+ else
+ python3 /scripts/distancematrix_nwk.py "~{input_nwk}" -d ~{distance}
+ fi
+ if [[ "~{persistent_cluster_tsv}" != "" ]]
+ then
+ perl /scripts/marcs_incredible_script.pl ""
+ awk 'NR==FNR {keys[$1]; next} $1 in keys' "$(find . -name '*_cluster_annotation.tsv' | head -n 1)" ~{persistent_cluster_tsv} > filtered_latest_clusters.tsv
+ awk 'NR==FNR {keys[$1]; next} $1 in keys' ~{persistent_cluster_tsv} "$(find . -name '*_cluster_annotation.tsv' | head -n 1)" > filtered_persistent_clusters.tsv
+ perl /scripts/marcs_incredible_script.pl filtered_persistent_clusters.tsv filtered_latest_clustes.tsv
+ fi
+
+ >>>
+
+ runtime {
+ cpu: cpu
+ disks: "local-disk " + 100 + " SSD"
+ docker: "ashedpotatoes/sranwrp:1.1.26"
+ memory: memory + " GB"
+ preemptible: 1
+ }
+
+ output {
+ Array[File] out_matrices = glob("*_dmtrx.tsv")
+ File samp_cluster = glob("*_cluster_annotation.tsv")[0]
+ File cluster_samps = glob("*_cluster_extraction.tsv")[0]
+ File samp_UUID = glob("*_cluster_UUIDs.tsv")[0]
+ File? persistent_cluster_translator = "mapped_persistent_cluster_ids_to_new_cluster_ids.tsv"
+ Int n_clusters = read_int("n_clusters")
+ Int n_samples_in_clusters = read_int("n_samples_in_clusters")
+ Int total_samples_processed = read_int("total_samples_processed")
+ }
+
+}
+
+task cluster_CDPH_method {
+ # find_clusters.py: Generates 20-10-5 clusters and distance matrices (normal and backmasked)
+ # process_clusters.py: Persistent cluster IDs, subtrees, and MR upload
+ # Any clusters that have at least one sample without a diff file will NOT be backmasked
+ # This might not work properly if any sample IDs contain a space
+ input {
+ File input_mat_with_new_samples
+ String datestamp # has to be defined here for non-glob delocalization to work properly
+
+ Boolean upload_clusters_to_microreact = true
+ Boolean disable_decimated_failsafe = false
+ Boolean inteight = false
+ Boolean only_matrix_special_samples # arg is assumed to be passed in from Tree Nine
+ File? special_samples
+
+ File? persistent_denylist
+ File? persistent_ids
+ File? persistent_cluster_meta
+ File combined_diff_file # used for local masking
+ File? previous_run_cluster_json # for comparisons -- currently we do this another way so this is unused
+
+ # keep these files in the workspace bucket for now
+ File? microreact_decimated_template_json
+ File? microreact_update_template_json
+ File? microreact_blank_template_json # hardcoded to expect a file named BLANK_template.json
+ File? microreact_key
+
+ # actually optional
+ File? metadata_csv
+ String? shareemail
+
+ Int preempt = 0 # only set if you're doing a small test run
+ Int memory = 50
+ Boolean debug = true
+
+ # temporary overrides
+ File? override_latest_samples_tsv # if provided, skips find_clusters.py
+ File? override_find_clusters_script
+ File? override_process_clusters_script
+ File? override_summarize_changes_script
+
+ }
+ # We cannot `String arg_token = if upload_clusters_to_microreact then "--token ~{microreact_key}" else "" ` or else the literal gs:// will
+ # instead of the delocalized version, so some args will need to be handled in the command section itself
+
+ Array[Int] cluster_distances = [20, 10, 5] # CHANGING THIS WILL BREAK SECOND SCRIPT!
+ String arg_denylist = if defined(persistent_denylist) then "--dl ~{persistent_denylist}" else ""
+ String arg_shareemail = if defined(shareemail) then "-s ~{shareemail}" else ""
+ String arg_microreact = if upload_clusters_to_microreact then "--upload_to_microreact" else ""
+ String arg_ieight = if inteight then "--int8" else ""
+ String arg_disable_decimated_failsafe = if disable_decimated_failsafe then "--disable_decimated_failsafe" else ""
+
+ command <<<
+ set -eux pipefail
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Starting task"
+
+ # validate inputs
+ if [[ "~{upload_clusters_to_microreact}" = "true" ]]
+ then
+ if [ -f "~{microreact_key}" ]
+ then
+ TOKEN_ARG="--token ~{microreact_key}"
+ else
+ echo "Upload to microreact is true, but no token provided. Crashing!"
+ exit 1
+ fi
+
+ if [ -f "~{microreact_update_template_json}" ]
+ then
+ MR_UPDATE_JSON_ARG="--mr_update_template ~{microreact_update_template_json}"
+ else
+ echo "Upload to microreact is true, but no microreact_update_template_json provided. Crashing!"
+ exit 1
+ fi
+
+ if [ -f "~{microreact_blank_template_json}" ]
+ then
+ MR_BLANK_JSON_ARG="--mr_blank_template ~{microreact_blank_template_json}"
+ else
+ echo "Upload to microreact is true, but no microreact_blank_template_json provided. Crashing!"
+ exit 1
+ fi
+
+ if [ -f "~{microreact_decimated_template_json}" ]
+ then
+ MR_DECIMATED_JSON_ARG="--mr_decimated_template ~{microreact_decimated_template_json}"
+ else
+ echo -n "Upload to microreact is true, but no microreact_decimated_template_json provided. This isn't recommended,"
+ echo -n "because decimated clusters on Microreact will never be updated, which might lead to incorrect assumptions."
+ fi
+ else
+ TOKEN_ARG=""
+ MR_UPDATE_JSON_ARG=""
+ MR_BLANK_JSON_ARG=""
+ MR_DECIMATED_JSON_ARG=""
+ fi
+
+ # we do similar logic within process_clusters.py too, but if we can crash before find_clusters.py that'd be ideal
+ if [ -f "~{persistent_ids}" ]
+ then
+ if [ -f "~{persistent_cluster_meta}" ]
+ then
+ PERSISTENTIDS_ARG="--persistentids ~{persistent_ids}"
+ PERSISTENTMETA_ARG="--persistentclustermeta ~{persistent_cluster_meta}"
+ else
+ echo "Found persistent IDs file but no persistent cluster meta. You need neither or both. Crashing!"
+ exit 1
+ fi
+ else
+ if [ -f "~{persistent_cluster_meta}" ]
+ then
+ echo "Found persistent cluster meta file but no persistent IDs. You need neither or both. Crashing!"
+ exit 1
+ else
+ echo "Found neither persistent IDs file nor persistent cluster meta, will be running without persistent IDs"
+ PERSISTENTIDS_ARG=""
+ PERSISTENTMETA_ARG=""
+ fi
+ fi
+
+ matUtils extract -i ~{input_mat_with_new_samples} -t A_big.nwk
+ cp ~{input_mat_with_new_samples} .
+
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Downloading files"
+
+ if [[ "~{override_find_clusters_script}" == '' ]]
+ then
+ wget https://raw.githubusercontent.com/aofarrel/tree_nine/0.6.2/find_clusters.py
+ mv find_clusters.py /scripts/find_clusters.py
+ else
+ mv "~{override_find_clusters_script}" /scripts/find_clusters.py
+ fi
+
+ if [[ "~{override_process_clusters_script}" == '' ]]
+ then
+ wget https://raw.githubusercontent.com/aofarrel/tree_nine/0.6.2/process_clusters.py
+ mv process_clusters.py /scripts/process_clusters.py
+ else
+ mv "~{override_process_clusters_script}" /scripts/process_clusters.py
+ fi
+
+ if [[ "~{override_summarize_changes_script}" == '' ]]
+ then
+ wget https://raw.githubusercontent.com/aofarrel/tree_nine/0.6.2/summarize_changes_alt.py
+ mv summarize_changes_alt.py /scripts/summarize_changes_alt.py
+ else
+ mv "~{override_summarize_changes_script}" /scripts/summarize_changes_alt.py
+ fi
+
+ wget https://gist.githubusercontent.com/aofarrel/6a458634abbca4eb16d120cc6694d5aa/raw/d6f5466e04394ca38f1a92b1580a9a5bd436bbc8/marcs_incredible_script_update.pl
+ mv marcs_incredible_script_update.pl /scripts/marcs_incredible_script_update.pl
+
+ wget https://raw.githubusercontent.com/aofarrel/tsvutils/refs/heads/main/extract_long_rows_and_truncate.sh
+ mv extract_long_rows_and_truncate.sh /scripts/strip_tsv.sh
+ wget https://raw.githubusercontent.com/aofarrel/tsvutils/refs/heads/main/equalize_tabs.sh
+ mv equalize_tabs.sh /scripts/equalize_tabs.sh
+
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Files downloaded and moved. Workdir:"
+ tree
+
+ CLUSTER_DISTANCES="~{sep=',' cluster_distances}"
+ FIRST_DISTANCE="${CLUSTER_DISTANCES%%,*}"
+ OTHER_DISTANCES="${CLUSTER_DISTANCES#*,}"
+ echo "cluster distances $CLUSTER_DISTANCES"
+ echo "First distance $FIRST_DISTANCE"
+ echo "Other distances $OTHER_DISTANCES"
+
+ # Turn off pipefail at this point for a few reasons
+ # 1) find_clusters.py can return not-0 in non-error cases
+ # 2) process_clusters.py writes a lot of logs to disk and we need them if it fails
+ set +eo pipefail
+
+ # TODO: on very large runs, the size of $/samples may eventually cause issues with ARG_MAX
+ # should be fine for our purposes though
+
+ if [[ "~{override_latest_samples_tsv}" == '' ]]
+ then
+
+ # shellcheck disable=SC2086
+ if [[ "~{only_matrix_special_samples}" = "true" ]]
+ then
+ samples=$(< "~{special_samples}" tr -s '\n' ',' | head -c -1)
+ echo "Samples that will be in the distance matrix: $samples"
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Running find_clusters.py"
+
+ python3 /scripts/find_clusters.py \
+ "~{input_mat_with_new_samples}" \
+ --samples $samples \
+ --collection-name big \
+ -t NB \
+ -d "$FIRST_DISTANCE" \
+ -rd "$OTHER_DISTANCES" \
+ -v ~{arg_ieight}
+
+ ALLSAMPLES_ARG_1="--allsamples"
+ ALLSAMPLES_ARG_2="$samples"
+
+ else
+ echo "No sample selection file passed in, will matrix the entire tree (WARNING: THIS MAY BE VERY SLOW)"
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Running find_clusters.py"
+ python3 /scripts/find_clusters.py \
+ "~{input_mat_with_new_samples}" \
+ --collection-name big \
+ -t NB \
+ -d "$FIRST_DISTANCE" \
+ -rd "$OTHER_DISTANCES" \
+ -v ~{arg_ieight}
+
+ ALLSAMPLES_ARG_1=""
+ ALLSAMPLES_ARG_2=""
+ fi
+
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Finished running find_clusters.py"
+
+ else
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Skipped find_clusters.py because you provided override_latest_samples_tsv"
+ mv "~{override_latest_samples_tsv}" ./latest_samples.tsv
+
+ fi
+
+ echo "Current sample information:"
+ cat latest_samples.tsv
+ echo "Contents of workdir:"
+ tree
+ # A_big.nwk big tree, nwk format (will be renamed later)
+ # LONELY-subtree-n.nwk (n as variable) subtrees (usually multiple) of unclustered samples
+ # unclustered_samples.txt what it says on the tin
+ # lonely-subtree-assignments.tsv which subtree each unclustered sample ended up in
+ # cluster_annotation_workdirIDs.tsv can be used to annotate by nonpersistent cluster (but isn't, at least not yet)
+ # latest_samples.tsv used by persistent ID script (will be renamed later)
+ # n_big_clusters (n as constant) # of 20SNP clusters
+ # n_samples_in_clusters (n as constant) # of samples that clustered
+ # n_samples_processed (n as constant) # of samples processed by find_clusters.py
+ # n_unclustered (n as constant) # of samples that failed to cluster
+ # ...and one distance matrix per cluster, and also one(?) subtree per cluster. Later, there will be two of each per cluster thanks to backmasking
+
+ mkdir logs
+ echo "Running second script"
+
+ # shellcheck disable=SC2086 # already dquoted
+ python3 /scripts/process_clusters.py \
+ --latestsamples latest_samples.tsv \
+ --latestclustermeta latest_clusters.tsv \
+ -mat "~{input_mat_with_new_samples}" \
+ -cd "~{combined_diff_file}" \
+ --no_err_on_decimated_on_mr \
+ ~{arg_denylist} ~{arg_shareemail} ~{arg_microreact} --today ~{datestamp} ~{arg_disable_decimated_failsafe} \
+ $MR_UPDATE_JSON_ARG $TOKEN_ARG $MR_BLANK_JSON_ARG $MR_DECIMATED_JSON_ARG $PERSISTENTIDS_ARG $PERSISTENTMETA_ARG $ALLSAMPLES_ARG_1 $ALLSAMPLES_ARG_2
+
+ PY_EXIT_CODE=$? # this does not seem reliable on WDL nowadays? hmmmm...
+
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Zipping process_clusters.py's logs"
+ zip -r logs.zip ./logs
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Logs zipped"
+
+ if [ -f "rosetta_stone_20_merges.tsv" ]
+ then
+ echo "Rosetta 20 merges"
+ cat rosetta_stone_20_merges.tsv
+ fi
+ if [ -f "rosetta_stone_10_merges.tsv" ]
+ then
+ echo "Rosetta 10 merges"
+ cat rosetta_stone_10_merges.tsv
+ fi
+ if [ -f "rosetta_stone_5_merges.tsv" ]
+ then
+ echo "Rosetta 5 merges"
+ cat rosetta_stone_5_merges.tsv
+ fi
+
+ if [ "~{previous_run_cluster_json}" != "" ]
+ then
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Running summarize_changes_alt.py"
+ python3 /scripts/summarize_changes_alt.py "all_cluster_information~{datestamp}.json"
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Finished find_clusters.py"
+ fi
+ if [ ~{debug} = "true" ]; then ls -lha; fi
+
+
+ # MR templates are generally deleted in the script itself to avoid globbing with the subtrees
+ mv A_big.nwk "BIGTREE~{datestamp}.nwk"
+ echo "Renamed A_big.nwk to BIGTREE~{datestamp}.nwk"
+ mv latest_samples.tsv "latest_samples~{datestamp}.tsv"
+ echo "Renamed latest_samples.tsv to latest_samples~{datestamp}.tsv"
+ mv latest_clusters.tsv "latest_clusters~{datestamp}.tsv"
+ echo "Renamed latest_clusters.tsv to latest_clusters~{datestamp}.tsv"
+ mv all_closest_relatives.txt "all_nearest_relatives~{datestamp}.txt"
+ echo "Renamed all_closest_relatives.txt to all_nearest_relatives~{datestamp}.txt"
+ mv unclustered_samples.txt "unclustered_samples~{datestamp}.txt"
+ echo "Renamed all_closest_relatives.txt to unclustered_samples~{datestamp}.txt"
+
+ # if process_clusters.py errored, NOW we should crash, since we have logs and such
+ exit $PY_EXIT_CODE
+
+ # shellcheck disable=SC2317
+ echo "[$(date '+%Y-%m-%d %H:%M:%S')] Finished task"
+
+ >>>
+
+ runtime {
+ bootDiskSizeGb: 15
+ cpu: 12
+ disks: "local-disk " + 150 + " SSD"
+ docker: "ashedpotatoes/usher-plus:0.6.4ash_2"
+ memory: memory + " GB"
+ preemptible: preempt
+ }
+
+ output {
+ # The amount of outputs we originally had was overloading Terra, so some of these are commented out now.
+ # Also, we try to avoid globbing where possible to make finding outs in Terra bucket easier since globs
+ # create a folder with a randomized name, which is annoying!
+
+ ###### IMPORTANT FILES THAT SHOULD ALWAYS GO INTO SUBSEQUENT RUNS IF THEY EXIST ######
+ File? new_samples = "new_samples" + datestamp + ".tsv"
+ File? clusterid_denylist = "clusterid_denylist.txt"
+ File? new_persistent_ids = "persistentIDS" + datestamp + ".tsv"
+ File? new_persistent_meta = "persistentMETA" + datestamp + ".tsv"
+ File? final_cluster_information_json = "all_cluster_information" + datestamp + ".json"
+ File? change_report_json = "change_report" + datestamp + ".json"
+
+ File? updated_mr_URIs_file = "updated_mr_URIs" + datestamp + ".txt"
+
+ # trees -- A = not internally masked, B = internally masked
+ # there is no internally masked big tree because masking is done per-cluster
+ File? bigtree_raw = "BIGTREE"+datestamp+".nwk" # generated directly via matUtils
+ File? bigtree_gen = "a000000.nwk" # generated by cluster script (but should be equivalent to bigtree_raw)
+ #Array[File]? acluster_trees = glob("a*.nwk") # !UnnecessaryQuantifier
+ Array[File]? bcluster_trees = glob("b*.nwk") # !UnnecessaryQuantifier
+
+ # stuff related to unclustered samples
+ File? all_nearest_relatives = "all_nearest_relatives" + datestamp + ".txt"
+ Array[File]? unclustered_subtree_assignments = glob("*subtree-assignments.tsv") # !UnnecessaryQuantifier
+ #Array[File]? unclustered_subtrees = glob("LONELY*.nwk") # !UnnecessaryQuantifier
+ File? unclustered_samples = "unclustered_samples" + datestamp + ".txt"
+
+ # distance matrices
+ File? bigtree_matrix = "a000000_dmtrx.tsv"
+ Array[File]? acluster_matrices = glob("a*_dmtrx.tsv") # !UnnecessaryQuantifier
+ Array[File]? bcluster_matrices = glob("b*_dmtrx.tsv") # !UnnecessaryQuantifier
+
+ # general cluster stats
+ Int n_big_clusters = read_int("n_big_clusters")
+ Int n_samples_in_clusters = read_int("n_samples_in_clusters")
+ Int n_samples_processed = read_int("n_samples_processed")
+ Int n_unclustered = read_int("n_unclustered")
+
+ # debug
+ File? logs = "logs.zip"
+ File? change_report_full = "change_report_full"+datestamp+".txt" # all clusters
+ File? change_report_cdph = "change_report_cdph"+datestamp+".txt" # excludes 20-clusters
+ File? intermediate_samplewise = "latest_samples"+datestamp+".tsv" # from find_clusters.py
+ File? intermediate_clusterwise = "latest_samples"+datestamp+".tsv" # from find_clusters.py, currently only for matrix_max
+
+ # for annotation of trees
+ File? samp_cluster_twn = "samp_persis20cluster" + datestamp + ".tsv" # this format is specifically for nextstrain conversion
+ File? samp_cluster_ten = "samp_persis10cluster" + datestamp + ".tsv" # this format is specifically for nextstrain conversion
+ File? samp_cluster_fiv = "samp_persis5cluster" + datestamp + ".tsv" # this format is specifically for nextstrain conversion
+
+ # old old old
+ #Array[File] abig_subtrees = glob("abig-subtree-*.nwk")
+ #File? persistent_cluster_translator = "mapped_persistent_cluster_ids_to_new_cluster_ids.tsv"
+ #Array[File] cluster_trees_json = glob("*.json")
+ #Array[File] metadata_tsvs = glob("*.tsv") # for auspice.us, which supports nwk
+ }
+}
+