Rr 115 update docs by ryanraaschCDC · Pull Request #138 · CDCgov/cfa-cloudops

ryanraaschCDC · 2026-03-19T13:11:29Z

No description provided.

Copilot

Pull request overview

Updates the documentation to improve onboarding/troubleshooting guidance and reflect newer client capabilities in cfa-cloudops.

Changes:

Add new troubleshooting sections for CloudClient instantiation and file-path/mount issues during jobs
Expand the CloudClient getting-started notebook with clearer guidance around env files, blob containers, ACR uploads, pools, and jobs
Update docs to mention ContainerAppClient and adjust the service-principal sample .env

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
docs/troubleshooting.md	Adds new troubleshooting guidance for CloudClient instantiation and file-not-found issues.
docs/overview.md	Mentions `ContainerAppClient` as a repo component.
docs/files/sp_sample.env	Removes `AZURE_SP_CLIENT_ID` from the SP sample env.
docs/examples/getting_started/cloudclient_walkthrough.ipynb	Expands the getting-started walkthrough with additional explanations and examples.
docs/examples/getting_started/Dockerfile	Replaces the minimal Dockerfile with a more explanatory example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -2,7 +2,6 @@
 AZURE_TENANT_ID="your azure tenant id"
 AZURE_SUBSCRIPTION_ID="your subscription id"
 AZURE_CLIENT_ID="your azure service principal client id"


sfrosenb · 2026-04-07T15:03:46Z

   "metadata": {},
   "source": [
-    "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working."
+    "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working. **Make sure to update your .env file based on the sample with values relevant to your Azure environment.**"


I do not think I understand what this direction is telling the user to to do. Is it saying you can have a .env file with a single line AZURE_KEYVAULT_NAME = "CFA-Predict"? Is it saying you should manually set a variable AZURE_KEYVAULT_NAME = "CFA-Predict" prior to running CloudClient()? Neither one worked for me. In both cases I get AttributeError: A non-None value for attribute azure_batch_account is required to obtain a value for Azure batch endpoint URL.

It did work to use cc = CloudClient(keyvault = 'CFA-Predict'), but if that is the easiest way to do things, why does it appear commented out and second in the walkthrough? This line implies there is a way to do the same thing by having "a variable called AZURE_KEYVAULT_NAME is saved to your environment" but neither way I could think to do that worked

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

sfrosenb

Cloudops Comments Not Directly Addressed by PR Suggestions

Overview

Several members of my team spent significant time (over 2 weeks) debugging the walkthrough for this package before we could get it to work, and some of us are still encountering issues. This seems to point to some underlying difficulties with CFA–Azure interaction or with the package itself. Below are a few suggestions that may help address these challenges.

Stress‑Test Functions

Several cloudops functions are somewhat brittle, failing under certain assumptions that fall outside typical user workflows.

`Cc.monitor_task(download_task_output = True)`

This function can fail if any tasks associated with a job no longer have nodes.
Possible improvements:
- Would it be possible to add a try–catch or filtering step to ignore tasks lacking nodes?
- Alternatively, a more informative error message could help users understand when a new job is required.
- If helpful, I’m happy to submit a feature request.

`Cc.create_job`

If the pool does not exist, the job appears to be created, but tasks submitted to it remain queued indefinitely.

Pre‑Identify User Issues

Reviewing feature requests from the predecessor package (cfa-azure) may help proactively address common user concerns.

Improve Error Messages

Cc.monitor_task fails when tasks lack associated nodes, but the resulting error is not very informative.

Reduce Debugging Time

Debugging has been more challenging because—even with vmsize = 'xsmall'—the transition from queued → running typically takes at least 4 minutes.

Possible improvements:

Is there any possibility of creating a pool that allows for faster debugging?
Would dedicated nodes help? Or would they pose too high a cost risk if users forget to delete the pool?
If startup time cannot be reduced, making it easier to launch multiple tasks for testing might help:
- Currently, if one task fails earlier than another, the second call to download_task_output errors out.

Other Suggestions

In the walkthrough, it could be more helpful for users if argument names were explicitly specified rather than relying on positional interpretation.

sfrosenb · 2026-04-07T15:03:46Z

   "metadata": {},
   "source": [
-    "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working."
+    "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working. **Make sure to update your .env file based on the sample with values relevant to your Azure environment.**"


I do not think I understand what this direction is telling the user to to do. Is it saying you can have a .env file with a single line AZURE_KEYVAULT_NAME = "CFA-Predict"? Is it saying you should manually set a variable AZURE_KEYVAULT_NAME = "CFA-Predict" prior to running CloudClient()? Neither one worked for me. In both cases I get AttributeError: A non-None value for attribute azure_batch_account is required to obtain a value for Azure batch endpoint URL.

sfrosenb · 2026-04-07T15:08:57Z

   "metadata": {},
   "source": [
-    "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working."
+    "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working. **Make sure to update your .env file based on the sample with values relevant to your Azure environment.**"


It did work to use cc = CloudClient(keyvault = 'CFA-Predict'), but if that is the easiest way to do things, why does it appear commented out and second in the walkthrough? This line implies there is a way to do the same thing by having "a variable called AZURE_KEYVAULT_NAME is saved to your environment" but neither way I could think to do that worked

sfrosenb · 2026-04-23T20:19:44Z

@@ -183,7 +203,8 @@
    "container_name = cc.package_and_upload_dockerfile(\n",
    "    registry_name = \"my_azure_registry\",\n",


the only value for registry_name that works for me is "cfprdbatchcr". Using the value in this example for the walkthrough results in the following error:

ERROR: Registry names may contain only alpha numeric characters and must be between 5 and 50 characters
The push refers to repository [my_azure_registry.azurecr.io/cloudops-demo]
Get "https:/v2/": http: no Host in request URL

Adjusting based on this error to use - instead of _ results in the following error:

WARNING: The resource with name 'my-azure-registry' and type 'Microsoft.ContainerRegistry/registries' could not be found in subscription 'EXT-EDAV-CFA-PRD (ef340bd6-2809-4635-b18b-7e6583a8803b)'.Using 'my-azure-registry.azurecr.io' as the default registry login server.

ERROR: Could not connect to the registry login server 'my-azure-registry.azurecr.io'. Please verify that the registry exists and the URL 'https://my-azure-registry.azurecr.io/v2/' is reachable from your environment.Try running 'az acr check-health -n my-azure-registry --yes' to diagnose this issue.
The push refers to repository [my-azure-registry.azurecr.io/cloudops-demo]
Get "https://my-azure-registry.azurecr.io/v2/": dial tcp: lookup my-azure-registry.azurecr.io on 127.0.0.53:53: no such host

Provide users the correct value that will work for them or if that is not the same between users provide them a way to determine which value works for them. If a value is meant to be a stand-in that will prompt users to find their appropriate value instead of taken literally, this needs to be made clear in the text above or in the code comments with a detailed description of how to find the appropriate value

sfrosenb · 2026-04-23T20:28:19Z

-    "Pools are usually created for each team or per project. It spins up nodes when necessary based on the container you specify. The following would create a pool based on the Docker image we just uploaded, autoscaling to 5 nodes, mounting to the 'input-test' container we uploaded to, an 8 core CPU, and call it 'getting-started-pool'. "
+    "Pools are usually created for each team or per project. It spins up nodes when necessary based on the container you specify. \n",
+    "\n",
+    "It's at this point we specify which Blob Containers we mount to the pool. This will make blobs in Blob Storage accessible to read or write for the containers that we mount. The mounts are then accessible in your code at the root of the node, i.e. a mounted container called 'input-test' will be accessible in your code via `/input-test`. \n",


Access to mounted files continues to be an issue for my team. It is unclear as of yet what is going on, but in some cases, tasks fail access the files in the blob container mounted to the pool

sfrosenb · 2026-04-23T20:35:46Z

-    "At this point we are ready to add tasks to the job we created. We can run the `main.py` python script that we uploaded to the 'input-test' container. It takes an argument called '--user' and prints a welcome message to the console. We will add two tasks to our job for two different users. In general, any number of tasks can be added to a job."
+    "At this point we are ready to add tasks to the job we created. We can run the `main.py` python script that we uploaded to the 'input-test' container. It takes an argument called '--user' and prints a welcome message to the console. We will add two tasks to our job for two different users. In general, any number of tasks can be added to a job.\n",
+    "\n",
+    "Notice that we reference the mounted Blob Container with /input-test (the leading / is important)."


it seems there may be additional issues related to the pathing in this line besides the importance of the leading /

Fatal error: cannot open file '/input-test-edp/cloudops_helloworld.r': No such file or directory

sfrosenb · 2026-04-23T20:37:45Z

several members of my team work almost exclusively in R, so it was important to establish the ability to run simple R scripts rather than simple python scripts. As such, to test this I created a simple dockerfile and corresponding helloworld script that you could consider including in this PR, or I can include in my own PR

sfrosenb · 2026-04-23T20:42:06Z

    "cc.delete_job(\"getting-started-job\")\n",
    "\n",
    "# delete the pool\n",
    "cc.delete_pool(\"getting-started-pool\")"


cc.delete_pool seems to not work as expected. I tested this and then tried to create_pool with the same name afterwards and received an error stating that pool already existed. So it either does not work or has a delay. This should be determined and either fixed if it doesnt work, or if there is an unfixable delay, a warning should be given here and output when users run delete_pool, or in the error message for create_pool (e.g. a pool with this name already exists. If you recently deleted this pool, it may take up to X minutes to successfully delete such that you can create a pool with the same name")

cc.delete_pool() followed by a creation of a pool with the same name results in
"Pool with name kps-cloudops-demo-pool already exists.
Skipping pool creation."

sfrosenb · 2026-04-23T21:25:33Z

+
+If you are interacting with files during a job and getting errors like a file is not found, it can be originating from two places:
+1. incorrect mount reference. The blob container should be mounted during pool creation and referenced at the root of the Docker container. For example, a container called `my-container` would be referenced as `/my-container` in code, unless you provided a relative mount path when creating the pool.
+2. file not present in container. If you are referencing a file that should exist in your Docker container, confirm the path where it exists. Note that Docker sets a working directory so any relative paths will start from the working directory specified in your Docker container.


Two of my teammates are having issues troubleshooting a similar issue and have looked into both of these suggestions as well as a number of other avenues to debug and gotten nowhere:

This is the stdout after including a pwd and ls in the add_task line in addition to running the helloworld script:

/
bin
boot
dev
etc
home
lib
lib32
lib64
libx32
media
mnt
opt
proc
rocker_scripts
root
run
sbin
srv
sys
tmp
usr
var
Fatal error: cannot open file '/input-test-edp/cloudops_helloworld.r': No such file or directory

sfrosenb · 2026-04-28T17:45:12Z

I think adding clarity around the dockerfile beyond pointing to the generic docker docs would help users and troubleshooting. Your example dockerfile sets WORKDIR to /app but then in the walkthrough, "app" is not used in create_pool() nor the add_task() nor main.py, and docs/troubleshooting.md does not use it either.

It would be helpful to provide the user an example Dockerfile that works with the walkthrough

sfrosenb · 2026-05-07T19:39:09Z

+    "\n",
+    "The following would create a pool based on the Docker image we just uploaded, autoscaling to 5 nodes, mounting to the 'input-test' container to which we uploaded, use an 8 core CPU, and call it 'getting-started-pool'. \n",
+    "\n",
+    "You could also specify vm_size from a list of xsmall, small, medium, large, and xlarge. These will use 2, 4, 8, 16, or 32 cores, respectively."


Can you add more information about what the various arguments for create_pool do? In the walkthrough you have max_autoscale_nodes=5. But when I tried this out my job did not seem to autoscale like it used to with out old cfa-azure pipeline. Does autoscale need to be set to true? Whats going on?

ryanraaschCDC added 2 commits March 3, 2026 21:56

add/update docs

9778cdf

update troubleshooting

94a1a4b

ryanraaschCDC requested a review from Copilot March 19, 2026 13:11

ryanraaschCDC linked an issue Mar 19, 2026 that may be closed by this pull request

provide Docker help #117

Open

Copilot AI reviewed Mar 19, 2026

View reviewed changes

ryanraaschCDC and others added 5 commits March 19, 2026 06:30

Potential fix for pull request finding

9395380

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

1feb6a8

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

48545e0

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

bc3405d

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

e656966

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

ryanraaschCDC requested a review from sfrosenb March 19, 2026 13:36

sfrosenb marked this pull request as ready for review April 16, 2026 14:41

sfrosenb requested changes Apr 23, 2026

View reviewed changes

ryanraaschCDC linked an issue Apr 28, 2026 that may be closed by this pull request

update documentation #115

Open

sfrosenb reviewed May 7, 2026

View reviewed changes

		@@ -183,7 +203,8 @@
		"container_name = cc.package_and_upload_dockerfile(\n",
		" registry_name = \"my_azure_registry\",\n",

Conversation

ryanraaschCDC commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfrosenb left a comment

Choose a reason for hiding this comment

Cloudops Comments Not Directly Addressed by PR Suggestions

Overview

Stress‑Test Functions

Cc.monitor_task(download_task_output = True)

Cc.create_job

Pre‑Identify User Issues

Improve Error Messages

Reduce Debugging Time

Possible improvements:

Other Suggestions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfrosenb commented Apr 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`Cc.monitor_task(download_task_output = True)`

`Cc.create_job`