Rr 115 update docs#138
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the documentation to improve onboarding/troubleshooting guidance and reflect newer client capabilities in cfa-cloudops.
Changes:
- Add new troubleshooting sections for CloudClient instantiation and file-path/mount issues during jobs
- Expand the CloudClient getting-started notebook with clearer guidance around env files, blob containers, ACR uploads, pools, and jobs
- Update docs to mention
ContainerAppClientand adjust the service-principal sample.env
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/troubleshooting.md | Adds new troubleshooting guidance for CloudClient instantiation and file-not-found issues. |
| docs/overview.md | Mentions ContainerAppClient as a repo component. |
| docs/files/sp_sample.env | Removes AZURE_SP_CLIENT_ID from the SP sample env. |
| docs/examples/getting_started/cloudclient_walkthrough.ipynb | Expands the getting-started walkthrough with additional explanations and examples. |
| docs/examples/getting_started/Dockerfile | Replaces the minimal Dockerfile with a more explanatory example. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -2,7 +2,6 @@ | |||
| AZURE_TENANT_ID="your azure tenant id" | |||
| AZURE_SUBSCRIPTION_ID="your subscription id" | |||
| AZURE_CLIENT_ID="your azure service principal client id" | |||
| "metadata": {}, | ||
| "source": [ | ||
| "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working." | ||
| "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working. **Make sure to update your .env file based on the sample with values relevant to your Azure environment.**" |
There was a problem hiding this comment.
I do not think I understand what this direction is telling the user to to do. Is it saying you can have a .env file with a single line AZURE_KEYVAULT_NAME = "CFA-Predict"? Is it saying you should manually set a variable AZURE_KEYVAULT_NAME = "CFA-Predict" prior to running CloudClient()? Neither one worked for me. In both cases I get AttributeError: A non-None value for attribute azure_batch_account is required to obtain a value for Azure batch endpoint URL.
There was a problem hiding this comment.
It did work to use cc = CloudClient(keyvault = 'CFA-Predict'), but if that is the easiest way to do things, why does it appear commented out and second in the walkthrough? This line implies there is a way to do the same thing by having "a variable called AZURE_KEYVAULT_NAME is saved to your environment" but neither way I could think to do that worked
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
sfrosenb
left a comment
There was a problem hiding this comment.
Cloudops Comments Not Directly Addressed by PR Suggestions
Overview
Several members of my team spent significant time (over 2 weeks) debugging the walkthrough for this package before we could get it to work, and some of us are still encountering issues. This seems to point to some underlying difficulties with CFA–Azure interaction or with the package itself. Below are a few suggestions that may help address these challenges.
Stress‑Test Functions
Several cloudops functions are somewhat brittle, failing under certain assumptions that fall outside typical user workflows.
Cc.monitor_task(download_task_output = True)
- This function can fail if any tasks associated with a job no longer have nodes.
- Possible improvements:
- Would it be possible to add a try–catch or filtering step to ignore tasks lacking nodes?
- Alternatively, a more informative error message could help users understand when a new job is required.
- If helpful, I’m happy to submit a feature request.
Cc.create_job
- If the pool does not exist, the job appears to be created, but tasks submitted to it remain queued indefinitely.
Pre‑Identify User Issues
- Reviewing feature requests from the predecessor package (
cfa-azure) may help proactively address common user concerns.
Improve Error Messages
Cc.monitor_taskfails when tasks lack associated nodes, but the resulting error is not very informative.
Reduce Debugging Time
Debugging has been more challenging because—even with vmsize = 'xsmall'—the transition from queued → running typically takes at least 4 minutes.
Possible improvements:
- Is there any possibility of creating a pool that allows for faster debugging?
- Would dedicated nodes help? Or would they pose too high a cost risk if users forget to delete the pool?
- If startup time cannot be reduced, making it easier to launch multiple tasks for testing might help:
- Currently, if one task fails earlier than another, the second call to
download_task_outputerrors out.
- Currently, if one task fails earlier than another, the second call to
Other Suggestions
- In the walkthrough, it could be more helpful for users if argument names were explicitly specified rather than relying on positional interpretation.
| "metadata": {}, | ||
| "source": [ | ||
| "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working." | ||
| "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working. **Make sure to update your .env file based on the sample with values relevant to your Azure environment.**" |
There was a problem hiding this comment.
I do not think I understand what this direction is telling the user to to do. Is it saying you can have a .env file with a single line AZURE_KEYVAULT_NAME = "CFA-Predict"? Is it saying you should manually set a variable AZURE_KEYVAULT_NAME = "CFA-Predict" prior to running CloudClient()? Neither one worked for me. In both cases I get AttributeError: A non-None value for attribute azure_batch_account is required to obtain a value for Azure batch endpoint URL.
| "metadata": {}, | ||
| "source": [ | ||
| "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working." | ||
| "The initialization below is the simplest way to create and instance of the `CloudClient` class. If a variable called AZURE_KEYVAULT_NAME is saved to your environment, the `CloudClient` will initialize based on some Azure values stored in the Key Vault. Otherwise it will use environment variables or values stored in a .env file to authenticate, like the .env file stored [here](../../files/sample.env), and a managed identity credential based on your local working environment. The .env file should be stored at the same level in the directory in which you're working. **Make sure to update your .env file based on the sample with values relevant to your Azure environment.**" |
There was a problem hiding this comment.
It did work to use cc = CloudClient(keyvault = 'CFA-Predict'), but if that is the easiest way to do things, why does it appear commented out and second in the walkthrough? This line implies there is a way to do the same thing by having "a variable called AZURE_KEYVAULT_NAME is saved to your environment" but neither way I could think to do that worked
| @@ -183,7 +203,8 @@ | |||
| "container_name = cc.package_and_upload_dockerfile(\n", | |||
| " registry_name = \"my_azure_registry\",\n", | |||
There was a problem hiding this comment.
the only value for registry_name that works for me is "cfprdbatchcr". Using the value in this example for the walkthrough results in the following error:
ERROR: Registry names may contain only alpha numeric characters and must be between 5 and 50 characters
The push refers to repository [my_azure_registry.azurecr.io/cloudops-demo]
Get "https:/v2/": http: no Host in request URL
Adjusting based on this error to use - instead of _ results in the following error:
WARNING: The resource with name 'my-azure-registry' and type 'Microsoft.ContainerRegistry/registries' could not be found in subscription 'EXT-EDAV-CFA-PRD (ef340bd6-2809-4635-b18b-7e6583a8803b)'.Using 'my-azure-registry.azurecr.io' as the default registry login server.
ERROR: Could not connect to the registry login server 'my-azure-registry.azurecr.io'. Please verify that the registry exists and the URL 'https://my-azure-registry.azurecr.io/v2/' is reachable from your environment.Try running 'az acr check-health -n my-azure-registry --yes' to diagnose this issue.
The push refers to repository [my-azure-registry.azurecr.io/cloudops-demo]
Get "https://my-azure-registry.azurecr.io/v2/": dial tcp: lookup my-azure-registry.azurecr.io on 127.0.0.53:53: no such host
There was a problem hiding this comment.
Provide users the correct value that will work for them or if that is not the same between users provide them a way to determine which value works for them. If a value is meant to be a stand-in that will prompt users to find their appropriate value instead of taken literally, this needs to be made clear in the text above or in the code comments with a detailed description of how to find the appropriate value
| "Pools are usually created for each team or per project. It spins up nodes when necessary based on the container you specify. The following would create a pool based on the Docker image we just uploaded, autoscaling to 5 nodes, mounting to the 'input-test' container we uploaded to, an 8 core CPU, and call it 'getting-started-pool'. " | ||
| "Pools are usually created for each team or per project. It spins up nodes when necessary based on the container you specify. \n", | ||
| "\n", | ||
| "It's at this point we specify which Blob Containers we mount to the pool. This will make blobs in Blob Storage accessible to read or write for the containers that we mount. The mounts are then accessible in your code at the root of the node, i.e. a mounted container called 'input-test' will be accessible in your code via `/input-test`. \n", |
There was a problem hiding this comment.
Access to mounted files continues to be an issue for my team. It is unclear as of yet what is going on, but in some cases, tasks fail access the files in the blob container mounted to the pool
| "At this point we are ready to add tasks to the job we created. We can run the `main.py` python script that we uploaded to the 'input-test' container. It takes an argument called '--user' and prints a welcome message to the console. We will add two tasks to our job for two different users. In general, any number of tasks can be added to a job." | ||
| "At this point we are ready to add tasks to the job we created. We can run the `main.py` python script that we uploaded to the 'input-test' container. It takes an argument called '--user' and prints a welcome message to the console. We will add two tasks to our job for two different users. In general, any number of tasks can be added to a job.\n", | ||
| "\n", | ||
| "Notice that we reference the mounted Blob Container with /input-test (the leading / is important)." |
There was a problem hiding this comment.
it seems there may be additional issues related to the pathing in this line besides the importance of the leading /
Fatal error: cannot open file '/input-test-edp/cloudops_helloworld.r': No such file or directory
There was a problem hiding this comment.
several members of my team work almost exclusively in R, so it was important to establish the ability to run simple R scripts rather than simple python scripts. As such, to test this I created a simple dockerfile and corresponding helloworld script that you could consider including in this PR, or I can include in my own PR
| "cc.delete_job(\"getting-started-job\")\n", | ||
| "\n", | ||
| "# delete the pool\n", | ||
| "cc.delete_pool(\"getting-started-pool\")" |
There was a problem hiding this comment.
cc.delete_pool seems to not work as expected. I tested this and then tried to create_pool with the same name afterwards and received an error stating that pool already existed. So it either does not work or has a delay. This should be determined and either fixed if it doesnt work, or if there is an unfixable delay, a warning should be given here and output when users run delete_pool, or in the error message for create_pool (e.g. a pool with this name already exists. If you recently deleted this pool, it may take up to X minutes to successfully delete such that you can create a pool with the same name")
There was a problem hiding this comment.
cc.delete_pool() followed by a creation of a pool with the same name results in
"Pool with name kps-cloudops-demo-pool already exists.
Skipping pool creation."
|
|
||
| If you are interacting with files during a job and getting errors like a file is not found, it can be originating from two places: | ||
| 1. incorrect mount reference. The blob container should be mounted during pool creation and referenced at the root of the Docker container. For example, a container called `my-container` would be referenced as `/my-container` in code, unless you provided a relative mount path when creating the pool. | ||
| 2. file not present in container. If you are referencing a file that should exist in your Docker container, confirm the path where it exists. Note that Docker sets a working directory so any relative paths will start from the working directory specified in your Docker container. |
There was a problem hiding this comment.
Two of my teammates are having issues troubleshooting a similar issue and have looked into both of these suggestions as well as a number of other avenues to debug and gotten nowhere:
This is the stdout after including a pwd and ls in the add_task line in addition to running the helloworld script:
/
bin
boot
dev
etc
home
lib
lib32
lib64
libx32
media
mnt
opt
proc
rocker_scripts
root
run
sbin
srv
sys
tmp
usr
var
Fatal error: cannot open file '/input-test-edp/cloudops_helloworld.r': No such file or directory
|
I think adding clarity around the dockerfile beyond pointing to the generic docker docs would help users and troubleshooting. Your example dockerfile sets WORKDIR to /app but then in the walkthrough, "app" is not used in create_pool() nor the add_task() nor main.py, and docs/troubleshooting.md does not use it either. It would be helpful to provide the user an example Dockerfile that works with the walkthrough |
| "\n", | ||
| "The following would create a pool based on the Docker image we just uploaded, autoscaling to 5 nodes, mounting to the 'input-test' container to which we uploaded, use an 8 core CPU, and call it 'getting-started-pool'. \n", | ||
| "\n", | ||
| "You could also specify vm_size from a list of xsmall, small, medium, large, and xlarge. These will use 2, 4, 8, 16, or 32 cores, respectively." |
There was a problem hiding this comment.
Can you add more information about what the various arguments for create_pool do? In the walkthrough you have max_autoscale_nodes=5. But when I tried this out my job did not seem to autoscale like it used to with out old cfa-azure pipeline. Does autoscale need to be set to true? Whats going on?
No description provided.