Skip to content

initial update#879

Draft
engineeredcurlz wants to merge 65 commits intomainfrom
test-refactor
Draft

initial update#879
engineeredcurlz wants to merge 65 commits intomainfrom
test-refactor

Conversation

@engineeredcurlz
Copy link
Copy Markdown

No description provided.

@engineeredcurlz
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

yaml.safe_load_all() enters an infinite loop when passed a MagicMock
object because PyYAML detects the .read attribute and treats it as a
file-like stream, then loops forever waiting to buffer enough bytes
(len(MagicMock()) returns 0 by default).

Fix by setting create_template.return_value to a valid YAML string in
the three create_deployment tests, so yaml.safe_load_all receives a
real string and parses it via the non-blocking code path.

Affected tests:
- test_create_deployment_success
- test_create_deployment_failure
- test_create_deployment_partial_success
begin_create_or_update() returns an LROPoller that was being discarded, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were then rejected with OperationNotAllowed.

Fix by calling poller.result() in scale_node_pool and _progressive_scale to block until Azure fully completes each operation before proceeding.
)
node_pool.count = step # Update node count in the node pool object
result = self.aks_client.agent_pools.begin_create_or_update(
poller = self.aks_client.agent_pools.begin_create_or_update(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we add poller here ? We do not need it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while running the pipeline I was getting failure error "OperationNotAllowed: Operation is not allowed because there's an in progress scale node pool operation". Operation was trying to move forward, while prior operation was still in progress.

begin_create_or_update was already returning a poller but it was being discarded. Adding poller.result (line 473) enforces a wait and check so that the prior operation can finish before moving on to the next. thereby fixing the failure I was receiving.

wait_condition_type="available",
resource_name=deployment_name,
namespace="default",
timeout_seconds=300 # 5 minutes timeout
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the self.step_timeout here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, have removed hardcoded timeout and added self.step_timeout

Returns:
True if all deployment creations were successful, False otherwise
"""
logger.info(f"Creating {number_of_deployments} deployment(s)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow %-style logs for consistency with other methods

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have refactored this and fixed

operation_timeout_in_minutes=5,
namespace="default",
pod_count=replicas,
label_selector="app=nginx-container"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make label-selector derive from the parameters instead of hardcoding on the template and here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have refactored this and updated deployment file

resource_type="deployment",
wait_condition_type="available",
resource_name=deployment_name,
namespace="default",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass namespace as parameter and make the default value as default

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored this

"deployment", parents=[common_parser], help="create deployments"
)
deployment_parser.add_argument("--node-pool-name", required=True, help="Node pool name")
deployment_parser.add_argument("--deployment-name", required=True, help="Deployment name")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need deployment name ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm probably not, its not mentioned in deploy_kwargs. I'll remove it

deployment_parser.add_argument("--node-pool-name", required=True, help="Node pool name")
deployment_parser.add_argument("--deployment-name", required=True, help="Deployment name")
deployment_parser.add_argument(
"--number_of_deployments",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be number-of-deployments, not number_of_deployments

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the hyphen

"replicas": args.replicas,
"manifest_dir": args.manifest_dir,
"number_of_deployments": args.number_of_deployments
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add else here

else:
logger.error("Unknown workload command: '%s'", command)
return 1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have added this to stop false successes and return logged errors

node groups, including create, scale (up/down), and delete operations. It supports
both direct and progressive scaling operations and handles GPU-enabled node groups.
node groups, including create, scale (up/down), and delete operations.
It supports both direct and progressive scaling operations and handles GPU-enabled node groups.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets revert this change

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored line wrapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants