Skip to content

Commit 48c5ff2

Browse files
Add GENESIS GPU MLP estimation bring-up
Connect GENESIS GPU runs to the PerfTools MLP_NN/v1.5 estimator path for early CI validation. GENESIS now emits a gpu_kernel_region estimation section when GPU MLP profiling is enabled and a padata archive is available, and the gpu_kernel_mlp_v15 section package can consume BenchKit padata archives containing Nsight Compute raw CSV data. Add an NCU-to-PerfTools input bridge that extracts profile_raw.csv from padata, normalizes the Nsight Compute columns observed on MiyabiG, fills the current v1.5 static GPU spec gaps for known GPUs, and produces a prepared CSV before invoking predict_v15.py. Add local validation support via scripts/test_estimate_submit.sh, mirroring test_submit.sh style for scheduler submission and adding an --estimate-only mode that can run inside Apptainer with PERFTOOLS/SIF or the corresponding BK_* variables. Record GENESIS results with the GENESIS-specific Exp p8 and declare the same baseline Exp for estimation. This avoids falling through to the common CASE0 default, which is QWS-specific and is not a valid GENESIS experiment label. Keep GENESIS padata estimation artifacts as repo-root-relative paths such as results/padata0.tgz, even when run.sh has executed from inside the benchmark input tree. This lets downstream estimate jobs resolve the artifact downloaded from GitLab CI artifacts. Treat padata upload HTTP 413 as a non-fatal condition in send_results.sh. The result JSON is already ingested, and large profiler archives remain available to downstream estimation jobs through GitLab artifacts even when the Result Server refuses the padata upload size. Upload generated results/estimation_inputs from send_estimate.sh using the source result UUID from the Estimate JSON, so portal re-estimation and /api/query/estimation-inputs can retrieve GPU MLP prepared inputs produced in the estimate job. Allow local matrix generation without CI_PIPELINE_SOURCE by recording PARENT_PIPELINE_SOURCE=local, and align the PerfTools smoke-mode documentation with the repository-wide Python 3.12+ runtime expectation. Temporary bring-up wiring is intentionally explicit in .gitlab-ci.yml: BK_QWS_GPU_MLP_SMOKE, BK_ESTIMATE_RUNNER_TAG=fncx-estimate-python, BK_GPU_MLP_PERFTOOLS_REPO/REF, BK_GENESIS_GPU_MLP_PROFILE, BK_GPU_MLP_NCU_LAUNCH_COUNT, BK_GPU_MLP_SOURCE_GPU, and BK_GPU_MLP_KERNEL_COUNT are provisional switches. Remove or replace them once the real estimator runner/package flow is settled. Validation: WSL bash -n and shellcheck -S error for changed shell scripts; python py_compile for prepare_gpu_mlp_ncu_input.py; env -u CI_PIPELINE_SOURCE bash scripts/matrix_generate.sh code=qws system=Fugaku; test_send_estimate_inputs.sh, test_genesis_gpu_mlp_estimation.sh, test_bk_profiler.sh, test_estimation_gpu_kernel_mlp_v15.sh, test_qws_gpu_mlp_smoke_estimation.sh, test_result_profile_data.sh, and test_send_results_profile_data.sh. Signed-off-by: Yoshifumi Nakamura <nakamura@riken.jp>
1 parent afbfd23 commit 48c5ff2

16 files changed

Lines changed: 988 additions & 23 deletions

.gitlab-ci.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ variables:
3737
BK_ESTIMATE_RUNNER_TAG: "fncx-estimate-python"
3838
BK_GPU_MLP_PERFTOOLS_REPO: "https://github.com/masaaki-kondo/PerfTools.git"
3939
BK_GPU_MLP_PERFTOOLS_REF: "main"
40+
BK_GENESIS_GPU_MLP_PROFILE: "true"
41+
BK_GPU_MLP_NCU_LAUNCH_COUNT: "20"
42+
BK_GPU_MLP_SOURCE_GPU: "H100"
43+
BK_GPU_MLP_KERNEL_COUNT: "20"
4044

4145
# Extract system and code filters from API variables or commit message
4246
.filters: &filters

docs/guides/add-estimation-package.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ export BK_GPU_MLP_PERFTOOLS_REF=main
114114
`BK_QWS_GPU_MLP_SMOKE` は qws を使った配管確認用、`BK_QWS_GPU_MLP_SMOKE_MODE` は prediction fixture 取り込みと PerfTools 実行の切り替え用、`BK_ESTIMATE_RUNNER_TAG` は推定用 runner/container を手動で逃がすためのものです。
115115
実際の GPU profiling input と推定 runner の運用が固まったら、専用の package/runner 設定へ置き換え、これらの暫定変数は削除対象として見直してください。
116116

117-
`perftools` smoke mode は GitHub から PerfTools を取得するため、推定 runner/container には `git` と外部接続、Python 3.11、numpy/pandas/torch が必要です。
117+
`perftools` smoke mode は GitHub から PerfTools を取得するため、推定 runner/container には `git` と外部接続、Python 3.12 以上、numpy/pandas/torch が必要です。
118118
実運用では smoke mode ではなく、推定 runner/container に PerfTools checkout を用意し、section artifact として実アプリ由来の prepared input CSV を渡してください。
119119

120120
## 5. metadata に持たせるもの

programs/genesis/estimate.sh

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
#!/bin/bash
2+
# estimate.sh — GENESIS estimation entrypoint and run-time section metadata.
3+
4+
genesis_declare_estimation_layout() {
5+
bk_clear_estimation_defaults
6+
bk_clear_estimation_declarations
7+
bk_define_current_estimation_package weakscaling
8+
bk_define_future_estimation_package instrumented_app_sections_dummy
9+
bk_define_baseline_system "${BK_ESTIMATION_BASELINE_SYSTEM:-MiyabiG}"
10+
bk_define_baseline_exp "${BK_ESTIMATION_BASELINE_EXP:-${BK_GENESIS_EXP:-p8}}"
11+
bk_define_future_system "${BK_ESTIMATION_FUTURE_SYSTEM:-GPU_MLP_TARGET}"
12+
bk_define_current_target_nodes "${BK_ESTIMATION_CURRENT_TARGET_NODES:-1}"
13+
bk_define_future_target_nodes "${BK_ESTIMATION_FUTURE_TARGET_NODES:-1}"
14+
bk_declare_section --side future gpu_kernel_region gpu_kernel_mlp_v15
15+
}
16+
17+
genesis_emit_estimation_data_from_fom() {
18+
local fom="$1"
19+
local artifact_path="results/padata0.tgz"
20+
local padata_path="$artifact_path"
21+
22+
case "${BK_GENESIS_GPU_MLP_PROFILE:-false}" in
23+
1|true|TRUE|yes|YES|on|ON) ;;
24+
*) return 0 ;;
25+
esac
26+
27+
if [[ -n "${GENESIS_BENCHKIT_ROOT:-}" ]]; then
28+
padata_path="${GENESIS_BENCHKIT_ROOT}/${artifact_path}"
29+
fi
30+
if [[ ! -f "$padata_path" ]]; then
31+
echo "Genesis GPU MLP estimation requested but profiler archive was not found: ${padata_path}" >&2
32+
return 0
33+
fi
34+
35+
bk_emit_declared_section --side future gpu_kernel_region "$fom" "$artifact_path"
36+
}
37+
38+
source scripts/bk_functions.sh
39+
source scripts/estimation/common.sh
40+
41+
BK_ESTIMATION_SECTION_DEFAULT_FACTOR="${BK_ESTIMATION_SECTION_DEFAULT_FACTOR:-1.0}"
42+
BK_GPU_MLP_ARTIFACT_MODE="${BK_GPU_MLP_ARTIFACT_MODE:-ncu}"
43+
BK_GPU_MLP_SOURCE_GPU="${BK_GPU_MLP_SOURCE_GPU:-H100}"
44+
BK_GPU_MLP_KERNEL_COUNT="${BK_GPU_MLP_KERNEL_COUNT:-20}"
45+
export BK_GPU_MLP_ARTIFACT_MODE
46+
export BK_GPU_MLP_SOURCE_GPU
47+
export BK_GPU_MLP_KERNEL_COUNT
48+
49+
genesis_declare_estimation_layout
50+
bk_estimation_apply_declared_defaults
51+
BK_ESTIMATION_PACKAGE="${BK_ESTIMATION_PACKAGE:-$BK_ESTIMATION_FUTURE_PACKAGE}"
52+
53+
if [[ "${BASH_SOURCE[0]}" != "$0" ]]; then
54+
return 0 2>/dev/null || exit 0
55+
fi
56+
57+
BK_ESTIMATION_INPUT_JSON="$1"
58+
59+
bk_estimation_run_declared_future_package "$BK_ESTIMATION_INPUT_JSON"
60+
bk_estimation_run_recorded_current_with_weakscaling \
61+
"${BK_ESTIMATION_BASELINE_SYSTEM:-MiyabiG}" \
62+
"${BK_ESTIMATION_BASELINE_EXP:-}" \
63+
"${BK_ESTIMATION_CURRENT_TARGET_NODES:-1}" \
64+
"${BK_ESTIMATION_CURRENT_PACKAGE:-weakscaling}"
65+
66+
bk_estimation_write_output "results/estimate_${est_code}_0.json"

programs/genesis/run.sh

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,16 @@ nthreads="$4"
88
numproc=$(( numproc_node * nodes ))
99

1010
source "${PWD}/scripts/bk_functions.sh"
11+
source "${PWD}/programs/genesis/estimate.sh"
1112

1213
SCRIPT_DIR="${PWD}"
14+
export GENESIS_BENCHKIT_ROOT="$SCRIPT_DIR"
1315
REPO_DIR="genesis_benchmark_input"
1416
REPO_URL="https://github.com/genesis-release-r-ccs/${REPO_DIR}.git"
1517
BRANCH="main"
1618
dir_path="npt/genesis2.0beta_3.5fs/apoa1"
1719
header=p8
20+
exp="${BK_GENESIS_EXP:-$header}"
1821
input=${header}.inp
1922
resultsdir=${SCRIPT_DIR}/results
2023
artifactsdir=${SCRIPT_DIR}/artifacts
@@ -152,7 +155,15 @@ run_genesis_gh200_gpu() {
152155
fi
153156

154157
genesis_profiler_tool=$(bk_get_profiler_tool "$genesis_profiler_requested") || return 1
155-
genesis_profiler_level="${!profiler_level_var:-${GENESIS_PROFILER_LEVEL:-single}}"
158+
local genesis_default_profiler_level="single"
159+
case "${BK_GENESIS_GPU_MLP_PROFILE:-false}" in
160+
1|true|TRUE|yes|YES|on|ON)
161+
genesis_default_profiler_level="detailed"
162+
export BK_PROFILER_NCU_RAW_CSV="${BK_PROFILER_NCU_RAW_CSV:-true}"
163+
export BK_PROFILER_ARGS="${BK_PROFILER_ARGS:---launch-count ${BK_GPU_MLP_NCU_LAUNCH_COUNT:-20}}"
164+
;;
165+
esac
166+
genesis_profiler_level="${!profiler_level_var:-${GENESIS_PROFILER_LEVEL:-${genesis_default_profiler_level}}}"
156167
if [ -n "$genesis_profiler_tool" ]; then
157168
if [ "$genesis_profiler_tool" = "ncu" ] && ! command -v ncu >/dev/null 2>&1; then
158169
if [ "$genesis_profiler_explicit" -eq 1 ]; then
@@ -223,14 +234,17 @@ fom_val=$(awk -F'=' '/^[[:space:]]*dynamics[[:space:]]*=/ {
223234
print $2;
224235
exit
225236
}' ${output})
226-
cd - > /dev/null
237+
cd "$SCRIPT_DIR" > /dev/null
227238

228239
if [[ -z "$fom_val" ]]; then
229240
echo "Warning: FOM value not found in ${output}" >&2
230241
fom_val="nan" # or 0.0
231242
fi
232243

233-
bk_emit_result --fom "$fom_val" --nodes "$nodes" --numproc-node "$numproc_node" --nthreads "$nthreads" >> ${resultsdir}/result
244+
{
245+
bk_emit_result --fom "$fom_val" --exp "$exp" --nodes "$nodes" --numproc-node "$numproc_node" --nthreads "$nthreads"
246+
genesis_emit_estimation_data_from_fom "$fom_val"
247+
} >> ${resultsdir}/result
234248
# if information is requierd
235249
#printf "%-10s nodes=%2d numproc=%3d FOM: %.3f\n" \
236250
# "$system" "$nodes" "$numproc" "$fom_val" >> ../results/result

scripts/bk_functions.sh

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -908,6 +908,8 @@ bk_profiler_write_meta() {
908908
else
909909
_bk_meta_ncu_report_path=""
910910
fi
911+
_bk_meta_ncu_raw_csv_path="raw/${_bk_meta_name}/profile_raw.csv"
912+
_bk_meta_ncu_raw_csv_abs="${_bk_meta_stage_dir}/${_bk_meta_ncu_raw_csv_path}"
911913
;;
912914
*)
913915
_bk_meta_text_path=""
@@ -965,6 +967,13 @@ bk_profiler_write_meta() {
965967
printf ' {"kind": "ncu_report", "path": "%s"}' "$_bk_meta_ncu_report_path"
966968
_bk_meta_has_report=1
967969
fi
970+
if [ "${_bk_meta_tool}" = "ncu" ] && [ -f "${_bk_meta_ncu_raw_csv_abs:-}" ]; then
971+
if [ "$_bk_meta_has_report" -eq 1 ]; then
972+
printf ',\n'
973+
fi
974+
printf ' {"kind": "ncu_raw_csv", "path": "%s"}' "$_bk_meta_ncu_raw_csv_path"
975+
_bk_meta_has_report=1
976+
fi
968977
if [ "$_bk_meta_has_report" -eq 1 ]; then
969978
printf '\n'
970979
fi
@@ -1141,6 +1150,19 @@ bk_profiler() {
11411150
else
11421151
echo "bk_profiler[ncu]: failed ${_bk_ncu_rep_name} level=${_bk_profiler_level} status=${_bk_profiler_status}" >&2
11431152
fi
1153+
case "${BK_PROFILER_NCU_RAW_CSV:-false}" in
1154+
1|true|TRUE|yes|YES|on|ON)
1155+
_bk_ncu_report_file=$(bk_profiler_find_ncu_report "$_bk_ncu_rep_dir" || true)
1156+
if [ -n "$_bk_ncu_report_file" ]; then
1157+
ncu --import "$_bk_ncu_report_file" \
1158+
--page raw \
1159+
--csv \
1160+
--print-units base \
1161+
--print-fp \
1162+
> "${_bk_ncu_rep_dir}/profile_raw.csv" 2> "${_bk_ncu_rep_dir}/profile_raw.csv.log" || true
1163+
fi
1164+
;;
1165+
esac
11441166
cp -R "$_bk_ncu_rep_dir" "$_bk_stage_dir/raw/${_bk_ncu_rep_name}"
11451167
_bk_profiler_run_names="${_bk_ncu_rep_name}"
11461168
_bk_profiler_run_events="${_bk_profiler_level}"

0 commit comments

Comments
 (0)