-
Notifications
You must be signed in to change notification settings - Fork 15
Error Codes
Paul Nilsson edited this page Apr 17, 2026
·
22 revisions
When detecting a fatal problem, the Pilot assigns an error code and informs the server. Aside from the numerical code itself, it also reports the error meaning and a more detailed error diagnostics.
| Error code | Acronym | Meaning | Notes |
|---|---|---|---|
| 1008 | GENERALERROR | General pilot error, consult batch log | Catch-all error set when no more specific code applies. Also set by the proxy validation code (proxy.py) when grid-proxy-info or voms-proxy-info return an unexpected error. Maps to wrapper exit code 65. |
| 1098 | NOLOCALSPACE | Not enough local space | Error code is set e.g. by job monitoring, also if copytool command fails (if "No space left on device" is found in command output) |
| 1099 | STAGEINFAILED | Failed to stage-in file | Set by the data layer (api/data.py, control/data.py) and individual copytools (rucio, mv, lsm) when a file transfer into the work directory fails and no more specific error code (e.g. STAGEINTIMEOUT, REPLICANOTFOUND) applies. Also the error code of the StageInFailure exception class. |
| 1100 | REPLICANOTFOUND | The rucio API function list_replicas() did not return any replicas. Check log for details. | Defined and present in error messages but not currently assigned anywhere in the Pilot 3 source. It is a legacy code inherited from Pilot 1 and reserved for future use. The nearest active equivalents are NOREPLICAS (1326) and RUCIOLISTREPLICASFAILED (1322). |
| 1103 | NOSUCHFILE | No such file or directory | Error thrown by open_file() function. Also set if copytool fails and "No such file or directory" is found in output |
| 1104 | USERDIRTOOLARGE | User work directory too large | The error is set if the user work directory exceeds the maximum allowed limit, as defined by schedconfig.maxwdir (default: 14 GB) |
| 1106 | STDOUTTOOBIG | Payload log or stdout file too big | Set if stdout exceeds maximum allowed limit of 2 GB, set in the Pilot's default config file |
| 1110 | SETUPFAILURE | Failed during payload setup | Set if the string "General payload setup verification error" is found in the transform stderr |
| 1115 | NFSSQLITE | NFS SQLite locking problems | Pilot identifies this error by doing a grep on the strings "prepare 5 database is locked" and "Error SQLiteStatement" in in the payload stdout |
| 1116 | QUEUEDATA | Pilot could not download queuedata | Raised as a PilotException from info/extinfo.py when the queue name is not specified, or when the server response contains an error. Also the error code of the QueuedataFailure exception class. Maps to wrapper exit code 71. |
| 1117 | QUEUEDATANOTOK | Pilot found non-valid queuedata | Raised when downloaded queuedata fails internal validation. Also the error code of the QueuedataNotOK exception class. Maps to wrapper exit code 72. Note: the source comments mark this as "not implemented yet, error code added". |
| 1124 | OUTPUTFILETOOLARGE | Output file too large | Set before stage-out if an output file exceeds the maximum allowed size (defined in the Pilot config or schedconfig) |
| 1133 | NOSTORAGE | Fetching default storage failed: no activity related storage defined | Raised as a PilotException from api/data.py when no DDM endpoint can be resolved for the required activity (states: NO_ASTORAGES_DEFINED, NO_OUTPUTSTORAGE_DEFINED). Also raised in the Event Service executor when failover storage is exhausted. |
| 1137 | STAGEOUTFAILED | Failed to stage-out file | Set by control/data.py and individual copytools (gs, mv, lsm, s3, rucio) when a file transfer out of the work directory fails and no more specific code applies. Also the error code of the StageOutFailure exception class. |
| 1141 | PUTMD5MISMATCH | md5sum mismatch on output file | Set in copytool/rucio.py and copytool/common.py when the md5 checksum of a staged-out file does not match the catalog value. When the checksum type is adler32, PUTADMISMATCH (1172) is used instead. Error acronym should be renamed. |
| 1143 | CHMODTRF | Failed to chmod trf | Returned by get_setup_command() in all user setup modules (atlas, epic, rubin, ska, darkside, sphenix, generic) when chmod 0755 on the downloaded transform binary fails. |
| 1144 | PANDAKILL | This job was killed by panda server | Set when the server responds with 'tobekilled' during a state update, when a 'kill' or 'softkill' debug command is received, or when job_monitor detects abort_job without an OS-level kill signal. Also set if gdb successfully produced a core dump as requested by the server. |
| 1145 | GETMD5MISMATCH | md5sum mismatch on input file | Set in copytool/common.py when an md5 checksum mismatch is detected on a staged-in file. When the checksum type is adler32, GETADMISMATCH (1171) is used instead. Error acronym should be renamed. |
| 1149 | TRFDOWNLOADFAILURE | Transform could not be downloaded | Returned by get_setup_command() in all user setup modules (atlas, epic, rubin, ska, darkside, sphenix, generic) when the HTTP(S) download of the transform executable fails (e.g. curl or wget exits non-zero). |
| 1150 | LOOPINGJOB | Looping job killed by pilot | The pilot will kill the payload (or stop stage-in/out) if there is no activity (i.e. files touched in the work directory or if the file transfer is stuck) within the allowed time. The default looping job time limit is 12*3600 s for production jobs and 3*3600 s for user analysis jobs. The limit can be overridden in the pilot's config file (or set by the user using the maxCPUCount variable) |
| 1151 | STAGEINTIMEOUT | File transfer timed out during stage-in | Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr). Called GETTIMEOUT in Pilot 1. |
| 1152 | STAGEOUTTIMEOUT | File transfer timed out during stage-out | Currently only identified for rucio file transfer (unless "Operation timed out" is in stderr). Called PUTTIMEOUT in Pilot 1. |
| 1163 | NOPROXY | Grid proxy not valid | Raised by the NoGridProxy exception class. Set in copytool/common.py when "Could not establish context" is found in copytool output, and in atlas/proxy.py when grid-proxy-info fails or returns a short-lived proxy. Maps to wrapper exit code 68. |
| 1165 | MISSINGOUTPUTFILE | Local output file is missing | Set in atlas/common.py when a file listed in the job report cannot be found in the work directory after payload execution. Also raised as a PilotException from api/data.py (state FILE_INFO_FAIL) when output file metadata cannot be resolved. |
| 1168 | SIZETOOLARGE | Total file size too large | Before stage-in, the pilot verifies that the sum of the input file sizes does not exceed maxwdir (set in schedconfig or in pilot config file). Any files that are to be accessed directly/remotely are excluded |
| 1171 | GETADMISMATCH | adler32 mismatch on input file | Set in copytool/common.py when adler32 verification fails on a staged-in file ("failed xrdadler32" or "does not match the checksum" with 'adler32' present in the output). When the checksum type is md5, GETMD5MISMATCH (1145) is used instead. Error acronym should be renamed. |
| 1172 | PUTADMISMATCH | adler32 mismatch on output file | Set in copytool/rucio.py and copytool/common.py when the adler32 checksum of a staged-out file does not match the catalog value. When the checksum type is md5, PUTMD5MISMATCH (1141) is used instead. Error acronym should be renamed. |
| 1177 | NOVOMSPROXY | Voms proxy not valid | Raised by the NoVomsProxy exception class. Set when arcproxy exits non-zero in any user proxy module (atlas, epic, rubin, etc.). For shared-library failures, ARCPROXYLIBFAILURE (1381) is used instead. |
| 1180 | GETGLOBUSSYSERR | Globus system error during stage-in | Detected in copytool/common.py when the string "globus_xio:" is found in copytool command output during stage-in. Note: the wiki previously contained a typo ("globes_xio"). |
| 1181 | PUTGLOBUSSYSERR | Globus system error during stage-out | Detected in copytool/common.py when the string "globus_xio:" is found in copytool command output during stage-out. The detection logic is shared with GETGLOBUSSYSERR (1180); stage direction determines which code is assigned. |
| 1186 | NOSOFTWAREDIR | Software directory does not exist | Raised as a NoSoftwareDir exception from all user setup modules (atlas, epic, rubin, ska, darkside, sphenix, generic) when the application directory (appdir) does not exist on the worker node at setup time. Maps to wrapper exit code 73. |
| 1187 | NOPAYLOADMETADATA | Payload metadata does not exist | Set when job report metadata is absent after payload execution. In atlas/diagnose.py, if this is the only error and the transform exit code is non-zero, the pilot attempts further diagnosis to find the real cause — so this code often indicates a preceding uncaught error rather than a genuine missing metadata condition. |
| 1190 | LFNTOOLONG | LFN too long (exceeding limit of 255 characters) | When validating a job definition, before executing the payload, the Pilot makes sure that no output file has an LFN that is longer than 255 characters (which is not supported by the DDM system) |
| 1191 | ZEROFILESIZE | File size cannot be zero | Before executing the stage-out command, the Pilot verifies that the size of the file is not zero (which will not be accepted by any storage system) |
| 1199 | MKDIR | Failed to create local directory | Also set if "cannot create directory" is found in transform stderr (via resolve_transform_error()). Also the error code of the MkdirFailure exception class. |
| 1200 | KILLSIGNAL | Job terminated by unknown kill signal | Catch-all for kill signals that do not map to a more specific signal error code (1201–1207). Maps to wrapper exit code 137. |
| 1201 | SIGTERM | Job killed by signal: SIGTERM | Maps to wrapper exit code 143 (128+15). On Kubernetes resources, SIGTERM caused by node preemption is reported as PREEMPTION (1379) instead. |
| 1202 | SIGQUIT | Job killed by signal: SIGQUIT | Registered in the pilot's signal handler. Typically sent by the terminal or batch system to request a core dump before terminating the process. |
| 1203 | SIGSEGV | Job killed by signal: SIGSEGV | Registered in the pilot's signal handler. When received at the pilot process level it indicates a segmentation fault in the pilot itself. Contrast with PAYLOADSIGSEGV (1328), which is extracted from the payload job report. |
| 1204 | SIGXCPU | Job killed by signal: SIGXCPU | Typically raised by the batch system when a job exceeds its CPU time limit |
| 1205 | USERKILL | Job killed by user | Reserved error code for user defined kill instructions. Currently not implemented |
| 1206 | SIGBUS | Job killed by signal: SIGBUS | Typically indicates a bus error (misaligned memory access or hardware fault) |
| 1207 | SIGUSR1 | Job killed by signal: SIGUSR1 | User-defined signal 1; used by some batch systems to request graceful job termination before walltime is reached |
| 1208 | SIGINT | Job killed by signal: SIGINT | Maps to wrapper exit code 130 (128+2). Registered as a catchable signal alongside SIGTERM, SIGQUIT, SIGSEGV, SIGXCPU, SIGUSR1 and SIGBUS in the pilot's signal handler. |
| 1211 | MISSINGINSTALLATION | Missing installation | Assigned error code if the payload fails to execute the transform |
| 1212 | PAYLOADOUTOFMEMORY | Payload ran out of memory | Assigned error code if the pilot finds the string "FATAL out of memory: taking the application down" in the stderr and "St9bad_alloc", "std::bad_alloc" in the stdout |
| 1213 | REACHEDMAXTIME | Reached batch system time limit | Pilot aborts automatically when 10 minutes remain of the maximum allowed running time, as set by 1) schedconfig.maxtime or 2) Pilot option -l <maxtime> (both values are in seconds). Maps to wrapper exit code 81. |
| 1220 | UNKNOWNPAYLOADFAILURE | Job failed due to unknown reason (consult log file) | Fallback code set when the payload exits non-zero but no specific error pattern is matched in stdout/stderr and no recognised error is found in the job report error list. |
| 1221 | FILEEXISTS | File already exists | Error code is set if "File exists", "SRM_FILE_BUSY" or "file already exists" is found in copytool command output |
| 1223 | BADALLOC | Transform failed due to bad_alloc | Assigned error code if the pilot finds "badalloc" among the job report errors |
| 1224 | ESRECOVERABLE | Event service: recoverable error | Used in Event Service / Jumbo Jobs mode. Signals that the current event range can be retried |
| 1228 | ESFATAL | Event service: fatal error | Used in Event Service / Jumbo Jobs mode. Signals an unrecoverable error that should not be retried |
| 1234 | EXECUTEDCLONEJOB | Clone job is already executed | Set when the pilot detects that a clone of the current job has already completed successfully elsewhere, preventing duplicate output |
| 1235 | PAYLOADEXCEEDMAXMEM | Payload exceeded maximum memory | Set when the memory monitor (prmon) reports that the payload exceeded the maximum allowed memory as defined by the job definition (maxRSS) |
| 1236 | FAILEDBYSERVER | Failed by server | This error is not set by the pilot. It is currently only set by Harvester. Note: the acronym in the source is FAILEDBYSERVER (not KILLEDBYSERVER as in Pilot 1) |
| 1238 | ESNOEVENTS | Event service: no events | Used in Event Service / Jumbo Jobs mode. Set when the server has no more event ranges to assign to this pilot |
| 1240 | MESSAGEHANDLINGFAILURE | Failed to handle message from payload | Set when the pilot fails to parse or act on a message received from the payload process (e.g. malformed event service status message) |
| 1242 | CHKSUMNOTSUP | Query checksum is not supported | The error code is set if Pilot finds "query chksum is not supported" or "Unable to checksum" in command output |
| 1244 | NORELEASEFOUND | No release candidates found | Set in atlas/resource/grid.py when the asetup/lsetup output contains the string "No release candidates found", indicating the requested ATLAS software release is not installed at the site. |
| 1246 | NOUSERTARBALL | User tarball could not be downloaded from PanDA server | Set via set_error_nousertarball() in atlas/diagnose.py when payload exit code 146 is detected, indicating the user code tarball could not be downloaded. The tarball URL is extracted from the tail of payload stdout and included in diagnostics. |
| 1247 | BADXML | Badly formed XML | Raised as a BadXML exception from atlas/diagnose.py when the job report XML metadata file exists but cannot be parsed, typically due to an illegal character in the output. |
| 1300 | NOTIMPLEMENTED | The class or function is not implemented | Error code of the NotImplemented exception class. Used as a placeholder for abstract methods in base classes that have not yet been implemented in a concrete subclass. |
| 1301 | UNKNOWNEXCEPTION | An unknown pilot exception has occurred | Default error code of the PilotException base class and the UnknownException subclass. Also raised from util/workernode.py when unexpected OS-level errors occur during worker node probing. Maps to wrapper exit code 70. |
| 1302 | CONVERSIONFAILURE | Failed to convert object data | E.g. if a JSON dictionary can't be converted from unicode to utf-8 |
| 1303 | FILEHANDLINGFAILURE | Failed during file handling | E.g. if a file can't be opened or a dictionary can't be loaded from file |
| 1305 | PAYLOADEXECUTIONFAILURE | Failed to execute payload | Set when the transform returns a non-zero exit code that does not map to a more specific error. Also the fallback code in resolve_transform_error() for unrecognised non-zero exits |
| 1306 | SINGULARITYGENERALFAILURE | Singularity/Apptainer: general failure | Site issue; set if the Pilot finds "Operation not permitted" in stderr |
| 1307 | SINGULARITYNOLOOPDEVICES | Singularity/Apptainer: No more available loop devices | Site issue; set if Pilot finds "No more available loop devices" in stderr |
| 1308 | SINGULARITYBINDPOINTFAILURE | Singularity/Apptainer: Not mounting requested bind point | Site issue; set if the Pilot finds "Not mounting requested bind point" in stderr |
| 1309 | SINGULARITYIMAGEMOUNTFAILURE | Singularity/Apptainer: Failed to mount image | Site issue; set if the Pilot finds "Failed to mount image" or "error: while mounting" in stderr |
| 1310 | PAYLOADEXECUTIONEXCEPTION | Exception caught during payload execution | Internal pilot problem |
| 1311 | NOTDEFINED | Not defined | A general - internally used - error that is explained in the corresponding exception (NotDefined) error diagnostics; e.g. the analytics package throws this exception if a fit has not been defined; or if a math function fails to convert a string to an integer |
| 1312 | NOTSAMELENGTH | Not same length | Internally used error corresponding to exception NotSameLength, which is thrown if input data are not of same length in a fit |
| 1313 | NOSTORAGEPROTOCOL | No protocol defined for storage endpoint | Set when building a storage endpoint SURL and no transfer protocol (e.g. srm, gsiftp, davs) is defined for that endpoint |
| 1314 | UNKNOWNCHECKSUMTYPE | Unknown checksum type | Set when the pilot encounters a checksum type that is not adler32 or md5 |
| 1315 | UNKNOWNTRFFAILURE | Unknown transform failure | Set for transform exit code 251, -1, or any other exit code that cannot be mapped to a more specific error |
| 1316 | RUCIOSERVICEUNAVAILABLE | Rucio: Service unavailable | Set if corresponding Rucio error details (reg.exp. or "service_unavailable") are found in copytool command output |
| 1317 | EXCEEDEDMAXWAITTIME | Exceeded maximum waiting time | Internally used exception/error code. Exception thrown by pilot monitoring when abort_job wait time has been exceeded (and when other threads have not finished cleaning up on time). abort_job is set when pilot has received a kill signal |
| 1318 | COMMUNICATIONFAILURE | Failed to communicate with server | Set in control/job.py when the pilot fails to report job state to the PanDA server (e.g. in pod/stager mode when send_state() returns False). Also used throughout the Event Service communication manager for PanDA server contact failures. Maps to wrapper exit code 79. |
| 1319 | INTERNALPILOTPROBLEM | An internal Pilot problem has occurred (consult Pilot log) | Error code used for internal debugging. A more precise error message should be written to the log |
| 1320 | LOGFILECREATIONFAILURE | Failed during creation of log file | In case tarfile.open() or the archive.add() fails, the pilot will set this error code |
| 1321 | RUCIOLOCATIONFAILED | Failed to get client location for Rucio | Set when the pilot cannot determine the correct Rucio client location (e.g. during RSE endpoint resolution) |
| 1322 | RUCIOLISTREPLICASFAILED | Failed to get replicas from Rucio | Set when the Rucio list_replicas() call raises an exception (as opposed to returning an empty result, which is 1100) |
| 1323 | UNKNOWNCOPYTOOL | Unknown copy tool | Set if the requested copy tool has no implementation |
| 1324 | SERVICENOTAVAILABLE | Service not available at the moment | Rucio server not available |
| 1325 | SINGULARITYNOTINSTALLED | Singularity: not installed | Identified by trf exit code 64 and the string "Singularity is not installed" present in the stderr |
| 1326 | NOREPLICAS | No matching replicas were found in list_replicas() output | list_replicas() returned replicas but no local matching replica was found |
| 1327 | UNREACHABLENETWORK | Unable to stage-in file since network is unreachable | Problem seen with xrdcp command during stage-in |
| 1328 | PAYLOADSIGSEGV | SIGSEGV: Invalid memory reference or a segmentation fault | Special payload error extracted from job report. A SIGSEGV is an error (signal) caused by an invalid memory reference or a segmentation fault. The payload is probably trying to access an array element out of bounds or trying to use too much memory |
| 1329 | NONDETERMINISTICDDM | Failed to construct SURL for non-deterministic ddm (update CRIC) | While Pilot 1 ignored the is_deterministic endpoint field if the storage path ended in /rucio, Pilot 3 will instead fail the job if the endpoint is not deterministic. The endpoint should be fixed in CRIC |
| 1330 | JSONRETRIEVALTIMEOUT | JSON retrieval timed out | Error is assigned if the pilot fails to download JSON |
| 1331 | MISSINGINPUTFILE | Input file is missing in storage element | Detected in copytool/common.py when "No such file or directory" is found in copytool output during stage-in. Means the file is absent from the storage element rather than being a local path error. Maps to wrapper exit code 77. |
| 1332 | BLACKHOLE | Black hole detected in file system (consult Pilot log) | This error is assigned if a pilot module goes missing. Typically this would mean that it cannot be imported |
| 1333 | NOREMOTESPACE | No space left on device | Set during stage-out when the remote storage element reports that it has no space remaining |
| 1334 | SETUPFATAL | Setup failed with a fatal exception (consult payload log) | Set in atlas/diagnose.py when the string "AtlasSetup(FATAL): Fatal exception" is found in the tail of payload stdout. Distinct from SETUPFAILURE (1110), which is triggered by a transform stderr pattern; this code indicates an unhandled exception inside AtlasSetup itself. |
| 1335 | MISSINGUSERCODE | User code not available on PanDA server (resubmit task with --useNewCode) | Error occurs when user tarball has been deleted from the server and the pilot tries to download it. User must resubmit task with prun/pathena option --useNewCode |
| 1336 | JOBALREADYRUNNING | Job is already running elsewhere | Error code of the JobAlreadyRunning exception class, defined for the case where a clone of the current job is detected as already executing elsewhere. Not currently raised anywhere in Pilot 3 — reserved for future use. |
| 1337 | BADMEMORYMONITORJSON | Memory monitor produced bad output | Failure to parse JSON file from Memory monitor |
| 1338 | STAGEINAUTHENTICATIONFAILURE | Authentication failure during stage-in | Set in api/data.py when "Cannot authenticate" is found in the caught exception string from a failed stage-in attempt. The same detection path sets STAGEOUTAUTHENTICATIONFAILURE (1383) for stage-out. Typically indicates an expired or invalid token or certificate at transfer time. |
| 1339 | DBRELEASEFAILURE | Local DBRelease handling failed (consult Pilot log) | Set when the pilot fails to set up a local DBRelease tarball that is required by the job (e.g. for conditions data access) |
| 1340 | SINGULARITYNEWUSERNAMESPACE | Singularity/Apptainer: Failed invoking the NEWUSER namespace runtime | Site issue; set if Pilot finds "Failed invoking the NEWUSER namespace runtime" in stderr. Indicates the kernel on the worker node does not support user namespaces |
| 1341 | BADQUEUECONFIGURATION | Bad queue configuration detected | Set when mandatory queue configuration fields (e.g. container_type in CRIC) are missing or inconsistent, causing the pilot to be unable to run the job |
| 1342 | MIDDLEWAREIMPORTFAILURE | Failed to import middleware (consult Pilot log) | Error code of the MiddlewareImportFailure exception class. Set when a required third-party Python module (e.g. psutil, rucio-client) cannot be imported. Maps to wrapper exit code 76. |
| 1343 | NOOUTPUTINJOBREPORT | Found no output in job report | Set when the job report contains an empty output list and the job's allowNoOutput flag is not set. The add_error_code call is currently commented out in atlas/common.py — a warning is logged instead, pending a policy decision. |
| 1344 | RESOURCEUNAVAILABLE | Resource temporarily unavailable (consult Pilot log) | Set when get_current_cpu_consumption_time() fails due to OSError exception raised in subprocess module (failed os.fork()). To be extended in v 2.1.22+ |
| 1345 | SINGULARITYFAILEDUSERNAMESPACE | Singularity/Apptainer: Failed to create user namespace | Site issue; detected in stderr when the transform has a non-zero exit code and "Failed to create user namespace" is present |
| 1346 | TRANSFORMNOTFOUND | Transform not found | Detected in stderr when the transform has a non-zero exit code |
| 1347 | UNSUPPORTEDSL5OS | Unsupported SL5 OS | Detected in stderr when the transform has a non-zero exit code |
| 1348 | SINGULARITYRESOURCEUNAVAILABLE | Singularity/Apptainer: Resource temporarily unavailable | Site issue; detected in stderr when the transform has a non-zero exit code. Distinct from RESOURCEUNAVAILABLE (1344), which is a pilot-level OS fork failure |
| 1349 | UNRECOGNIZEDTRFARGUMENTS | Unrecognized transform arguments | Detected in stderr when the transform has a non-zero exit code |
| 1350 | EMPTYOUTPUTFILE | Empty output file detected | Detected in stderr when the transform has a non-zero exit code |
| 1351 | UNRECOGNIZEDTRFSTDERR | Unrecognized fatal error in transform stderr | Detected in stderr when the transform has a non-zero exit code |
| 1352 | STATFILEPROBLEM | Failed to stat proc file for CPU consumption calculation | The pilot sets this error during the CPU consumption calculation if reading /proc/pid/stat fails with "No such file or directory" |
| 1353 | NOSUCHPROCESS | CPU consumption calculation failed: No such process | The pilot sets this error during the CPU consumption calculation if reading /proc/pid/stat fails with "No such process" |
| 1354 | GENERALCPUCALCPROBLEM | General CPU consumption calculation problem (consult Pilot log) | If there is a problem accessing the /proc/pid/stat file that is not recognised, this error will be set |
| 1355 | COREDUMP | Core dump detected | Set if a core dump is found for a failed job in the payload work dir (during the initial payload error analysis). The core dump is removed. Note: currently the file name must be "core" (i.e. not "core.*") |
| 1356 | PREPROCESSFAILURE | Pre-process command failed | Set when a pre-processing command defined in the job definition (preprocess) exits with a non-zero code |
| 1357 | POSTPROCESSFAILURE | Post-process command failed | Set when a post-processing command defined in the job definition (postprocess) exits with a non-zero code |
| 1358 | MISSINGRELEASEUNPACKED | Missing release setup in unpacked container | Pilot requires that /release_setup.sh is present in unpacked containers. It is not present in older containers |
| 1359 | PANDAQUEUENOTACTIVE | PanDA queue is not active | The error is set as soon as the pilot has downloaded queue data if the queue is not active |
| 1360 | IMAGENOTFOUND | Image not found | The error is set if the pilot cannot find an image whose path is known |
| 1361 | REMOTEFILECOULDNOTBEOPENED | Remote file could not be opened | For direct access jobs, the pilot attempts to open (and close) all input root files to avoid wasting CPU with the payload |
| 1362 | XRDCPERROR | Xrdcp was unable to open file | Detected in copytool/common.py when the string "Run: [ERROR] Server responded with an error" is found in xrdcp output. Distinct from UNREACHABLENETWORK (1327, network-level failure) and REMOTEFILECOULDNOTBEOPENED (1361, pre-flight open check). |
| 1363 | KILLPAYLOAD | Raythena has decided to kill payload | If the pilot monitoring discovers a kill instruction file in the pilot's work directory ($PILOT_HOME), it will terminate the payload and set this error. The kill instruction file name and checking time are defined in the pilot configuration file |
| 1364 | MISSINGCREDENTIALS | Unable to locate credentials for S3 transfer | Error set if "Unable to locate credentials" is found in the S3 transfer command output |
| 1365 | NOCTYPES | Python module ctypes not available on worker node | The ctypes module is used by the pilot for CPU consumption calculations. If it cannot be imported, this error is set |
| 1366 | CHECKSUMCALCFAILURE | Failure during checksum calculation | Set when the adler32 or md5 checksum calculation of a staged-in or staged-out file raises an unexpected exception |
| 1367 | COMMANDTIMEDOUT | Command timed out | Set when a shell command run via execute_command_with_timeout exceeds its timeout. Also maps directly as a transform exit code in resolve_transform_error() |
| 1368 | REMOTEFILEOPENTIMEDOUT | Remote file open timed out | Set when the remote file open script returns exit code 3, indicating the xrdcp/open attempt exceeded the allowed time |
| 1369 | FRONTIER | Frontier error | Set when the payload log contains Frontier-related error strings, indicating a conditions-data retrieval failure |
| 1370 | VOMSPROXYABOUTTOEXPIRE | VOMS proxy is about to expire | Internal use only. Not a job failure; used as a trigger to download a refreshed VOMS proxy before it expires |
| 1371 | BADOUTPUTFILENAME | Output file name does not follow naming convention | Naming convention can be defined in copytool_definitions.py. The _error_messages entry describes this as "contains illegal characters" |
| 1372 | APPTAINERNOTINSTALLED | Apptainer not installed | Identified by the string "Apptainer is not installed" in stderr. Companion to SINGULARITYNOTINSTALLED (1325) |
| 1373 | CERTIFICATEHASEXPIRED | Certificate has expired | Set when the grid or VOMS certificate used for authentication has passed its validity end date |
| 1374 | REMOTEFILEDICTDOESNOTEXIST | Remote file open dictionary does not exist | Set when the pilot cannot locate the state dictionary file written by the remote file open script, which is required to determine whether input files are accessible |
| 1375 | LEASETIME | Lease time is up | For DASK mode (internal use only) |
| 1376 | LOGCREATIONTIMEOUT | Log file creation timed out | Set when the log tarball creation step does not complete within the allowed time (defined in the pilot config file) |
| 1377 | CVMFSISNOTALIVE | CVMFS is not responding | Pilot returns exit code 64 to the wrapper |
| 1378 | LSETUPTIMEDOUT | Lsetup command timed out during remote file open | Set when the remote file open script returns exit code 2, indicating that the lsetup step (ATLAS software setup) did not complete in time |
| 1379 | PREEMPTION | Job was preempted | Replaces SIGTERM (1201) for Kubernetes resources where a SIGTERM is caused by node preemption rather than a deliberate kill. Allows the server to distinguish preempted jobs and reschedule them automatically |
| 1380 | ARCPROXYFAILURE | General arcproxy failure | Set when the arcproxy command fails for a reason other than a shared-library loading error (ARCPROXYLIBFAILURE, 1381) or a non-valid VOMS proxy (NOVOMSPROXY, 1177). |
| 1381 | ARCPROXYLIBFAILURE | Arcproxy failure while loading shared libraries | Set when arcproxy fails with a shared-library loading error, typically indicating a missing or incompatible GSI/Globus installation on the worker node |
| 1382 | PROXYTOOSHORT | Proxy is too short | Verified at the beginning of the pilot. Proxy must be over 72h long. Pilot returns exit code 80 to the wrapper |
| 1383 | STAGEOUTAUTHENTICATIONFAILURE | Authentication failure during stage-out | Set when authentication fails during stage-out (e.g. invalid token or certificate). Complement to STAGEINAUTHENTICATIONFAILURE (1338). |
| 1384 | QUEUENOTSETUPFORCONTAINERS | Queue is not set up for containers | Set when the pilot determines that a container job must be run but container_type is not defined in CRIC for this queue. The job is failed immediately without attempting execution |
| 1385 | NOJOBSINPANDA | No jobs in PanDA | Internally used code. Set on HPC resources when the maximum number of consecutive getjob failures (≥5) is reached, indicating PanDA has no work to assign. Maps to wrapper exit code 82 |
| 1386 | PANDAQUEUENOTONLINE | PanDA queue is not online | Set when the pilot finds that the target PanDA queue is in an offline state (e.g. deliberately disabled by a site admin). Maps to wrapper exit code 83 ("Site offline") |
| 1387 | ALLOCATIONERROR | Failed to allocate memory for transform execution (cling JIT failure) | Set when the cling JIT compiler inside ROOT reports "Cannot allocate memory". This is caused by the worker node exhausting the kernel's 64k VMA limit and is distinct from a genuine OOM condition: increasing the memory allocation will not help. The recommended retry action is to reduce the number of input files per job (retryModule action 5) |
- Introduction
- Pilot Architecture
- Project Structure
- Pilot Workflows
- Event service
- Metadata
- Signal Handling
- Error Codes
- Containers
- Special Algorithms
- Timing Measurements
- Data Transfers
- Copy Tools
- Direct Access
- Fallback Mechanism in Unified PanDA Queues
- Memory Monitoring
- Job Metrics
- Pilot release procedure