Skip to content

Simple program fails with mixed AMD/Intel machines #74

@ghost

Description

Works with only Intel or only AMD nodes. With a mix, it always fails in the same way. It completes loop 3 and gets stuck in the MPI_Barrier for loop 4.

#include
#include <mpi.h>

int main()
{
    int argc = 0;
    MPI_Init(&argc, nullptr);

    const int count = 100;
    for (int i = 0; i < count; ++i)
    {
        std::cout << " Attempting Barrier " << i + 1 << std::endl;
        MPI_Barrier(MPI_COMM_WORLD);
        std::cout << " Completed Barrier " << i + 1 << std::endl;
    }

    MPI_Finalize();
}

command line, from intel_machine:
mpiexec -hosts 2 localhost amd_machine -wdir "\network\path" \path-to-exe

output:

[0] Attempting Barrier 1
[1] Attempting Barrier 1
[0] Completed Barrier 1
[0] Attempting Barrier 2
[1] Completed Barrier 1
[0] Completed Barrier 2
[1] Attempting Barrier 2
[0] Attempting Barrier 3
[0] Completed Barrier 3
[0] Attempting Barrier 4
[1] Completed Barrier 2
[1] Attempting Barrier 3
[1] Completed Barrier 3
[1] Attempting Barrier 4

job aborted:
[ranks] message

[0] terminated

[1] fatal error
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(MPI_COMM_WORLD) failed
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (errno 10060)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions