Skip to content

pairwise() with direction = "lt" filters pairs by alphabetical variable name, not by declaration order #5

Description

@AntoineSoetewey

Description

pairs_generator() in R/generate-pairs.R works by expanding a grid of all
name combinations and then keeping only those where the inequality holds between the
two name strings. Because R's < operator on character vectors uses lexicographic
(alphabetical) ordering, direction = "lt" currently means "keep pairs where
name_a comes before name_b alphabetically" — not "keep pairs where name_a was
listed before name_b in the call to pairwise()".

A concrete example:

pairwise(z_score, age, bmi)
# Declaration order: z_score (1st), age (2nd), bmi (3rd)

# Current behaviour with direction = "lt":
# Keeps pairs where name is alphabetically smaller on the left:
# → (age, bmi), (age, z_score), (bmi, z_score)

# What a user might reasonably expect ("lt" = earlier in my list):
# → (z_score, age), (z_score, bmi), (age, bmi)

Two things are affected:

  1. The direction within each pair. If downstream code computes a directed
    difference (e.g. mean(group_a) - mean(group_b)), the sign of the result
    depends on which name ends up on the left. The current code always puts the
    alphabetically earlier name on the left, regardless of what the user wrote.
  2. Predictability. Renaming a column (for example from bmi to BMI) can
    silently change which name appears first in each pair, which could affect printed
    output or downstream comparisons.

For a pure lower-triangle use case (where the set of pairs is all that matters and
direction within each pair is irrelevant), this is harmless. But it is worth
clarifying the intended semantics before the pairwise path is used more broadly.

Proposed solution / discussion point

One option is to work with integer positions rather than name strings inside
pairs_generator():

pairs_generator = function(x, direction = "lteq", simplify = TRUE) {
    idx = seq_along(x)
    pairs = tidyr::expand_grid(i = idx, j = idx) |>
        dplyr::filter(inequality(.data$i, .data$j, direction = direction))
    # then map back: x[pairs$i], x[pairs$j]
    ...
}

This would make "lt" mean "declared before in the list", which is probably what
most users expect.

The alternative is to keep the current alphabetical behaviour but document it
explicitly, since it is at least deterministic. The key question is: what should
direction express: position in the declaration list, or alphabetical order of
names?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions