Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions api/PAGINATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Pagination

Every response from the `/timeseries/...` endpoints is a JSON envelope:

```json
{
"docs": [...],
"next_url": "/timeseries/bsose?...&tile_index=N",
"message": "page N"
}
```

- `docs` — array of result documents (or stubs / metadata documents,
depending on the mode flags). May be empty.
- `next_url` — relative path + query for the next page. `null` when this
is the last page. Clients resolve it against the original request's
origin and follow it until they see `null`.
- `message` — human-readable status, currently the served tile index.

There is no separate "no results" status code: an empty response is
`200 + {docs: [], next_url: null, ...}`, never `404`.

## How pagination walks

Server-side, each request's spatial parameters define a sequence of
**tiles**. A tile is one spatial sub-region paired with one discrete
depth level. Tiles are ordered *spatial outer, level inner*: all levels
for one (lon, lat) cell come out before moving to the next cell.

Tile size and extent are per-dataset:

- Spatial extent is `DatasetConfig::tile_degrees` (5° for BSOSE).
- Depth pages are the dataset's discrete `levels` (52 brackets for BSOSE).
- The tile sequence is clipped to `DatasetConfig::coverage_bbox`, an
optional rectangle that tells the generator where the dataset has
data. For BSOSE that's `[-180,-90]→[180,-30]` (south of 30°S); for
datasets without an a-priori coverage bound, it can be `None` and
the generator walks the whole globe.

Each HTTP request serves at most **one** non-empty tile. The server
**probes forward** from the requested `tile_index`, opening a small
cursor per candidate tile and advancing past empties until it finds one
that yields output (or runs out of tiles). `next_url` carries
`tile_index = served_idx + 1`, so the next request resumes one tile past
the one we just emitted. When the server runs out of tiles, `next_url`
is `null`.

The coverage bbox is the cheap way to keep probe-forward sane: tiles
that fall entirely outside the coverage are never probed at all, so
e.g. a BSOSE whole-globe walk doesn't have to confirm that the entire
Northern Hemisphere is empty before terminating. Probe-forward is still
linear in the number of *candidate* tiles after coverage filtering, so
sparse datasets within their coverage area can still incur empty
probes — a denser secondary mask (e.g. land/ocean per cell) would help
here but isn't implemented.

## Tile membership

Tiles are **half-open** — `[sw, ne)` on both lon and lat axes — so each
grid point is owned by exactly one tile (the one whose SW corner it sits
at). Without this, a doc at the corner where four tiles meet would be
emitted four times. The half-open behaviour is implemented by shrinking
each tile's NE corner inward by a sub-cm epsilon, *except* at the global
east meridian (`lon=180`) and the north pole (`lat=90`), where there's
no neighbouring tile to overlap with — those edges remain inclusive so
antimeridian / north-pole docs aren't lost.

## Special spatial modes

| Mode | Tile sequence |
|------|---------------|
| `id` | A single passthrough tile — no spatial or level constraint is added. |
| `center + radius` | No spatial tiling; pagination is level-only. Radius must satisfy `radius ≤ max_radius_meters` (100 km for BSOSE today). |
| `polygon` | Tile the polygon's bounding box. Mongo `$geoWithin` does the actual polygon intersection per tile. |
| `box` | Tile the box. A dateline-crossing box (`sw_lon > ne_lon`) is split into east and west sub-boxes; tile generation runs on each. |
| no spatial param | Tile the whole globe. |

## Mode flags

- `compression=minimal` — each doc is serialized as a compact 5-element
array `[_id, lon, lat, level, metadata]` rather than the full
measurement document.
- `batchmeta` — instead of measurement docs, return the *metadata*
documents referenced by the matching docs (looked up in
`timeseriesMeta`). Aggregates per-page; clients union across pages.
Takes precedence over `compression=minimal` if both are set.

## Query parameters

| Param | Type | Notes |
|-------|------|-------|
| `id` | string | Exact match on `_id`. |
| `box` | JSON `[[sw_lon, sw_lat], [ne_lon, ne_lat]]` | Bounding box. Wraps the dateline if `sw_lon > ne_lon`. |
| `polygon` | JSON `[[lon, lat], ...]` | Closed ring of vertices (first = last), ≥ 4 points. |
| `center` + `radius` | JSON `[lon, lat]` + meters | Disk query. Radius capped at the dataset's `max_radius_meters`. |
| `verticalRange` | JSON `[lo, hi]` | Half-open depth range applied on top of tile-level pagination. |
| `startDate` / `endDate` | RFC-3339 string | Slices each doc's timeseries to this window. |
| `data` | comma-separated | Variables to include. `all` keeps everything. `except_data_values` keeps the schema but clears values. |
| `compression` | `minimal` | See mode flags. |
| `batchmeta` | any | See mode flags. |
| `tile_index` | non-negative integer | Pagination cursor. Default `0`. Almost always supplied by the previous response's `next_url`. |

## Validation errors (HTTP 400)

- More than one of `polygon` / `box` / `center` set.
- `center` set without `radius`, or vice versa.
- `radius` non-numeric, negative, non-finite, or above the dataset's cap.
- `polygon` malformed: fewer than 4 points, not closed, or any vertex
that isn't a 2-element pair.
- `startDate` or `endDate` not RFC-3339.
- `tile_index` present but not a non-negative integer.

## Known limitations

- **Polygons spanning more than a hemisphere or with multiple
antimeridian crossings.** Single-crossing antimeridian polygons are
detected and split into east + west sub-bboxes (no globe-spanning
over-tile). Polygons with two or more antimeridian crossings, or
polygons covering more than half the sphere, may over-tile — the
result is still correct (Mongo's `$geoWithin` does the actual
polygon intersection), just slower than ideal.
- **Grid-aligned user box NE corner.** A user-supplied box whose NE
corner sits exactly on a tile grid line (e.g.
`box=[[20,10],[40,30]]`) will lose docs at that exact NE corner,
because the rightmost/topmost tile's NE is shrunk by the half-open
mechanism. Workaround on the client: pad the NE by a tiny amount.
- **Antipodal docs at `lon=±180` stored as distinct values.** Docs at
`lon=+180` land in the easternmost tile, docs at `lon=-180` in the
westernmost — same physical meridian, two different pages. Data
providers should normalise to one convention on insertion.

## Per-dataset configuration

`api/src/helpers/dataset_config.rs` defines a `DatasetConfig` struct
with the dataset's `tile_degrees`, `max_radius_meters`, the discrete
`levels` array, and an optional `coverage_bbox`. The BSOSE handler
binds `BSOSE_CONFIG` directly; adding a new dataset means defining its
config there and wiring its handler through the same `tile_generator` /
`filter_composer` machinery. `coverage_bbox: None` for a new dataset
gives global-walk semantics; setting it to a bounding rectangle tells
the tile generator to skip everything outside the rectangle.
8 changes: 4 additions & 4 deletions api/fixtures/bsose.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"_id": "bsose_doc_001",
"metadata": ["bsose-profile-meta-2020"],
"basin": 1.0,
"geolocation": { "type": "Point", "coordinates": [20.0, 10.0] },
"geolocation": { "type": "Point", "coordinates": [20.0, -50.0] },
"level": 10.0,
"cell_vertical_fraction": 1.0,
"sea_binary_mask_at_t_locaiton": true,
Expand All @@ -23,7 +23,7 @@
"_id": "bsose_doc_002",
"metadata": ["bsose-profile-meta-2020"],
"basin": 1.0,
"geolocation": { "type": "Point", "coordinates": [40.0, 30.0] },
"geolocation": { "type": "Point", "coordinates": [40.0, -40.0] },
"level": 10.0,
"cell_vertical_fraction": 1.0,
"sea_binary_mask_at_t_locaiton": true,
Expand All @@ -43,7 +43,7 @@
"_id": "bsose_doc_003",
"metadata": ["bsose-profile-meta-2020"],
"basin": 2.0,
"geolocation": { "type": "Point", "coordinates": [-170.0, 50.0] },
"geolocation": { "type": "Point", "coordinates": [-170.0, -55.0] },
"level": 20.0,
"cell_vertical_fraction": 1.0,
"sea_binary_mask_at_t_locaiton": true,
Expand All @@ -63,7 +63,7 @@
"_id": "bsose_doc_004",
"metadata": ["bsose-profile-meta-2020"],
"basin": 1.0,
"geolocation": { "type": "Point", "coordinates": [20.0, 10.0] },
"geolocation": { "type": "Point", "coordinates": [20.0, -50.0] },
"level": 50.0,
"cell_vertical_fraction": 1.0,
"sea_binary_mask_at_t_locaiton": true,
Expand Down
129 changes: 129 additions & 0 deletions api/src/helpers/dataset_config.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
//! Per-dataset configuration governing request-size limits.
//!
//! This is the seam where pagination decisions hang off the dataset
//! identity. `tile_degrees` drives spatial tile generation; `levels`
//! defines the discrete depth pages within each spatial tile;
//! `max_radius_meters` caps `center + radius` queries (which go through
//! MongoDB `$near` and aren't paginated, so the cap is the only thing
//! preventing a runaway disk-of-most-of-the-globe); `coverage_bbox`
//! tells the tile generator the lat/lon rectangle the dataset's data
//! actually lives inside, so we skip probing tiles outside it.

use super::geometry::BoundingBox;

/// Per-dataset request-size policy.
///
/// `tile_degrees`: edge length (degrees of longitude and latitude) of one
/// spatial pagination tile.
///
/// `max_radius_meters`: hard upper bound on the `radius` query parameter
/// for `center + radius` requests. These bypass tile pagination because
/// Mongo's `$near` enforces its own bound; we cap the bound so a
/// malicious or naive caller can't ask for a half-globe disk.
///
/// `levels`: the discrete vertical levels the dataset is sampled at, in
/// strictly increasing order (shallowest first). Pagination treats each
/// level as a separate page within a spatial tile. Datasets without a
/// vertical dimension can pass a single-element slice (effectively a
/// single "level" per tile).
///
/// `coverage_bbox`: optional rectangle the dataset's data is known to
/// live inside. The tile generator drops any spatial tile that doesn't
/// overlap this rectangle, so probe-forward never has to walk through
/// regions that *can't* contain data. `None` means "no a-priori bound"
/// — tile generation falls back to walking the whole globe. The
/// rectangle is treated as inclusive on its edges; a doc lying exactly
/// on the coverage boundary is preserved.
pub struct DatasetConfig {
pub tile_degrees: f64,
pub max_radius_meters: f64,
pub levels: &'static [f64],
pub coverage_bbox: Option<BoundingBox>,
}

/// BSOSE's 52 vertical levels, in metres (positive-downward), shallowest
/// first. From the dataset's published grid; should be updated if BSOSE
/// re-releases with a different vertical discretisation.
pub const BSOSE_LEVELS: &[f64] = &[2.1, 6.7, 12.15, 18.55, 26.25, 35.25, 45.0, 55.0, 65.0, 75.0, 85.0, 95.0, 105.0, 115.0, 125.0, 135.0, 146.5, 161.5, 180.0, 200.0, 220.0, 240.0, 260.0, 280.0, 301.0, 327.0, 361.0, 402.5, 450.0, 500.0, 551.5, 614.0, 700.0, 800.0, 900.0, 1000.0, 1100.0, 1225.0, 1400.0, 1600.0, 1800.0, 2010.0, 2270.0, 2610.0, 3000.0, 3400.0, 3800.0, 4200.0, 4600.0, 5000.0, 5400.0, 5800.0];

/// Configuration for the BSOSE timeseries dataset.
///
/// 5° tiles × 12 grid cells/degree = 60 × 60 = 3600 cells per (tile,
/// level), most less due to land/coastlines. `max_radius_meters` is
/// intentionally tight: BSOSE produces many docs even in a small disk
/// since `$near` isn't spatially tiled. `coverage_bbox` reflects that
/// BSOSE only has data south of 30°S — no point in probing northern
/// tiles that will never contain anything.
pub const BSOSE_CONFIG: DatasetConfig = DatasetConfig {
tile_degrees: 5.0,
max_radius_meters: 100_000.0, // 100 km — bump if users complain
levels: BSOSE_LEVELS,
coverage_bbox: Some(BoundingBox {
sw: [-180.0, -90.0],
ne: [180.0, -30.0],
}),
};

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn bsose_tile_degrees_is_positive_and_divides_a_hemisphere() {
assert!(BSOSE_CONFIG.tile_degrees > 0.0);
// We don't strictly require integer-divisibility of 180/360 by
// tile_degrees (the tile generator will handle ragged remainders),
// but a divisor is a useful invariant to flag if someone bumps the
// value to something exotic like 7.0.
assert!(
(180.0_f64 % BSOSE_CONFIG.tile_degrees).abs() < 1e-9,
"tile_degrees should evenly divide 180° for clean global coverage"
);
assert!(
(360.0_f64 % BSOSE_CONFIG.tile_degrees).abs() < 1e-9,
"tile_degrees should evenly divide 360° for clean global coverage"
);
}

#[test]
fn bsose_max_radius_is_positive_and_subhemispheric() {
assert!(BSOSE_CONFIG.max_radius_meters > 0.0);
// Earth's mean radius is ~6.371e6 m; a half-circumference is ~2.0e7 m.
// We want our cap well under that so we never approach the
// antipode-degenerate case that breaks Mongo geo queries.
assert!(BSOSE_CONFIG.max_radius_meters < 1.0e7);
}

#[test]
fn bsose_levels_is_non_empty() {
// Pagination treats each level as a page; an empty level list would
// produce a dataset with zero pages, which is almost certainly a
// misconfiguration rather than an intentional state.
assert!(!BSOSE_CONFIG.levels.is_empty());
}

#[test]
fn bsose_levels_is_strictly_increasing() {
// The tile generator will rely on level order to map a level index
// to a (lower, upper) depth bracket. If two levels collide or the
// sequence reverses, that mapping is ambiguous.
for w in BSOSE_CONFIG.levels.windows(2) {
assert!(
w[0] < w[1],
"levels must be strictly increasing; found {} not < {}",
w[0],
w[1]
);
}
}

#[test]
fn bsose_levels_are_all_non_negative() {
// Depth is conventionally positive-downward in oceanographic data;
// a negative value would indicate a sign-convention bug we'd want
// to catch early.
for &d in BSOSE_CONFIG.levels {
assert!(d >= 0.0, "level depths should be non-negative; got {}", d);
}
}
}
Loading
Loading