Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 29 additions & 26 deletions .github/workflows/Python-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@ on:
push:
paths:
- python-package/**
- .github/workflows/Python-CMD-check.yaml
branches:
- main
- master
- dev
pull_request:
paths:
- python-package/**
- .github/workflows/Python-CMD-check.yaml
branches:
- main
- master
Expand All @@ -20,48 +22,49 @@ jobs:
Python-CMD-check:
runs-on: ${{ matrix.os }}

name: ${{ matrix.os }} (${{ matrix.python-version }})
name: ${{ matrix.os }} (${{ matrix.python-version }}${{ matrix.extras != '' && format(', {0}', matrix.extras) || '' }})

strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macOS-latest, windows-latest]
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
# Python 3.8 support ends in 2024-10
# Python 3.12 support starts in 2023-10
# Check Python maintenance status at: https://www.python.org/downloads/

env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
extras: [""]
include:
# Optional DuckDB extra: ubuntu only to keep CI cost reasonable.
- os: ubuntu-latest
python-version: "3.11"
extras: all
- os: ubuntu-latest
python-version: "3.12"
extras: all

steps:
- name: Check out geobr
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: "pip"

- name: Install uv
run: |
python -m pip install --upgrade pip
curl -LsSf https://astral.sh/uv/install.sh | sh
uses: astral-sh/setup-uv@v5
with:
enable-cache: true

- name: Install dependencies with uv
run: uv sync
- name: Install dependencies
shell: bash
working-directory: python-package

- name: Run tests with uv
run: |
uv run pytest -n auto ./tests
working-directory: python-package
uv sync
if [ -n "${{ matrix.extras }}" ]; then
uv pip install -e ".[${{ matrix.extras }}]"
else
uv pip install -e .
fi

- name: Upload check results
if: always()
uses: actions/upload-artifact@v3
with:
name: test-results
path: python-package/test-results.txt
if-no-files-found: warn
- name: Run tests
shell: bash
working-directory: python-package
run: uv run pytest -n auto ./tests -m "not network"
76 changes: 76 additions & 0 deletions appveyor.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# AppVeyor: Windows R CMD check for r-package only.
# Python CI runs in GitHub Actions (.github/workflows/Python-CMD-check.yaml).
# R also runs on Windows via GitHub Actions (.github/workflows/R-CMD-check.yaml).

only_commits:
files:
- r-package/**

skip_commits:
files:
- python-package/**
- .github/**
- mcp-server/**

environment:
global:
R_REMOTES_STANDALONE: true
PKGDIR: r-package
matrix:
- R_VERSION: release
R_ARCH: x64

init:
ps: |
$ErrorActionPreference = "Stop"
Get-Date

install:
ps: |
$ErrorActionPreference = "Stop"
if (-not (Test-Path r-appveyor-scripts)) {
New-Item -ItemType Directory -Force -Path r-appveyor-scripts | Out-Null
}
if (-not (Test-Path r-appveyor-scripts/appveyor-tool.ps1)) {
Invoke-WebRequest -UseBasicParsing `
-Uri "https://raw.githubusercontent.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1" `
-OutFile "r-appveyor-scripts/appveyor-tool.ps1"
}
Import-Module .\r-appveyor-scripts\appveyor-tool.ps1
Bootstrap

build_script:
ps: |
$ErrorActionPreference = "Stop"
Push-Location $env:PKGDIR
try {
travis-tool.sh install_deps
} finally {
Pop-Location
}

test_script:
ps: |
$ErrorActionPreference = "Stop"
Push-Location $env:PKGDIR
try {
travis-tool.sh run_tests
} finally {
Pop-Location
}

on_failure:
- 7z a failure.zip *.Rcheck\*
- appveyor PushArtifact failure.zip

artifacts:
- path: r-package\*.Rcheck\**\*.log
name: Logs
- path: r-package\*.Rcheck\**\*.out
name: Logs
- path: r-package\*.Rcheck\**\*.fail
name: Logs
- path: r-package\*.Rcheck\**\*.Rout
name: Logs
- path: r-package\*_*.zip
name: Bits
30 changes: 30 additions & 0 deletions python-package/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,36 @@

-------------------------------------------------------

# 0.3.0 (unreleased)

## Foundation (Phase 0)
* Core dependencies include `pyarrow` and `rapidfuzz` (Arrow output and fuzzy `lookup_muni`)
* Optional extra: `geobr[duckdb]` (alias `geobr[all]`)
* Parquet v2.0.0 download pipeline (`download_metadata_v2`, `download_parquet`, disk cache)
* Shared helpers: `_filter`, `_output`, `_cache`, `read_geobr_v2`, `read_geobr_hybrid`

### Phase 1 — Agent 1
* `read_capitals`, `read_favela`, `read_polling_places`, `read_quilombola_land`
* `cep_to_state`, `remove_islands`

### Phase 1 — Agent 2
* `code_muni` filtering: `read_schools`, `read_health_facilities`, `read_neighborhood`, `read_disaster_risk_area`, `read_statistical_grid`
* `keep_areas_operacionais` on `read_municipality`

### Phase 1 — Agent 3
* `code_state` filtering: `read_indigenous_land`, `read_metro_area`, `read_pop_arrangements`, `read_urban_concentrations`, `read_conservation_units`
* Default year 2010 for pop arrangements / urban concentrations

### Phase 1 — Agent 4
* `lookup_muni(year=...)`, fuzzy name match via rapidfuzz
* `list_geobr(wide=)` returns DataFrame
* `read_health_region(geometry_level=, code_state=)`

### Phase 1 — Agent 5
* `output="duckdb"` and `output="arrow"` via `convert_output`

-------------------------------------------------------

# 0.1.10
* Enforces correct data types to certain variables (issue #260)
* Changes package manager to poetry
Expand Down
33 changes: 33 additions & 0 deletions python-package/geobr/_cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""Disk-backed cache helpers for geobr parquet downloads."""

from __future__ import annotations

import os
from pathlib import Path


def cache_dir() -> Path:
"""Return the geobr cache directory (~/.cache/geobr or temp fallback)."""
base = os.environ.get("XDG_CACHE_HOME")
if base:
path = Path(base) / "geobr"
else:
path = Path.home() / ".cache" / "geobr"
try:
path.mkdir(parents=True, exist_ok=True)
except OSError:
import tempfile

path = Path(tempfile.gettempdir()) / "geobr"
path.mkdir(parents=True, exist_ok=True)
return path


def cached_path(filename: str) -> Path:
"""Full path for a cached parquet file."""
return cache_dir() / filename


def is_cached(filename: str) -> bool:
path = cached_path(filename)
return path.exists() and path.stat().st_size > 0
90 changes: 90 additions & 0 deletions python-package/geobr/_duckdb_backend.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
"""Optional DuckDB backend for lazy parquet reads."""

from __future__ import annotations

from pathlib import Path
from typing import Any, Optional, Union

_CONN: Optional[Any] = None


def _require_duckdb():
try:
import duckdb
except ImportError as e:
raise ImportError(
"Optional dependency 'duckdb' is required for output='duckdb'. "
"Install with: pip install geobr[duckdb]"
) from e
return duckdb


def _setup_connection(conn) -> None:
for stmt in ("INSTALL spatial", "LOAD spatial", "INSTALL httpfs", "LOAD httpfs"):
try:
conn.execute(stmt)
except Exception:
pass


def duckdb_connection():
"""Return the shared DuckDB connection."""
global _CONN
if _CONN is None:
duckdb = _require_duckdb()
_CONN = duckdb.connect()
_setup_connection(_CONN)
return _CONN


def register_dataset(
name: str,
parquet_path: Union[str, Path],
*,
connection: Optional[Any] = None,
) -> Any:
"""Register a parquet file as a DuckDB view."""
conn = connection or duckdb_connection()
path_str = str(Path(parquet_path).resolve()).replace("'", "''")
safe_name = name.replace('"', '""')
conn.execute(
f'CREATE OR REPLACE VIEW "{safe_name}" AS '
f"SELECT * FROM read_parquet('{path_str}')"
)
return conn.sql(f'SELECT * FROM "{safe_name}"')


def read_parquet_relation(
path: Union[str, Path],
filter_code: Any = "all",
*,
connection: Optional[Any] = None,
view_name: Optional[str] = None,
) -> Any:
"""Return a DuckDB relation over a parquet file."""
conn = connection or duckdb_connection()
if view_name:
register_dataset(view_name, path, connection=conn)
source = f'"{view_name.replace(chr(34), chr(34) * 2)}"'
else:
path_str = str(Path(path).resolve()).replace("'", "''")
source = f"read_parquet('{path_str}')"

if filter_code == "all" or filter_code is None:
return conn.sql(f"SELECT * FROM {source}")

codes = filter_code if isinstance(filter_code, (list, tuple)) else [filter_code]
code = codes[0] if len(codes) == 1 else filter_code

if isinstance(code, str) and len(code) == 2 and code.isalpha():
return conn.sql(f"SELECT * FROM {source} WHERE abbrev_state = '{code}'")
if str(code).isdigit() and len(str(code)) == 7:
return conn.sql(
f"SELECT * FROM {source} WHERE CAST(code_muni AS BIGINT) = {int(code)}"
)
if str(code).isdigit() and len(str(code)) <= 2:
return conn.sql(
f"SELECT * FROM {source} WHERE CAST(code_state AS INTEGER) = {int(code)}"
)

return conn.sql(f"SELECT * FROM {source}")
Comment on lines +57 to +90
Copy link
Copy Markdown
Collaborator

@camilagb camilagb May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the suggestion in https://github.com/ipeaGIT/geobr/pull/418/changes#r3283548653, this function can change to something like below. The filters would be done in a previous step in the arrow table.

I also included the ST_GeomFromWKB function to correctly convert the geometry column into a duckdb spatial column (thx for the heads up regarding this @rafapereirabr !)

Suggested change
def read_parquet_relation(
path: Union[str, Path],
filter_code: Any = "all",
*,
connection: Optional[Any] = None,
view_name: Optional[str] = None,
) -> Any:
"""Return a DuckDB relation over a parquet file."""
conn = connection or duckdb_connection()
if view_name:
register_dataset(view_name, path, connection=conn)
source = f'"{view_name.replace(chr(34), chr(34) * 2)}"'
else:
path_str = str(Path(path).resolve()).replace("'", "''")
source = f"read_parquet('{path_str}')"
if filter_code == "all" or filter_code is None:
return conn.sql(f"SELECT * FROM {source}")
codes = filter_code if isinstance(filter_code, (list, tuple)) else [filter_code]
code = codes[0] if len(codes) == 1 else filter_code
if isinstance(code, str) and len(code) == 2 and code.isalpha():
return conn.sql(f"SELECT * FROM {source} WHERE abbrev_state = '{code}'")
if str(code).isdigit() and len(str(code)) == 7:
return conn.sql(
f"SELECT * FROM {source} WHERE CAST(code_muni AS BIGINT) = {int(code)}"
)
if str(code).isdigit() and len(str(code)) <= 2:
return conn.sql(
f"SELECT * FROM {source} WHERE CAST(code_state AS INTEGER) = {int(code)}"
)
return conn.sql(f"SELECT * FROM {source}")
def read_relation(
arrow_table: pyarrow.Table,
connection: Optional[Any] = None
) -> Any:
"""Return a DuckDB relation over a arrow table."""
conn = connection or duckdb_connection()
conn.register("arrow_table", arrow_table)
return conn.sql("""
SELECT
* EXCLUDE (geometry),
ST_GeomFromWKB(geometry) AS geometry
FROM arrow_table
""")

Loading
Loading