Add v2 parquet pipeline foundation for Python geobr#418
Conversation
Introduce cached parquet downloads, filtering, multi-format output (sf/arrow/duckdb relation), and shared read_geobr_v2/hybrid helpers to align Python with the R v2.0.0 data path. Co-authored-by: Cursor <cursoragent@cursor.com>
Upgrade deprecated GitHub Actions, use astral-sh/setup-uv cross-platform, and skip network-dependent list_geobr test while testing filters via read_geobr_v2. Co-authored-by: Cursor <cursoragent@cursor.com>
AppVeyor is not required for Python (GitHub Actions Python-CMD-check covers all platforms). Path filters skip builds when only python-package or .github change. Co-authored-by: Cursor <cursoragent@cursor.com>
camilagb
left a comment
There was a problem hiding this comment.
hey @JoaoCarabetta! I'm Camila, a new team member here at IPEA. I started translating the new features for the Python package, but since you've already opened the PRs, I'll just leave some suggestions :)
| import geopandas as gpd | ||
| import pyarrow.parquet as pq | ||
|
|
||
| OutputType = Literal["sf", "duckdb", "arrow"] |
There was a problem hiding this comment.
Since we are dealing with output formats, what do you think we use gpd instead of sf?
| OutputType = Literal["sf", "duckdb", "arrow"] | |
| OutputType = Literal["gpd", "duckdb", "arrow"] |
| from geobr._cache import cached_path, is_cached | ||
|
|
||
| MIRRORS = ["https://github.com/ipeaGIT/geobr/releases/download/v1.7.0/"] | ||
| GEOBR_DATA_RELEASE = "v2.0.0" |
There was a problem hiding this comment.
I was talking to @rafapereirabr about avoiding a hardcoded release tag. If anything changes in geobr_prep_data and a new tag is released, we'd have to create a new geobr version. Since both libraries are managed internally, I suggest setting it to latest.
| api_url = ( | ||
| "https://api.github.com/repos/ipea/geobr_prep_data/releases/tags/" | ||
| f"{GEOBR_DATA_RELEASE}" | ||
| ) |
There was a problem hiding this comment.
as mentioned in https://github.com/ipeaGIT/geobr/pull/418/changes#r3283451283
| api_url = ( | |
| "https://api.github.com/repos/ipea/geobr_prep_data/releases/tags/" | |
| f"{GEOBR_DATA_RELEASE}" | |
| ) | |
| api_url = ( | |
| "https://api.github.com/repos/ipea/geobr_prep_data/releases/latest" | |
| ) |
| if year is None: | ||
| year = int(max(years_available)) |
There was a problem hiding this comment.
In the new R package, one of the breaking changes is regarding to the year parameter, that is now mandatory.
The year and date arguments can no longer be NULL; they must be explicitly
specified. This change is intentional and is meant to encourage users to be more
mindful of historical changes in the data.
| if year is None: | |
| year = int(max(years_available)) |
| return enforce_types(gdf) | ||
|
|
||
| if output == "arrow": | ||
| return pq.read_table(path) |
There was a problem hiding this comment.
arrow is missing the filters!
In that regard, instead of reading the file in each format and then filtering the table, I suggest always reading it in arrow, filtering it and only then converting the files to the desired outputs. What do you think?
The download_parquet function in utils can return an arrow table instead of a path and the filter_by_code function in _filter can perform the filters lazily in the arrow table before the type conversions. That way we only convert what is necessary, avoid writing the same filter codes for each format and this function can be only for converting the outputs.
There was a problem hiding this comment.
I agree the filters performed internally in the function should be done out-of-memory before passing the output to the user. This is the behavior in R
| def read_parquet_relation( | ||
| path: Union[str, Path], | ||
| filter_code: Any = "all", | ||
| *, | ||
| connection: Optional[Any] = None, | ||
| view_name: Optional[str] = None, | ||
| ) -> Any: | ||
| """Return a DuckDB relation over a parquet file.""" | ||
| conn = connection or duckdb_connection() | ||
| if view_name: | ||
| register_dataset(view_name, path, connection=conn) | ||
| source = f'"{view_name.replace(chr(34), chr(34) * 2)}"' | ||
| else: | ||
| path_str = str(Path(path).resolve()).replace("'", "''") | ||
| source = f"read_parquet('{path_str}')" | ||
|
|
||
| if filter_code == "all" or filter_code is None: | ||
| return conn.sql(f"SELECT * FROM {source}") | ||
|
|
||
| codes = filter_code if isinstance(filter_code, (list, tuple)) else [filter_code] | ||
| code = codes[0] if len(codes) == 1 else filter_code | ||
|
|
||
| if isinstance(code, str) and len(code) == 2 and code.isalpha(): | ||
| return conn.sql(f"SELECT * FROM {source} WHERE abbrev_state = '{code}'") | ||
| if str(code).isdigit() and len(str(code)) == 7: | ||
| return conn.sql( | ||
| f"SELECT * FROM {source} WHERE CAST(code_muni AS BIGINT) = {int(code)}" | ||
| ) | ||
| if str(code).isdigit() and len(str(code)) <= 2: | ||
| return conn.sql( | ||
| f"SELECT * FROM {source} WHERE CAST(code_state AS INTEGER) = {int(code)}" | ||
| ) | ||
|
|
||
| return conn.sql(f"SELECT * FROM {source}") |
There was a problem hiding this comment.
Following the suggestion in https://github.com/ipeaGIT/geobr/pull/418/changes#r3283548653, this function can change to something like below. The filters would be done in a previous step in the arrow table.
I also included the ST_GeomFromWKB function to correctly convert the geometry column into a duckdb spatial column (thx for the heads up regarding this @rafapereirabr !)
| def read_parquet_relation( | |
| path: Union[str, Path], | |
| filter_code: Any = "all", | |
| *, | |
| connection: Optional[Any] = None, | |
| view_name: Optional[str] = None, | |
| ) -> Any: | |
| """Return a DuckDB relation over a parquet file.""" | |
| conn = connection or duckdb_connection() | |
| if view_name: | |
| register_dataset(view_name, path, connection=conn) | |
| source = f'"{view_name.replace(chr(34), chr(34) * 2)}"' | |
| else: | |
| path_str = str(Path(path).resolve()).replace("'", "''") | |
| source = f"read_parquet('{path_str}')" | |
| if filter_code == "all" or filter_code is None: | |
| return conn.sql(f"SELECT * FROM {source}") | |
| codes = filter_code if isinstance(filter_code, (list, tuple)) else [filter_code] | |
| code = codes[0] if len(codes) == 1 else filter_code | |
| if isinstance(code, str) and len(code) == 2 and code.isalpha(): | |
| return conn.sql(f"SELECT * FROM {source} WHERE abbrev_state = '{code}'") | |
| if str(code).isdigit() and len(str(code)) == 7: | |
| return conn.sql( | |
| f"SELECT * FROM {source} WHERE CAST(code_muni AS BIGINT) = {int(code)}" | |
| ) | |
| if str(code).isdigit() and len(str(code)) <= 2: | |
| return conn.sql( | |
| f"SELECT * FROM {source} WHERE CAST(code_state AS INTEGER) = {int(code)}" | |
| ) | |
| return conn.sql(f"SELECT * FROM {source}") | |
| def read_relation( | |
| arrow_table: pyarrow.Table, | |
| connection: Optional[Any] = None | |
| ) -> Any: | |
| """Return a DuckDB relation over a arrow table.""" | |
| conn = connection or duckdb_connection() | |
| conn.register("arrow_table", arrow_table) | |
| return conn.sql(""" | |
| SELECT | |
| * EXCLUDE (geometry), | |
| ST_GeomFromWKB(geometry) AS geometry | |
| FROM arrow_table | |
| """) |
| def filter_by_code( | ||
| gdf: gpd.GeoDataFrame, | ||
| code: Any = "all", | ||
| ) -> gpd.GeoDataFrame: | ||
| """Filter a GeoDataFrame by state abbrev, state code, municipality code, or other code_* column. | ||
|
|
||
| Mirrors R ``filter_arrw()`` behavior for in-memory GeoDataFrames. | ||
| """ | ||
| if gdf is None or len(gdf) == 0: | ||
| return gdf | ||
|
|
||
| if code == "all" or code is None: | ||
| return gdf | ||
|
|
||
| codes = _normalize_code(code) | ||
| if not isinstance(codes, list): | ||
| codes = [codes] | ||
|
|
||
| filter_col = None | ||
|
|
||
| if all(c in ALL_ABBREV_STATE for c in codes): | ||
| if "abbrev_state" in gdf.columns: | ||
| filter_col = "abbrev_state" | ||
| elif all( | ||
| _numbers_only(str(c)) and len(str(c)) <= 2 | ||
| and (str(c).zfill(2) in ALL_CODE_STATE or str(c) in ALL_CODE_STATE) | ||
| for c in codes | ||
| ): | ||
| if "code_state" in gdf.columns: | ||
| filter_col = "code_state" | ||
| codes = [int(c) if str(c).isdigit() else c for c in codes] | ||
| elif all(_numbers_only(str(c)) and len(str(c)) == 7 for c in codes): | ||
| if "code_muni" in gdf.columns: | ||
| filter_col = "code_muni" | ||
| codes = [int(c) for c in codes] | ||
| elif all(_numbers_only(c) and len(str(c)) > 3 for c in codes): | ||
| code_cols = [c for c in gdf.columns if c.startswith("code_")] | ||
| if code_cols: | ||
| filter_col = code_cols[0] | ||
|
|
||
| if filter_col is None: | ||
| raise ValueError("Invalid value to argument `code_` / `code_muni` / `code_state`.") | ||
|
|
||
| if filter_col == "code_state": | ||
| gdf = gdf.copy() | ||
| gdf[filter_col] = pd.to_numeric(gdf[filter_col], errors="coerce") | ||
| codes_num = [int(c) for c in codes] | ||
| result = gdf[gdf[filter_col].isin(codes_num)] | ||
| elif filter_col == "code_muni": | ||
| gdf = gdf.copy() | ||
| gdf[filter_col] = pd.to_numeric(gdf[filter_col], errors="coerce").astype("Int64") | ||
| codes_num = [int(c) for c in codes] | ||
| result = gdf[gdf[filter_col].isin(codes_num)] | ||
| if len(result) == 0: | ||
| result = gdf[gdf[filter_col].astype(str).isin([str(c) for c in codes_num])] | ||
| else: | ||
| result = gdf[gdf[filter_col].isin(codes)] | ||
|
|
||
| if len(result) == 0: | ||
| raise ValueError("Invalid value to argument `code_` / `code_muni` / `code_state`.") | ||
|
|
||
| return result |
There was a problem hiding this comment.
Following the suggestion in https://github.com/ipeaGIT/geobr/pull/418/changes#r3283548653, we could translate this function to filter an arrow table.
p.s.: the function was translated using AI, so a deeper review is needed
| def filter_by_code( | |
| gdf: gpd.GeoDataFrame, | |
| code: Any = "all", | |
| ) -> gpd.GeoDataFrame: | |
| """Filter a GeoDataFrame by state abbrev, state code, municipality code, or other code_* column. | |
| Mirrors R ``filter_arrw()`` behavior for in-memory GeoDataFrames. | |
| """ | |
| if gdf is None or len(gdf) == 0: | |
| return gdf | |
| if code == "all" or code is None: | |
| return gdf | |
| codes = _normalize_code(code) | |
| if not isinstance(codes, list): | |
| codes = [codes] | |
| filter_col = None | |
| if all(c in ALL_ABBREV_STATE for c in codes): | |
| if "abbrev_state" in gdf.columns: | |
| filter_col = "abbrev_state" | |
| elif all( | |
| _numbers_only(str(c)) and len(str(c)) <= 2 | |
| and (str(c).zfill(2) in ALL_CODE_STATE or str(c) in ALL_CODE_STATE) | |
| for c in codes | |
| ): | |
| if "code_state" in gdf.columns: | |
| filter_col = "code_state" | |
| codes = [int(c) if str(c).isdigit() else c for c in codes] | |
| elif all(_numbers_only(str(c)) and len(str(c)) == 7 for c in codes): | |
| if "code_muni" in gdf.columns: | |
| filter_col = "code_muni" | |
| codes = [int(c) for c in codes] | |
| elif all(_numbers_only(c) and len(str(c)) > 3 for c in codes): | |
| code_cols = [c for c in gdf.columns if c.startswith("code_")] | |
| if code_cols: | |
| filter_col = code_cols[0] | |
| if filter_col is None: | |
| raise ValueError("Invalid value to argument `code_` / `code_muni` / `code_state`.") | |
| if filter_col == "code_state": | |
| gdf = gdf.copy() | |
| gdf[filter_col] = pd.to_numeric(gdf[filter_col], errors="coerce") | |
| codes_num = [int(c) for c in codes] | |
| result = gdf[gdf[filter_col].isin(codes_num)] | |
| elif filter_col == "code_muni": | |
| gdf = gdf.copy() | |
| gdf[filter_col] = pd.to_numeric(gdf[filter_col], errors="coerce").astype("Int64") | |
| codes_num = [int(c) for c in codes] | |
| result = gdf[gdf[filter_col].isin(codes_num)] | |
| if len(result) == 0: | |
| result = gdf[gdf[filter_col].astype(str).isin([str(c) for c in codes_num])] | |
| else: | |
| result = gdf[gdf[filter_col].isin(codes)] | |
| if len(result) == 0: | |
| raise ValueError("Invalid value to argument `code_` / `code_muni` / `code_state`.") | |
| return result | |
| def filter_by_code( | |
| table: pyarrow.Table, | |
| code: Any = "all", | |
| ) -> pyarrow.Table: | |
| """Filter an arrow table by state abbrev, state code, municipality code, or other code_* column. | |
| Mirrors R ``filter_arrw()`` behavior for in-memory arrow table. | |
| """ | |
| # 1. Early exits for empty tables or 'all' filter | |
| if table is None or table.num_rows == 0: | |
| return table | |
| if code == "all" or code is None: | |
| return table | |
| # 2. Normalize codes input | |
| codes = _normalize_code(code) | |
| if not isinstance(codes, list): | |
| codes = [codes] | |
| filter_col = None | |
| # 3. Identify the correct filtering column using the schema | |
| if all(c in ALL_ABBREV_STATE for c in codes): | |
| if "abbrev_state" in table.schema.names: | |
| filter_col = "abbrev_state" | |
| elif all( | |
| _numbers_only(str(c)) and len(str(c)) <= 2 | |
| and (str(c).zfill(2) in ALL_CODE_STATE or str(c) in ALL_CODE_STATE) | |
| for c in codes | |
| ): | |
| if "code_state" in table.schema.names: | |
| filter_col = "code_state" | |
| codes = [int(c) if str(c).isdigit() else c for c in codes] | |
| elif all(_numbers_only(str(c)) and len(str(c)) == 7 for c in codes): | |
| if "code_muni" in table.schema.names: | |
| filter_col = "code_muni" | |
| codes = [int(c) for c in codes] | |
| elif all(_numbers_only(c) and len(str(c)) > 3 for c in codes): | |
| code_cols = [c for c in table.schema.names if c.startswith("code_")] | |
| if code_cols: | |
| filter_col = code_cols[0] | |
| if filter_col is None: | |
| raise ValueError("Invalid value to argument `code_` / `code_muni` / `code_state`.") | |
| # 4. Handle Column Casting and Filtering | |
| if filter_col == "code_state": | |
| # Safe cast column to numeric (int64) and match against integer array | |
| casted_col = pc.cast(table[filter_col], pa.int64(), safe=False) | |
| codes_arr = pa.array([int(c) for c in codes], type=pa.int64()) | |
| expr = pc.field(filter_col).isin(codes_arr) | |
| # We apply the filter using an updated table containing the casted type | |
| table_to_filter = table.set_column(table.schema.get_field_index(filter_col), filter_col, casted_col) | |
| result = table_to_filter.filter(expr) | |
| elif filter_col == "code_muni": | |
| # Try numeric matching first | |
| casted_col = pc.cast(table[filter_col], pa.int64(), safe=False) | |
| codes_arr = pa.array([int(c) for c in codes], type=pa.int64()) | |
| expr = pc.field(filter_col).isin(codes_arr) | |
| table_to_filter = table.set_column(table.schema.get_field_index(filter_col), filter_col, casted_col) | |
| result = table_to_filter.filter(expr) | |
| # Fallback to string matching if no rows were matched numerically | |
| if result.num_rows == 0: | |
| str_col = pc.cast(table[filter_col], pa.string()) | |
| codes_str = pa.array([str(int(c)) for c in codes], type=pa.string()) | |
| expr_str = pc.field(filter_col).isin(codes_str) | |
| table_to_filter_str = table.set_column(table.schema.get_field_index(filter_col), filter_col, str_col) | |
| result = table_to_filter_str.filter(expr_str) | |
| else: | |
| # Default string/exact match filtering | |
| codes_arr = pa.array(codes) | |
| expr = pc.field(filter_col).isin(codes_arr) | |
| result = table.filter(expr) | |
| # 5. Validate output row count | |
| if result.num_rows == 0: | |
| raise ValueError("Invalid value to argument `code_` / `code_muni` / `code_state`.") | |
| return result |
| def download_parquet( | ||
| filename_to_download: str, | ||
| show_progress: bool = True, | ||
| cache: bool = True, | ||
| ) -> Path: | ||
| """Download a parquet file from geobr_prep_data v2.0.0. Returns local path.""" | ||
| dest = cached_path(filename_to_download) | ||
| if cache and is_cached(filename_to_download): | ||
| return dest | ||
| urls = [ | ||
| f"{GEOBR_PREP_DATA_BASE}/{filename_to_download}", | ||
| f"{IPEA_FALLBACK_BASE}/{filename_to_download}", | ||
| ] | ||
| if not _download_file(urls, dest, show_progress=show_progress): | ||
| raise ConnectionError( | ||
| "A file may have been corrupted during download. " | ||
| "Please try again or report at https://github.com/ipeaGIT/geobr/issues" | ||
| ) | ||
| return dest |
There was a problem hiding this comment.
Following the suggestion in https://github.com/ipeaGIT/geobr/pull/418/changes#r3283548653, this function wound return an arrow table
| def download_parquet( | |
| filename_to_download: str, | |
| show_progress: bool = True, | |
| cache: bool = True, | |
| ) -> Path: | |
| """Download a parquet file from geobr_prep_data v2.0.0. Returns local path.""" | |
| dest = cached_path(filename_to_download) | |
| if cache and is_cached(filename_to_download): | |
| return dest | |
| urls = [ | |
| f"{GEOBR_PREP_DATA_BASE}/{filename_to_download}", | |
| f"{IPEA_FALLBACK_BASE}/{filename_to_download}", | |
| ] | |
| if not _download_file(urls, dest, show_progress=show_progress): | |
| raise ConnectionError( | |
| "A file may have been corrupted during download. " | |
| "Please try again or report at https://github.com/ipeaGIT/geobr/issues" | |
| ) | |
| return dest | |
| def download_parquet( | |
| filename_to_download: str, | |
| show_progress: bool = True, | |
| cache: bool = True, | |
| ) -> pyarrow.Table: | |
| """Download a parquet file from latest geobr_prep_data. Returns an arrow table""" | |
| dest = cached_path(filename_to_download) | |
| if cache and is_cached(filename_to_download): | |
| arrow_table = pq.read_table(dest) | |
| return arrow_table | |
| urls = [ | |
| f"{GEOBR_PREP_DATA_BASE}/{filename_to_download}", | |
| f"{IPEA_FALLBACK_BASE}/{filename_to_download}", | |
| ] | |
| if not _download_file(urls, dest, show_progress=show_progress): | |
| raise ConnectionError( | |
| "A file may have been corrupted during download. " | |
| "Please try again or report at https://github.com/ipeaGIT/geobr/issues" | |
| ) | |
| arrow_table = pq.read_table(dest) | |
| return arrow_table |
| path = download_parquet( | ||
| row["file_name"], | ||
| show_progress=show_progress, | ||
| cache=cache, | ||
| ) | ||
| if output == "duckdb" and view_name is None: | ||
| view_name = f"{geography}_{year}" | ||
| return convert_output( | ||
| path, | ||
| output=output, | ||
| filter_code=code, | ||
| connection=connection, | ||
| view_name=view_name, | ||
| ) |
There was a problem hiding this comment.
Following the suggestion in https://github.com/ipeaGIT/geobr/pull/418/changes#r3283548653, we would add the filter_by_code step
| path = download_parquet( | |
| row["file_name"], | |
| show_progress=show_progress, | |
| cache=cache, | |
| ) | |
| if output == "duckdb" and view_name is None: | |
| view_name = f"{geography}_{year}" | |
| return convert_output( | |
| path, | |
| output=output, | |
| filter_code=code, | |
| connection=connection, | |
| view_name=view_name, | |
| ) | |
| table= download_parquet( | |
| row["file_name"], | |
| show_progress=show_progress, | |
| cache=cache, | |
| ) | |
| table = filter_by_code(table, code) | |
| if output == "duckdb" and view_name is None: | |
| view_name = f"{geography}_{year}" | |
| return convert_output( | |
| table, | |
| output=output, | |
| connection=connection, | |
| view_name=view_name, | |
| ) |
| gdf.to_parquet(path) | ||
| rel = convert_output(path, output="duckdb") | ||
| df = rel.df() | ||
| assert len(df) == 1 |
There was a problem hiding this comment.
| assert len(df) == 1 | |
| column_types = dict(zip(df.columns, df.types)) | |
| assert len(df) == 1 | |
| assert isinstance(df, duckdb.DuckDBPyRelation) | |
| assert "geometry" in column_types | |
| assert column_types["geometry"] == "GEOMETRY" |
|
@JoaoCarabetta , are you Ok with @camilagb's suggestions above? Once you give us green light, @camilagb will accept this PR and proceed to review the next one |
|
|
||
|
|
||
| def convert_output( | ||
| parquet_path: Union[str, Path], |
There was a problem hiding this comment.
I added the possibility of receiveing a geoDataFrame here so that the read_health_region function in #419 can work properly
| parquet_path: Union[str, Path], | |
| table: pyarrow.Table | gpd.GeoDataFrame, |
| output: OutputType = "sf", | ||
| filter_code: str = "all", |
There was a problem hiding this comment.
| output: OutputType = "sf", | |
| filter_code: str = "all", | |
| output: OutputType = "gpd", |
| """Load parquet and return in the requested format. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| parquet_path : path to local parquet file | ||
| output : ``"sf"`` (default), ``"duckdb"``, or ``"arrow"`` | ||
| filter_code : passed to ``filter_by_code`` when output is ``"sf"`` |
There was a problem hiding this comment.
| """Load parquet and return in the requested format. | |
| Parameters | |
| ---------- | |
| parquet_path : path to local parquet file | |
| output : ``"sf"`` (default), ``"duckdb"``, or ``"arrow"`` | |
| filter_code : passed to ``filter_by_code`` when output is ``"sf"`` | |
| """Receive an arrow or gpd table and return in the requested format. | |
| Parameters | |
| ---------- | |
| table: an arrow or gpd table | |
| output : ``"gpd"`` (default), ``"duckdb"``, or ``"arrow"`` |
| if output == "sf": | ||
| gdf = gpd.read_parquet(path) | ||
| if filter_code != "all": | ||
| from geobr._filter import filter_by_code | ||
|
|
||
| gdf = filter_by_code(gdf, filter_code) | ||
| from geobr.utils import enforce_types | ||
|
|
||
| return enforce_types(gdf) | ||
|
|
||
| if output == "arrow": | ||
| return pq.read_table(path) | ||
|
|
||
| if output == "duckdb": | ||
| from geobr._duckdb_backend import read_parquet_relation | ||
|
|
||
| return read_parquet_relation( | ||
| path, | ||
| filter_code=filter_code, | ||
| connection=connection, | ||
| view_name=view_name, | ||
| ) |
There was a problem hiding this comment.
| if output == "sf": | |
| gdf = gpd.read_parquet(path) | |
| if filter_code != "all": | |
| from geobr._filter import filter_by_code | |
| gdf = filter_by_code(gdf, filter_code) | |
| from geobr.utils import enforce_types | |
| return enforce_types(gdf) | |
| if output == "arrow": | |
| return pq.read_table(path) | |
| if output == "duckdb": | |
| from geobr._duckdb_backend import read_parquet_relation | |
| return read_parquet_relation( | |
| path, | |
| filter_code=filter_code, | |
| connection=connection, | |
| view_name=view_name, | |
| ) | |
| if output == "arrow" | |
| if isinstance(table, pa.Table): | |
| return table | |
| return table.to_arrow() | |
| if output == "gdp": | |
| if insinstance(table, gpd.GeoDataFrame): | |
| return table | |
| df = table.to_pandas() | |
| df["geometry"] = gpd.GeoSeries.from_wkb(df["geometry"]) | |
| gdf = gpd.GeoDataFrame(df, geometry="geometry") | |
| from geobr.utils import enforce_types | |
| return enforce_types(gdf) | |
| if output == "duckdb": | |
| if isinstance(table, gpd.GeoDataFrame): | |
| table = table.to_arrow() | |
| from geobr._duckdb_backend import read_relation | |
| return read_relation( | |
| table, | |
| connection=connection, | |
| view_name=view_name, | |
| ) |
Summary
read_geobr_v2/read_geobr_hybridpipeline_filter,_output(sf/arrow/duckdb relation), and minimal DuckDB parquet backendpyarrowandrapidfuzzto core deps; add CI matrix for optional DuckDB extraTest plan
pytest -m "not network"passes for foundation tests (test_filter*,test_output,test_utils_v2,test_arrow_output,test_duckdb_output)Made with Cursor