Skip to content

Improve list_geobr catalog and lookup_muni fuzzy matching#421

Open
JoaoCarabetta wants to merge 4 commits into
ipeaGIT:masterfrom
JoaoCarabetta:python-catalog-lookup
Open

Improve list_geobr catalog and lookup_muni fuzzy matching#421
JoaoCarabetta wants to merge 4 commits into
ipeaGIT:masterfrom
JoaoCarabetta:python-catalog-lookup

Conversation

@JoaoCarabetta
Copy link
Copy Markdown
Collaborator

Summary

  • list_geobr() returns a DataFrame joined with live v2 metadata years
  • lookup_muni() adds year parameter and rapidfuzz-based fuzzy name matching

Test plan

  • test_list_geobr, test_lookup_muni, test_lookup_muni_v2

Depends on #418

Made with Cursor

JoaoCarabetta and others added 3 commits May 21, 2026 12:55
Introduce cached parquet downloads, filtering, multi-format output (sf/arrow/duckdb relation), and shared read_geobr_v2/hybrid helpers to align Python with the R v2.0.0 data path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Join live v2 metadata in list_geobr and add year-aware fuzzy municipality lookup using rapidfuzz.

Co-authored-by: Cursor <cursoragent@cursor.com>
Cherry-pick CI workflow upgrade from python-v2-pipeline; keep PR4 list_geobr tests unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
@rafapereirabr rafapereirabr requested a review from camilagb May 21, 2026 17:19
AppVeyor is not required for Python (GitHub Actions Python-CMD-check covers all platforms). Path filters skip builds when only python-package or .github change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines +42 to +51
"alias": [
"country", "regions", "states", "mesoregions", "microregions",
"intermediateregions", "immediateregions", "municipalities",
"municipalseats", "weightingareas", "censustracts", "statsgrid",
"metroarea", "urbanareas", "amazonialegal", "biomes",
"conservationunits", "disasterriskareas", "indigenousland",
"semiarid", "healthfacilities", "healthregions", "neighborhoods",
"schools", "amc", "urbanconcentrations", "poparrengements",
"favelas", "pollingplaces", "quilombolalands",
],
Copy link
Copy Markdown
Collaborator

@camilagb camilagb May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the R package

Suggested change
"alias": [
"country", "regions", "states", "mesoregions", "microregions",
"intermediateregions", "immediateregions", "municipalities",
"municipalseats", "weightingareas", "censustracts", "statsgrid",
"metroarea", "urbanareas", "amazonialegal", "biomes",
"conservationunits", "disasterriskareas", "indigenousland",
"semiarid", "healthfacilities", "healthregions", "neighborhoods",
"schools", "amc", "urbanconcentrations", "poparrengements",
"favelas", "pollingplaces", "quilombolalands",
],
"alias": [
"country", "regions", "states", "mesoregions", "microregions",
"intermediateregions", "immediateregions", "municipalities",
"municipalseats", "weightingareas", "censustracts", "statsgrid",
"metroarea", "urbanareas", "amazonialegal", "biomes",
"conservationunits", "disasterriskareas", "indigenouslands",
"semiarid", "healthfacilities", "healthregions", "neighborhoods",
"schools", "amc", "poparrangements", "poparrangements",
"favelas", "pollingplaces", "quilombolalands",
],

Comment on lines +87 to +101
rows = []
for _, row in out.iterrows():
raw = row.get("years_available")
if raw is None or (isinstance(raw, float) and pd.isna(raw)):
years = []
else:
years = str(raw).split(", ")
if not years or years == [""]:
rows.append(row.to_dict())
else:
for y in years:
r = row.to_dict()
r["year"] = y.strip()
rows.append(r)
return pd.DataFrame(rows)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rows = []
for _, row in out.iterrows():
raw = row.get("years_available")
if raw is None or (isinstance(raw, float) and pd.isna(raw)):
years = []
else:
years = str(raw).split(", ")
if not years or years == [""]:
rows.append(row.to_dict())
else:
for y in years:
r = row.to_dict()
r["year"] = y.strip()
rows.append(r)
return pd.DataFrame(rows)
out["year"] = out["years_available"].fillna("").str.split(", ")
out_expandido = out.explode("year")
out_expandido["year"] = out_expandido["year"].str.strip()
out_expandido = out_expandido.drop(columns=["years_available"])
return out_expandido

immediate regions. You should not select both code_muni and name_muni

def lookup_muni(
year: int = 2010,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
year: int = 2010,
year: int,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants