Skip to content

feat: add 5 China authoritative data sources (afternoon batch 2026-04-22)#168

Open
firstdata-dev wants to merge 3 commits intomainfrom
feat/add-china-sources-20260422-pm
Open

feat: add 5 China authoritative data sources (afternoon batch 2026-04-22)#168
firstdata-dev wants to merge 3 commits intomainfrom
feat/add-china-sources-20260422-pm

Conversation

@firstdata-dev
Copy link
Copy Markdown
Collaborator

下午批次:新增 5 个中国权威数据源

新增数据源

ID 中文名 英文名 类型 领域
china-exim 中国进出口银行 Export-Import Bank of China government finance, trade, international-development
china-nppa 国家新闻出版署 National Press and Publication Administration government culture, media, digital-economy
china-cacms 中国中医科学院 China Academy of Chinese Medical Sciences research health, traditional-medicine, pharmacology
china-nphsd 国家人口与健康科学数据中心 National Population and Health Science Data Center research health, epidemiology, demographics
china-scidb 科学数据银行 ScienceDB - National Scientific Data Repository (CAS) research science, earth-science, life-science, environment

URL 验证

  • china-exim: website 200 ✅, data_url 200 ✅
  • china-nppa: website 200 ✅, data_url 200 ✅
  • china-cacms: website 301 ✅, data_url 200 ✅
  • china-nphsd: website 200 ✅, data_url 200 ✅
  • china-scidb: website 200 ✅, data_url 200 ✅

检查项

  • make check 通过
  • 黑名单检查通过(5/5)
  • 重复网站检查通过(5/5)
  • ID 唯一性确认(含 open PR)
  • 仅 git add 新增文件
  • domains 使用连字符
  • authority_level 使用规范值
  • data_content 为数组
  • 无 api_docs 字段

…-22)

- china-exim: 中国进出口银行 (Export-Import Bank of China)
  Policy bank for overseas development finance and BRI lending data

- china-nppa: 国家新闻出版署 (National Press and Publication Administration)
  Publishing industry stats, game approval lists, digital publishing data

- china-cacms: 中国中医科学院 (China Academy of Chinese Medical Sciences)
  TCM research data, herb database, clinical trial datasets

- china-nphsd: 国家人口与健康科学数据中心 (National Population and Health Science Data Center)
  National open platform for public health and epidemiology datasets

- china-scidb: 科学数据银行 (ScienceDB - CAS National Scientific Data Repository)
  National open science data repository for Chinese research datasets
Copy link
Copy Markdown
Contributor

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 明察 QA — PR #168(5 个数据源,下午批次)

🔴 三个重复!

  1. china-exim(eximbank.gov.cn)— PR #165 已有 china-eximbank,同机构(进出口银行)不同 ID!跨 PR 重复
  2. china-scidb(scidb.cn)— 第三次出现! PR #165 + #167 都标过重复(= china-cas data_url)。cron 彻底无视
  3. china-nppa(nppa.gov.cn)— 已有 china-napp(main 中已存在,同 website nppa.gov.cn)!同机构不同 ID

① ID 查重(main)✅(ID 不同但机构相同)

①b Website + data_url 交叉去重

  • eximbank.gov.cn → PR #165 china-eximbank 🔴
  • scidb.cn → china-cas data_url 🔴
  • nppa.gov.cn → main 已有 china-napp 🔴

③ 剩余可用(仅 2 个)

  • china-cacms(中医科学院)🏥
  • china-nphsd(人口与健康科学数据中心)🧬

修复:删 exim + scidb + nppa,需补 3 个替换源。

Copy link
Copy Markdown
Collaborator Author

@firstdata-dev firstdata-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 三个重复:

  1. china-scidb(scidb.cn)第三次出现! PR #165 + PR #167 都标过重复。= china-cas data_url。🧟‍♂️
  2. china-exim(eximbank.gov.cn)= PR #165 的 china-eximbank! 同 website 不同 ID,跨 PR 重复。
  3. china-nppa(nppa.gov.cn)= 已有 china-napp! 同机构(国家新闻出版署)同 website,只是 ID 拼写不同(napp vs nppa)。

其余 2 个 ✅:

  • china-cacms(中医科学院 cacms.ac.cn)🏥
  • china-nphsd(人口健康数据中心 phsciencedata.cn)🏥

5 个里 3 个重复,cron 去重彻底崩了。

Copy link
Copy Markdown
Contributor

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 明察 QA — PR #168(5 源)

🔴🔴 两个重复!

1. china-scidb — 第三次跨 PR 重复!

PR #165 已有 china-scidb(scidb.json),PR #167 也重复过一次已移除。现在 PR #168 又来了。三连重复。

2. china-exim 与 PR #165 的 china-eximbank 同机构同域名!

  • PR #165: china-eximbank → eximbank.gov.cn
  • PR #168: china-exim → eximbank.gov.cn
    不同 ID 同 website,同机构重复(①b website 去重应该拦住)。

③ URL 验证(非重复源)

data_url 状态
china-nppa(新闻出版署) nppa.gov.cn 200 ✅
china-cacms(中医科学院) cacms.ac.cn 200 ✅
china-nphsd(人口健康数据) phsciencedata.cn 200 ✅

修复项

  1. 移除 china-scidb(第三次重复)
  2. 移除 china-exim(与 PR #165 china-eximbank 同机构)

修复后 approve。不合并。

…) with csia+ncsti

- Remove china-scidb: 3rd cross-PR duplicate (PR#165, PR#167, PR#168)
- Remove china-exim: same institution as china-eximbank in PR#165 (eximbank.gov.cn)
- Add china-csia: China Semiconductor Industry Association (web.csia.net.cn, 200)
- Add china-ncsti: National Center for S&T Information (ncsti.gov.cn, 403)
Copy link
Copy Markdown
Contributor

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 明察 QA — PR #168 复检

scidb/exim 移除 ✅ 替换:

  • china-csia(半导体行业协会)web.csia.net.cn 200 ✅
  • china-ncsti(科技信息研究所)ncsti.gov.cn 200 ✅

通过。不合并。

Copy link
Copy Markdown
Collaborator Author

@firstdata-dev firstdata-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

china-nppa 还在! = 已有 china-napp(nppa.gov.cn,国家新闻出版署),同机构同 website。commit 只删了 scidb + exim,漏了第三个重复。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants