Python library for capturing GitHub repository metadata and README files to a local SQLite database.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
For development with the test suite:
pip install -e '.[dev]'
pytest
Installing capxure adds a cap console script. GitHub commands live under cap git:
export GITHUB_TOKEN=ghp_... # or GH_TOKEN
cap git owner/repo # capture by shorthand
cap git https://github.com/owner/repo # or by full URL
cap git capture owner/repo # explicit form (alias of the above)
cap git owner/repo --data-dir ~/caps # override storage location
cap git stars # bulk-capture your starred repos
Progress events print to stderr (info: Fetching metadata…, success: owner/repo: captured successfully). stdout is reserved for read commands (ls) that produce structured output.
Exit codes: 0 success (including dedup-skip), 1 library-reported failure or missing token, 2 usage error, 3 malformed target, 130 Ctrl-C.
GitHub commands live under cap git; quick-capture notes live under cap note:
cap note "fleeting thought" # quickest path; smart-dispatch to add
cap note "great quote" -s "book:Atomic Habits" -L "p.42" -k quote
echo "from a pipe" | cap note add -a "via stdin"
cap note ls # cards on a TTY, TSV when piped
cap note ls --format plain # force TSV (7 fields per note)
cap git ls # pretty table on a TTY, TSV when piped
cap git ls -s stars -r # least-starred first
cap git ls -t ml -t nlp -l 25 # repos tagged ml OR nlp, top 25
cap git ls topics # topic counts, descending
cap git ls --format plain # force TSV (9 fields per repo)
Plain output is tab-separated and intended for scripts; the description field has tabs/newlines collapsed to spaces so downstream awk / cut / fzf parse cleanly. Use CAPXURE_DATA_DIR to target a non-default db.
Exit codes for ls: 0 success (including empty results), 2 usage error, 130 Ctrl-C.
cap mcp runs a read-only Model Context Protocol server over stdio that exposes captured repos and notes to AI consumers like Claude Code. It only reads; capture still happens through the regular cap git and cap note CLIs.
Wire it into Claude Code's mcp_servers configuration:
{
"mcpServers": {
"capxure": {
"command": "cap",
"args": ["mcp"]
}
}
}Or with a custom data directory:
{
"mcpServers": {
"capxure": {
"command": "cap",
"args": ["mcp", "--data-dir", "/path/to/capxure-data"]
}
}
}search_repos(query, topics?, language?, k?)— FTS5 search across captured repos.get_repo(owner, name)— full structured metadata for one repo.get_readme(owner, name)— full README text.list_topics(prefix?, min_count?, max_count?, order?, limit?)— discovery: what topics exist.search_notes(query, sources?, k?)— FTS5 search across notes.list_sources(prefix?, min_count?, max_count?, order?, limit?)— discovery: what sources you've taken notes from.
Searches return lean hits with snippets; use get_readme / get_repo to pull full content for the hits worth investigating. This composes cheaply — searches stay fast and small, and full-document fetches happen only after you've decided you want one.
The server is stdio-only (no HTTP) and read-only (no capture tools — use the regular cap git / cap note CLIs for ingest).
For programmatic use, import directly. Your consumer code is responsible for obtaining a GitHub personal-access token (e.g., via python-dotenv, your shell environment, or a secrets manager) and passing it to GitHubClient.
import asyncio
import os
from capxure import Database, GitHubClient, process_repo, Severity
async def main() -> None:
def log(message: str, severity: Severity) -> None:
print(f"[{severity}] {message}")
with Database() as db:
async with GitHubClient(os.environ["GITHUB_TOKEN"]) as gh:
await process_repo(
"https://github.com/owner/repo",
github=gh,
repos=db.repos,
on_status=log,
)
asyncio.run(main())The notes domain works the same way:
from capxure import Database
with Database() as db:
note = db.notes.add("fleeting thought", source="twitter", kind_hint="quote")
for n in db.notes.list_notes():
print(n.id, n.content)Capxure persists captured repos to a single SQLite database. The default location resolves in this order:
$CAPXURE_DATA_DIRenvironment variable (if set and non-empty)platformdirs.user_data_dir("capxure")— e.g.~/.local/share/capxure/on Linux,~/Library/Application Support/capxure/on macOS
The database file inside that directory is capxure.db. Override the full path with Database(db_path=Path(...)).
WAL mode is enabled, so while a connection is open you'll see capxure.db-wal and capxure.db-shm sidecar files next to the database. These are cleaned up on a normal close and do not need to be backed up separately.
The schema is a documented public contract — you may run arbitrary SQL against it via the db.connection escape hatch.
- Table
repos— one row per captured GitHub repo. Includes denormalized columns for common query hotspots (language,stars,forks,pushed_at,is_fork,is_archived), an inlinereadme_contentcolumn (nullable;NULLmeans "no README"), and the full GitHub API response preserved as JSON inmetadata. - Table
repo_topics— junction table for many-to-many topics. Composite primary key(repo_id, topic)provides insert-dedup; a secondary index ontopicsupportsWHERE topic = ?filtering. - Table
notes— append-only quick-capture inbox.contentis the only required column;annotation,source,source_locator, andkind_hintare optional free-form strings;captured_atis set by SQLite default. No indexes — minimalcap note lsdoesn't need them yet.
The full DDL lives in src/capxure/db.py under _SCHEMA_SQL.
Existing capxure databases at schema version 1 auto-upgrade to version 2 on next open (the upgrade adds the notes table; no data is touched).
Database owns the connection and schema; RepoStore (accessed via db.repos) owns repo queries. This split keeps each unit small and lets future capture domains (e.g. notes) drop in alongside without touching repo code.
from capxure import Database, UpsertOutcome
with Database() as db:
outcome = db.repos.upsert(github_metadata_dict, readme_content)
# outcome is one of: NEW, UPDATED, RENAMED, UNCHANGED, LOCAL_IS_NEWER
repo = db.repos.get_repo("sindresorhus", "awesome-nodejs")
if repo is not None:
print(repo.stars, repo.topics)
# Default order is last_synced_at DESC; pass sort=/reverse=/topics=/limit=
# to customize. Zero-arg call still returns every row.
all_repos = db.repos.list_repos()Database.notes returns a NoteStore for the catch-all notes inbox; Database.repos returns a RepoStore for git captures. Both share db.connection so a single transaction can span both domains via the SQL escape hatch.
The db.connection property exposes the underlying sqlite3.Connection as an escape hatch for ad-hoc SQL:
with Database() as db:
for row in db.connection.execute(
"SELECT full_name, stars FROM repos WHERE language = ? ORDER BY stars DESC",
("Python",),
):
print(row["full_name"], row["stars"])- New domain:
capxure.notefor low-friction quick-capture inbox. Add viacap note "<text>"(smart-dispatches tocap note add) with optional-a/--annotation,-s/--source,-L/--loc,-k/--kindflags. Stdin pipe supported when no positional given. List withcap note ls(pretty cards on TTY, TSV when piped). - Python:
Database.notesreturns aNoteStorewithadd(...),list_notes(),count_notes(). NewNotedataclass. - Schema bumped to v2 (added
notestable, no indexes). Existing v1 databases auto-upgrade on next open. Forward-incompatible dbs still raiseUnsupportedSchemaError. - Fix:
Database.__exit__now commits pending writes (and rolls back on exception) before closing the connection — required for cross-connection visibility within the same process.
Breaking: CLI and Python API both reorganized around per-domain subpackages to make room for additional capture surfaces.
- CLI: GitHub commands move under
cap git.cap owner/repo,cap ls,cap stars, andcap captureno longer work — usecap git owner/repo,cap git ls,cap git stars,cap git captureinstead.capwith no arguments prints usage and exits 2. - Python API:
Storageis removed. UseDatabase(connection + schema lifecycle) plusdb.repos(aRepoStoreinstance for repo queries).process_reponow takesrepos: RepoStoreinstead ofstorage: Storage. - Module reorganization:
capxure.github→capxure.git.client;capxure.processor→capxure.git.processor. The top-levelcapxurepackage re-exports everything previously available there (minusStorage), sofrom capxure import GitHubClientetc. keeps working. - Schema is unchanged. Existing databases on disk continue to work without migration.
- New
capconsole script (installed via[project.scripts]). First subcommand:cap <target>captures a repo. Targets accept full URLs or bareowner/reposhorthand. capxure.github.parse_github_urlregex broadened so bareowner/repoalso parses (thegithub.com/prefix is now optional). No behavior change for inputs the old regex accepted.- New public module
capxure.cliexposingmain()andbuild_parser(). Not re-exported from the top-level package — users of the library directly should keep importing fromcapxureas before.
Breaking: Storage layer rewritten from JSON-files-on-disk to SQLite.
DeduplicationResultenum removed. Replaced byUpsertOutcomewith the valuesNEW,UPDATED,RENAMED,UNCHANGED,LOCAL_IS_NEWER.Storageconstructor parameter renamed fromdata_dirtodb_path. The new default resolves to{CAPXURE_DATA_DIR or platformdirs.user_data_dir("capxure")}/capxure.db.- Old
Storagemethods removed:load_metadata,save_metadata,check_dedup,save_readme,upsert_entry,find_key_by_id,ensure_directories,make_key. - New
Storagemethods:upsert,diff,get_repo,get_repo_by_github_id,list_repos,count_repos,get_metadata_json, and theconnectionproperty (raw SQL escape hatch). - New types exported:
Repo(frozen dataclass),UpsertOutcome,DuplicateRepoNameError,UnsupportedSchemaError. - The
processor.process_reposignature is unchanged, butProcessResult.outcomeis nowUpsertOutcome | Noneinstead ofDeduplicationResult | None. - Data previously captured at
data/metadata.json+data/readmes/*.mdis not auto-migrated. Re-capture viaprocess_repoor hand-load viastorage.upsert.