Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: CI

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13"]

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: uv sync --group dev

- name: Lint
run: uv run ruff check chatgpt_export_tool tests

- name: Format check
run: uv run ruff format --check chatgpt_export_tool tests

- name: Test
run: uv run pytest --cov=chatgpt_export_tool --cov-report=term-missing
28 changes: 28 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Publish to PyPI

on:
release:
types: [published]

permissions:
id-token: write

jobs:
publish:
runs-on: ubuntu-latest
environment: pypi
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.13"

- name: Install build tools
run: pip install build

- name: Build package
run: python -m build

- name: Publish to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
233 changes: 81 additions & 152 deletions Fields.md
Original file line number Diff line number Diff line change
@@ -1,232 +1,161 @@
# Field Selection Reference

This document describes the field-selection and metadata-selection features that the current CLI actually supports.
Practical reference for the `--fields` and `--include`/`--exclude` options in `chatgpt-export`.

It is intentionally practical rather than exhaustive. The goal is to document the fields, groups, and selectors you can use with `chatgpt-export`, not to guess every field that might appear in every historical `conversations.json` file.
---

## Structural Levels
## How Data Is Structured

The tool understands conversation data at these nested levels:
ChatGPT exports nest data at these levels:

```text
```
conversation
└── mapping node
└── message
├── author
├── content
└── metadata
├── author (role, name)
├── content (content_type, parts, text, ...)
└── metadata (model_slug, message_type, ...)
```

The field selector can retain or remove fields across those levels while preserving the containers needed to reach nested selected fields.

Text export is transcript-oriented: it follows the active branch defined by `current_node` and `parent` links, then applies transcript visibility rules from the TOML config passed to `export`.

## `--fields`
The field selector retains or removes fields across these levels while preserving the containers needed to reach any selected nested field.

The `--fields` argument accepts one field-selection spec.
Text/Markdown export is **transcript-oriented** — it follows the active branch via `current_node` and `parent` links, then applies visibility rules from the TOML config.

Supported forms:
---

```text
all
none
include field1,field2
exclude field1,field2
groups group1,group2
```
## `--fields`

Examples:
Controls which structural fields survive before formatting.

```bash
chatgpt-export export data.json --fields all
chatgpt-export export data.json --fields none
chatgpt-export export data.json --fields "include title,create_time,mapping"
chatgpt-export export data.json --fields "exclude moderation_results,plugin_ids"
chatgpt-export export data.json --fields "groups minimal"
chatgpt-export export data.json --fields "groups conversation,message"
--fields all # keep everything (default)
--fields none # empty structure
--fields "include title,create_time,mapping" # whitelist
--fields "exclude moderation_results" # blacklist
--fields "groups minimal" # named group
--fields "groups conversation,message" # combine groups
```

Multi-word specs must be quoted.
> Multi-word specs must be quoted.

## Field Groups
---

The current built-in field groups are:
## Built-in Field Groups

### `conversation`

Includes:

- `_id`
- `conversation_id`
- `create_time`
- `update_time`
- `title`
- `type`
`_id` · `conversation_id` · `create_time` · `update_time` · `title` · `type`

### `message`

Includes:

- `author`
- `content`
- `status`
- `end_turn`
`author` · `content` · `status` · `end_turn`

### `metadata`

Includes:

- `model_slug`
- `message_type`
- `is_archived`
`model_slug` · `message_type` · `is_archived`

### `minimal`

Includes:
`title` · `create_time` · `message`

- `title`
- `create_time`
- `message`
---

## Known Structural Fields
## All Known Structural Fields

These are the structural fields the tool currently categorizes by level.
<details>
<summary><strong>Conversation level</strong></summary>

### Conversation
`title` · `create_time` · `update_time` · `mapping` · `moderation_results` · `current_node` · `plugin_ids` · `_id` · `conversation_id` · `type`

- `title`
- `create_time`
- `update_time`
- `mapping`
- `moderation_results`
- `current_node`
- `plugin_ids`
- `_id`
- `conversation_id`
- `type`
</details>

### Mapping Node
<details>
<summary><strong>Mapping node level</strong></summary>

- `id`
- `parent`
- `children`
- `message`
`id` · `parent` · `children` · `message`

### Message
</details>

- `author`
- `content`
- `status`
- `end_turn`
- `weight`
- `recipient`
- `channel`
- `create_time`
- `update_time`
<details>
<summary><strong>Message level</strong></summary>

### Author
`author` · `content` · `status` · `end_turn` · `weight` · `recipient` · `channel` · `create_time` · `update_time`

- `role`
- `name`
</details>

### Content
<details>
<summary><strong>Author</strong></summary>

- `content_type`
- `parts`
- `language`
- `response_format_name`
- `text`
- `user_profile`
- `user_instructions`
`role` · `name`

Unknown names are still allowed in `include` and `exclude` field specs, but the validator may warn about them.
</details>

## Metadata Filtering
<details>
<summary><strong>Content</strong></summary>

Metadata filtering is separate from `--fields`.
`content_type` · `parts` · `language` · `response_format_name` · `text` · `user_profile` · `user_instructions`

Use:
</details>

- `--include PATTERN [PATTERN ...]`
- `--exclude PATTERN [PATTERN ...]`
> Unknown field names are allowed in `include`/`exclude` specs, but the validator may warn about them.

These apply to known metadata names inside nested `message.metadata` dictionaries after structural field filtering.
---

Examples:
## Metadata Filtering

Separate from `--fields`. Applies only to keys inside `message.metadata` dictionaries, *after* structural filtering.

```bash
chatgpt-export export data.json --include model_slug
chatgpt-export export data.json --include "model*" --exclude plugin_ids
chatgpt-export export data.json --fields "groups message" --include is_archived
```

Pattern matching supports:

- exact matches
- substring matches
- shell-style wildcards such as `model*`

## Known Metadata Names
**Pattern matching:** exact, substring, and shell-style globs (`model*`).

The current metadata filter recognizes these names:
**Known metadata names:** `model_slug` · `message_type` · `plugin_ids` · `is_archived`

- `model_slug`
- `message_type`
- `plugin_ids`
- `is_archived`
---

## How Filtering Combines
## Filtering Pipeline

Filtering happens in this order:

1. structural field selection through `--fields`
2. metadata filtering through `--include` and `--exclude`
3. formatting to text or JSON

This means:
```
conversations.json
1. Structural field selection (--fields)
2. Metadata filtering (--include / --exclude)
3. Formatting (md / txt / json)
→ output
```

- `--fields` decides whether structural containers like `mapping`, `message`, `author`, `content`, and `metadata` survive
- `--include` and `--exclude` decide which metadata keys remain inside metadata dictionaries
`--fields` decides whether containers like `mapping`, `message`, and `metadata` survive.
`--include`/`--exclude` decides which keys remain inside those metadata containers.

## Practical Recipes
---

Keep only a small readable subset:
## Recipes

```bash
# Minimal readable export
chatgpt-export export data.json --fields "groups minimal"
```

Keep titles and timestamps but drop plugin noise:

```bash
# Drop noise, keep structure
chatgpt-export export data.json --fields "exclude plugin_ids,moderation_results"
```

Keep only message-oriented structure and model metadata:

```bash
# Message structure + model info only
chatgpt-export export data.json --fields "groups message" --include "model*"
```

Write one file per conversation with a minimal payload:

```bash
# One file per conversation, minimal payload
chatgpt-export export data.json --split subject --output-dir exports --fields "groups minimal"
```

---

## Notes

- `analyze --fields` reports field coverage; it does not accept the export-style field-selection spec.
- `export` can load defaults from a TOML file via `--config PATH`.
- The repo ships `chatgpt_export.toml.example` as a template; copy it to a local file before use.
- `export --split single` writes to stdout unless `--output` is provided.
- Subject split files are named from the source conversation title plus identifier.
- Split modes such as `subject`, `date`, and `id` write to `--output-dir`.
- Text export follows the active conversation branch and is configurable through the `[transcript]` and `[text_output]` TOML sections.
- Default text export shows user text, assistant text, assistant thoughts, and a compact preview of `user_editable_context`.
- Default text export hides assistant code, reasoning recap, and tool plumbing unless the transcript policy explicitly enables them.
- Advanced transcript controls include `user_editable_context_mode`, `show_visually_hidden_content_types`, `include_content_types`, and `exclude_content_types`.
- Text layout controls include `layout_mode`, `heading_style`, `include_turn_count_in_header`, `include_turn_numbers`, `turn_separator`, `strip_chatgpt_artifacts`, and `wrap_width`.
- A practical default is `layout_mode = "reading"` with `turn_separator = "---"` and artifact stripping enabled.
- For tighter exports, use `layout_mode = "compact"` and disable turn counts.
- For notes-oriented output, use `heading_style = "markdown"`.
- `analyze --fields` reports field *coverage* — it does not use the export-style field-selection spec.
- `export --split single` writes to stdout unless `--output` is given.
- Subject split files are named `Title_ID.md` (or `.txt`/`.json`).
- Default Markdown export shows user text, assistant text, thoughts, and compact context previews.
- Hidden by default: assistant code, reasoning recap, tool plumbing.
- All transcript visibility is configurable via `[transcript]` in the TOML config.
- Text layout is configurable via `[text_output]`: layout mode, heading style, wrap width, separators, turn numbering.
Loading
Loading