diff --git a/docs/migration/elasticsearch-to-doris.md b/docs/migration/elasticsearch-to-doris.md new file mode 100644 index 0000000000000..b9aa80e2f90f4 --- /dev/null +++ b/docs/migration/elasticsearch-to-doris.md @@ -0,0 +1,137 @@ +--- +{ + "title": "Elasticsearch to Doris", + "language": "en", + "description": "Comprehensive guide to migrating data from Elasticsearch to Apache Doris" +} +--- + +This guide covers migrating data from Elasticsearch to Apache Doris. Doris can serve as a powerful alternative to Elasticsearch for log analytics, full-text search, and general OLAP workloads, often with better performance and lower operational complexity. + +## Why Migrate from Elasticsearch to Doris? + +| Aspect | Elasticsearch | Apache Doris | +|--------|---------------|--------------| +| Query Language | DSL (JSON-based) | Standard SQL | +| JOINs | Limited | Full SQL JOINs | +| Storage Efficiency | Higher storage usage | Columnar compression | +| Operational Complexity | Complex cluster management | Simpler operations | +| Full-text Search | Native inverted index | Inverted index support | +| Real-time Analytics | Good | Excellent | + +## Considerations + +1. **Full-text Search**: Doris supports [Inverted Index](../table-design/index/inverted-index/overview.md) for full-text search capabilities similar to Elasticsearch. + +2. **Index to Table Mapping**: Each Elasticsearch index typically maps to a Doris table. + +3. **Nested Documents**: Elasticsearch nested types map to Doris [VARIANT](../data-operate/import/complex-types/variant.md) type for flexible schema handling. + +4. **Array Handling**: Elasticsearch doesn't have explicit array types. To read arrays correctly via the ES Catalog, configure array field metadata in the ES index mapping using `_meta.doris.array_fields`. + +5. **Date Types**: Elasticsearch dates can have multiple formats. Ensure consistent date handling when migrating — use explicit casting to DATETIME. + +6. **_id Field**: To preserve Elasticsearch document `_id`, enable `mapping_es_id` in the ES Catalog configuration. + +7. **Performance**: For better ES Catalog read performance, enable `doc_value_scan`. Note that `text` fields don't support doc_value and will fall back to `_source`. + +## Data Type Mapping + +| Elasticsearch Type | Doris Type | Notes | +|--------------------|------------|-------| +| null | NULL | | +| boolean | BOOLEAN | | +| byte | TINYINT | | +| short | SMALLINT | | +| integer | INT | | +| long | BIGINT | | +| unsigned_long | LARGEINT | | +| float | FLOAT | | +| half_float | FLOAT | | +| double | DOUBLE | | +| scaled_float | DOUBLE | | +| keyword | STRING | | +| text | STRING | Consider inverted index in Doris | +| date | DATE or DATETIME | See Date Types consideration above | +| ip | STRING | | +| nested | VARIANT | See [VARIANT type](../data-operate/import/complex-types/variant.md) for flexible schema | +| object | VARIANT | See [VARIANT type](../data-operate/import/complex-types/variant.md) | +| flattened | VARIANT | Supported since Doris 3.1.4, 4.0.3 | +| geo_point | STRING | Store as "lat,lon" string | +| geo_shape | STRING | Store as GeoJSON string | + +## Migration Options + +### Option 1: ES Catalog (Direct Query and Migration) + +The [ES Catalog](../lakehouse/catalogs/es-catalog.md) provides direct access to Elasticsearch data from Doris, enabling both querying and migration. + +**Prerequisites**: Elasticsearch 5.x or higher; network connectivity between Doris FE/BE nodes and Elasticsearch. + +### Option 2: Logstash Pipeline + +Use Logstash to read from Elasticsearch and write to Doris via HTTP (Stream Load). This approach gives you transformation capabilities during migration. + +### Option 3: Custom Script with Scroll API + +For more control, use a custom script with Elasticsearch Scroll API to read data and load it into Doris via Stream Load. + +## Full-text Search in Doris + +Doris's [Inverted Index](../table-design/index/inverted-index/overview.md) provides full-text search capabilities similar to Elasticsearch. + +### DSL to SQL Conversion Reference + +| Elasticsearch DSL | Doris SQL | +|-------------------|-----------| +| `{"match": {"title": "doris"}}` | `WHERE title MATCH 'doris'` | +| `{"match_phrase": {"content": "real time"}}` | `WHERE content MATCH_PHRASE 'real time'` | +| `{"term": {"status": "active"}}` | `WHERE status = 'active'` | +| `{"terms": {"tag": ["a", "b"]}}` | `WHERE tag IN ('a', 'b')` | +| `{"range": {"price": {"gte": 10}}}` | `WHERE price >= 10` | +| `{"bool": {"must": [...]}}` | `WHERE ... AND ...` | +| `{"bool": {"should": [...]}}` | `WHERE ... OR ...` | +| `{"exists": {"field": "email"}}` | `WHERE email IS NOT NULL` | + +## Feature Compatibility + +### VARIANT Type vs ES Dynamic Mapping + +Doris [VARIANT](../data-operate/import/complex-types/variant.md) type provides comparable functionality to Elasticsearch Dynamic Mapping for flexible schema handling. + +| Feature | Doris VARIANT | ES Dynamic Mapping | Status | +|---------|--------------|-------------------|--------| +| Dynamic schema inference | Auto-infer JSON field types | Dynamic Mapping | Compatible | +| Predefined field types | `MATCH_NAME 'field': type` | Explicit Mapping | Compatible | +| Pattern-based type matching | `MATCH_NAME_GLOB 'pattern*': type` | dynamic_templates | Compatible | +| Field index configuration | `INDEX ... PROPERTIES("field_pattern"=...)` | Mapping + Index Settings | Compatible | +| Custom analyzer | `CREATE INVERTED INDEX ANALYZER` | Custom Analyzer | Compatible | +| Sub-column count limit | `variant_max_subcolumns_count` | `mapping.total_fields.limit` | Compatible | +| Sparse column optimization | `variant_enable_typed_paths_to_sparse` | N/A | Doris-specific | +| Nested array objects | Flattened handling | Nested Type | Partial | + +### Search Function vs ES Query String + +Doris `search()` function provides Lucene-compatible query string syntax similar to Elasticsearch `query_string`. + +| Feature | Doris search() | ES query_string | Status | +|---------|---------------|----------------|--------| +| Query string syntax | Lucene mode | query_string query | Compatible | +| Multi-field search | `fields` parameter | multi_match / fields | Supported | +| best_fields mode | Supported | Supported | Supported | +| cross_fields mode | Supported | Supported | Supported | +| VARIANT sub-column search | `variant.field:term` | Object/Nested search | Supported | +| Boolean queries | AND/OR/NOT | AND/OR/NOT | Supported | +| Phrase queries | `"exact phrase"` | `"exact phrase"` | Supported | +| Wildcards | `*`, `?` | `*`, `?` | Supported | +| Regular expressions | `/pattern/` | `/pattern/` | Supported | +| Relevance scoring | Disabled | BM25 | Not supported | +| Fuzzy queries | Not supported | `term~2` | Not supported | +| Range queries | Not supported | `[a TO z]` | Not supported | +| Proximity queries | Not supported | `"foo bar"~5` | Not supported | + +## Next Steps + +- [Inverted Index](../table-design/index/inverted-index/overview.md) - Full-text search in Doris +- [ES Catalog](../lakehouse/catalogs/es-catalog.md) - Complete ES Catalog reference +- [Log Storage Analysis](../log-storage-analysis.md) - Optimizing log analytics in Doris diff --git a/docs/migration/mysql-to-doris.md b/docs/migration/mysql-to-doris.md new file mode 100644 index 0000000000000..aaf5dedfddc4c --- /dev/null +++ b/docs/migration/mysql-to-doris.md @@ -0,0 +1,150 @@ +--- +{ + "title": "MySQL to Doris", + "language": "en", + "description": "Comprehensive guide to migrating data from MySQL to Apache Doris" +} +--- + +This guide covers migrating data from MySQL to Apache Doris. MySQL is one of the most common migration sources, and Doris provides excellent compatibility with MySQL protocol, making migration straightforward. + +## Considerations + +1. **Protocol Compatibility**: Doris is MySQL protocol compatible, so existing MySQL clients and tools work with Doris. + +2. **Real-time Requirements**: If you need real-time synchronization, Flink CDC supports automatic table creation and schema changes. + +3. **Full Database Sync**: The Flink Doris Connector supports synchronizing entire MySQL databases including DDL operations. + +4. **Auto Increment Columns**: MySQL AUTO_INCREMENT columns can map to Doris's auto-increment feature. When migrating, you can preserve original IDs by explicitly specifying column values. + +5. **ENUM and SET Types**: MySQL ENUM and SET types are migrated as STRING in Doris. + +6. **Binary Data**: Binary data (BLOB, BINARY) is typically stored as STRING. Consider using HEX encoding for binary data during migration. + +7. **Large Table Performance**: For tables with billions of rows, consider increasing Flink parallelism, tuning Doris write buffer, and using batch mode for initial load. + +## Data Type Mapping + +| MySQL Type | Doris Type | Notes | +|------------|------------|-------| +| BOOLEAN / TINYINT(1) | BOOLEAN | | +| TINYINT | TINYINT | | +| SMALLINT | SMALLINT | | +| MEDIUMINT | INT | | +| INT / INTEGER | INT | | +| BIGINT | BIGINT | | +| FLOAT | FLOAT | | +| DOUBLE | DOUBLE | | +| DECIMAL(P, S) | DECIMAL(P, S) | | +| DATE | DATE | | +| DATETIME | DATETIME | | +| TIMESTAMP | DATETIME | Stored as UTC, converted on read | +| TIME | STRING | Doris does not support TIME type | +| YEAR | INT | | +| CHAR(N) | CHAR(N) | | +| VARCHAR(N) | VARCHAR(N) | | +| TEXT / MEDIUMTEXT / LONGTEXT | STRING | | +| BINARY / VARBINARY | STRING | | +| BLOB / MEDIUMBLOB / LONGBLOB | STRING | | +| JSON | VARIANT | See [VARIANT type](../data-operate/import/complex-types/variant.md) | +| ENUM | STRING | | +| SET | STRING | | +| BIT | BOOLEAN / BIGINT | BIT(1) maps to BOOLEAN | + +## Migration Options + +### Option 1: Flink CDC (Real-time Sync) + +Flink CDC captures MySQL binlog changes and streams them to Doris. This method is suited for: + +- Real-time data synchronization +- Full database migration with automatic table creation +- Continuous sync with schema evolution support + +**Prerequisites**: MySQL 5.7+ or 8.0+ with binlog enabled; Flink 1.15+ with Flink CDC 3.x and Flink Doris Connector. + +For detailed setup, see the [Flink Doris Connector](../ecosystem/flink-doris-connector.md) documentation. + +### Option 2: JDBC Catalog + +The [JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) allows direct querying and batch migration from MySQL. This is the simplest approach for one-time or periodic batch migrations. + +### Option 3: Streaming Job (Built-in CDC Sync) + +Doris's built-in [Streaming Job](../data-operate/import/streaming-job/streaming-job-multi-table.md) can directly synchronize full and incremental data from MySQL to Doris without external tools like Flink. It uses CDC under the hood to read MySQL binlog and automatically creates target tables (UNIQUE KEY model) with primary keys matching the source. + +This option is suited for: + +- Real-time multi-table sync without deploying a Flink cluster +- Environments where you prefer Doris-native features over external tools +- Full + incremental migration with a single SQL command + +**Prerequisites**: MySQL with binlog enabled (`binlog_format = ROW`); MySQL JDBC driver deployed to Doris. + +#### Step 1: Enable MySQL Binlog + +Ensure `my.cnf` contains: + +```ini +[mysqld] +log-bin = mysql-bin +binlog_format = ROW +server-id = 1 +``` + +#### Step 2: Create Streaming Job + +```sql +CREATE JOB mysql_sync +ON STREAMING +FROM MYSQL ( + "jdbc_url" = "jdbc:mysql://mysql-host:3306", + "driver_url" = "mysql-connector-j-8.0.31.jar", + "driver_class" = "com.mysql.cj.jdbc.Driver", + "user" = "root", + "password" = "password", + "database" = "source_db", + "include_tables" = "orders,customers,products", + "offset" = "initial" +) +TO DATABASE target_db ( + "table.create.properties.replication_num" = "3" +) +``` + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `include_tables` | Comma-separated list of tables to sync | +| `offset` | `initial` for full + incremental; `latest` for incremental only | +| `snapshot_split_size` | Row count per split during full sync (default: 8096) | +| `snapshot_parallelism` | Parallelism during full sync phase (default: 1) | + +#### Step 3: Monitor Sync Status + +```sql +-- Check job status +SELECT * FROM jobs(type=insert) WHERE ExecuteType = "STREAMING"; + +-- Check task history +SELECT * FROM tasks(type='insert') WHERE jobName = 'mysql_sync'; + +-- Pause / Resume / Drop +PAUSE JOB WHERE jobname = 'mysql_sync'; +RESUME JOB WHERE jobname = 'mysql_sync'; +DROP JOB WHERE jobname = 'mysql_sync'; +``` + +For detailed reference, see the [Streaming Job Multi-Table Sync](../data-operate/import/streaming-job/streaming-job-multi-table.md) documentation. + +### Option 4: DataX + +[DataX](https://github.com/alibaba/DataX) is a widely-used data synchronization tool that supports MySQL to Doris migration via the `mysqlreader` and `doriswriter` plugins. + +## Next Steps + +- [Flink Doris Connector](../ecosystem/flink-doris-connector.md) - Detailed connector documentation +- [Loading Data](../data-operate/import/load-manual.md) - Alternative import methods +- [Data Model](../table-design/data-model/overview.md) - Choose the right table model diff --git a/docs/migration/other-olap-to-doris.md b/docs/migration/other-olap-to-doris.md new file mode 100644 index 0000000000000..fbf29ac139701 --- /dev/null +++ b/docs/migration/other-olap-to-doris.md @@ -0,0 +1,151 @@ +--- +{ + "title": "Other OLAP Systems to Doris", + "language": "en", + "description": "Guide to migrating data from ClickHouse, Greenplum, Hive, Iceberg, Hudi and other OLAP systems to Apache Doris" +} +--- + +This guide covers migrating data from various OLAP systems to Apache Doris, including ClickHouse, Greenplum, and data lake technologies like Hive, Iceberg, and Hudi. + +## Migration Methods Overview + +| Source System | Migration Method | Notes | +|---------------|-------------------|-------| +| ClickHouse | JDBC Catalog + SQL Convertor | Schema and SQL syntax conversion needed | +| Greenplum | JDBC Catalog | PostgreSQL-compatible | +| Hive | Multi-Catalog (Hive Catalog) | Direct metadata integration | +| Iceberg | Multi-Catalog (Iceberg Catalog) | Table format native support | +| Hudi | Multi-Catalog (Hudi Catalog) | Table format native support | +| Spark/Flink Tables | Spark/Flink Doris Connector | Batch or streaming | + +## ClickHouse + +ClickHouse and Doris are both columnar OLAP databases with some similarities but different SQL dialects and data types. + +### Data Type Mapping + +| ClickHouse Type | Doris Type | Notes | +|-----------------|------------|-------| +| Int8 | TINYINT | | +| Int16 | SMALLINT | | +| Int32 | INT | | +| Int64 | BIGINT | | +| UInt8 | SMALLINT | Unsigned to signed | +| UInt16 | INT | | +| UInt32 | BIGINT | | +| UInt64 | LARGEINT | | +| Float32 | FLOAT | | +| Float64 | DOUBLE | | +| Decimal(P, S) | DECIMAL(P, S) | | +| String | STRING | | +| FixedString(N) | CHAR(N) | | +| Date | DATE | | +| DateTime | DATETIME | | +| DateTime64 | DATETIME(precision) | | +| UUID | VARCHAR(36) | | +| Array(T) | ARRAY | | +| Tuple | STRUCT | | +| Map(K, V) | MAP | | +| Nullable(T) | T (nullable) | | +| LowCardinality(T) | T | No special handling needed | +| Enum8/Enum16 | TINYINT/SMALLINT or STRING | | + +### SQL Syntax Conversion + +Common ClickHouse to Doris SQL conversions: + +| ClickHouse | Doris | +|------------|-------| +| `toDate(datetime)` | `DATE(datetime)` | +| `toDateTime(string)` | `CAST(string AS DATETIME)` | +| `formatDateTime(dt, '%Y-%m')` | `DATE_FORMAT(dt, '%Y-%m')` | +| `arrayJoin(arr)` | `EXPLODE(arr)` with LATERAL VIEW | +| `groupArray(col)` | `COLLECT_LIST(col)` | +| `argMax(col1, col2)` | `MAX_BY(col1, col2)` | +| `argMin(col1, col2)` | `MIN_BY(col1, col2)` | +| `uniq(col)` | `APPROX_COUNT_DISTINCT(col)` | +| `uniqExact(col)` | `COUNT(DISTINCT col)` | +| `JSONExtract(json, 'key', 'String')` | `JSON_EXTRACT(json, '$.key')` | +| `multiIf(cond1, val1, cond2, val2, default)` | `CASE WHEN cond1 THEN val1 WHEN cond2 THEN val2 ELSE default END` | + +### Table Engine Mapping + +| ClickHouse Engine | Doris Model | Notes | +|-------------------|-------------|-------| +| MergeTree | DUPLICATE | Append-only analytics | +| ReplacingMergeTree | UNIQUE | Deduplication by key | +| SummingMergeTree | AGGREGATE | Pre-aggregation | +| AggregatingMergeTree | AGGREGATE | Complex aggregations | +| CollapsingMergeTree | UNIQUE | With delete support | + +### Migration + +Use the [JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) to connect to ClickHouse and migrate data via `INSERT INTO ... SELECT`. + +## Greenplum + +Greenplum is PostgreSQL-based, so migration is similar to PostgreSQL. See the [PostgreSQL to Doris](./postgresql-to-doris.md) guide for general principles. + +### Data Type Mapping + +Use the [PostgreSQL type mapping](./postgresql-to-doris.md#data-type-mapping) as reference. Additional Greenplum-specific types: + +| Greenplum Type | Doris Type | Notes | +|----------------|------------|-------| +| INT2/INT4/INT8 | SMALLINT/INT/BIGINT | | +| FLOAT4/FLOAT8 | FLOAT/DOUBLE | | +| NUMERIC | DECIMAL | | +| TEXT | STRING | | +| BYTEA | STRING | | +| TIMESTAMP | DATETIME | | +| INTERVAL | STRING | | + +### Migration + +Use the [JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) with the PostgreSQL driver to connect to Greenplum and migrate data. For large tables, consider parallel export via `gpfdist` followed by file-based loading into Doris. + +## Data Lake (Hive, Iceberg, Hudi) {#data-lake} + +Doris's Multi-Catalog feature provides native integration with data lake table formats. + +### Hive + +Use the [Hive Catalog](../lakehouse/catalogs/hive-catalog.md) to directly query and migrate data from Hive. Supports both HDFS and S3-based storage. + +### Iceberg + +Use the [Iceberg Catalog](../lakehouse/catalogs/iceberg-catalog.md) to query and migrate Iceberg tables. Supports HMS and REST catalog types, as well as time travel queries. + +### Hudi + +Use the Hive Catalog to query and migrate Hudi tables (read-optimized view). + +## Spark/Flink Connector Migration + +For systems not directly supported by catalogs, use the [Spark Doris Connector](../ecosystem/spark-doris-connector.md) or [Flink Doris Connector](../ecosystem/flink-doris-connector.md) to read from any Spark/Flink-supported source and write to Doris. + +## Schema Design Principles + +When migrating from other OLAP systems: + +1. **Choose the right data model**: + - DUPLICATE for append-only event data + - UNIQUE for dimension tables with updates + - AGGREGATE for pre-aggregated metrics + +2. **Partition strategy**: + - Time-based partitioning for time-series data + - Match source partitioning when possible + +3. **Bucket count**: + - Start with 8-16 buckets per partition + - Scale based on data volume and query patterns + +## Next Steps + +- [Lakehouse Overview](../lakehouse/lakehouse-overview.md) - Multi-Catalog capabilities +- [Hive Catalog](../lakehouse/catalogs/hive-catalog.md) - Hive integration details +- [Iceberg Catalog](../lakehouse/catalogs/iceberg-catalog.md) - Iceberg integration +- [Spark Doris Connector](../ecosystem/spark-doris-connector.md) - Spark integration +- [Flink Doris Connector](../ecosystem/flink-doris-connector.md) - Flink integration diff --git a/docs/migration/overview.md b/docs/migration/overview.md new file mode 100644 index 0000000000000..258e67c7e777b --- /dev/null +++ b/docs/migration/overview.md @@ -0,0 +1,93 @@ +--- +{ + "title": "Migration Overview", + "language": "en", + "description": "Guide to migrating data from various databases and data systems to Apache Doris" +} +--- + +Apache Doris provides multiple methods to migrate data from various source systems. This guide helps you choose the best migration approach based on your source system and requirements. + +## Migration Paths + +| Source System | Migration Method | Sync Modes | +|---------------|-------------------|------------| +| [PostgreSQL](./postgresql-to-doris.md) | Streaming Job / JDBC Catalog / Flink CDC | Full, CDC | +| [MySQL](./mysql-to-doris.md) | Streaming Job / JDBC Catalog / Flink CDC | Full, CDC | +| [Elasticsearch](./elasticsearch-to-doris.md) | ES Catalog | Full | +| [ClickHouse](./other-olap-to-doris.md#clickhouse) | JDBC Catalog | Full | +| [Greenplum](./other-olap-to-doris.md#greenplum) | JDBC Catalog | Full | +| [Hive/Iceberg/Hudi](./other-olap-to-doris.md#data-lake) | Multi-Catalog | Full, Batch Incremental | + +## Choosing a Migration Method + +### Catalog-Based Migration + +Doris's [Multi-Catalog](../lakehouse/lakehouse-overview.md) feature allows you to directly query external data sources without data movement. This approach is suited for: + +- **Initial exploration**: Query source data before deciding on migration strategy +- **Hybrid queries**: Join data across Doris and external sources +- **Incremental migration**: Gradually move data while keeping source accessible + +### Streaming Job (Built-in CDC Sync) + +Doris's built-in [Streaming Job](../data-operate/import/streaming-job/streaming-job-multi-table.md) can directly synchronize full and incremental data from MySQL and PostgreSQL without external tools. It reads binlog/WAL natively, auto-creates target tables, and keeps data in sync — all with a single SQL command. + +- **No external dependencies**: No Flink cluster or other middleware required +- **Full + incremental sync**: Initial snapshot followed by continuous CDC +- **Multi-table support**: Sync multiple tables in one job + +### Flink CDC (Real-time Synchronization) + +[Flink CDC](../ecosystem/flink-doris-connector.md) provides CDC-based sync via an external Flink cluster. Choose this over Streaming Job when you need: + +- **Schema evolution**: Automatic DDL propagation +- **Complex transformations**: Flink SQL processing during sync +- **Broader source support**: Sources beyond MySQL/PostgreSQL + +### Export-Import Method + +For scenarios where direct connectivity is limited: + +1. Export data from source system to files (CSV, Parquet, JSON) +2. Stage files in object storage (S3, GCS, HDFS) +3. Load into Doris using [S3 Load](../data-operate/import/data-source/amazon-s3.md) or [Broker Load](../data-operate/import/import-way/broker-load-manual.md) + +## Migration Planning Principles + +Before migrating, consider the following: + +1. **Data Volume Assessment** + - Total data size and row count + - Daily/hourly data growth rate + - Historical data retention requirements + +2. **Schema Design** + - Choose appropriate [Data Model](../table-design/data-model/overview.md) (Duplicate, Unique, Aggregate) + - Plan [Partitioning](../table-design/data-partitioning/data-distribution.md) strategy + - Define [Bucketing](../table-design/data-partitioning/data-bucketing.md) keys + +3. **Data Type Mapping** + - Review type compatibility (see migration guides for specific mappings) + - Handle special types (arrays, JSON, timestamps with timezone) + +4. **Performance Requirements** + - Query latency expectations + - Concurrent query load + - Data freshness requirements + +## Best Practices + +- **Start with a pilot table**: Before migrating your entire database, test with a representative table to validate schema design, type mappings, and data correctness. +- **Batch large migrations**: For tables with billions of rows, migrate in batches (e.g., by date range) to manage resource usage and allow for validation between batches. +- **Monitor migration progress**: Use `SHOW LOAD` to track active and completed load jobs. +- **Validate after migration**: Compare row counts, spot-check records, and verify data types between source and target. + +## Next Steps + +Choose your source system to see detailed migration instructions: + +- [PostgreSQL to Doris](./postgresql-to-doris.md) +- [MySQL to Doris](./mysql-to-doris.md) +- [Elasticsearch to Doris](./elasticsearch-to-doris.md) +- [Other OLAP Systems to Doris](./other-olap-to-doris.md) diff --git a/docs/migration/postgresql-to-doris.md b/docs/migration/postgresql-to-doris.md new file mode 100644 index 0000000000000..7d600b37c49c9 --- /dev/null +++ b/docs/migration/postgresql-to-doris.md @@ -0,0 +1,161 @@ +--- +{ + "title": "PostgreSQL to Doris", + "language": "en", + "description": "Comprehensive guide to migrating data from PostgreSQL to Apache Doris" +} +--- + +This guide covers migrating data from PostgreSQL to Apache Doris. You can choose from several migration methods depending on your requirements for real-time sync, data volume, and operational complexity. + +## Considerations + +1. **Schema Design**: Before migration, select an appropriate Doris [Data Model](../table-design/data-model/overview.md) and plan your [Partitioning](../table-design/data-partitioning/data-distribution.md) and [Bucketing](../table-design/data-partitioning/data-bucketing.md) strategies. + +2. **Data Types**: Review the type mapping table below. Some PostgreSQL types require special handling (arrays, timestamps with timezone, JSON). + +3. **Primary Keys**: PostgreSQL's serial/identity columns map to Doris INT/BIGINT types. For unique constraints, use Doris's UNIQUE KEY model. + +4. **Timezone Handling**: PostgreSQL `timestamptz` stores timestamps in UTC and converts to session timezone on read. Doris `DATETIME` does not carry timezone information. Convert timestamps explicitly during migration and ensure JVM timezone consistency in Doris BE (`be.conf`). + +5. **Array Handling**: PostgreSQL arrays map to Doris ARRAY type, but dimension detection requires existing data. If array dimension cannot be determined, use explicit casting. + +6. **JSON/JSONB**: PostgreSQL JSON/JSONB maps to Doris VARIANT type, which supports flexible schema and efficient JSON operations. + +7. **Large Table Migration**: For tables with hundreds of millions of rows, partition the migration by time ranges or ID ranges, use multiple INSERT statements concurrently, and monitor Doris BE memory and disk usage. + +## Data Type Mapping + +| PostgreSQL Type | Doris Type | Notes | +|-----------------|------------|-------| +| boolean | BOOLEAN | | +| smallint / int2 | SMALLINT | | +| integer / int4 | INT | | +| bigint / int8 | BIGINT | | +| decimal / numeric | DECIMAL(P,S) | Numeric without precision maps to STRING | +| real / float4 | FLOAT | | +| double precision | DOUBLE | | +| smallserial | SMALLINT | | +| serial | INT | | +| bigserial | BIGINT | | +| char(n) | CHAR(N) | | +| varchar / text | STRING | | +| timestamp | DATETIME | | +| timestamptz | DATETIME | Converted to local timezone; see Timezone Handling above | +| date | DATE | | +| time | STRING | Doris does not support TIME type | +| interval | STRING | | +| json / jsonb | VARIANT | See [VARIANT type](../data-operate/import/complex-types/variant.md) for flexible schema | +| uuid | STRING | | +| bytea | STRING | | +| array | ARRAY | See Array Handling above | +| inet / cidr / macaddr | STRING | | +| point / line / polygon | STRING | Geometric types stored as strings | + +## Migration Options + +### Option 1: JDBC Catalog (Batch Migration) + +The [JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) provides direct access to PostgreSQL data from Doris. This is the simplest approach for both querying and migrating data. + +**Prerequisites**: PostgreSQL 11.x or higher; [PostgreSQL JDBC Driver](https://jdbc.postgresql.org/) version 42.5.x or above; network connectivity between Doris FE/BE nodes and PostgreSQL (port 5432). + +### Option 2: Flink CDC (Real-time Sync) + +Flink CDC captures changes from PostgreSQL WAL (Write-Ahead Log) and streams them to Doris in real-time. This is ideal for continuous synchronization. + +**Prerequisites**: PostgreSQL with logical replication enabled (`wal_level = logical`); Flink 1.15+ with Flink CDC and Flink Doris Connector; a replication slot in PostgreSQL. + +For detailed setup, see the [Flink Doris Connector](../ecosystem/flink-doris-connector.md) documentation. + +### Option 3: Streaming Job (Built-in CDC Sync) + +Doris's built-in [Streaming Job](../data-operate/import/streaming-job/streaming-job-multi-table.md) can directly synchronize full and incremental data from PostgreSQL to Doris without external tools like Flink. It uses CDC under the hood to read PostgreSQL WAL and automatically creates target tables (UNIQUE KEY model) with primary keys matching the source. + +This option is suited for: + +- Real-time multi-table sync without deploying a Flink cluster +- Environments where you prefer Doris-native features over external tools +- Full + incremental migration with a single SQL command + +**Prerequisites**: PostgreSQL with logical replication enabled (`wal_level = logical`); PostgreSQL JDBC driver deployed to Doris. + +#### Step 1: Enable Logical Replication + +Ensure `postgresql.conf` contains: + +```ini +wal_level = logical +``` + +#### Step 2: Create Streaming Job + +```sql +CREATE JOB pg_sync +ON STREAMING +FROM POSTGRES ( + "jdbc_url" = "jdbc:postgresql://pg-host:5432/source_db", + "driver_url" = "postgresql-42.5.6.jar", + "driver_class" = "org.postgresql.Driver", + "user" = "postgres", + "password" = "password", + "database" = "source_db", + "schema" = "public", + "include_tables" = "orders,customers,products", + "offset" = "initial" +) +TO DATABASE target_db ( + "table.create.properties.replication_num" = "3" +) +``` + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `include_tables` | Comma-separated list of tables to sync | +| `offset` | `initial` for full + incremental; `latest` for incremental only | +| `snapshot_split_size` | Row count per split during full sync (default: 8096) | +| `snapshot_parallelism` | Parallelism during full sync phase (default: 1) | + +#### Step 3: Monitor Sync Status + +```sql +-- Check job status +SELECT * FROM jobs(type=insert) WHERE ExecuteType = "STREAMING"; + +-- Check task history +SELECT * FROM tasks(type='insert') WHERE jobName = 'pg_sync'; + +-- Pause / Resume / Drop +PAUSE JOB WHERE jobname = 'pg_sync'; +RESUME JOB WHERE jobname = 'pg_sync'; +DROP JOB WHERE jobname = 'pg_sync'; +``` + +For detailed reference, see the [Streaming Job Multi-Table Sync](../data-operate/import/streaming-job/streaming-job-multi-table.md) documentation. + +### Option 4: Export and Load + +For air-gapped environments or when direct connectivity is not possible: + +1. Export data from PostgreSQL to files (CSV, Parquet) +2. Upload to object storage (S3, HDFS) +3. Load into Doris using [S3 Load](../data-operate/import/data-source/amazon-s3.md) or [Broker Load](../data-operate/import/import-way/broker-load-manual.md) + +## Validation Checklist + +After migration, validate: + +- Row counts match between source and target +- Sample records are identical +- Null values are preserved correctly +- Numeric precision is maintained +- Date/time values are correct (check timezone) +- Array and JSON fields are queryable + +## Next Steps + +- [Flink Doris Connector](../ecosystem/flink-doris-connector.md) - Detailed connector documentation +- [Loading Data](../data-operate/import/load-manual.md) - Alternative import methods +- [Data Model](../table-design/data-model/overview.md) - Choose the right table model diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/elasticsearch-to-doris.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/elasticsearch-to-doris.md new file mode 100644 index 0000000000000..3b99dc6cd8f62 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/elasticsearch-to-doris.md @@ -0,0 +1,137 @@ +--- +{ + "title": "Elasticsearch 迁移到 Doris", + "language": "zh-CN", + "description": "从 Elasticsearch 迁移数据到 Apache Doris 的完整指南" +} +--- + +本指南介绍如何将数据从 Elasticsearch 迁移到 Apache Doris。Doris 可以作为 Elasticsearch 的强大替代方案,用于日志分析、全文搜索和通用 OLAP 工作负载,通常具有更好的性能和更低的运维复杂度。 + +## 为什么从 Elasticsearch 迁移到 Doris? + +| 方面 | Elasticsearch | Apache Doris | +|------|---------------|--------------| +| 查询语言 | DSL(基于 JSON) | 标准 SQL | +| JOIN | 有限支持 | 完整 SQL JOIN | +| 存储效率 | 存储使用较高 | 列式压缩 | +| 运维复杂度 | 集群管理复杂 | 运维更简单 | +| 全文搜索 | 原生倒排索引 | 支持倒排索引 | +| 实时分析 | 良好 | 优秀 | + +## 注意事项 + +1. **全文搜索**:Doris 支持[倒排索引](../table-design/index/inverted-index/overview.md),提供类似 Elasticsearch 的全文搜索能力。 + +2. **索引到表映射**:每个 Elasticsearch 索引通常映射到一个 Doris 表。 + +3. **嵌套文档**:Elasticsearch nested 类型映射到 Doris [VARIANT](../data-operate/import/complex-types/variant.md) 类型,支持灵活的 Schema 处理。 + +4. **数组处理**:Elasticsearch 没有显式的数组类型。通过 ES Catalog 正确读取数组时,需要在 ES 索引映射中使用 `_meta.doris.array_fields` 配置数组字段元数据。 + +5. **日期类型**:Elasticsearch 日期可以有多种格式。迁移时确保一致的日期处理 — 使用显式转换到 DATETIME。 + +6. **_id 字段**:要保留 Elasticsearch 文档 `_id`,在 ES Catalog 配置中启用 `mapping_es_id`。 + +7. **性能**:为了获得更好的 ES Catalog 读取性能,启用 `doc_value_scan`。注意 `text` 字段不支持 doc_value,会回退到 `_source`。 + +## 数据类型映射 + +| Elasticsearch 类型 | Doris 类型 | 说明 | +|--------------------|------------|------| +| null | NULL | | +| boolean | BOOLEAN | | +| byte | TINYINT | | +| short | SMALLINT | | +| integer | INT | | +| long | BIGINT | | +| unsigned_long | LARGEINT | | +| float | FLOAT | | +| half_float | FLOAT | | +| double | DOUBLE | | +| scaled_float | DOUBLE | | +| keyword | STRING | | +| text | STRING | 考虑在 Doris 中使用倒排索引 | +| date | DATE 或 DATETIME | 参见上方日期类型 | +| ip | STRING | | +| nested | VARIANT | 参见 [VARIANT 类型](../data-operate/import/complex-types/variant.md),支持灵活 Schema | +| object | VARIANT | 参见 [VARIANT 类型](../data-operate/import/complex-types/variant.md) | +| flattened | VARIANT | Doris 3.1.4、4.0.3 起支持 | +| geo_point | STRING | 存储为 "lat,lon" 字符串 | +| geo_shape | STRING | 存储为 GeoJSON 字符串 | + +## 迁移选项 + +### 选项 1:ES Catalog(直接查询和迁移) + +[ES Catalog](../lakehouse/catalogs/es-catalog.md) 提供从 Doris 直接访问 Elasticsearch 数据的能力,支持查询和迁移。 + +**前提条件**:Elasticsearch 5.x 或更高版本;Doris FE/BE 节点与 Elasticsearch 之间的网络连接。 + +### 选项 2:Logstash Pipeline + +使用 Logstash 从 Elasticsearch 读取数据,通过 HTTP(Stream Load)写入 Doris。此方法在迁移过程中提供数据转换能力。 + +### 选项 3:自定义脚本 + Scroll API + +需要更多控制时,使用自定义脚本结合 Elasticsearch Scroll API 读取数据,通过 Stream Load 加载到 Doris。 + +## Doris 中的全文搜索 + +Doris 的[倒排索引](../table-design/index/inverted-index/overview.md)提供类似 Elasticsearch 的全文搜索能力。 + +### DSL 到 SQL 转换参考 + +| Elasticsearch DSL | Doris SQL | +|-------------------|-----------| +| `{"match": {"title": "doris"}}` | `WHERE title MATCH 'doris'` | +| `{"match_phrase": {"content": "real time"}}` | `WHERE content MATCH_PHRASE 'real time'` | +| `{"term": {"status": "active"}}` | `WHERE status = 'active'` | +| `{"terms": {"tag": ["a", "b"]}}` | `WHERE tag IN ('a', 'b')` | +| `{"range": {"price": {"gte": 10}}}` | `WHERE price >= 10` | +| `{"bool": {"must": [...]}}` | `WHERE ... AND ...` | +| `{"bool": {"should": [...]}}` | `WHERE ... OR ...` | +| `{"exists": {"field": "email"}}` | `WHERE email IS NOT NULL` | + +## 功能兼容性 + +### VARIANT 类型与 ES Dynamic Mapping 对比 + +Doris [VARIANT](../data-operate/import/complex-types/variant.md) 类型提供与 Elasticsearch Dynamic Mapping 相当的灵活 Schema 处理功能。 + +| 功能 | Doris VARIANT | ES Dynamic Mapping | 状态 | +|------|--------------|-------------------|------| +| 动态 Schema 推断 | 自动推断 JSON 字段类型 | Dynamic Mapping | 对齐 | +| 预定义字段类型 | `MATCH_NAME 'field': type` | Explicit Mapping | 对齐 | +| 模式匹配指定类型 | `MATCH_NAME_GLOB 'pattern*': type` | dynamic_templates | 对齐 | +| 字段索引配置 | `INDEX ... PROPERTIES("field_pattern"=...)` | Mapping + Index Settings | 对齐 | +| 自定义分析器 | `CREATE INVERTED INDEX ANALYZER` | Custom Analyzer | 对齐 | +| 子列数量限制 | `variant_max_subcolumns_count` | `mapping.total_fields.limit` | 对齐 | +| 稀疏列优化 | `variant_enable_typed_paths_to_sparse` | N/A | Doris 特有 | +| Nested 数组对象 | 扁平化处理 | Nested Type | 部分支持 | + +### Search 函数与 ES Query String 对比 + +Doris `search()` 函数提供与 Elasticsearch `query_string` 兼容的 Lucene 查询字符串语法。 + +| 功能 | Doris search() | ES query_string | 状态 | +|------|---------------|----------------|------| +| Query String 语法 | Lucene mode | query_string query | 兼容 | +| 多字段搜索 | `fields` 参数 | multi_match / fields | 支持 | +| best_fields 模式 | 支持 | 支持 | 支持 | +| cross_fields 模式 | 支持 | 支持 | 支持 | +| VARIANT 子列搜索 | `variant.field:term` | Object/Nested 搜索 | 支持 | +| 布尔查询 | AND/OR/NOT | AND/OR/NOT | 支持 | +| 短语查询 | `"exact phrase"` | `"exact phrase"` | 支持 | +| 通配符 | `*`, `?` | `*`, `?` | 支持 | +| 正则表达式 | `/pattern/` | `/pattern/` | 支持 | +| 评分排序 | 已禁用 | BM25 | 不支持 | +| 模糊查询 | 不支持 | `term~2` | 不支持 | +| 范围查询 | 不支持 | `[a TO z]` | 不支持 | +| 近似查询 | 不支持 | `"foo bar"~5` | 不支持 | + +## 下一步 + +- [倒排索引](../table-design/index/inverted-index/overview.md) - Doris 中的全文搜索 +- [ES Catalog](../lakehouse/catalogs/es-catalog.md) - 完整的 ES Catalog 参考 +- [日志存储分析](../log-storage-analysis.md) - 优化 Doris 中的日志分析 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/mysql-to-doris.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/mysql-to-doris.md new file mode 100644 index 0000000000000..de6ca8a60df26 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/mysql-to-doris.md @@ -0,0 +1,150 @@ +--- +{ + "title": "MySQL 迁移到 Doris", + "language": "zh-CN", + "description": "从 MySQL 迁移数据到 Apache Doris 的完整指南" +} +--- + +本指南介绍如何将数据从 MySQL 迁移到 Apache Doris。MySQL 是最常见的迁移源之一,Doris 对 MySQL 协议有很好的兼容性,使迁移变得简单。 + +## 注意事项 + +1. **协议兼容**:Doris 兼容 MySQL 协议,因此现有的 MySQL 客户端和工具可以与 Doris 配合使用。 + +2. **实时需求**:如果需要实时同步,Flink CDC 支持自动建表和 Schema 变更。 + +3. **全库同步**:Flink Doris Connector 支持同步整个 MySQL 数据库,包括 DDL 操作。 + +4. **自增列**:MySQL AUTO_INCREMENT 列可以映射到 Doris 的自增功能。迁移时可以通过显式指定列值来保留原始 ID。 + +5. **ENUM 和 SET 类型**:MySQL ENUM 和 SET 类型在 Doris 中作为 STRING 迁移。 + +6. **二进制数据**:二进制数据(BLOB、BINARY)通常存储为 STRING。迁移时考虑使用 HEX 编码。 + +7. **大表性能**:对于数十亿行的表,考虑增加 Flink 并行度、调整 Doris 写缓冲以及使用批量模式进行初始加载。 + +## 数据类型映射 + +| MySQL 类型 | Doris 类型 | 说明 | +|------------|------------|------| +| BOOLEAN / TINYINT(1) | BOOLEAN | | +| TINYINT | TINYINT | | +| SMALLINT | SMALLINT | | +| MEDIUMINT | INT | | +| INT / INTEGER | INT | | +| BIGINT | BIGINT | | +| FLOAT | FLOAT | | +| DOUBLE | DOUBLE | | +| DECIMAL(P, S) | DECIMAL(P, S) | | +| DATE | DATE | | +| DATETIME | DATETIME | | +| TIMESTAMP | DATETIME | 以 UTC 存储,读取时转换 | +| TIME | STRING | Doris 不支持 TIME 类型 | +| YEAR | INT | | +| CHAR(N) | CHAR(N) | | +| VARCHAR(N) | VARCHAR(N) | | +| TEXT / MEDIUMTEXT / LONGTEXT | STRING | | +| BINARY / VARBINARY | STRING | | +| BLOB / MEDIUMBLOB / LONGBLOB | STRING | | +| JSON | VARIANT | 参见 [VARIANT 类型](../data-operate/import/complex-types/variant.md) | +| ENUM | STRING | | +| SET | STRING | | +| BIT | BOOLEAN / BIGINT | BIT(1) 映射为 BOOLEAN | + +## 迁移选项 + +### 选项 1:Flink CDC(实时同步) + +Flink CDC 捕获 MySQL binlog 变更并流式传输到 Doris。此方法适用于: + +- 实时数据同步 +- 自动建表的全库迁移 +- 支持 Schema 演进的持续同步 + +**前提条件**:MySQL 5.7+ 或 8.0+,启用 binlog;Flink 1.15+ 配合 Flink CDC 3.x 和 Flink Doris Connector。 + +详细设置请参考 [Flink Doris Connector](../ecosystem/flink-doris-connector.md) 文档。 + +### 选项 2:JDBC Catalog + +[JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) 允许从 MySQL 直接查询和批量迁移。这是一次性或定期批量迁移最简单的方法。 + +### 选项 3:Streaming Job(内置 CDC 同步) + +Doris 内置的 [Streaming Job](../data-operate/import/streaming-job/streaming-job-multi-table.md) 可以直接从 MySQL 同步全量和增量数据到 Doris,无需部署 Flink 等外部工具。底层使用 CDC 读取 MySQL binlog,并自动创建目标表(UNIQUE KEY 模型),主键与源表保持一致。 + +此选项适用于: + +- 无需部署 Flink 集群的实时多表同步 +- 偏好使用 Doris 原生功能而非外部工具的环境 +- 通过单条 SQL 命令实现全量 + 增量迁移 + +**前提条件**:MySQL 启用 binlog(`binlog_format = ROW`);MySQL JDBC 驱动已部署到 Doris。 + +#### 步骤 1:启用 MySQL Binlog + +确保 `my.cnf` 包含: + +```ini +[mysqld] +log-bin = mysql-bin +binlog_format = ROW +server-id = 1 +``` + +#### 步骤 2:创建 Streaming Job + +```sql +CREATE JOB mysql_sync +ON STREAMING +FROM MYSQL ( + "jdbc_url" = "jdbc:mysql://mysql-host:3306", + "driver_url" = "mysql-connector-j-8.0.31.jar", + "driver_class" = "com.mysql.cj.jdbc.Driver", + "user" = "root", + "password" = "password", + "database" = "source_db", + "include_tables" = "orders,customers,products", + "offset" = "initial" +) +TO DATABASE target_db ( + "table.create.properties.replication_num" = "3" +) +``` + +关键参数: + +| 参数 | 说明 | +|------|------| +| `include_tables` | 逗号分隔的待同步表列表 | +| `offset` | `initial` 全量 + 增量;`latest` 仅增量 | +| `snapshot_split_size` | 全量同步时每个分片的行数(默认:8096) | +| `snapshot_parallelism` | 全量同步阶段的并行度(默认:1) | + +#### 步骤 3:监控同步状态 + +```sql +-- 查看 Job 状态 +SELECT * FROM jobs(type=insert) WHERE ExecuteType = "STREAMING"; + +-- 查看 Task 历史 +SELECT * FROM tasks(type='insert') WHERE jobName = 'mysql_sync'; + +-- 暂停 / 恢复 / 删除 +PAUSE JOB WHERE jobname = 'mysql_sync'; +RESUME JOB WHERE jobname = 'mysql_sync'; +DROP JOB WHERE jobname = 'mysql_sync'; +``` + +详细参考请见 [Streaming Job 多表同步](../data-operate/import/streaming-job/streaming-job-multi-table.md) 文档。 + +### 选项 4:DataX + +[DataX](https://github.com/alibaba/DataX) 是一个广泛使用的数据同步工具,通过 `mysqlreader` 和 `doriswriter` 插件支持 MySQL 到 Doris 的迁移。 + +## 下一步 + +- [Flink Doris Connector](../ecosystem/flink-doris-connector.md) - 详细的连接器文档 +- [数据导入](../data-operate/import/load-manual.md) - 其他导入方法 +- [数据模型](../table-design/data-model/overview.md) - 选择正确的表模型 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/other-olap-to-doris.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/other-olap-to-doris.md new file mode 100644 index 0000000000000..11648fd82c681 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/other-olap-to-doris.md @@ -0,0 +1,151 @@ +--- +{ + "title": "其他 OLAP 系统迁移到 Doris", + "language": "zh-CN", + "description": "从 ClickHouse、Greenplum、Hive、Iceberg、Hudi 等 OLAP 系统迁移数据到 Apache Doris 的指南" +} +--- + +本指南介绍如何从各种 OLAP 系统迁移数据到 Apache Doris,包括 ClickHouse、Greenplum 以及数据湖技术如 Hive、Iceberg 和 Hudi。 + +## 迁移方法概述 + +| 源系统 | 迁移方法 | 说明 | +|--------|---------|------| +| ClickHouse | JDBC Catalog + SQL 转换 | 需要 Schema 和 SQL 语法转换 | +| Greenplum | JDBC Catalog | 兼容 PostgreSQL | +| Hive | Multi-Catalog(Hive Catalog) | 直接元数据集成 | +| Iceberg | Multi-Catalog(Iceberg Catalog) | 原生表格式支持 | +| Hudi | Multi-Catalog(Hudi Catalog) | 原生表格式支持 | +| Spark/Flink 表 | Spark/Flink Doris Connector | 批量或流式 | + +## ClickHouse + +ClickHouse 和 Doris 都是列式 OLAP 数据库,有一些相似之处,但 SQL 方言和数据类型不同。 + +### 数据类型映射 + +| ClickHouse 类型 | Doris 类型 | 说明 | +|-----------------|------------|------| +| Int8 | TINYINT | | +| Int16 | SMALLINT | | +| Int32 | INT | | +| Int64 | BIGINT | | +| UInt8 | SMALLINT | 无符号转有符号 | +| UInt16 | INT | | +| UInt32 | BIGINT | | +| UInt64 | LARGEINT | | +| Float32 | FLOAT | | +| Float64 | DOUBLE | | +| Decimal(P, S) | DECIMAL(P, S) | | +| String | STRING | | +| FixedString(N) | CHAR(N) | | +| Date | DATE | | +| DateTime | DATETIME | | +| DateTime64 | DATETIME(precision) | | +| UUID | VARCHAR(36) | | +| Array(T) | ARRAY | | +| Tuple | STRUCT | | +| Map(K, V) | MAP | | +| Nullable(T) | T(可空) | | +| LowCardinality(T) | T | 无需特殊处理 | +| Enum8/Enum16 | TINYINT/SMALLINT 或 STRING | | + +### SQL 语法转换 + +常见的 ClickHouse 到 Doris SQL 转换: + +| ClickHouse | Doris | +|------------|-------| +| `toDate(datetime)` | `DATE(datetime)` | +| `toDateTime(string)` | `CAST(string AS DATETIME)` | +| `formatDateTime(dt, '%Y-%m')` | `DATE_FORMAT(dt, '%Y-%m')` | +| `arrayJoin(arr)` | `EXPLODE(arr)` 配合 LATERAL VIEW | +| `groupArray(col)` | `COLLECT_LIST(col)` | +| `argMax(col1, col2)` | `MAX_BY(col1, col2)` | +| `argMin(col1, col2)` | `MIN_BY(col1, col2)` | +| `uniq(col)` | `APPROX_COUNT_DISTINCT(col)` | +| `uniqExact(col)` | `COUNT(DISTINCT col)` | +| `JSONExtract(json, 'key', 'String')` | `JSON_EXTRACT(json, '$.key')` | +| `multiIf(cond1, val1, cond2, val2, default)` | `CASE WHEN cond1 THEN val1 WHEN cond2 THEN val2 ELSE default END` | + +### 表引擎映射 + +| ClickHouse 引擎 | Doris 模型 | 说明 | +|-----------------|------------|------| +| MergeTree | DUPLICATE | 仅追加分析 | +| ReplacingMergeTree | UNIQUE | 按键去重 | +| SummingMergeTree | AGGREGATE | 预聚合 | +| AggregatingMergeTree | AGGREGATE | 复杂聚合 | +| CollapsingMergeTree | UNIQUE | 支持删除 | + +### 迁移 + +使用 [JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) 连接 ClickHouse,通过 `INSERT INTO ... SELECT` 迁移数据。 + +## Greenplum + +Greenplum 基于 PostgreSQL,因此迁移与 PostgreSQL 类似。参见 [PostgreSQL 迁移到 Doris](./postgresql-to-doris.md) 指南了解通用原则。 + +### 数据类型映射 + +参考 [PostgreSQL 类型映射](./postgresql-to-doris.md#数据类型映射)。Greenplum 特有的类型: + +| Greenplum 类型 | Doris 类型 | 说明 | +|----------------|------------|------| +| INT2/INT4/INT8 | SMALLINT/INT/BIGINT | | +| FLOAT4/FLOAT8 | FLOAT/DOUBLE | | +| NUMERIC | DECIMAL | | +| TEXT | STRING | | +| BYTEA | STRING | | +| TIMESTAMP | DATETIME | | +| INTERVAL | STRING | | + +### 迁移 + +使用 [JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) 配合 PostgreSQL 驱动连接 Greenplum 并迁移数据。对于大表,考虑通过 `gpfdist` 并行导出,然后基于文件加载到 Doris。 + +## 数据湖(Hive、Iceberg、Hudi) {#data-lake} + +Doris 的 Multi-Catalog 功能提供与数据湖表格式的原生集成。 + +### Hive + +使用 [Hive Catalog](../lakehouse/catalogs/hive-catalog.md) 直接查询和迁移 Hive 数据。支持 HDFS 和基于 S3 的存储。 + +### Iceberg + +使用 [Iceberg Catalog](../lakehouse/catalogs/iceberg-catalog.md) 查询和迁移 Iceberg 表。支持 HMS 和 REST catalog 类型,以及时间旅行查询。 + +### Hudi + +使用 Hive Catalog 查询和迁移 Hudi 表(读优化视图)。 + +## Spark/Flink Connector 迁移 + +对于 catalog 不直接支持的系统,使用 [Spark Doris Connector](../ecosystem/spark-doris-connector.md) 或 [Flink Doris Connector](../ecosystem/flink-doris-connector.md) 从任何 Spark/Flink 支持的源读取数据并写入 Doris。 + +## Schema 设计原则 + +从其他 OLAP 系统迁移时: + +1. **选择正确的数据模型**: + - DUPLICATE 用于仅追加的事件数据 + - UNIQUE 用于带更新的维表 + - AGGREGATE 用于预聚合指标 + +2. **分区策略**: + - 对时间序列数据使用基于时间的分区 + - 尽可能匹配源分区 + +3. **分桶数**: + - 每个分区从 8-16 个桶开始 + - 根据数据量和查询模式扩展 + +## 下一步 + +- [湖仓一体概述](../lakehouse/lakehouse-overview.md) - Multi-Catalog 功能 +- [Hive Catalog](../lakehouse/catalogs/hive-catalog.md) - Hive 集成详情 +- [Iceberg Catalog](../lakehouse/catalogs/iceberg-catalog.md) - Iceberg 集成 +- [Spark Doris Connector](../ecosystem/spark-doris-connector.md) - Spark 集成 +- [Flink Doris Connector](../ecosystem/flink-doris-connector.md) - Flink 集成 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/overview.md new file mode 100644 index 0000000000000..5a3766a27a75f --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/overview.md @@ -0,0 +1,93 @@ +--- +{ + "title": "迁移概述", + "language": "zh-CN", + "description": "从各种数据库和数据系统迁移数据到 Apache Doris 的指南" +} +--- + +Apache Doris 提供多种方法从各种源系统迁移数据。本指南帮助您根据源系统和需求选择最佳的迁移方式。 + +## 迁移路径 + +| 源系统 | 迁移方法 | 同步模式 | +|--------|---------|---------| +| [PostgreSQL](./postgresql-to-doris.md) | Streaming Job / JDBC Catalog / Flink CDC | 全量、CDC | +| [MySQL](./mysql-to-doris.md) | Streaming Job / JDBC Catalog / Flink CDC | 全量、CDC | +| [Elasticsearch](./elasticsearch-to-doris.md) | ES Catalog | 全量 | +| [ClickHouse](./other-olap-to-doris.md#clickhouse) | JDBC Catalog | 全量 | +| [Greenplum](./other-olap-to-doris.md#greenplum) | JDBC Catalog | 全量 | +| [Hive/Iceberg/Hudi](./other-olap-to-doris.md#data-lake) | Multi-Catalog | 全量、批量增量 | + +## 选择迁移方法 + +### 基于 Catalog 的迁移 + +Doris 的 [Multi-Catalog](../lakehouse/lakehouse-overview.md) 功能允许您直接查询外部数据源而无需数据移动。此方法适用于: + +- **初步探索**:在决定迁移策略之前查询源数据 +- **混合查询**:跨 Doris 和外部源进行 JOIN 查询 +- **增量迁移**:在保持源可访问的同时逐步迁移数据 + +### Streaming Job(内置 CDC 同步) + +Doris 内置的 [Streaming Job](../data-operate/import/streaming-job/streaming-job-multi-table.md) 可以直接从 MySQL 和 PostgreSQL 同步全量和增量数据,无需外部工具。它原生读取 binlog/WAL,自动创建目标表,并保持数据同步 — 只需一条 SQL 命令。 + +- **无外部依赖**:无需部署 Flink 集群或其他中间件 +- **全量 + 增量同步**:先全量快照,后持续 CDC +- **多表支持**:一个 Job 同步多张表 + +### Flink CDC(实时同步) + +[Flink CDC](../ecosystem/flink-doris-connector.md) 通过外部 Flink 集群提供基于 CDC 的同步。在以下场景选择 Flink CDC 而非 Streaming Job: + +- **Schema 演进**:自动 DDL 传播 +- **复杂转换**:同步过程中使用 Flink SQL 处理 +- **更广泛的源支持**:支持 MySQL/PostgreSQL 之外的数据源 + +### 导出-导入方法 + +对于直接连接受限的场景: + +1. 从源系统导出数据到文件(CSV、Parquet、JSON) +2. 将文件存储到对象存储(S3、GCS、HDFS) +3. 使用 [S3 Load](../data-operate/import/data-source/amazon-s3.md) 或 [Broker Load](../data-operate/import/import-way/broker-load-manual.md) 加载到 Doris + +## 迁移规划原则 + +迁移前,请考虑以下事项: + +1. **数据量评估** + - 总数据大小和行数 + - 每日/每小时数据增长率 + - 历史数据保留要求 + +2. **Schema 设计** + - 选择合适的[数据模型](../table-design/data-model/overview.md)(Duplicate、Unique、Aggregate) + - 规划[分区](../table-design/data-partitioning/data-distribution.md)策略 + - 定义[分桶](../table-design/data-partitioning/data-bucketing.md)键 + +3. **数据类型映射** + - 检查类型兼容性(参见各迁移指南的具体映射) + - 处理特殊类型(数组、JSON、带时区的时间戳) + +4. **性能要求** + - 查询延迟预期 + - 并发查询负载 + - 数据新鲜度要求 + +## 最佳实践 + +- **从试点表开始**:在迁移整个数据库之前,先用一个代表性的表进行测试,验证 Schema 设计、类型映射和数据正确性。 +- **批量大规模迁移**:对于数十亿行的表,分批迁移(例如按日期范围),以管理资源使用并在批次之间进行验证。 +- **监控迁移进度**:使用 `SHOW LOAD` 跟踪活动和已完成的加载任务。 +- **迁移后验证**:比较源和目标的行数、抽查记录、验证数据类型。 + +## 下一步 + +选择您的源系统查看详细的迁移说明: + +- [PostgreSQL 迁移到 Doris](./postgresql-to-doris.md) +- [MySQL 迁移到 Doris](./mysql-to-doris.md) +- [Elasticsearch 迁移到 Doris](./elasticsearch-to-doris.md) +- [其他 OLAP 系统迁移到 Doris](./other-olap-to-doris.md) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/postgresql-to-doris.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/postgresql-to-doris.md new file mode 100644 index 0000000000000..b769d35af5eea --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/migration/postgresql-to-doris.md @@ -0,0 +1,161 @@ +--- +{ + "title": "PostgreSQL 迁移到 Doris", + "language": "zh-CN", + "description": "从 PostgreSQL 迁移数据到 Apache Doris 的完整指南" +} +--- + +本指南介绍如何将数据从 PostgreSQL 迁移到 Apache Doris。您可以根据实时同步需求、数据量和运维复杂度选择多种迁移方法。 + +## 注意事项 + +1. **Schema 设计**:迁移前,选择合适的 Doris [数据模型](../table-design/data-model/overview.md)并规划您的[分区](../table-design/data-partitioning/data-distribution.md)和[分桶](../table-design/data-partitioning/data-bucketing.md)策略。 + +2. **数据类型**:查看下面的类型映射表。某些 PostgreSQL 类型需要特殊处理(数组、带时区的时间戳、JSON)。 + +3. **主键**:PostgreSQL 的 serial/identity 列映射到 Doris 的 INT/BIGINT 类型。对于唯一约束,使用 Doris 的 UNIQUE KEY 模型。 + +4. **时区处理**:PostgreSQL `timestamptz` 以 UTC 存储时间戳并在读取时转换为会话时区。Doris `DATETIME` 不携带时区信息。迁移时显式转换时间戳,并确保 Doris BE 中 JVM 时区一致(`be.conf`)。 + +5. **数组处理**:PostgreSQL 数组映射到 Doris ARRAY 类型,但维度检测需要现有数据。如果无法确定数组维度,使用显式转换。 + +6. **JSON/JSONB**:PostgreSQL JSON/JSONB 映射到 Doris VARIANT 类型,支持灵活 Schema 和高效的 JSON 操作。 + +7. **大表迁移**:对于数亿行的表,按时间范围或 ID 范围分区迁移,同时运行多个 INSERT 语句,并监控 Doris BE 内存和磁盘使用情况。 + +## 数据类型映射 + +| PostgreSQL 类型 | Doris 类型 | 说明 | +|-----------------|------------|------| +| boolean | BOOLEAN | | +| smallint / int2 | SMALLINT | | +| integer / int4 | INT | | +| bigint / int8 | BIGINT | | +| decimal / numeric | DECIMAL(P,S) | 无精度的 Numeric 映射为 STRING | +| real / float4 | FLOAT | | +| double precision | DOUBLE | | +| smallserial | SMALLINT | | +| serial | INT | | +| bigserial | BIGINT | | +| char(n) | CHAR(N) | | +| varchar / text | STRING | | +| timestamp | DATETIME | | +| timestamptz | DATETIME | 转换为本地时区;参见上方时区处理 | +| date | DATE | | +| time | STRING | Doris 不支持 TIME 类型 | +| interval | STRING | | +| json / jsonb | VARIANT | 参见 [VARIANT 类型](../data-operate/import/complex-types/variant.md),支持灵活 Schema | +| uuid | STRING | | +| bytea | STRING | | +| array | ARRAY | 参见上方数组处理 | +| inet / cidr / macaddr | STRING | | +| point / line / polygon | STRING | 几何类型存储为字符串 | + +## 迁移选项 + +### 选项 1:JDBC Catalog(批量迁移) + +[JDBC Catalog](../lakehouse/catalogs/jdbc-catalog.md) 提供从 Doris 直接访问 PostgreSQL 数据的能力。这是查询和迁移数据最简单的方法。 + +**前提条件**:PostgreSQL 11.x 或更高版本;[PostgreSQL JDBC 驱动](https://jdbc.postgresql.org/) 42.5.x 或更高版本;Doris FE/BE 节点与 PostgreSQL 之间的网络连接(端口 5432)。 + +### 选项 2:Flink CDC(实时同步) + +Flink CDC 从 PostgreSQL WAL(预写日志)捕获变更并实时流式传输到 Doris。这适用于持续同步场景。 + +**前提条件**:启用逻辑复制的 PostgreSQL(`wal_level = logical`);Flink 1.15+ 配合 Flink CDC 和 Flink Doris Connector;PostgreSQL 中的复制槽。 + +详细设置请参考 [Flink Doris Connector](../ecosystem/flink-doris-connector.md) 文档。 + +### 选项 3:Streaming Job(内置 CDC 同步) + +Doris 内置的 [Streaming Job](../data-operate/import/streaming-job/streaming-job-multi-table.md) 可以直接从 PostgreSQL 同步全量和增量数据到 Doris,无需部署 Flink 等外部工具。底层使用 CDC 读取 PostgreSQL WAL,并自动创建目标表(UNIQUE KEY 模型),主键与源表保持一致。 + +此选项适用于: + +- 无需部署 Flink 集群的实时多表同步 +- 偏好使用 Doris 原生功能而非外部工具的环境 +- 通过单条 SQL 命令实现全量 + 增量迁移 + +**前提条件**:PostgreSQL 启用逻辑复制(`wal_level = logical`);PostgreSQL JDBC 驱动已部署到 Doris。 + +#### 步骤 1:启用逻辑复制 + +确保 `postgresql.conf` 包含: + +```ini +wal_level = logical +``` + +#### 步骤 2:创建 Streaming Job + +```sql +CREATE JOB pg_sync +ON STREAMING +FROM POSTGRES ( + "jdbc_url" = "jdbc:postgresql://pg-host:5432/source_db", + "driver_url" = "postgresql-42.5.6.jar", + "driver_class" = "org.postgresql.Driver", + "user" = "postgres", + "password" = "password", + "database" = "source_db", + "schema" = "public", + "include_tables" = "orders,customers,products", + "offset" = "initial" +) +TO DATABASE target_db ( + "table.create.properties.replication_num" = "3" +) +``` + +关键参数: + +| 参数 | 说明 | +|------|------| +| `include_tables` | 逗号分隔的待同步表列表 | +| `offset` | `initial` 全量 + 增量;`latest` 仅增量 | +| `snapshot_split_size` | 全量同步时每个分片的行数(默认:8096) | +| `snapshot_parallelism` | 全量同步阶段的并行度(默认:1) | + +#### 步骤 3:监控同步状态 + +```sql +-- 查看 Job 状态 +SELECT * FROM jobs(type=insert) WHERE ExecuteType = "STREAMING"; + +-- 查看 Task 历史 +SELECT * FROM tasks(type='insert') WHERE jobName = 'pg_sync'; + +-- 暂停 / 恢复 / 删除 +PAUSE JOB WHERE jobname = 'pg_sync'; +RESUME JOB WHERE jobname = 'pg_sync'; +DROP JOB WHERE jobname = 'pg_sync'; +``` + +详细参考请见 [Streaming Job 多表同步](../data-operate/import/streaming-job/streaming-job-multi-table.md) 文档。 + +### 选项 4:导出和加载 + +适用于网络隔离环境或无法直接连接的情况: + +1. 从 PostgreSQL 导出数据到文件(CSV、Parquet) +2. 上传到对象存储(S3、HDFS) +3. 使用 [S3 Load](../data-operate/import/data-source/amazon-s3.md) 或 [Broker Load](../data-operate/import/import-way/broker-load-manual.md) 加载到 Doris + +## 验证清单 + +迁移后,验证: + +- 源和目标的行数匹配 +- 样本记录相同 +- NULL 值正确保留 +- 数值精度保持 +- 日期/时间值正确(检查时区) +- 数组和 JSON 字段可查询 + +## 下一步 + +- [Flink Doris Connector](../ecosystem/flink-doris-connector.md) - 详细的连接器文档 +- [数据导入](../data-operate/import/load-manual.md) - 其他导入方法 +- [数据模型](../table-design/data-model/overview.md) - 选择正确的表模型 diff --git a/sidebars.ts b/sidebars.ts index ee8854b2c2c0c..4321fec4a2ce6 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -20,6 +20,18 @@ const sidebars: SidebarsConfig = { }, ], }, + { + type: 'category', + label: 'Migration', + collapsed: false, + items: [ + 'migration/overview', + 'migration/postgresql-to-doris', + 'migration/mysql-to-doris', + 'migration/elasticsearch-to-doris', + 'migration/other-olap-to-doris', + ], + }, { type: 'category', label: 'Guides',