Skip to content
This repository was archived by the owner on Jan 8, 2026. It is now read-only.

Latest commit

 

History

History
151 lines (138 loc) · 12.1 KB

File metadata and controls

151 lines (138 loc) · 12.1 KB

Home > @datashaper/schema

schema package

Enumerations

Enumeration Description
BinStrategy Describes the binning technique to use. See numpy for detailed definitions: https://numpy.org/doc/stable/reference/generated/numpy.histogram\_bin\_edges.html
BooleanComparisonOperator
BooleanOperator
CodebookStrategy
ComparisonStrategy Indicates the comparison type used for a filter operation. This is done on a row-by-row basis.
DataFormat Base format the data is stored within. This will expand to include additional formats such as Arrow and Parquet over time. TODO: we've seen a number of examples in the wild using JSON Lines https://jsonlines.org/
DataNature Indicates the expected general layout of the data. This could be used to provide validation hints. For example, microdata must have one row per subject. TODO: "timeseries" as distinct from "panel"? others?
DataOrientation

Indicates the orientation of the data within the file.

Most CSV data files are 'values' (row-oriented).

JSON files can commonly be either. Records are probably more common, though require more space due to replication of keys. Apache Arrow or Parquet are columnar. This nearly aligns with pandas: https://pandas.pydata.org/pandas-docs/stable/user\_guide/io.html\#json

A key difference (which probably needs resolved) is that we don't yet support the notion of an index. See their example for "columns" or "index" orientation, which is a nested structure.

Example JSON formats: records: [{ colA: valueA1, colB: valueB1 }, { colA: valueA2, colB: valueB2 }]

columns: { colA: [valueA1, valueA2], colB: [valueB1, valueB2] }

array: ["value1", "value2"]

values: [ ["colA", "colB"], ["valueA1", "valueA2"], ["valueA2", "valueB2"] ]

DataType Explicit data type of the value (i.e., for a column or property). TODO: clarify/update null/undefined
DateComparisonOperator
ErrorCode
FieldAggregateOperation This is the subset of aggregate functions that can operate on a single field so we don't accommodate additional args. See https://uwdata.github.io/arquero/api/op\#aggregate-functions
FileType These are the available formats for the snapshot verb.
JoinStrategy
KnownProfile
KnownRel
MathOperator
MergeStrategy
NumericComparisonOperator
ParseType This is a subset of data types available for parsing operations.
SetOp Indicates the type of set operation to perform across two collections.
SortDirection
StringComparisonOperator
VariableNature Describes the semantic shape of a variable. This has particular effect on how we display and compare data, such as using line charts for continuous versus bar charts for categorical. This mostly applies to numeric variables, but strings for instance can be categorical.
Verb
WindowFunction These are operations that perform windowed compute. See https://uwdata.github.io/arquero/api/op\#window-functions

Functions

Function Description
createCodebookSchemaObject(input)
createDataPackageSchemaObject(input)
createDataTableSchemaObject(input)
createSchemaValidator()
createTableBundleSchemaObject(input)
createWorkflowSchemaObject(input)

Interfaces

Interface Description
AggregateArgs
BasicInput Single-input, single-output step I/O
Bin Describes a data bin in terms of inclusive lower bound and count of values in the bin.
BinArgs
BinarizeArgs
BooleanArgs
BundleSchema A schema for defining custom bundle types.
Category Describes a nominal category in terms of category name and count of values in the category.
CodebookSchema This contains all of the field-level details for interpreting a dataset, including data types, mapping, and metadata. Note that with persisted metadata and field examples, a dataset can often be visualized and described to the user without actually loading the source file. resource profile: 'codebook'
Constraints Validation constraints for a field.
ConvertArgs
CopyArgs
Criteria
DataPackageSchema Defines a Data Package, which is a collection of data resources such as files and schemas. Loosely based on the Frictionless spec, but modified where needed to meet our needs. https://specs.frictionlessdata.io/data-package/
DataShape Defines parameters for understanding the logical structure of data contents.
DataTableSchema This defines the table-containing resource type. A dataset can be embedded directly using the data property, or it can be linked to a raw file using the path. If the latter, optional format and parsing options can be applied to aid interpreting the file contents. resource profile: 'datatable'
DeriveArgs
DestructureArgs
DualInput Dual-input, single-output step I/O
EncodeDecodeArgs
EraseArgs
Field Contains the full schema definition and metadata for a data field (usually a table column). This includes the required data type, various data nature and rendering properties, potential validation rules, and mappings from a data dictionary.
FieldError
FieldMetadata Holds core metadata/stats for a data field.
FillArgs
FilterArgs
FoldArgs
ImputeArgs
InputColumnArgs
InputColumnListArgs Base interface for a number of operations that work on a column list.
InputColumnRecordArgs
InputKeyValueArgs
JoinArgs
JoinArgsBase
LookupArgs
MergeArgs
Named Base interface for sharing properties of named resources/objects.
OnehotArgs
OrderbyArgs
OrderbyInstruction
OutputColumnArgs
ParserOptions Parsing options for delimited files. This is a mix of the options from pandas and spark.
PivotArgs
PrintArgs
RecodeArgs
RelationshipConstraint
ResourceSchema Parent class for any resource type understood by the system. Any object type that extends from Resource is expected to have a standalone schema published. For project state, this can be left as generic as possible for now.
RollupArgs
SampleArgs
SnapshotArgs
SourcePreservingArgs
SpreadArgs
StepJsonCommon Common step properties
StringsArgs
StringsReplaceArgs
TableBundleSchema

A table bundle encapsulates table-specific resources into a single resource with a prescribed workflow.

A tablebundle requires a source entry with rel="input" for the source table. A tablebundle may also include source entries with rel="codebook" and rel="workflow" for interpretation and processing of the source data table.

TypeHints Configuration values for interpreting data types when parsing a delimited file. By default, all values are read as strings - applying these type hints can derive primitive types from the strings.
UnhotArgs
UnknownInput
ValidationResult
VariadicInput Multi-input, single output step I/O
WindowArgs
WorkflowArgs
WorkflowSchema The root wrangling workflow specification. resource profile: 'workflow'

Variables

Variable Description
DataTableSchemaDefaults
LATEST_CODEBOOK_SCHEMA
LATEST_DATAPACKAGE_SCHEMA
LATEST_DATATABLE_SCHEMA
LATEST_TABLEBUNDLE_SCHEMA
LATEST_WORKFLOW_SCHEMA
ParserOptionsDefaults
TypeHintsDefaults This is a collection of default string values for inferring strict types from strings. They replicate the defaults from pandas. https://pandas.pydata.org/pandas-docs/stable/user\_guide/io.html\#csv-text-files

Type Aliases

Type Alias Description
DedupeArgs
DropArgs
FactoryInput
GroupbyArgs
InputBinding
Profile Resources must have a profile, which is a key defining how it should be interpreted. Profiles are essentially shorthand for a schema URL. The core profiles for DataShaper are defined here, but any application can define one as a string.
Rel A rel is a string that describes the relationship between a resource and its child.
RenameArgs
SelectArgs
Step Specification for step items
UnfoldArgs
UnrollArgs
ValidationFunction
Value A cell/property value of any type.
WorkflowInput
WorkflowStepId The Id of the step to which the input is bound