datashaper/javascript/schema/docs/markdown/schema.md at main · microsoft/datashaper

schema package

Enumerations

Enumeration	Description
BinStrategy	Describes the binning technique to use. See numpy for detailed definitions: https://numpy.org/doc/stable/reference/generated/numpy.histogram\_bin\_edges.html
BooleanComparisonOperator
BooleanOperator
CodebookStrategy
ComparisonStrategy	Indicates the comparison type used for a filter operation. This is done on a row-by-row basis.
DataFormat	Base format the data is stored within. This will expand to include additional formats such as Arrow and Parquet over time. TODO: we've seen a number of examples in the wild using JSON Lines https://jsonlines.org/
DataNature	Indicates the expected general layout of the data. This could be used to provide validation hints. For example, microdata must have one row per subject. TODO: "timeseries" as distinct from "panel"? others?
DataOrientation	Indicates the orientation of the data within the file. Most CSV data files are 'values' (row-oriented). JSON files can commonly be either. Records are probably more common, though require more space due to replication of keys. Apache Arrow or Parquet are columnar. This nearly aligns with pandas: https://pandas.pydata.org/pandas-docs/stable/user\_guide/io.html\#json A key difference (which probably needs resolved) is that we don't yet support the notion of an index. See their example for "columns" or "index" orientation, which is a nested structure. Example JSON formats: records: [{ colA: valueA1, colB: valueB1 }, { colA: valueA2, colB: valueB2 }] columns: { colA: [valueA1, valueA2], colB: [valueB1, valueB2] } array: ["value1", "value2"] values: [ ["colA", "colB"], ["valueA1", "valueA2"], ["valueA2", "valueB2"] ]
DataType	Explicit data type of the value (i.e., for a column or property). TODO: clarify/update null/undefined
DateComparisonOperator
ErrorCode
FieldAggregateOperation	This is the subset of aggregate functions that can operate on a single field so we don't accommodate additional args. See https://uwdata.github.io/arquero/api/op\#aggregate-functions
FileType	These are the available formats for the snapshot verb.
JoinStrategy
KnownProfile
KnownRel
MathOperator
MergeStrategy
NumericComparisonOperator
ParseType	This is a subset of data types available for parsing operations.
SetOp	Indicates the type of set operation to perform across two collections.
SortDirection
StringComparisonOperator
VariableNature	Describes the semantic shape of a variable. This has particular effect on how we display and compare data, such as using line charts for continuous versus bar charts for categorical. This mostly applies to numeric variables, but strings for instance can be categorical.
Verb
WindowFunction	These are operations that perform windowed compute. See https://uwdata.github.io/arquero/api/op\#window-functions

Functions

Function	Description
createCodebookSchemaObject(input)
createDataPackageSchemaObject(input)
createDataTableSchemaObject(input)
createSchemaValidator()
createTableBundleSchemaObject(input)
createWorkflowSchemaObject(input)

Interfaces

Interface	Description
AggregateArgs
BasicInput	Single-input, single-output step I/O
Bin	Describes a data bin in terms of inclusive lower bound and count of values in the bin.
BinArgs
BinarizeArgs
BooleanArgs
BundleSchema	A schema for defining custom bundle types.
Category	Describes a nominal category in terms of category name and count of values in the category.
CodebookSchema	This contains all of the field-level details for interpreting a dataset, including data types, mapping, and metadata. Note that with persisted metadata and field examples, a dataset can often be visualized and described to the user without actually loading the source file. resource profile: 'codebook'
Constraints	Validation constraints for a field.
ConvertArgs
CopyArgs
Criteria
DataPackageSchema	Defines a Data Package, which is a collection of data resources such as files and schemas. Loosely based on the Frictionless spec, but modified where needed to meet our needs. https://specs.frictionlessdata.io/data-package/
DataShape	Defines parameters for understanding the logical structure of data contents.
DataTableSchema	This defines the table-containing resource type. A dataset can be embedded directly using the `data` property, or it can be linked to a raw file using the `path`. If the latter, optional format and parsing options can be applied to aid interpreting the file contents. resource profile: 'datatable'
DeriveArgs
DestructureArgs
DualInput	Dual-input, single-output step I/O
EncodeDecodeArgs
EraseArgs
Field	Contains the full schema definition and metadata for a data field (usually a table column). This includes the required data type, various data nature and rendering properties, potential validation rules, and mappings from a data dictionary.
FieldError
FieldMetadata	Holds core metadata/stats for a data field.
FillArgs
FilterArgs
FoldArgs
ImputeArgs
InputColumnArgs
InputColumnListArgs	Base interface for a number of operations that work on a column list.
InputColumnRecordArgs
InputKeyValueArgs
JoinArgs
JoinArgsBase
LookupArgs
MergeArgs
Named	Base interface for sharing properties of named resources/objects.
OnehotArgs
OrderbyArgs
OrderbyInstruction
OutputColumnArgs
ParserOptions	Parsing options for delimited files. This is a mix of the options from pandas and spark.
PivotArgs
PrintArgs
RecodeArgs
RelationshipConstraint
ResourceSchema	Parent class for any resource type understood by the system. Any object type that extends from Resource is expected to have a standalone schema published. For project state, this can be left as generic as possible for now.
RollupArgs
SampleArgs
SnapshotArgs
SourcePreservingArgs
SpreadArgs
StepJsonCommon	Common step properties
StringsArgs
StringsReplaceArgs
TableBundleSchema	A table bundle encapsulates table-specific resources into a single resource with a prescribed workflow. A tablebundle requires a `source` entry with rel="input" for the source table. A tablebundle may also include `source` entries with rel="codebook" and rel="workflow" for interpretation and processing of the source data table.
TypeHints	Configuration values for interpreting data types when parsing a delimited file. By default, all values are read as strings - applying these type hints can derive primitive types from the strings.
UnhotArgs
UnknownInput
ValidationResult
VariadicInput	Multi-input, single output step I/O
WindowArgs
WorkflowArgs
WorkflowSchema	The root wrangling workflow specification. resource profile: 'workflow'

Variables

Variable	Description
DataTableSchemaDefaults
LATEST_CODEBOOK_SCHEMA
LATEST_DATAPACKAGE_SCHEMA
LATEST_DATATABLE_SCHEMA
LATEST_TABLEBUNDLE_SCHEMA
LATEST_WORKFLOW_SCHEMA
ParserOptionsDefaults
TypeHintsDefaults	This is a collection of default string values for inferring strict types from strings. They replicate the defaults from pandas. https://pandas.pydata.org/pandas-docs/stable/user\_guide/io.html\#csv-text-files

Type Aliases

Type Alias	Description
DedupeArgs
DropArgs
FactoryInput
GroupbyArgs
InputBinding
Profile	Resources must have a profile, which is a key defining how it should be interpreted. Profiles are essentially shorthand for a schema URL. The core profiles for DataShaper are defined here, but any application can define one as a string.
Rel	A rel is a string that describes the relationship between a resource and its child.
RenameArgs
SelectArgs
Step	Specification for step items
UnfoldArgs
UnrollArgs
ValidationFunction
Value	A cell/property value of any type.
WorkflowInput
WorkflowStepId	The Id of the step to which the input is bound

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema package

Enumerations

Functions

Interfaces

Variables

Type Aliases

FilesExpand file tree

schema.md

Latest commit

History

schema.md

File metadata and controls

schema package

Enumerations

Functions

Interfaces

Variables

Type Aliases