NewsDoc

This package provides type declarations for NewsDoc as Go types, protobuf messages, and a JSON schema. Protobuf and JSON schemas are generated from the Go type declarations.

NewsDoc was created to be a convenient and type-safe document format for editorial data like articles and concept metadata that minimises the need for evolving the schema to adapt to new types of data. It avoids this by not using data structure for expressing relationships ({categories:['a', 'b'], seeAlso:['c', 'd']}) or type/identity of the data ({articleMetadata:{teaserHeadline:"v", teaserText:"w"}, headline:"x", "lead_in":"y", paragraphs:["z"]}). An example of a hypothetical format that does this:

{
    "categories": [
        "28b94216-77d7-41e9-be08-a6bfbe59f1d5",
        "a23528b7-31af-4ae2-bbca-0c78f1cbc959",
    ],
    "readMore": [
        "6dd826dd-d866-459b-a07e-0da4bad7bce0",
        "043c248f-92ac-4e0b-b0ec-76cc26323634"
    ],
    "articleMetadata": {
        "teaserHeadline": "v",
        "teaserText": "w"
    },
    "headline": "x",
    "lead_in": "y",
    "paragraphs": ["z"],
    "image": "https://example.com/an-image.jpg",
    "image_width": 128,
    "image_height": 128,
    "image_alt_text": "desc"
}

Instead it adopts a view of documents as a set of links expressing relationships to other entities, a set of typed metadata blocks, and a list of typed content blocks that represent the actual content of f.ex. an article. The article hinted at in the above paragraph would instead look like this:

{
    "type": "example/article",
    "links": [
        {"rel":"category", "uuid":"28b94216-77d7-41e9-be08-a6bfbe59f1d5"},
        {"rel":"category", "uuid":"a23528b7-31af-4ae2-bbca-0c78f1cbc959"},
        {
            "rel":"see-also", "type":"example/article",
            "uuid":"6dd826dd-d866-459b-a07e-0da4bad7bce0"
        },
        {
            "rel":"see-also", "type":"example/article",
            "uuid":"043c248f-92ac-4e0b-b0ec-76cc26323634"
        }
    ],
    "meta": [
        {
            "type": "example/teaser",
            "title": "v",
            "data": {
                "text": "w"
            }
        }
    ],
    "content": [
        {
            "type": "example/headline",
            "data": {
                "text": "x"
            }
        },
        {
            "type": "example/image",
            "url": "https://example.com/an-image.jpg",
            "data": {
                "width": "128",
                "height": "128",
                "alt": "desc"
            }
        },
        {
            "type": "example/lead-in",
            "data": {
                "text": "y"
            }
        },
        {
            "type": "example/paragraph",
            "data": {
                "text": "z"
            }
        },
    ]
}

This kind of structure allows a system that's using NewsDoc to f.ex. recognise that there is a link to another entity, or a content element with text, without knowing about the specific type of relationship or content. On the flip side it's also easy to ignore f.ex. a metadata block with a type that you don't recognize.

One thing is lost in translation here, the "data" object of a block is a string->string key value structure, so the width 128 becomes "128". We sacrifice the specific types of some data to be able to have a largely static type system. But the "type contract" between content producers and consumers in a system like this is that "width" and "height" always must be integers. Revisor is our attempt to formalise and enforce these type contracts.

A revisor schema for the above format could look like this:

{"documents":[{
  "name": "News article",
  "description": "A basic news article example",
  "declares": "example/article",
  "links": [
    {
      "name": "Category",
      "description": "A category assigned to the article",
      "declares": {"rel":"category"},
      "attributes": {"uuid": {}}
    }
    {
      "name": "Read more",
      "description": "A link to other articles that are interesting",
      "declares": {"rel":"see-also", "type": "example/article"},
      "attributes": {"uuid": {}}
    }
  ],
  "meta": [
    {
      "name": "Teaser",
      "declares": {"type":"example/teaser"},
      "attributes": {"title": {}},
      "data": {"text": {}},
      "count": 1
    }
  ],
  "content": [
    {
      "name": "Headline",
      "declares": {"type":"example/headline"},
      "data": {"text": {}}
    },
    {
      "name": "Lead-in",
      "declares": {"type":"example/lead-in"},
      "data": {"text": {}}
    },
    {
      "name": "Paragraph",
      "declares": {"type":"example/paragraph"},
      "data": {"text": {}}
    },
    {
      "name": "Image",
      "declares": {"type":"example/image"},
      "attributes": {
        "url": {"glob":"https://**"}
      },
      "data": {
        "width": {"format":"int"},
        "height": {"format":"int"},
        "alt": {},
      }
    }
  ]
}]}

This schema can then be used to validate documents to ensure the data quality of stored documents. It's also serves as documentation, and can be used by automated systems like a full text index provide a hint about the correct way to index the data.

Value extractor expressions

The ValueExtractor provides a way to extract values from documents using a selector expression language. An expression consists of a chain of block selectors followed by a value specifier that determines what to extract from the matched blocks.

Selectors

Selectors navigate the block hierarchy of a document. Each selector targets a block list (meta, links, or content) and can optionally filter by block attributes:

.meta                              -- all meta blocks
.links(rel='category')             -- links with rel "category"
.meta(type='core/note').links      -- links inside meta blocks of type "core/note"
.content(type='core/text' role='heading')  -- content blocks matching both type and role

Selectors can be chained to navigate into nested blocks. The available filter attributes are: id, uuid, uri, url, type, rel, role, name, value, contenttype, and sensitivity. Attribute values are single-quoted; use \' to escape a literal quote inside a value.

Data filters

In addition to block attributes, selectors can filter on values in the block's data map using the data. prefix inside the parentheses. Three modes are supported:

data.key='value'   -- exact match: the data key must exist with this value
data.key?          -- exists: the data key must be present (even if empty)
data.key??         -- non-empty: the data key must be present and non-empty

Data filters can be mixed freely with attribute filters:

.meta(type='core/event' data.date?? data.status='confirmed').data{date}
.links(rel='item' data.date_timezone='Asia/Shanghai').data{date}

Combining conditions with `or` and grouping

By default, multiple conditions inside a selector are combined with implicit AND — a block must satisfy all of them. Use the or keyword to match blocks satisfying at least one alternative:

.meta(value='text' or value='picture')

AND binds tighter than or, so conditions separated by spaces are grouped together before or is applied. To control precedence, use parentheses:

.meta(type='core/thing' (value='a' or value='b'))

This matches meta blocks with type='core/thing' AND either value='a' or value='b'. Without the inner parentheses, the expression would be parsed as (type='core/thing' value='a') or value='b'.

Parenthesized groups can be nested and combined freely with attribute and data filters:

-- OR between two AND groups
.meta((type='a' value='x') or (type='b' value='y'))

-- OR between data filters
.meta(data.status='draft' or data.status='review')

-- Nested groups
.meta((type='a' (value='x' or value='y')) or (type='b' value='z'))

-- Three-way OR
.meta(value='text' or value='picture' or value='video')

Child selectors

Use # to filter blocks by their descendants without navigating into them. The selectors after # form a child selector chain — the parent block is only matched if it has descendants satisfying the chain. The extraction targets the parent block, not the descendants:

.meta(type='core/assignment')#.links(rel='deliverable' uuid='...')

This selects core/assignment meta blocks that contain a link with rel='deliverable' and the given UUID. The result is the assignment block itself. Compare with the non-child version which would navigate into and return the link:

.meta(type='core/assignment').links(rel='deliverable' uuid='...')

Child selectors can be chained to match deeper descendants, and support the same attribute and data filters as regular selectors:

assignment=.meta(type='core/assignment')#.links(rel='deliverable' data.status='active'):label

Extracting data values

Use .data{} to extract values from the matched blocks' data maps. Values are space-separated (commas are also accepted):

.meta(type='core/planning-item').data{start_date end_date}

Each matched block must have all specified data keys for the extraction to succeed. Append ? to make a value optional:

.meta(type='core/planning-item').data{start_date date_tz?}

Extracting block attributes

Use @{} to extract block attribute values:

.content(type='core/text')@{value}
.links(rel='author')@{uuid title}

When no selectors are provided, @{} extracts document-level attributes (uuid, type, uri, url, title, language):

@{title language}

Combining attribute and data extraction

An expression can combine @{} and .data{} to extract both block attributes and data values from the same matched blocks:

.meta(type='core/assignment')@{title}.data{start_date date_tz}

This extracts the title attribute and the start_date and date_tz data values from each matched block. The same all-or-nothing semantics apply: if any required value is missing, the block is skipped.

Annotations and roles

Values can be annotated with a type hint using :, and given a role using = as a prefix:

.meta(type='core/event').data{date:date tz=date_timezone?}

Here date has the annotation date, and date_timezone is extracted with the role tz. Annotations and roles are passed through in the extracted results and can be used by the caller to interpret the values.

Extracting full blocks

If no .data{} or @{} value specifier is present, the expression extracts the full matched blocks. Block extraction requires a name prefix and optionally accepts an annotation:

name=.selectors
name=.selectors:annotation

Examples:

items=.meta(type='core/collection').links(rel='item')
event=.links(rel='event' type='core/event'):calendar

The name is used as the key in the extracted results and populates the Name field of the ExtractedValue. The matched block is available in the Block field.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
cmd/newsdoc		cmd/newsdoc
codegen		codegen
internal/test		internal/test
testdata		testdata
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
clone.go		clone.go
clone_test.go		clone_test.go
data.go		data.go
data_test.go		data_test.go
doc.go		doc.go
go.mod		go.mod
go.sum		go.sum
matching.go		matching.go
matching_test.go		matching_test.go
newsdoc.proto		newsdoc.proto
newsdoc.schema.json		newsdoc.schema.json
operations.go		operations.go
operations_test.go		operations_test.go
schema.go		schema.go
schema_test.go		schema_test.go
value_extractor.go		value_extractor.go
value_extractor_test.go		value_extractor_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NewsDoc

Value extractor expressions

Selectors

Data filters

Combining conditions with `or` and grouping

Child selectors

Extracting data values

Extracting block attributes

Combining attribute and data extraction

Annotations and roles

Extracting full blocks

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NewsDoc

Value extractor expressions

Selectors

Data filters

Combining conditions with or and grouping

Child selectors

Extracting data values

Extracting block attributes

Combining attribute and data extraction

Annotations and roles

Extracting full blocks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Combining conditions with `or` and grouping

Packages