Skip to content

Faster JSON parser #112

@kuriwaki

Description

@kuriwaki

For dataset retrieval, we download and parse JSON metadata multiple times.
For example, in get_dataframe_by_name, get_fileid.character would first find the dataset id via

jsonlite::fromJSON(httr::content(r, as = "text", encoding = "UTF-8"))[["data"]][["id"]]

and the file the list of ids for each file in the dataset at
out <- jsonlite::fromJSON(httr::content(r, as = "text", encoding = "UTF-8"), simplifyDataFrame = FALSE)$data

It turns out the time this takes is non-trivial. Most of the time is taken up by loading the JSON from URL. A small remaining fraction (< 1%) is due to the parsing of the JSON file. We could make a minor improvement in speed by switching to a faster parser, RcppSimdJson (https://github.com/eddelbuettel/rcppsimdjson). This is about 2-10x faster in my tests, per below. The current jsonlite::fromJSON seems to be optimal for data science pipelines where we deal with data but here we are only interested in bits of metadata. An even faster switch is to download metadata only once.

Switching packages will require changes in at least 20 places where jsonlite is used.

library(jsonlite) # currently used
library(RcppSimdJson) # potential replacement

# sample: https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O
js_url <- "https://demo.dataverse.org/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.70122/FK2/PPIAXE"

# download once
tmp <- tempfile()
download.file(js_url, tmp)

microbenchmark::microbenchmark(
  statusquo = jsonlite::fromJSON(js_url), # what is currently being called
  dl = curl::curl_download(js_url, tempfile()), # separating download from parsing
  jsonlite = jsonlite::fromJSON(tmp),  # parsing, without download
  RcppJson = RcppSimdJson::fload(tmp), # replace with Rcpp
  RcppJson_file = RcppSimdJson::fload(tmp, query = "/datasetVersion/files"), # only files data
  RcppJson_id = RcppSimdJson::fload(tmp, query = "/id"),  # stop at dataset /id
  times = 30
)
#> Unit: microseconds
#>           expr        min         lq        mean      median         uq        max neval
#>      statusquo 365097.709 371235.626 374774.8021 373752.4175 378357.084 387006.459    30
#>             dl 361154.168 364100.750 369091.1201 369528.3965 371835.459 378629.667    30
#>       jsonlite   1487.834   2743.500   3248.0424   2994.1465   3270.959   8380.876    30
#>       RcppJson    186.876    262.001    438.1298    345.3130    468.042   2335.417    30
#>  RcppJson_file    136.292    224.001    334.5173    301.6465    409.376    688.001    30
#>    RcppJson_id    138.459    177.876    287.7714    263.3965    362.792    586.750    30

Created on 2022-01-05 by the reprex package (v2.0.1)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions