-
Notifications
You must be signed in to change notification settings - Fork 27
Description
For dataset retrieval, we download and parse JSON metadata multiple times.
For example, in get_dataframe_by_name, get_fileid.character would first find the dataset id via
Line 29 in 4775a92
| jsonlite::fromJSON(httr::content(r, as = "text", encoding = "UTF-8"))[["data"]][["id"]] |
and the file the list of ids for each file in the dataset at
dataverse-client-r/R/get_dataset.R
Line 101 in 4775a92
| out <- jsonlite::fromJSON(httr::content(r, as = "text", encoding = "UTF-8"), simplifyDataFrame = FALSE)$data |
It turns out the time this takes is non-trivial. Most of the time is taken up by loading the JSON from URL. A small remaining fraction (< 1%) is due to the parsing of the JSON file. We could make a minor improvement in speed by switching to a faster parser, RcppSimdJson (https://github.com/eddelbuettel/rcppsimdjson). This is about 2-10x faster in my tests, per below. The current jsonlite::fromJSON seems to be optimal for data science pipelines where we deal with data but here we are only interested in bits of metadata. An even faster switch is to download metadata only once.
Switching packages will require changes in at least 20 places where jsonlite is used.
library(jsonlite) # currently used
library(RcppSimdJson) # potential replacement
# sample: https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O
js_url <- "https://demo.dataverse.org/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.70122/FK2/PPIAXE"
# download once
tmp <- tempfile()
download.file(js_url, tmp)
microbenchmark::microbenchmark(
statusquo = jsonlite::fromJSON(js_url), # what is currently being called
dl = curl::curl_download(js_url, tempfile()), # separating download from parsing
jsonlite = jsonlite::fromJSON(tmp), # parsing, without download
RcppJson = RcppSimdJson::fload(tmp), # replace with Rcpp
RcppJson_file = RcppSimdJson::fload(tmp, query = "/datasetVersion/files"), # only files data
RcppJson_id = RcppSimdJson::fload(tmp, query = "/id"), # stop at dataset /id
times = 30
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> statusquo 365097.709 371235.626 374774.8021 373752.4175 378357.084 387006.459 30
#> dl 361154.168 364100.750 369091.1201 369528.3965 371835.459 378629.667 30
#> jsonlite 1487.834 2743.500 3248.0424 2994.1465 3270.959 8380.876 30
#> RcppJson 186.876 262.001 438.1298 345.3130 468.042 2335.417 30
#> RcppJson_file 136.292 224.001 334.5173 301.6465 409.376 688.001 30
#> RcppJson_id 138.459 177.876 287.7714 263.3965 362.792 586.750 30Created on 2022-01-05 by the reprex package (v2.0.1)