Skip to content
This repository was archived by the owner on Mar 20, 2024. It is now read-only.
This repository was archived by the owner on Mar 20, 2024. It is now read-only.

Add UTF-8 support in data (lookup tables) #95

@reinholdsson

Description

@reinholdsson

Need to find a solution for dealing with åäö. See examples below:

library(Coldbir)
a <- cdb()
dt <- data.table(
    x = c('a', 'b', 'a', 'o', 'a', 'o', 'o'),
    y = c('a', 'b', 'å', 'ö', 'a', 'ö', 'ö')
)
a[] <- dt
# Warning message:
# In `[.data.table`(y, xkey, nomatch = ifelse(all.x, NA, 0), allow.cartesian = allow.cartesian) :
#   A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes 
# currently, so doesn't support *mixed* encodings well; i.e., using both latin1 and UTF-8, or if any 
# unknown encodings are non-ascii and some of those are marked known and others not. But if either 
# latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. 
# In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this 
# without impacting performance for ascii-only cases.

a[]
#    x y
#1: a a
#2: b b
#3: a  
#4: o  
#5: a a
#6: o  
#7: o  
# Warning message:
# In `levels<-`(`*tmp*`, value = c("a", "b", "", "")) :
#   duplicated levels in factors are deprecated

lookup.txt for variable y:

1       a
2       b
3
4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions