Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/collate: Compare and CompareString broken in "ka-shifted" collation mode #68166

Open
danderson opened this issue Jun 25, 2024 · 6 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@danderson
Copy link
Contributor

Go version

go version go1.22.3 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/dave/.cache/go-build'
GOENV='/home/dave/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/dave/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/dave/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.3'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/dave/hack/psl/tools/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1869424909=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Tried to compare strings with the "shifted" variable weighting option, which you can select with e.g. language tag en-u-ka-shifted.

Demo: https://go.dev/play/p/afAOLzqh3Oe

What did you see happen?

99% of Compare() and CompareString() calls on different strings return 0.

What did you expect to see?

Comparison results that match the configured collation, or at least a courtesy panic letting me know it's known broken (spoiler alert...)

@gopherbot gopherbot added this to the Unreleased milestone Jun 25, 2024
@gabyhelp
Copy link

Similar Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@danderson
Copy link
Contributor Author

Pretty simple bug: https://cs.opensource.google/go/x/text/+/refs/tags/v0.16.0:collate/collate.go;l=166 . When the collator is in the shifted alternate mode, compare() silently skips the entire primary level, and thus the comparisons are == 0 always when comparing ascii. They're occasionally != 0 if you get lucky and the secondary/tertiary levels have tie-breakers.

Looking at the history, x/text/collate seems unmaintained (last change that wasn't fmts or Unicode upgrades was 2016 :( ). Given this, my suggestion tactically would be to rip out the custom codepaths for Compare/CompareString, and have them generate and bytes.Compare full collation keys. It's less efficient for one-off compares, but I think it would be correct, and a future maintainer can always make it faster again since it's all internal package gubbins.

@ianlancetaylor
Copy link
Contributor

CC @mpvl

@joedian joedian added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 25, 2024
@danderson
Copy link
Contributor Author

I hacked up a more comprehensive debugging tool: https://go.dev/play/p/GcsuboreVW8 . For a bunch of comparisons and collations, it prints:

  • The result of bytes.Compare() over the collation keys obtained from coll.KeyFromString()
  • The result of coll.CompareString()
  • The raw collation keys, formatted to line up nicely and highlight where the different collation levels are.

There's also main() in there suitable for running from the CLI as well.

What I reported above can be seen in the FULL vs. ITER of collations en and en-u-ka-shifted: the latter reports all inputs as equivalent.

Further weirdness I discovered while writing the tool:

  • If you also request numeric sorting, with en-u-kn-true-u-ka-shifted, this somehow fixes iterative collation to match key compares??
  • ... but only if you request numeric collation before shifted variable weights. If you use en-u-ka-shifted-u-kn-true, iterative collation continues to be wrong.
  • Additionally, if you request ka=shifted before kn=true this breaks numeric sorting for both the sort keys and iterative compare.

Summary:

  • en and en-u-kn-true-u-ka-shifted seem to work as specified by Unicode.
  • en-u-ka-shifted and en-u-ka-shifted-u-kn-true silently breaks iterative comparison
  • en-u-ka-shifted-u-kn-true additionally silently doesn't do numeric sorting

I have no idea where these new behaviors come from, I haven't paged in enough of the code to understand what's happening.

@danderson
Copy link
Contributor Author

Looking more closely at UTS 35, the multi-key language tags above are not well formed because multiple -u- extensions is ill-formed (I would have hoped that language.Parse would tell me :( ), although based on the lang.String() output it looks like the parser is graciously accepting the ill-formed inputs.

I tried collations en-u-ka-shifted-kn-true, en-u-ka-shifted-kn, en-u-ka-shifted-u-kn-true, en-u-ka-shifted-u-kn, en-u-kn-ka-shifted, en-u-kn-true-ka-shifted, en-u-kn-true-u-ka-shifted, all of which are trying to request the same behavior (ka=shifted, kn=true) in various degrees of ill-formed or non-canonicality (the canonical tag is en-u-ka-shifted-kn). All of them break both iterative comparison and numeric sorting, except for the ill-formed and non-canonical en-u-kn-true-u-ka-shifted, where both comparison styles are correct and both use numeric sorting.

@danderson
Copy link
Contributor Author

Correction: en-u-kn-true-u-ka-shifted is also broken, it produces a numeric sort but fails to ignore punctuation. It's hard to see until you start sorting whole slices with these comparators, but afaict there is just no way to enable kn=true and ka=shifted at the same time, even if you avoid Compare/CompareString and use collation keys explicitly, you can only get kn=true xor ka=shifted no matter how you phrase your request.

Given that numeric sorting is a wrapper around the standard weight, I'm assuming that it lacks awareness of ka=shifted and so breaks the weight adjustments that need to happen, but again I haven't paged in enough of the code to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants