Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dumpling: support specifying the character set of the output files #54217

Open
kennytm opened this issue Jun 26, 2024 · 1 comment
Open

Dumpling: support specifying the character set of the output files #54217

kennytm opened this issue Jun 26, 2024 · 1 comment
Labels
component/dumpling This is related to Dumpling of TiDB. found/customer Customers have encountered this bug. type/feature-request This is a feature requests on the product

Comments

@kennytm
Copy link
Contributor

kennytm commented Jun 26, 2024

Feature Request

Is your feature request related to a problem? Please describe:

Currently dumpling assumes the output is always utf8mb4. If a user wants the result in a different character set such as GBK, they will either need iconv or changing @@character_set_results. The former requires 2x the storage space while the latter may cause unexpected transformation like insertion of unwanted \ and ".

Describe the feature you'd like:

Add a flag that specifies the output encoding (naming and options follow that of Lightning):

  • --data-character-set=«encoding», should support:
    • utf8mb4
    • gb18030
    • (optional) gbk
    • (optional) latin1
  • (optional) --schema-character-set=«encoding», default to --data-character-set if unspecified
  • (optional) --data-invalid-char-replace='?', what to do with characters outside of the specified charset.

Both CSV and SQL output format should be supported, and should round-trip with the corresponding Lightning settings (including the cases with custom CSV separator/terminator etc)

Describe alternatives you've considered:

Nothing, document that Dumpling can only output utf8mb4 and require users to use iconv to perform conversion.

Teachability, Documentation, Adoption, Migration Strategy:

@kennytm kennytm added type/feature-request This is a feature requests on the product component/dumpling This is related to Dumpling of TiDB. found/customer Customers have encountered this bug. labels Jun 26, 2024
@kennytm
Copy link
Contributor Author

kennytm commented Jun 26, 2024

changing @@character_set_results ... may cause unexpected transformation like insertion of unwanted \ and ".

and to clarify, this is the PoC:

CREATE TABLE test_table (
  id INT PRIMARY KEY,
  value VARCHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci
);
INSERT INTO test_table (id, value) VALUES (1, '');
tiup dumpling:v7.1.5 \
  --sql 'select * from test.test_table' \
  --params 'character_set_client=gbk,character_set_connection=gbk,character_set_results=gbk' ...

if you check the produced file in GBK encoding you will see the value 玕\ (note the extra backslash) because the GBK encoding of is AB 5C so --escape-backslash will escape that 5C in the middle of this multi-byte character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/dumpling This is related to Dumpling of TiDB. found/customer Customers have encountered this bug. type/feature-request This is a feature requests on the product
Projects
None yet
Development

No branches or pull requests

1 participant