Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected result using like clause with fts index #33

Open
HowePa opened this issue Jun 4, 2024 · 11 comments
Open

unexpected result using like clause with fts index #33

HowePa opened this issue Jun 4, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@HowePa
Copy link

HowePa commented Jun 4, 2024

Describe the unexpected behaviour
When I filter String using like clause, MyScale gets different results.

How to reproduce
version: myscale/myscaledb:1.5
case 1: filter with fts
image

case 2: filter without fts
image

Expected behavior
The results need to be uniform.

@lqhl
Copy link
Contributor

lqhl commented Jun 4, 2024

Currently, ClickHouse built-in functions only return the same results as without a Full-Text Search (FTS) index when using the raw tokenizer. We will update our documentation to reflect this.

We have not discovered a method to achieve consistent results with other tokenizers. Perhaps we should consider disabling FTS indexing with other tokenizers in built-in functions? @MochiXu @chtlp

@HowePa
Copy link
Author

HowePa commented Jun 5, 2024

While result from testsearch is correct. Maybe it caused by bad query_text passed to Tantivy when use fts skip index? 😵
image

@chtlp
Copy link
Contributor

chtlp commented Jun 5, 2024

Thank you for bringing this issue to our attention @HowePa ! textSearch is the preferred way to use FTS index. LIKE relies on regex matching, which is messed up with tokenizers other than raw.

We will resolve it by disabling the FTS index for the like, notLike, startWith, and endsWith functions when used with other tokenizers. Additionally, we will update the documentation to reflect this change. While this may reduce some performance, it will ensure consistent results.

In the meantime, you can still use textSearch, hasToken, multiSearchAny, and other string functions on FTS with any tokenizer.

@HowePa
Copy link
Author

HowePa commented Jun 5, 2024

It is true that different tokenizers may lead to inconsistent results.
However, when I do regex match on single term, only the first term of each doc fail.
image

@MochiXu
Copy link
Collaborator

MochiXu commented Jun 6, 2024

@HowePa The issue you're encountering is quite specific. To address this, we first need to understand the behavior of the Tantivy tokenizer. By default, the Tantivy tokenizer converts all text to lowercase. This means that in the index file, 'Ancient' will be stored as 'ancient'. As a result, when performing a '%Ancient%' match, nothing will be found.

If you want Tantivy to be case-sensitive during tokenization, you need to configure the tokenizer with the case_sensitive parameter. You can do this by executing the following SQL commands:

ALTER TABLE simple_table ADD INDEX doc_idx doc TYPE fts('{"doc":{"tokenizer":{"type":"simple","case_sensitive":true}}}') GRANULARITY 1;

ALTER TABLE simple_table MATERIALIZE INDEX doc_idx;

This will ensure that the tokenizer respects the case sensitivity of the text.

@HowePa
Copy link
Author

HowePa commented Jun 6, 2024

Forgive me if I'm a little wordy @MochiXu . According to your description, if I set case_sensitive=false, then I can use 'ancient' to search, but the result is still fail.
image

Besides, why case_sensitive=false != case_insensitive=true ?

@HowePa
Copy link
Author

HowePa commented Jun 6, 2024

Well, I can use ilike to avoid this problem, but it's a little strange using like with case_sensitive=false ...
image

@MochiXu
Copy link
Collaborator

MochiXu commented Jun 6, 2024

If you would like to gain a deeper understanding of how Tantivy indexing works, it would be beneficial to first learn about ClickHouse's Skip Index, as Tantivy is integrated based on SkipIndex. For example, if there are 10 granules in a table, without using the Tantivy index, ClickHouse would need to access all the granules to perform a search. However, when using the Tantivy index, ClickHouse will only access a subset of granules when executing the LIKE function. These granules are those that Tantivy identifies as matching the search criteria. The role of the Tantivy index is to filter these granules, preventing ClickHouse from accessing unnecessary ones.

Regarding your earlier question, when you set case_sensitive to false, the Tantivy tokenizer converts all characters to lowercase during tokenization. Consequently, the string indexed by the Tantivy tokenizer becomes 'ancient'. The LIKE function you use will match this string, so the granule containing row 0 will be marked as a hit. However, when ClickHouse retrieves all data within this granule, it determines that '%ancient%' does not match 'Ancient', resulting in an empty return.

@HowePa
Copy link
Author

HowePa commented Jun 6, 2024

So, when I want to do textSearch with case-insensitive, and to match with case-sensitive, unexpected results happened.
Is there any way to do regex match using textSearch, or I can only do regex match using like '%XXX%' settings using_skip_indexes = 0? @MochiXu

@MochiXu
Copy link
Collaborator

MochiXu commented Jun 6, 2024

Currently, there isn't an effective solution to achieve this, unless you configure a case-sensitive tokenizer. We plan to optimize the search logic for functions like LIKE in the future. Specifically, regarding your mention of Tantivy using case-insensitive tokenization while the LIKE function performs case-sensitive '%XXX%' searches, we will convert the regex matches to lowercase within Tantivy. This approach will help avoid filtering out the desired granules, thereby allowing ClickHouse to return the expected results. Additionally, we are considering adding a query setting option to enable users to decide whether to activate the FTS index during searches.

@HowePa
Copy link
Author

HowePa commented Jun 6, 2024

Many thanks for your patience. 👍

@lqhl lqhl added the bug Something isn't working label Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants