Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Embeddings.ipynb to show output_dimenstionality parameter. #82

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 111 additions & 38 deletions quickstarts/Embeddings.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,18 @@
},
shilpakancharla marked this conversation as resolved.
Show resolved Hide resolved
shilpakancharla marked this conversation as resolved.
Show resolved Hide resolved
shilpakancharla marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

@markmcd markmcd Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #12.        output_dimensionality=10)

It sounds like this change is to add this parameter but it's hidden in this code snippet comparing task types without explanation.

Maybe add a section after this block with a short note? e.g.:

## Truncating embeddings

The text-embedding-004 model also supports lower embedding dimensions. Specify output_dimensionality to truncate the output.


`
result1 = genai.embed_content(
    model="models/text-embedding-004",
    content="Hello world)


result2 = genai.embed_content(
    model="models/text-embedding-004",
    content="Hello world",
    output_dimensionality=10)


(len(result1), len(result2))
`

Can we talk about the relationship between the index and specificity? It'd be great to add a statement like "When using text-embedding-004, each dimension adds diminishing value so truncating may be effective in constrained environments ." - but I haven't verified if this is true.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmcd I haven't verified it either but I think it would be a great add! @MarkDaoust what do you think?

Copy link
Contributor

@MarkDaoust MarkDaoust Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use output_dimensionality=4 here

Include a comment explaining it.


Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That way you wouldn't need the '... trimmed')

shilpakancharla marked this conversation as resolved.
Show resolved Hide resolved
shilpakancharla marked this conversation as resolved.
Show resolved Hide resolved
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {
"id": "YD6urJjWGVDf"
},
"outputs": [],
"source": [
"!pip install -U -q google.generativeai # Install the Python SDK"
"!pip install -q google-generativeai"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {
"id": "yBapI259C99C"
},
Expand All @@ -68,7 +68,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {
"id": "Zey3UiYGDDzU"
},
Expand All @@ -87,42 +87,51 @@
"source": [
"## Embed content\n",
"\n",
"Call the `embed_content` method with the `models/embedding-001` model to generate text embeddings."
"Call the `embed_content` method with the `models/text-embedding-004` model to generate text embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"metadata": {
"id": "J76TNa3QDwCc"
"id": "J76TNa3QDwCc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "ab2eaa5e-21b8-4ae9-db4a-a19ee008a175"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"name": "stdout",
"text": [
"[0.04703258, -0.040190056, -0.029026963, -0.026809 ... TRIMMED]\n"
"[0.013168523, -0.008711934, -0.046782676, 0.000699 ... TRIMMED]\n"
]
}
],
"source": [
"text = \"Hello world\"\n",
"result = genai.embed_content(model=\"models/embedding-001\", content=text)\n",
"result = genai.embed_content(model=\"models/text-embedding-004\", content=text)\n",
shilpakancharla marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"# Print just a part of the embedding to keep the output manageable\n",
"print(str(result['embedding'])[:50], '... TRIMMED]')"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 6,
"metadata": {
"id": "rU6XX33547Ll"
"id": "rU6XX33547Ll",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "182bec79-016c-46ed-8910-1010a8c765f1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"name": "stdout",
"text": [
"768\n"
]
Expand All @@ -145,24 +154,29 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 7,
"metadata": {
"id": "Hzz-7Heuf4tV"
"id": "Hzz-7Heuf4tV",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 69
},
"outputId": "42e1ae40-fb91-47fd-84c4-cc0e1a68d467"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"name": "stdout",
"text": [
"[-0.0002620658, -0.05592018, -0.012463195, -0.0206 ... TRIMMED]\n",
"[-0.0151748555, -0.050790474, -0.032357067, -0.058 ... TRIMMED]\n",
"[0.025271073, -0.064161226, -0.025818137, -0.00611 ... TRIMMED]\n"
"[-0.010632277, 0.019375855, 0.0209652, 0.000770642 ... TRIMMED]\n",
"[0.018467998, 0.0054281196, -0.017658804, 0.013859 ... TRIMMED]\n",
"[0.05808907, 0.020941721, -0.108728774, -0.0403925 ... TRIMMED]\n"
]
}
],
"source": [
"result = genai.embed_content(\n",
" model=\"models/embedding-001\",\n",
" model=\"models/text-embedding-004\",\n",
" content=[\n",
" 'What is the meaning of life?',\n",
" 'How much wood would a woodchuck chuck?',\n",
Expand All @@ -178,7 +192,7 @@
"id": "sSKcLGIpo8yc"
},
"source": [
"## Use `task_type` to provide a hint to the model how you'll use the embeddings"
"## Specify `task_type`"
]
},
{
Expand All @@ -187,12 +201,13 @@
"id": "bz0zq1_shk98"
},
"source": [
"Let's look at all the parameters the `embed_content` method takes. There are four:\n",
"Let's look at all the parameters the `embed_content` method takes. There are five:\n",
"\n",
"* `model`: Required. Must be `models/embedding-001`.\n",
"* `model`: Required. Must be `models/text-embedding-004` or `models/embedding-001`.\n",
"* `content`: Required. The content that you would like to embed.\n",
"*`task_type`: Optional. The task type for which the embeddings will be used. See below for possible values.\n",
"*`task_type`: Optional. The task type for which the embeddings will be used.\n",
"* `title`: Optional. You should only set this parameter if your task type is `retrieval_document` (or `document`).\n",
"* `output_dimensionality`: Optional. Reduced dimension for the output embedding. If set, excessive values in the output embedding are truncated from the end. This is supported by `models/text-embedding-004`, but cannot be specified in `models/embedding-001`.\n",
"\n",
"`task_type` is an optional parameter that provides a hint to the API about how you intend to use the embeddings in your application.\n",
"\n",
Expand All @@ -203,38 +218,96 @@
"* `retrieval_document` (or `document`): The given text is a document from a corpus being searched. Optionally, also set the `title` parameter with the title of the document.\n",
"* `semantic_similarity` (or `similarity`): The given text will be used for Semantic Textual Similarity (STS).\n",
"* `classification`: The given text will be classified.\n",
"* `clustering`: The embeddings will be used for clustering.\n"
"* `clustering`: The embeddings will be used for clustering.\n",
"* `question_answering`: The given text will be used for question answering.\n",
"* `fact_verification`: The given text will be used for fact verification."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 11,
"metadata": {
"id": "LFjMapMV91es"
"id": "LFjMapMV91es",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"outputId": "8b534c70-b880-4614-aa90-b0b4b337d3d1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"name": "stdout",
"text": [
"[0.04703258, -0.040190056, -0.029026963, -0.026809 ... TRIMMED]\n",
"[0.05889487, -0.004501751, -0.067298084, -0.012740 ... TRIMMED]\n"
"[0.013168523, -0.008711934, -0.046782676, 0.00069968984]\n",
"[0.023399517, -0.00854715, -0.052534223, -0.012143112]\n"
]
}
],
"source": [
"# Notice the API returns different embeddings depending on `task_type`\n",
"result1 = genai.embed_content(\n",
" model=\"models/embedding-001\",\n",
" model=\"models/text-embedding-004\",\n",
" content=\"Hello world\",\n",
" output_dimensionality=4) # Set output_dimensionality to truncate the dimesions of the embeddings.\n",
"\n",
"result2 = genai.embed_content(\n",
" model=\"models/text-embedding-004\",\n",
" content=\"Hello world\",\n",
" task_type=\"document\",\n",
" output_dimensionality=4)\n",
"\n",
"print(str(result1['embedding']))\n",
"print(str(result2['embedding']))"
]
},
{
"cell_type": "markdown",
"source": [
"## Truncating embeddings\n",
"\n",
"The `text-embedding-004` model also supports lower embedding dimensions. Specify `output_dimensionality` to truncate the output."
],
"metadata": {
"id": "r0r0dt958QQg"
}
},
{
"cell_type": "code",
"source": [
"result1 = genai.embed_content(\n",
" model=\"models/text-embedding-004\",\n",
" content=\"Hello world\")\n",
"\n",
"\n",
"result2 = genai.embed_content(\n",
" model=\"models/embedding-001\",\n",
" model=\"models/text-embedding-004\",\n",
" content=\"Hello world\",\n",
" task_type=\"document\",)\n",
" output_dimensionality=10)\n",
"\n",
"\n",
"print(str(result1['embedding'])[:50], '... TRIMMED]')\n",
"print(str(result2['embedding'])[:50], '... TRIMMED]')"
"(len(result1['embedding']), len(result2['embedding']))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "bX_AjfMx8PvV",
"outputId": "738afb36-ae11-4aae-a3be-047a098f9559"
},
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(768, 10)"
]
},
"metadata": {},
"execution_count": 10
}
]
},
{
Expand Down Expand Up @@ -265,8 +338,8 @@
],
"metadata": {
"colab": {
"name": "Embeddings.ipynb",
"toc_visible": true
"toc_visible": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
Expand All @@ -275,4 +348,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}