Thursday, December 7, 2023

What is embedding? What is a Token? How to generate embedding with Azure OpenAI

     


     An embedding is a new way to represent data for Vector databases, machine learning models, and algorithms. Its format is similar to spatial data; rather than latitude and longitude, we have a bunch of vectors representing data. Just like spatial data, it is not really readable by users. It contains a bunch of floating-point numbers. Here is an example of embedding. The following numbers represent the text "Hello World!"


     1536 Floating points represent "Hello World"; now you are wondering if the number of floating points increases with the size of the text. I wonder how many floating points I need for "This is not my first rodeo. I have season tickets."



    As you can see, the length of the text does not change the number of floating points. This is good news. Embedding represents given data with 1539 floating points. You need to tell Azure OpenAI which model to use to generate embedding. The model you select controls the limits of the given input text. For example, I used the recommended model named text-embedding-ada-002 (Version 2) model to generate this embedding. The current input text limit for this model is 8192 tokens. What is a token? A Token can be words or chunks of characters depending on the words you have in a text. Here are some helpful rules for tokens from OpenAI.

1 token ~= 4 chars in English 1-2 sentences ~= 30 tokens
1 token ~= 3/4 words 1 paragraph ~= 100 tokens
100 tokens ~= 75 words 1500 words ~= 2048 tokens
  
    If you want to make a vector search in a database like Azure Cosmos DB MongoDB vcore, you must first convert your search parameter into an embedding/vector. To generate embeddings, you need to have Azure OpenAI feature. Then, you need to deploy the text-embedding-ada-002 model. The following screenshot shows you my deployed models. You must wait 5 to 10 minutes after the deployment to use a new model. 


     Next, you need the Azure.AI.OpenAI (1.0.0-beta9) or later version. Check the Include prerelease checkbox to find it in Visual Studio.


   You will need the URL and the credentials from the Azure Portal or in the Sample Code link in Azure AI Studio.



  1. You will need to use the endpoint and the key when you declare the OpenAIClient. 
  2. DeploymentName is the name of the text-embedding-ada-002 model deployment.
  3. Input is the text you want to convert into embedding/vector.

var client = new OpenAIClient(new Uri("endpoint goes here"), 
                              new AzureKeyCredential("key goes here"));
var options = new EmbeddingsOptions()
{
    DeploymentName = "embedding",
    Input = { "Generate this text into embedding" }
};

var vector = await client.GetEmbeddingsAsync(options);
foreach (var item in vector.Value.Data[0].Embedding.ToArray())
{
    Console.WriteLine(item);
}

No comments:

Post a Comment