Skip to main content
Loading

Creating vector embeddings

Overview

To use Aerospike Vector Search (AVS), you must build an application that generates vector embeddings. This page outlines some general approaches for generating vector embeddings using Python and a machine learning model.

Generate embeddings using a hosted service

Using a hosted model like OpenAI provides ease of use, quick deployment, and access to cutting-edge technology without the need for significant infrastructure investment. It ensures scalability, automatic updates, and maintenance, allowing organizations to focus on application development rather than managing the underlying model infrastructure.

  1. Install the OpenAI Python client library if you haven't already:

    pip install openai
  2. Use the following Python code to generate a vector embedding:

    import openai

    # Set your OpenAI API key
    openai.api_key = 'your-api-key-here'

    # Define the text chunk for which you want to generate an embedding
    text_chunk = "OpenAI's GPT-4 is a powerful language model capable of performing a wide range of natural language processing tasks."

    # Generate the embedding
    response = openai.Embedding.create(
    input=text_chunk,
    model="text-embedding-ada-002"
    )

    # Extract the embedding vector
    embedding_vector = response['data'][0]['embedding']

    # Print the embedding vector
    print(embedding_vector)

Self-host an open-source model

Self-hosting a machine learning model offers enhanced data privacy, security, and control over the environment, making it easier to comply with regulatory requirements and optimize performance. It can also be more cost-effective for high usage scenarios, eliminating dependency on third-party providers and reducing latency.

The following example shows how you can generate a vector embedding from a chunk of text using the LLaMA model. You can use the generated vector for various downstream tasks such as similarity searches and other vector computations.

  1. Install the required libraries:

    pip install transformers
    pip install torch
  2. Use the following Python code to generate a vector embedding:

    from transformers import AutoTokenizer, AutoModel
    import torch

    # Load the LLaMA model and tokenizer
    model_name = "facebook/llama-7b" # Replace with the correct model name
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Define the text chunk for which you want to generate an embedding
    text_chunk = "LLaMA is a powerful language model capable of performing a wide range of natural language processing tasks."

    # Tokenize the text chunk
    inputs = tokenizer(text_chunk, return_tensors="pt")

    # Generate the embeddings
    with torch.no_grad():
    outputs = model(**inputs)
    # The embeddings are typically in the 'last_hidden_state' tensor
    embeddings = outputs.last_hidden_state

    # Average the token embeddings to get a single vector representation
    embedding_vector = torch.mean(embeddings, dim=1).squeeze().numpy()

    # Print the embedding vector
    print(embedding_vector)

Additional resources

  1. Hugging Face Model Hub: A comprehensive repository of pre-trained models for various natural language processing (NLP) tasks, computer vision, and more.

  2. TensorFlow Hub: A library of reusable MLLs for TensorFlow, offering models for text, image, and audio processing.

  3. PyTorch Hub: A repository of pre-trained PyTorch models, facilitating easy integration and deployment for various machine learning tasks.

  4. Kaggle: An online community for data scientists and machine learning practitioners, providing datasets, notebooks, and pre-trained models.

  5. GitHub: A vast platform hosting a multitude of open-source projects, including repositories for MLLs, tools, and frameworks.

  6. OpenVINO Model Zoo: A collection of pre-trained models optimized for Intel hardware, supporting various AI tasks.

  7. Model Zoo for Caffe, TensorFlow, PyTorch, MXNet: Collections of pre-trained models specific to different deep learning frameworks, offering models for diverse applications.

  8. NVIDIA NGC: A platform offering GPU-optimized deep learning frameworks and pre-trained models for various AI tasks.