Use Generative AI to Enhance your Product Catalog using only SQL

Using Vertex AI LLM natively in Google Cloud Spanner

Published in

Google Cloud - Community

12 min readDec 13, 2023

Unlock AI for your product catalog without complex code. Vertex AI’s Large Language Model (LLM), integrated seamlessly within Google Cloud Spanner, empowers developers to leverage Generative AI using only familiar SQL queries.

Image created by Generative AI with the prompt “generating content for a catalog with AI”. Image depicts a person looking at several pieces of artwork on a desk with 2 LCD computer displays — Source: Generative AI, Prompt: “generating content for a catalog with AI”

In this article, we’ll explore how to democratize Generative AI tools by allowing developers with little to no AI experience to incorporate Generative AI into their commerce and marketing applications using only SQL.

We’ll establish a sample environment in Cloud Spanner, Google’s fully-managed relational database service offering global consistency, high availability, and unlimited scale. Then we’ll incorporate native-integration of Google’s Large Language Model (LLM) text-bison in Vertex AI to allow developers to leverage the power of Generative AI using only SQL in Cloud Spanner.

Prerequisites

Let’s assume that your product catalog is stored in a Cloud Spanner database. We’ll need to walk through a few steps in the Google Cloud Console to create the Spanner Instance and Database, load the sample data, and perform a few SQL queries to validate our date. Commands below are best executed through the Google Cloud Shell, accessed from the Cloud Console. If you’d prefer, you can run the same commands from your local machine provided that you have the Google Cloud SDK installed and configured.

TL;DR — for those of you that are impatient and would like to run a script to skip the Prerequisites section and get into the Generative AI content, scroll down to the Quickstart section.

Create Spanner Instance & Database

Start by creating a Cloud Spanner instance in your Google Cloud project. First, we’ll need to enable the Cloud Spanner API.

gcloud services enable spanner.googleapis.com

You should receive a confirmation that the operation finished successfully.

To create the Cloud Spanner instance, we’ll specify the smallest and most cost-effective configuration that supports ML Functions. While Cloud Spanner supports fractional node deployments known as processing units, ML Functions require full nodes. For production deployments, you’ll want to choose at least 1 node (1000 processing-units) in a multi-regional configuration.

gcloud spanner instances create productcatalog \
--config=regional-us-central1 \
--description="productcatalog instance" \
--nodes=1

Ensure that your instance was created successfully by issuing the follwing command:

gcloud spanner instances list

You should see something like the following:

Output of command “gcloud spanner instances list” showing the Spanner instance created with the specified configuration

Now that we have a Spanner instance, let’s create our first database to hold our product catalog data.

gcloud spanner databases create productdb \
  --instance=productcatalog

To confirm that the empty database was created successfully, issue the following command:

gcloud spanner databases list --instance productcatalog

Again, we should see a similar confirmation of our empty database:

Output of command “gcloud spanner databases list — instance productcatalog” showing the Spanner database created with the specified configuration

Prepare Sample Dataset

Now that we have a Spanner instance (productcatalog) and empty productdb database, we’ll prepare some sample data from a public product dataset available on Kaggle.
https://www.kaggle.com/datasets/thedevastator/the-home-depot-products-dataset/

Download the ZIP file from the link above to your local computer and unzip to access the dataset named home_depot_data_1_2021_12.csv

The file includes a header row, which we’ll need to remove before importing. Open the file locally and delete the header row. **IMPORTANT** If you don’t complete this step, your Cloud Dataflow job will fail.

Now that you’ev removed the header row, we’ll need to load the csv file for the sample dataset to Google Cloud Storage before we can import into Spanner. First, let’s create the Google Cloud Storage (GCS) bucket with the same name as our project ID in the region us-central1 . From the Cloud Shell, issue the following commands to create the GCS bucket:

export PROJECT_ID=$(gcloud config get-value project)
echo $PROJECT_ID
gsutil mb -l us-central1 gs://${PROJECT_ID}

Confirm that your bucket was created successfully by issuing the following command:

gsutil ls

You should see any buckets that you have previously created including the bucket with the same name as your project ID starting with “gs://”

Finally, we need to upload the CSV file that we downloaded above into our new bucket. Unless you have the Google Cloud SDK locally installed on your machine, the simplest way to upload the same data CSV file to the bucket that we just created is to use the Google Cloud Console.

In a web browser, navigate to https://console.cloud.google.com/storage/browser

Screenshot of the Google Cloud Console showing a listing od Cloud Storage buckets. There is a bucket with the same name as the project ID.

You’ll see the name of the bucket that you just created (same name as your project ID). Click on the bucket name to access the Bucket details screen.

On the Bucket details screen, click UPLOAD FILES and select the CSV file that we just downloaded home_depot_data_1_2021_12.csv

You will receive a confirmation that the file was uploaded successfully.

Navigate back to the Cloud Shell and issue the following command to confirm that your file was successfully uploaded:

gsutil ls gs://${PROJECT_ID}

You should now see the file named home_depot_data_1_2021_12.csv in your bucket.

In the Cloud Shell, using your favorite text editor, open a file named home_depot_data_1_2021_12.temp where we’ll paste the following content to describe our sample data file for import:

{
    "tables": [
      {
        "table_name": "homedepot",
        "file_patterns": [
          "gs://BUCKET/home_depot_data_1_2021_12.csv"
        ]
      }
    ]
  }

Save and exit the file.

Run the following commands to update the BUCKET variable with the name of the GCS bucket that we crated earlier and copy the file to cloud storage.

export PROJECT_ID=$(gcloud config get-value project)
sed 's|'BUCKET'|'"$PROJECT_ID"'|g' home_depot_data_1_2021_12.temp  > home_depot_data_1_2021_12.json
gsutil cp home_depot_data_1_2021_12.json gs://${PROJECT_ID}

Confirm the presence of home_depot_data_1_2021_12.json in our GCS bucket by running the following command in the Cloud Shell:

gsutil ls gs://${PROJECT_ID}

Load Sample Dataset with Cloud Dataflow

Now that we’d prepared the sample dataset and copied it to Cloud Storage, we can finally load the sample dataset to our Spanner Database using Cloud Dataflow. Start by enabling the Dataflow API by issuing the following command in the Cloud Shell:

gcloud services enable dataflow.googleapis.com

In order to load data from a CSV file through Cloud Dataflow, we’ll need to first create the table in Cloud Spanner. Given the schema of the CSV file from the Kaggle dataset, issue the following command in the CLoud Shell to create the table homedepot in the Cloud Spanner database that we previously created:

gcloud spanner databases ddl update productdb --instance=productcatalog \
--ddl="CREATE TABLE homedepot (\
index INT64 NOT NULL,\
url STRING(4096),\
title STRING(4096),\
images STRING(4096),\
description STRING(4096),\
product_id FLOAT64,\
sku FLOAT64,\
gtin13 FLOAT64,\
brand STRING(1024),\
price FLOAT64,\
currency STRING(1024),\
availability STRING(1024),\
uniq_id STRING(1024),\
scraped_at STRING(1024)\
) PRIMARY KEY (index);"

With all of the prep work done, we can load the CSV file to our new Cloud Spanner table using Cloud Dataflow by issuing the following commands in Cloud Shell:

export PROJECT_ID=$(gcloud config get-value project)
gcloud dataflow jobs run loadhomedepot \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Cloud_Spanner \
    --region us-central1 \
    --parameters \
instanceId=productcatalog,\
databaseId=productdb,\
importManifest=gs://${PROJECT_ID}/home_depot_data_1_2021_12.json

The job will take several minutes to process. To check on the status of the job in a visual graph, open a browser window to the following section of the Cloud Console: https://console.cloud.google.com/dataflow/jobs

Click on the job named loadhomedepot to see the graph of the job being performed.

Quickstart (Prerequisites)

Alternate Method to complete Prerequisites

Instead of completing the prerequisite steps above, you can download and run the following commands to perform all of the prerequisite tasks:

git clone https://github.com/cloud-jake/spanner-llm-catalog.git
cd spanner-llm-catalog/
./install_prerequisites.sh

After the script has completed, scroll through the output to verify that there were no errors. You should now have a Cloud Spanner Instance and Database created with the table homedepot populated with the sameple data.

Now on to the fun part!

Register a Generative AI model in a Spanner schema

Now that we’ve completed all of the prerequisite tasks of creating our Cloud Spanner environment, we’ll need to register a Generative AI model in Cloud Spanner.

First, we’ll need to enable the Vertex AI API in our project. Issue the following command in the Cloud Shell:

gcloud services enable aiplatform.googleapis.com

Next, we’ll run a script to create the model TextBison in the Cloud Spanner database productdb that wa created earlier.

LOCATION="us-central1"
PROJECT_ID=$(gcloud config get-value project)

gcloud spanner databases ddl update productdb --instance=productcatalog \
--ddl="CREATE MODEL TextBison \
INPUT (prompt STRING(MAX)) \
OUTPUT (content STRING(MAX)) \
REMOTE \
OPTIONS ( \
endpoint = "'"'"//aiplatform.googleapis.com/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/text-bison"'"'"
);"

Schema discovery and validation isn’t available for Generative AI models. Therefore, you must provide INPUT and OUTPUT clauses that match the model's schema. You can find the full schema of the text-bison model on the Vertex AI Model API reference page.

As long as both the database and endpoints are in the same project, Spanner should grant appropriate permissions automatically. Otherwise, review the model endpoint access control section of the CREATE MODEL reference page.

If you did not complete the quickstart, you can get a copy of this script from the following git repo:

git clone https://github.com/cloud-jake/spanner-llm-catalog.git
cd spanner-llm-catalog/

Then run the script to create the TextBison model:

./create_model.sh

To verify that the model was created successfully, open your web browser and access the Cloud Console to navigate to Spanner Studio from the left menu:

DATABASES → Spanner → productcatalog instance → productdb → Spanner Studio (from the left menu)

Or click the following link:

https://console.cloud.google.com/spanner/instances/productcatalog/databases/productdb/details/query

If you are familiar with BigQuery, you should see a familiar interface similar to BigQuery Studio.

screenshot of Spanner Studio with Explorer menu on the left and SQL editor on the right.

On the left, you should see the Table and Model that we previously created. To start testing your Model, open an Editor window on the right and try out this sample query:

SELECT content
FROM ML.PREDICT(
  MODEL TextBison,
  (SELECT "What's the most populated state in the US?" AS prompt),
  STRUCT(256 AS maxOutputTokens, 0.2 AS temperature, 40 as topK, 0.95 AS topP)
) @{remote_udf_max_rows_per_rpc=1};

If your query runs successfully, it should look something like this:

Google Cloud Spanner Studio Editor pane showing the query code and output using the ML.PREDICT method of calling the TextBison LLM model that we previously created. The question “What is the most populated state in the US?” is answered with “California”

The model expects a single STRING column named prompt. You can use a Spanner subquery to generate the prompt column. The TextBison model requires you to specify a maxOutputTokens model parameter. Other parameters are optional and allow you to further tune the output of your model. Other common model parameters include the following:

maxOutputTokens — Maximum number of tokens that can be generated in the response. A token is approximately four characters.
temperature — The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection.
topK — Top-K changes how the model selects tokens for output. For each token selection step, the top-K tokens with the highest probabilities are sampled. Then tokens are further filtered based on top-P with the final token selected using temperature sampling.
topP — Top-P changes how the model selects tokens for output. Tokens are selected from the most (see top-K) to least probable until the sum of their probabilities equals the top-P value.

The Vertex AI text-bison model doesn't support batching, so you must use the @{remote_udf_max_rows_per_rpc=1} parameter to set the batch size to 1.

Using Generative AI Models in Spanner SQL Queries

Now that we’ve established a way to accessVertex AI models from within Spanner, we can now incorporate these models into our queries using only SQL! Let’s try out a few use cases with our sample data catalog. Note that we are using the LIMIT clause to avoid billing and quota overruns while keeping output easy to consume for the sake of these demonstrations.

Enhance Product Descriptions

One of the most common use cases for Generative AI in retail is to generate or enhance product description.

-- Enhance Product Descriptions
SELECT     
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    content AS enhanced_product_description
FROM ML.PREDICT(
  MODEL TextBison,
  (SELECT
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    CONCAT( "Provide a concise product dscription emphasizing key features", "\n",
            "Product Name:: ", title, "\n",
            "Product Description:", description, "\n"
    ) AS prompt
   FROM homedepot 
   limit 5 ),
   STRUCT(  
    1024 AS maxOutputTokens,
    0.2 AS temperature,
    40 AS topK,
    0.8 as topP
   )
) @{remote_udf_max_rows_per_rpc=1};

Output:

Product Classification or Categorization

Another common use case for Generative AI in addition to enhancing Product Descriptions is to perform Product Classification or Categorization.

-- Product Classification or Categorization
SELECT     
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    content AS category
FROM ML.PREDICT(
  MODEL TextBison,
  (SELECT
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    CONCAT( "Provide a comma delimited list of categories for the following product:", "\n",
            "Product Name:: ", title, "\n",
            "Product Description:", description, "\n"
    ) AS prompt
   FROM homedepot 
   limit 5 ),
   STRUCT(  
    256 AS maxOutputTokens,
    0.2 AS temperature,
    40 AS topK,
    0.8 as topP
   )
) @{remote_udf_max_rows_per_rpc=1};

Output:

Describe Products for Social Media and Email Marketing

Oftentimes, different media and audiences require different content and tone, and may be limited to certain number of characters. Leveraging Prompt Engineering, developers can specify the style in which content is generated.

-- Describe Products for Social Media and Email Marketing
SELECT     
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    content AS marketing_description
FROM ML.PREDICT(
  MODEL TextBison,
  (SELECT
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    CONCAT( "Provide a description of the product for use in a twitter post less than 140 characters", "\n",
            "Product Name:: ", title, "\n",
            "Product Description:", description, "\n"
    ) AS prompt
   FROM homedepot 
   limit 5 ),
   STRUCT(  
    50 AS maxOutputTokens,
    0.2 AS temperature,
    40 AS topK,
    0.8 as topP
   )
) @{remote_udf_max_rows_per_rpc=1};

Output:

Generate Hashtags for Social Media Posts

Similar to the tailored style, tone, and message length, a complimentary use case would be to generate hastags for product posts.

-- Generate Hashtags for Social Media Posts
SELECT     
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    content AS hashtags
FROM ML.PREDICT(
  MODEL TextBison,
  (SELECT
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    CONCAT( "Generate a comma delimited list of 3-5 hashtags for the product below:", "\n",
            "Product Name:: ", title, "\n",
            "Product Description:", description, "\n"
    ) AS prompt
   FROM homedepot 
   limit 5 ),
   STRUCT(  
    20 AS maxOutputTokens,
    0.2 AS temperature,
    40 AS topK,
    0.8 as topP
   )
) @{remote_udf_max_rows_per_rpc=1};

Output:

Answer Questions about Products

In addition to content generation about a product, Generative AI can be used to answer common questions about a product. Here are a few examples:

-- Answer questions about a product
SELECT     
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    content AS gift
FROM ML.PREDICT(
  MODEL TextBison,
  (SELECT
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    CONCAT( "Would the product be a good gift? Answer with YES or NO", "\n",
            "Product Name:: ", title, "\n",
            "Product Description:", description, "\n"
    ) AS prompt
   FROM homedepot 
   limit 5 ),
   STRUCT(  
    5 AS maxOutputTokens,
    0.2 AS temperature,
    40 AS topK,
    0.8 as topP
   )
) @{remote_udf_max_rows_per_rpc=1};

Output:

Limiting questions to 1 product with a more specific question:

-- Answer questions about a product
SELECT     
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    content AS content
FROM ML.PREDICT(
  MODEL TextBison,
  (SELECT
    index, url, title, description, product_id, sku,
    gtin13, brand, price, currency, availability, uniq_id,
    CONCAT( "What are some common accessories purchased with type of product?", "\n",
            "Product Name:: ", title, "\n",
            "Product Description:", description, "\n"
    ) AS prompt
   FROM homedepot 
   WHERE index = 11
   limit 1 ),
   STRUCT(  
    1024 AS maxOutputTokens,
    0.2 AS temperature,
    40 AS topK,
    0.8 as topP
   )
) @{remote_udf_max_rows_per_rpc=1};

Output:

Summary

In this article, we explored several different use cases to leverage Generative AI from Google Cloud Spanner SQL. please comment below with use cases that you’ve found helpful or ways in which you’ve incorporated Generative AI into your applications!