Exploring the new Data Quality Dashboard in Google Discovery AI

And how to use Generative AI in BigQuery to Generate Product Catalog Descriptions and Improve Data Quality

Published in

Google Cloud - Community

12 min readAug 13, 2023

users gathered around computer screens showing an ecommerce site.

Discovery AI for Retail gives organizations the ability to provide Google-quality search, browse and recommendations on their own digital properties, helping to increase conversions and reduce search abandonment.

As with any data project, starting with a reliable and accurate corpus of data is critical to providing great results. Google has recently released an enhanced Data Quality Dashboard for Discovery AI to help identify data issues to prevent poor search and recommendation results. Garbage In = Garbage Out…. A common issue with product catalog data is incomplete, inaccurate, or otherwise unsatisfactory product descriptions.

In this article, we’ll explore Google’s Discovery AI suite of products from the catalog data quality perspective and leverage Google’s Generative AI tools to generate meaningful item descriptions for our product catalog and boost our data quality scores.

Start with a Discovery AI Instance

If you don’t already have an instance of Discovery AI (aka Retail API), check out the last article Test Drive Google Discovery AI — which outlines the steps to create a Discovery AI instance and preload it with sample data from the Movielens open dataset.

→ Create a Discovery AI Instance

Data Quality Dashboard

The Data Quality Dashboard for Discovery AI details Catalog and User Event data quality issues for search and browse queries. The dashboard breaks down performance issues into 4 cumulative tiers (i.e. Data requirements from previous tiers apply to the next tiers). Within each tier, data requirements are classified as Compliant, Performance Critical, or Upgrade Blocking. We’ll start by exploring the Data Quality Dashboard and current data quality state for the Movielens database example.

Data Quality Tool Overview

To access the Data Quality Dashboard in the Google Cloud Console, click the Retail menu and navigate to Data Quality.

Retail menu in the Google Cloud Console shoeing the menu items in a vertical list with “Data Quality” highlighted

On the Data Quality Dashboard page, you’ll notice the option to choose between TEXT SEARCH and BROWSE SEARCH. For the purpose of this example, we’ll focus on TEXT SEARCH queries. At the top of the page, a summary of the 4 tiers are shown, including the tier that is currently in use.

Screenshot of the Retail API Data Quality page showing the summary table at the top of the page.

Below the tier summary are the 4 tiers listed individually. To see the details for each tier, click the VIEW link to expand the section.

Screenshot of the Data Quality Dashboard showing the tier summary and individual tiers listed in a table.

Product documentation: Unlock search performance tiers

Data Quality Tiers

Tier 1 — Relevance
Results are returned based only on relevance to the query

Screenshot of the Data Quality Dashboard table showing the status of each data quality metric for Tier 1 — Relevance.

Tier 2 — Relevance & Popularity
Results are returned based on relevance to the query. Equally relevant products are ranked by the popularity of the product on your site

Screenshot of the Data Quality Dashboard table showing the status of each data quality metric for Tier 2— Relevance & Popularity.

Tier 3 — Revenue Optimized
Relevant results are ranked by the product’s likelihood of being purchased based on site-wide activity

Screenshot of the Data Quality Dashboard table showing the status of each data quality metric for Tier 3 — Revenue Optimized.

Tier 4 — Personalized Revenue Optimized
Relevant results are ranked by the individual user’s preferences and the product’s likelihood of being purchased based on site-wide activity

Screenshot of the Data Quality Dashboard table showing the status of each data quality metric for Tier 4 — Personlized and Revenue Optimized.

Generally, the highest tier that has satisfied all of the requirements will be labeled “in use”. Tier requirements are cummulative. In order for a tier to be in use, all requirements of lower tiers must be satisfied.

If it is labeled “In use”: You have satisfied all blocking issues for that tier and have unlocked it.
If there are 0 blocking issues: You have satisfied the data checks for that tier. If there are 0 blocking issues for this tier and all previous tiers, it takes about 24 hours to train and prepare the model and activate the newly unlocked tier.
If there are any blocking issues: Check that tier’s metrics to see which data issues to address to unlock that tier.

Meeting Data Quality Requirements

For each metric listed the data quality scorecard for each tier, a status is shown along with the current and threshold value. The Status will be shown as one of the following values:

Upgrade blocking: Identifies data issues that prevent Retail Search from upgrading your use case to the next performance tier. To unlock a tier, satisfy all upgrade blocking data checks for that performance tier (and those of its previous tiers).
Performance critical: Identifies data issues that do not block an upgrade, but can have a significant impact on text search or browse search performance.
Compliant: Indicates that this data check has passed.
Unavailable: Indicates that a non-upgrade blocking data check is not yet completed. Values for these metrics are displayed as N/A. It can take up to 24 hours after importing data to compute some data checks.

To get a full explanation, click the DETAILS link for any of the metrics listed. Let’s take a look at details card for “Percentage of products with a description”

Detail card for the data quality metric “percentage of products with a description” showing the description of what needs to be done in order to satisfy this requirement.

In the next section, we’ll leverage Google’s Generative AI tooling in Vertex AI to generate product descriptions for each movie in our product catalog in order to satisfy the “Percentage of products with a description” requirement.

Boost Data Quality Scores with Generative AI

One emerging use-case for Generative AI and associated large language models (LLM) has been to generate product descriptions and other product metadata. In this example, we’ll use Google’s VertexAI LLMs to generate a movie description from a movie title.

Since our data catalog is stored in BigQuery, let’s find a way to call VertexAI LLM models directly from BigQuery. Fortunately, Google’s data cloud provides a unified, open approach to data-driven transformation that is unmatched for speed, scale, and security with AI built in. We’ll take advantage of BigQuery’s ability to connect to external data sources — in this case a machine learning (ML) model in Vertex AI — Google’s unified machine learning platform.

Enable BigQuery Connection API

Let’s start by enabling the BigQuery Connection API which will allow BigQuery to call ML models from VertexAI. In the same project where you created your Disovery AI environemnt, open the following URL:

https://console.cloud.google.com/marketplace/product/google/bigqueryconnection.googleapis.com

Alternatively, you can search for “bigqueryconnection.googleapis.com” in the Cloud Console.

Click the blue ENABLE button to enable to BigQuery Connection API.

Screenshot of the API explorer page with a prompt to enable the BigQuery Connection API

Once the API has been enabled, navigate to BigQuery in the Cloud Console.

Add External Connection

In the BigQuery section of the Cloud Console, click the “+ADD” link at the top left of the screen.

Explorer pane in BigQuery with the “+ADD” button highlighted

Under Popular Sources, select Connections to external data sources

Screenshot of the add external connection dialog in BigQuery

In the Connection type dropdown, select BigLake and remote functions and provide llm-conn as Connection ID. Click the blue CREATE CONNECTION button. You can leave all other values as the default.

Screenshot of external data source dialog in BigQuery with the values noted in the paragraph above.

Once the connection is created, make a note of the Service Account generated from the connection configuration details. To see the connection details, locate your project in the Explorer pane and expand the External conenctions section to reveal the connetion that we just created (us.llm-conn). A Connection info section will appear in the right window. Locate the Service account id

BigQuery in Google Cloud Console showing the “External connection” “Connection info” for the connect “llm-conn” that we created in the previous step. The value for Service account id is highlighted.

Grant Service Account Permissions to Vertex AI

To call VertexAI models from BigQuery, we need to grant permission to the service account that was just created. You can complete the following task by granting the “Vertex AI User” role to the Service Account in the IAM console, or by issuing the following commands using Cloud Shell.

First, define the variable “SA” by pasting the Service account id that we just created. Enter the following command in Cloud Shell and hit enter (replacing the sample text with your account Service account id):

SA="paste YOUR Service Account ID from llm-conn"

Now paste and run the following commands:


PROJECT_ID=`gcloud config get project`
gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:${SA}" --role="roles/aiplatform.user"

You should receive several lines of text as output that starts with “Updated IAM policy for project [your-project-id].”

Create a remote model in BigQuery

To call the text-bison LLM model in VertexAI, we’ll create a BQML remote model using the remote connection that we created earlier. In the BigQuery console, issue the following query — noting the variables for the model and connection that we created earlier:

DATASET_ID.MODEL_NAME = movielens.llm_model

REGION.CONNECTION_NAME = us.llm-conn

CREATE OR REPLACE MODEL movielens.llm_model
  REMOTE WITH CONNECTION `us.llm-conn`
  OPTIONS (remote_service_type = 'CLOUD_AI_LARGE_LANGUAGE_MODEL_V1');

Generate Product Descriptions

Once the model is created, use the model to generate and categorize text. Since our product catalog is already stored in BigQuery, we can join our existing table when we call the model to generate a new table that includes project descriptions.

Let’s start with a sample query to refine our Prompt and Parameters.

SELECT
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  ml_generate_text_result['predictions'][0]['safetyAttributes'] AS safety_attributes,
  * EXCEPT (ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `DATASET_ID.MODEL_NAME`,
    (
  SELECT
        CONCAT('Generate a logline for the following movie: ', title)
        AS prompt from `DATASET_ID.PRODUCT_TABLE`
          limit 10
    ),
    STRUCT(
      0.2 AS temperature,
      1024 AS max_output_tokens));

Replace the following variables in the SQL code listed about. If you followed the instructions above, you would use the following values:

DATASET_ID.MODEL_NAME = movielens.llm_model

DATASET_ID.PRODUCT_TABLE = movielens.products

The full query with the variables listed above for reference:

SELECT
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  ml_generate_text_result['predictions'][0]['safetyAttributes'] AS safety_attributes,
  * EXCEPT (ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `movielens.llm_model`,
    (
  SELECT
        CONCAT('Generate a logline for the following movie: ', title)
        AS prompt from `movielens.products`
          limit 10
    ),
    STRUCT(
      0.2 AS temperature,
      1024 AS max_output_tokens,
      40 AS top_k,
      0.8 as top_p
      ));

Run this query in BigQuery to generate some loglines for 10 movies selected fro the “movielens.products” table (view). Review the output:

BigQuery results from generating a logline for the movies listed in the query. Focus on the colum “generated_text”

If you’d like to explore updating the variables listed in the STRUCT section at the bottom of the query, see the “[optional] Refine your prompt and parameters in Generative AI Studio” section at the end of this article.

If we are satisfied with the output, let’s generate a logline for all of the rows in the products table (view) then update our product catalog in Discovery AI to address the data quality requirement for product description.

Generate Product Descriptions for all Products

Now that we have a query that will reliably generate a logline or summary of the movies in our products table (view), let’s modify the query so we can use it to generate a products_description table that will include the product id and the generated text that we can join back to our products table (view).


SELECT
  id,
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  * EXCEPT (id)
FROM
  ML.GENERATE_TEXT(
    MODEL `movielens.llm_model`,
    (
  SELECT
        id,
        CONCAT('Generate a logline for the following movie: ', title) AS prompt 
        FROM `movielens.products`
        limit 10
    ),
    STRUCT(
      0.2 AS temperature,
      1024 AS max_output_tokens,
      40 AS top_k,
      0.8 as top_p
      ))

To create a new table from the output, we’ll add a few lines to our query. We’ll also update the number of products (movies) from 10 to 100 for this example.

CREATE OR REPLACE TABLE movielens.products_description
AS (

SELECT
  id,
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  * EXCEPT (id)
FROM
  ML.GENERATE_TEXT(
    MODEL `movielens.llm_model`,
    (
  SELECT
        id,
        CONCAT('Generate a logline for the following movie: ', title) AS prompt 
        FROM `movielens.products`
        limit 100
    ),
    STRUCT(
      0.2 AS temperature,
      1024 AS max_output_tokens,
      40 AS top_k,
      0.8 as top_p
      ))
)

It took about a minute to generate the requested text for 100 products (movies). You should see the following message once completed:

Query results in BigQuery noting that a new table has been generated based on the query that we just ran.

Click on the blue GO TO TABLE button to check out our results.

We could repeat the process for our entire products table (view), but for now, let’s include the results in the products view, load the updated information into Discovery AI, and verify the added product descriptions in the Data Quality Dashboard.

Update the Products Table (View) with Generated Product Descriptions

In the BigQuery console, locate the Products view in the left explorer pane. Next, click the DETAILS tab in the right window to see the SQL code that generates this view.

Schema of the view named products in BigQuery

Scroll to the bottom of the DETAILS tab and select EDIT QUERY

Details of the view named products in BigQuery showing the SQL code that generates the view.

Update the query to join our newly created products_descriptions table to include the generated_text as the description field. You can replace the entire query with the SQL code below:

 SELECT
   CAST(movies.movieId AS string) AS id,
   SUBSTR(title, 0, 128) AS title,
   prod_desc.generated_text AS description,
   SPLIT(genres, "|") AS categories,
   CONCAT("https://www.imdb.com/title/tt",links.imdbId) AS uri
 FROM `movielens.movies` movies
 LEFT JOIN `movielens.links` links
 ON movies.movieID = links.movieid
 LEFT JOIN `movielens.products_description` prod_desc
 ON CAST(movies.movieId AS string) = prod_desc.id

Verify that the updated query works as expected by clicking the blue RUN button. Once you have verified that the new query works as expected, click SAVE VIEW to update and save the products view.

Updated SQL code for the products view in BigQuery that now includes the table products_description joined

Reload Updated Product Catalog in Discovery AI

Finally, let’s take the updated products catalog and update Discovery AI.

Open the Discovery AI section of the Cloud Console by navigating or searching Retail. You can also enter the following URL: https://console.cloud.google.com/ai/retail/

In the Retail menu, navigate to the Data section, then click IMPORT at the top of the screen.

To perform an update to the data catalog, select the following values in the Import Data pane:

Import data dialog showing the values that are spelled our in the patragraph below

Import Type: Product Catalog

Source of Data: BigQuery

Import Branch: Branch 0

Schema of data: Retail Product Catalogs Schema

BigQuery table: select the products table in your project

Click the blue IMPORT button.

You should receive a pop-up confirmation message once your import has been started, Go back to the Data page and click Activity Status at the top of the page to monitor the progress of your import.

Discovery AI Data import screen with the ACTIVITY STATUS button highlighted

In the Activity status pane, Click the refresh activity icon at the top right of the screen until the Status column reads “Succeeded”.

Activity Status pane showing the status of the catalog import that we just processed. Status now shows “Succeeded”

Once the import has been completed, navigate to the Data Quality section and wait for the metric Percentage of products with a description to be updated. Data quality checks are triggered by data imports and it will take some time for all checks to be completed. The date when all checks were last completed will be printed a the top of the page.

Discovery AI Data Quality table showing the effect of adding the generated product desciptions to the new data catalog import.

Since we updated 100 (of the 59,098 products), we should now see 0.17% of the products with a valid description — still below the 90% threshold…. ;)

[optional] Refine your prompt and parameters in Generative AI Studio

Want to tweak the prompt and parameters that we used to generate product descriptions? It’s easy!! Check out Generative AI Studio in Vertex AI.

Vertex AI’s Generative AI Studio provides an easy to use interface for prompt design and tuning. Choose from a variety of tuning options based on your ML expertise, data, and budget including tuning with text prompts or state-of-the-art capabilities like Reinforcement Learning from Human Feedback (RLHF).

Summary

In this article, we explored Google Discovery AI’s new Data Quality Dashboard and how to unlock search performance tiers by reconciling data quality issues.

We also explored how to leverage Generative AI within BigQuery to generate missing product descriptions in a product catalog.

References

1 — SQL-only LLM for text generation using Vertex AI model in BigQuery
2 — Test Drive Google Discovery AI
3 — Unlock Search Performance Tiers
4 — Generate text content from BigQuery data