Implement Multimodal Vector Search with BigQuery: Challenge Lab
PROJECT_ID | `qwiklabs-gcp-03-b5156984b274` |
CONNECTION_REGION | `us-east1` |
CONNECTION_NAME | `vector_conn` |
DATASET_NAME | `gcc_bqml_dataset` |
OBJECT_TABLE_NAME | `gcc_image_object_table` |
MODEL_NAME | `gcc_embedding` |
DEFINE_ENDPOINT | `ENDPOINT = 'multimodalembedding@001'` |
EMBEDDINGS_TABLE_NAME | `gcc_retail_store_embeddings` |
SEARCH_RESULTS_TABLE | `gcc_vector_search_table` |
STATEMENT_TO_SELECT_TOP_2_RESULTS | `top_k => 2` |
VECTOR_SEARCH_FUNCTION | `VECTOR_SEARCH` |
EMBEDDINGS_FUNCTION | `ML.GENERATE_EMBEDDING` |
Task 1. Create a source connection and grant IAM permissions
BigQuery Vertex AI에서 `us-east1` Region으로 `vector_conn`이라는 외부 소스 연결
IAM Permissions 부여
- BigQuery Data Owner
- Storage Object Viewer
- Vertex AI User
Task 2. Create an object table
Google Cloud Storage에 저장된 이미지 및 동영상과 같은 비정형 데이터를 쿼리하기 위해 `gcc_bqml_dataset`라는 미리 만들어진 BigQuery 데이터 세트에서 객체 테이블 `gcc_image_object_table`을 생성
Create or replace external table `[PROJECT_ID].[DATASET_NAME].[OBJECT_TABLE_NAME]`
with connection `[PROJECT_ID].[CONNECTION_REGION].[CONNECTION_NAME]`
options
(
object_metadata='SIMPLE',
uris=['gs://[PROJECT_ID]/*']
)
Create or replace external table `qwiklabs-gcp-03-b5156984b274.gcc_bqml_dataset.gcc_image_object_table`
with connection `qwiklabs-gcp-03-b5156984b274.us-east1.vector_conn`
options
(
object_metadata='SIMPLE',
uris=['gs://qwiklabs-gcp-03-b5156984b274/*']
)
Task 3. Generate embeddings
Connect to the multimodal embeddings model
원격 임베딩 모델에 연결하기 위해 `gcc_bqml_dataset`라는 미리 만들어진 BigQuery 데이터 세트에서 BigQuery의 새 모델 `gcc_embedding`을 만들고 엔드포인트(모델 이름)를 `multimodalembedding@001`로 지정
Create or replace model
`[PROJECT_ID].[DATASET_NAME].[MODEL_NAME]`
remote with connection `[PROJECT_ID].[CONNECTION_REGION].[CONNECTION_NAME]`
options(
[DEFINE_ENDPOINT]
);
Create or replace model
`qwiklabs-gcp-03-b5156984b274.gcc_bqml_dataset.gcc_embedding`
remote with connection `qwiklabs-gcp-03-b5156984b274.us-east1.vector_conn`
options(
ENDPOINT = 'multimodalembedding@001'
);
Generate embeddings
이미지의 임베딩을 생성할 때 `gcc_bqml_dataset`라는 미리 만들어진 BigQuery 데이터 세트의 `gcc_retail_store_embeddings` 테이블에 임베딩을 저장
Create or replace table `[PROJECT_ID].[DATASET_NAME].[EMBEDDINGS_TABLE_NAME]`
as select *, REGEXP_EXTRACT(uri, r'[^/]+$') as product_name
from [EMBEDDINGS_FUNCTION]
(
MODEL `[PROJECT_ID].[DATASET_NAME].[MODEL_NAME]`,
TABLE `[PROJECT_ID].[DATASET_NAME].[OBJECT_TABLE_NAME]`
)
Create or replace table `qwiklabs-gcp-03-b5156984b274.gcc_bqml_dataset.gcc_retail_store_embeddings`
as select *, REGEXP_EXTRACT(uri, r'^/+$') as product_name
from ML.GENERATE_EMBEDDING
(
MODEL `qwiklabs-gcp-03-b5156984b274.gcc_bqml_dataset.gcc_embedding`,
TABLE `qwiklabs-gcp-03-b5156984b274.gcc_bqml_dataset.gcc_image_object_table`
)
임베딩 결과를 검토하려면,
SELECT * FROM `[PROJECT_ID].[DATASET_NAME].[EMBEDDINGS_TABLE_NAME]`
임베딩 결과는 부동 소수점 숫자로 표현되므로 바로 해석하지 못할 수 있음
Task 4. Run a vector search
벡터 검색을 실행하여 검색구문에 가장 유사한 이미지를 찾을 때 `gcc_bqml_dataset`라는 미리 만들어진 BigQuery 데이터 세트의 `gcc_vector_search_table` 테이블에 검색 결과를 저장
Create or replace table `[PROJECT_ID].[DATASET_NAME].[SEARCH_RESULTS_TABLE]` AS
select base.uri,
base.product_name,
base.content_type,
distance
from
[VECTOR_SEARCH_FUNCTION](table [DATASET_NAME].[EMBEDDINGS_TABLE_NAME],'ml_generate_embedding_result',
(
SELECT ml_generate_embedding_result as embedding_col
FROM
[EMBEDDINGS_FUNCTION]
(
MODEL `[DATASET_NAME].[MODEL_NAME]`,
(select 'Men Sweaters' as content),
STRUCT(TRUE AS flatten_json_output)
)
),
[STATEMENT_TO_SELECT_TOP_2_RESULTS],
distance_type => 'COSINE'
);
Create or replace table `qwiklabs-gcp-03-b5156984b274.gcc_bqml_dataset.gcc_vector_search_table` AS
select base.uri,
base.product_name,
base.content_type,
distance
from
VECTOR_SEARCH(table gcc_bqml_dataset.gcc_retail_store_embeddings,'ml_generate_embedding_result',
(
SELECT ml_generate_embedding_result as embedding_col
FROM
ML.GENERATE_EMBEDDING
(
MODEL `gcc_bqml_dataset.gcc_embedding`,
(select 'Men Sweaters' as content),
STRUCT(TRUE AS flatten_json_output)
)
),
top_k => 2,
distance_type => 'COSINE'
);
벡터 검색 결과를 검토하려면,
SELECT * FROM `[PROJECT_ID].[DATASET_NAME].[SEARCH_RESULTS_TABLE]`