OpenAI 임베딩 모델과 Pinecone 인덱스 구축

🧠 1️⃣ 임베딩(Embedding)이란?

임베딩(Embedding)은 텍스트를 벡터(숫자 배열)로 변환하여
컴퓨터가 의미적으로 이해하고 비교할 수 있게 하는 기술입니다.

구분	예시	설명
입력 텍스트	“기생충은 사회 계급을 다룬 영화이다.”	자연어 문장
출력 벡터	[0.11, -0.02, 0.87, …]	1536차원 실수 벡터
활용	유사도 계산, 검색, 분류, 추천	의미 기반 처리 가능

💡 OpenAI의 text-embedding-3-small 모델은

1536차원 벡터 출력
다국어(한국어 포함) 지원
RAG 및 검색 시스템에 최적화 되어 있습니다.

🗃️ 2️⃣ Pinecone 인덱스란?

Pinecone은 벡터를 효율적으로 저장하고 검색하기 위한 벡터 데이터베이스입니다.

구성요소	설명
ID	각 벡터를 식별하는 고유 키
벡터(embedding)	텍스트의 수치 표현
메타데이터(metadata)	제목, 장르, 연도 등 부가 정보

🧩 핵심 역할

대용량 벡터 저장
유사도 검색 (cosine, dot product 등)
메타데이터 필터링 지원

⚙️ 3️⃣ Pinecone 환경 설정

.env

OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
PINECONE_ENVIRONMENT=us-east-1-aws

🔑 키는 .env 파일로 관리하는 것이 안전합니다.

🏗️ 4️⃣ Pinecone 인덱스 생성

import pinecone, os
from dotenv import load_dotenv

# ✅ 1️⃣ 환경변수 로드
load_dotenv()

# ✅ 2️⃣ Pinecone 초기화
pc = pinecone.Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# 인덱스가 없으면 생성 (ServerlessSpec 필수)
index_name = "movie-index" 

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",  
        spec=pinecone.ServerlessSpec(
            cloud="aws",       # 또는 "gcp"
            region="us-east-1" # 사용 중인 리전으로 교체
        ),
    )

# 인덱스 불러오기
index = pc.Index(index_name)
print(f"✅ Pinecone 인덱스 준비 완료 → {index_name}")

💡 주의사항

dimension은 임베딩 모델의 출력 크기(1536)와 일치해야 합니다.

🎬 5️⃣ 한국 영화 메타데이터 구성

필드명	설명
title	제목
year	개봉년도
genre	장르
director	감독
actors	출연 배우 목록
rating	평점
synopsis	줄거리 요약

예시 JSON 구조:

{
  "title": "기생충",
  "year": 2019,
  "genre": ["드라마", "스릴러"],
  "director": "봉준호",
  "actors": ["송강호", "조여정", "이선균"],
  "rating": 8.6,
  "synopsis": "가난한 가족과 부유한 가족 사이의 계급을 그린 블랙코미디."
}

🧩 6️⃣ 하이브리드 검색을 고려한 필드 설계

검색 조건	필드 예시	필터 예시
장르 필터	genre	`{"genre": {"$in": ["드라마", "스릴러"]}}`
연도 필터	year	`{"year": {"$gte": 2020}}`
평점 필터	rating	`{"rating": {"$gt": 8.0}}`
배우 필터	actors	`{"actors": {"$in": ["송강호"]}}`

💡 하이브리드 검색

벡터 유사도 + 메타데이터 조건을 함께 사용해
더 정확한 검색 결과를 제공합니다.

🧮 7️⃣ 임베딩 생성 및 Pinecone 업서트

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document

# ✅ 3️⃣ 영화 데이터 정의 
data = [
    {
        "title": "응답하라 1988",
        "year": 2015,
        "genre": ["드라마", "코미디"],
        "director": "신원호",
        "actors": ["혜리", "박보검"],
        "rating": 9.2,
        "synopsis": "1988년 서울 쌍문동 이웃들의 우정과 가족애를 그린 드라마."
    },
    {
        "title": "기생충",
        "year": 2019,
        "genre": ["드라마", "스릴러"],
        "director": "봉준호",
        "actors": ["송강호", "조여정"],
        "rating": 8.6,
        "synopsis": "가난한 가족과 부유한 가족의 계급 격차를 다룬 영화."
    }
]

# ✅ 4️⃣ LangChain Document 객체로 변환
documents = [
    Document(
        page_content=item["synopsis"],
        metadata={
            "title": item["title"],
            "year": item["year"],
            "genre": ", ".join(item["genre"]),
            "director": item["director"],
            "actors": ", ".join(item["actors"]),
            "rating": item["rating"]
        }
    )
    for item in data
]
print(f"✅ Document 변환 완료 ({len(documents)}개 문서)")

# ✅ 5️⃣ OpenAI 임베딩 생성기 초기화
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# ✅ 6️⃣ Pinecone VectorStore 업서트
vector_store = PineconeVectorStore.from_documents(
    documents=documents,
    embedding=embeddings,
    index_name=index_name
)
print("✅ Pinecone VectorStore 업서트 완료!")

✅ 결과 확인

# ✅ 7️⃣ 검색 테스트
query = "가난과 부유함의 차이를 다룬 영화"
results = vector_store.similarity_search_with_score(query, k=2)

print("\n🔍 검색 결과:")
for doc, score in results:
    print(f"🎬 {doc.metadata['title']} ({doc.metadata['year']})")
    print(f"💬 {doc.page_content}")
    print(f"🔢 유사도 점수: {score:.4f}\n")

🔍 검색 결과:
🎬 응답하라 1988 (2015.0)
💬 1988년 서울 쌍문동 이웃들의 우정과 가족애를 그린 드라마.
🔢 유사도 점수: 0.2984

🎬 기생충 (2019.0)
💬 가난한 가족과 부유한 가족의 계급 격차를 다룬 영화.
🔢 유사도 점수: 0.6398

🔍 8️⃣ 정리

항목	설명
임베딩 모델	OpenAI `text-embedding-3-small`
인덱스 구조	ID + 벡터 + 메타데이터
저장소	Pinecone 서버리스 인덱스
검색 방식	코사인 유사도 + 메타데이터 필터
활용 예시	RAG, 추천 시스템, 콘텐츠 검색