Group Relative Policy Optimization (GRPO)

그룹 상대 정책 최적화(GRPO)를 발견해 보십시오. 이 메모리 효율적이고 비평가 없는 RL 알고리즘이 어떻게 LLM 추론을 향상하고 학습 비용을 절감하는지 알아보십시오.

Group Relative Policy Optimization (GRPO) is a memory-efficient reinforcement learning algorithm developed to enhance the reasoning capabilities of Large Language Models (LLMs) and broader Artificial Intelligence (AI) systems. First introduced in the 2024 DeepSeekMath paper, GRPO improves upon traditional optimization methods by removing the need for a separate value network (critic model). Instead, it normalizes the rewards of a group of generated responses derived from the same prompt. By evaluating responses relative to their peers within the group, GRPO dramatically reduces computational overhead while boosting performance on complex reasoning tasks in modern Deep Learning (DL) architectures.

Link to this sectionGRPO와 PPO의 차이점#

While GRPO shares similarities with Proximal Policy Optimization (PPO)—a standard optimization algorithm often used in reinforcement learning from human feedback (RLHF)—the two differ significantly in architecture. PPO requires a secondary "critic" model that runs parallel to the main policy network to estimate the value of a given state. This nearly doubles the memory required during the training phase.

반면, GRPO는 비평가가 없는 알고리즘입니다. 단일 프롬프트에 대해 여러 출력을 샘플링하고 규칙 기반 보상 시스템이나 검증기를 사용하여 점수를 매김으로써, GRPO는 해당 그룹 내 점수를 정규화하여 어드밴티지를 계산합니다. 이러한 상대적 비교가 기준점 역할을 하여, 가치 네트워크가 차지할 막대한 양의 메모리를 절약하고 전반적인 모델 학습을 가속화합니다.

Link to this sectionGRPO의 실제 활용 사례#

GRPO는 최근 생성형 AI 및 자연어 처리 분야에서 여러 혁신을 주도했습니다. 주목할 만한 두 가지 적용 사례는 다음과 같습니다.

수학적 추론 모델: 널리 인용되는 DeepSeek-R1 릴리스와 DeepSeekMath에서 GRPO는 모델이 생각의 사슬(chain-of-thought) 추론과 자기 검증을 개발하도록 장려하는 데 사용되었으며, OpenAI의 o1과 같은 독점 모델의 성능과 대등한 결과를 보였습니다. 정확한 최종 답변과 형식에 보상을 제공함으로써, 이 알고리즘은 모델이 인간이 주석을 단 데이터에 대한 광범위한 파인튜닝 없이도 고급 문제 해결 전략을 유기적으로 발견할 수 있게 했습니다.
코드 생성 및 에이전트 로직: 코드를 작성하거나 자율적인 에이전트 워크플로우를 구동하는 모델의 경우 절대적인 정확성을 평가하기가 어렵습니다. GRPO는 모델이 코드 변형을 실행하고 컴파일 성공 여부나 통과된 테스트 케이스를 기준으로 상대적으로 점수를 매겨 학습하도록 함으로써, 신뢰성이 높은 AI 코딩 어시스턴트의 배포를 가속화합니다.

Link to this sectionPyTorch에서 GRPO 개념 구현하기#

At its core, GRPO calculates the relative advantage of responses by normalizing their rewards. Here is a basic PyTorch implementation demonstrating this normalization using standard tensor operations:



def compute_grpo_advantages(rewards):
    # 'rewards' is a tensor of shape (batch_size, group_size)
    group_mean = rewards.mean(dim=1, keepdim=True)
    group_std = rewards.std(dim=1, keepdim=True)

    # Normalize rewards within the group to calculate relative advantages
    advantages = (rewards - group_mean) / (group_std + 1e-8)
    return advantages

Link to this section스마트 최적화를 통한 AI 발전#

GRPO가 텍스트 생성의 효율성을 재정의하는 것과 마찬가지로, 고급 머신러닝(ML) 기술은 지속적으로 시각적 인식 분야를 변화시키고 있습니다. 아키텍처와 손실 함수를 최적화함으로써 개발자는 모든 도메인에서 더 가볍고 빠른 모델을 구축할 수 있습니다.

최첨단 컴퓨터 비전 작업을 위해서는 엔드 투 엔드 최적화를 탐색하는 것이 똑같이 중요합니다. 예를 들어, Ultralytics YOLO26은 기본적으로 NMS가 없는 아키텍처와 LLM 연구에서 영감을 받은 하이브리드 옵티마이저를 도입하여 엣지 배포를 획기적으로 개선합니다. 효율적인 컴퓨터 비전 워크플로우를 활용하려는 개발자는 Ultralytics Platform을 사용하여 모델을 손쉽게 구축, 학습 및 배포할 수 있습니다. 이 클라우드 기반 도구는 복잡한 데이터셋 관리와 하이퍼파라미터 튜닝을 단순화하여 강력한 실시간 비전 애플리케이션을 지원합니다.

Group Relative Policy Optimization (GRPO)

Link to this sectionGRPO와 PPO의 차이점#

Link to this sectionGRPO의 실제 활용 사례#

Link to this sectionPyTorch에서 GRPO 개념 구현하기#

Link to this section스마트 최적화를 통한 AI 발전#

Explore solutions

로봇 공학에서의 AI

물류 분야의 AI

소매업에서의 AI

의료 분야의 AI

제조 분야의 AI

자동차 분야의 AI

농업 분야의 AI

로봇 공학에서의 AI

물류 분야의 AI

소매업에서의 AI

의료 분야의 AI

제조 분야의 AI

자동차 분야의 AI

농업 분야의 AI

로봇 공학에서의 AI

물류 분야의 AI

소매업에서의 AI

의료 분야의 AI

제조 분야의 AI

자동차 분야의 AI

농업 분야의 AI

미래의 AI를 함께 구축합시다!