LLMs

LLMs มีประสิทธิภาพดีพอแล้ว?

by Leonardo.ai

เราจะรู้ได้อย่างไรว่า Large Language Models (LLMs) ที่พัฒนาขึ้น มีประสิทธิภาพดีพอแล้ว?

Blog นี้ จะพูดถึงการประเมินประสิทธิภาพของ LLMs พร้อมตัวอย่างการใช้งานและ Python Code

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

# Sample data
reference = "The quick brown fox jumps over the lazy dog."
hypothesis = "The fast brown fox leaps over the sleepy dog."

# 1. BLEU Score
def calculate_bleu(reference, hypothesis):
    reference_tokens = reference.split()
    hypothesis_tokens = hypothesis.split()
    return sentence_bleu([reference_tokens], hypothesis_tokens)

bleu_score = calculate_bleu(reference, hypothesis)
print(f"BLEU Score: {bleu_score}")

# 2. ROUGE Score
def calculate_rouge(reference, hypothesis):
    rouge = Rouge()
    scores = rouge.get_scores(hypothesis, reference)
    return scores[0]

rouge_scores = calculate_rouge(reference, hypothesis)
print(f"ROUGE Scores: {rouge_scores}")

# 3. Perplexity
# Note: Perplexity calculation typically requires the full model. 
# This is a simplified example.
def calculate_perplexity(probabilities):
    return np.exp(-np.mean(np.log(probabilities)))

sample_probabilities = [0.1, 0.2, 0.3, 0.4]
perplexity = calculate_perplexity(sample_probabilities)
print(f"Perplexity: {perplexity}")

# 4. Accuracy (for classification tasks)
def calculate_accuracy(true_labels, predicted_labels):
    return accuracy_score(true_labels, predicted_labels)

true_labels = [1, 0, 1, 1, 0]
predicted_labels = [1, 0, 1, 0, 0]
accuracy = calculate_accuracy(true_labels, predicted_labels)
print(f"Accuracy: {accuracy}")

# 5. Precision, Recall, F1-Score
def calculate_prf(true_labels, predicted_labels):
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predicted_labels, average='binary')
    return precision, recall, f1

precision, recall, f1 = calculate_prf(true_labels, predicted_labels)
print(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")

# 6. Human Evaluation
# This typically involves manual scoring by human evaluators
# Example scoring rubric:
# Fluency: 1-5 scale
# Coherence: 1-5 scale
# Relevance: 1-5 scale

human_scores = {
    "Fluency": 4,
    "Coherence": 5,
    "Relevance": 4
}
average_human_score = sum(human_scores.values()) / len(human_scores)
print(f"Average Human Evaluation Score: {average_human_score}")

คำอธิบายเกี่ยวกับ Evaluation metrics และ กรณีใช้งาน (Use cases) ต่างๆ

BLEU Score (Bilingual Evaluation Understudy):
- กรณีการใช้งาน: ใช้หลักๆ สำหรับการแปลภาษาด้วยเครื่อง แต่สามารถประยุกต์ใช้กับงานสร้างข้อความได้เช่นกัน
- การแปลผล: วัดว่าข้อความที่สร้างโดย Machine มีความคล้ายคลึงกับข้อความอ้างอิงมากแค่ไหน คะแนนอยู่ในช่วง 0 ถึง 1 โดย 1 คือการ Matching ที่สมบูรณ์ (Perfect)
ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation):
- กรณีการใช้งาน: ใช้บ่อยในการประเมินงานสรุปข้อความและงานสร้างข้อความ
- การแปลผล: วัดความซ้ำซ้อนของ n-grams ระหว่างข้อความที่สร้างขึ้นกับข้อความอ้างอิง มีผลคะแนนแยกสำหรับ Precision, Recall และค่า F1-Score
Perplexity:
- กรณีการใช้งาน: ใช้ประเมิน Language Model โดยเฉพาะในงานเช่น การทำนายคำถัดไป
- การแปลผล: วัดว่า Model ความน่าจะเป็นสามารถทำนายตัวอย่างได้ดีแค่ไหน ค่า Perplexity ที่ต่ำกว่าบ่งชี้ถึงประสิทธิภาพที่ดีกว่า
Accuracy:
- กรณีการใช้งาน: ใช้สำหรับงาน Classification เช่น การวิเคราะห์ความรู้สึก (Sentimental Analysis) หรือการจำแนกหัวข้อ (Topic Classification)
- การแปลผล: สัดส่วนของการทำนายที่ถูกต้องจากจำนวน Cases ทั้งหมดที่ตรวจสอบ
Precision, Recall, F1-Score:
- กรณีการใช้งาน: ใช้สำหรับงาน Classification โดยเฉพาะเมื่อต้องจัดการกับชุดข้อมูลที่ไม่สมดุล (Imbalance)
- การแปลผล:
  - Precision = TP / (TP + FP)
  - Recall = TP / (TP + FN)
  - F1-Score: ค่าเฉลี่ยฮาร์โมนิกของ Precision และ Recall
การประเมินโดยมนุษย์:
- กรณีการใช้งาน: ใช้สำหรับการประเมินคุณภาพภาษา ความสอดคล้อง และความเกี่ยวข้องเชิงอัตวิสัย
- การแปลผล: โดยทั่วไปเกี่ยวข้องกับผู้ประเมินที่เป็นมนุษย์ให้คะแนนผลลัพธ์ตามเกณฑ์ต่างๆ คะแนนมักจะถูกเฉลี่ยเพื่อนำมาจัดอันดับโดยรวม

สิ่งสำคัญคือการเลือก Evaluation metrics จะขึ้นอยู่กับงานและเป้าหมายของ LLM บ่อยครั้งที่มีการใช้ Metrics หลายอันร่วมกันเพื่อให้ได้การประเมินที่ครอบคลุม

เพิ่มเติม

เกณฑ์การประเมินเฉพาะงาน: ขึ้นอยู่กับการประยุกต์ใช้ เช่น การตอบคำถาม ระบบสนทนา อาจต้องใช้เกณฑ์การประเมินเฉพาะงาน
ชุดข้อมูลมาตรฐาน: มีชุดข้อมูลมาตรฐานหลายชุด (เช่น GLUE, SuperGLUE) ที่ใช้ในการประเมินและเปรียบเทียบ LLM ในงานต่างๆ
การประเมิน Bias และ Fairness: สำคัญมากที่จะต้องประเมิน LLM สำหรับปัญหาเรื่อง Bias ที่อาจเกิดขึ้นและ Fairness ในPopulation ที่แตกต่างกัน
ประสิทธิภาพการคำนวณ: เกณฑ์เช่น Inference time และขนาดของ Model เป็นสิ่งสำคัญสำหรับการพิจารณาในการนำไปใช้งานจริง

Blog นี้ เขียนร่วมกับ Claude.ai โดยใช้ Prompt

Please explain the metrics to evaluate LLMs.

LLMs มีประสิทธิภาพดีพอแล้ว?

Read next

การเขียน Prompt สำหรับวิเคราะห์ข้อมูล

Gen AI สำหรับคนทั่วไป

เทคนิคการ Prompt เพื่อให้ได้คำตอบที่ดีจาก LLMs