Data Engineering

Data Engineering SFT: NEFTune & SemDeDup | SLM Playbook

← Series hub ← Previous | Next → In the era of LLMs/SLMs, the classic data science proverb: “Garbage In, Garbage Out” has never been more relevant. When performing Supervised Fine-Tuning (SFT) for Small Language Models (SLMs), data quality and format dictate over 90% of the model’s downstream capabilities. Feeding millions of raw, web-scraped dialogue pairs or low-quality synthetic data directly into your model will overfit it to repetitive phrasing, restrict its reasoning capabilities, and waste thousands of GPU hours. ...

Executive Summary: The Disruption of Naive RAG and the GraphRAG Era

If you have ever built an internal chatbot for your company by chunking documents, creating embeddings, and stuffing them into Pinecone or Milvus… you have undoubtedly encountered this scenario: User: “What was the Q3 revenue for product A, and how does it affect the Q4 strategy?” Bot: (Replies hesitantly, outputs last year’s Q2 figures, and completely loses context regarding the strategy). Welcome to the disruption of Naive RAG (Retrieval-Augmented Generation). ...