SCORM to RAG Pipeline: Train Your LLM on E-Learning Content
TL;DR.Most enterprise L&D departments sit on hundreds of legacy SCORM courses — compliance training, onboarding modules, product tutorials. In 2026, that's the highest-value untapped training corpus for internal LLM assistants. This guide walks through converting SCORM 1.2 / 2004 packages into Markdown chunks, embedding them in a vector store, and exposing them to your team via a retrieval-augmented generation (RAG) chatbot. Total cost: ~$0.10 per course converted, plus standard embedding costs.
Why SCORM is an underrated training corpus
A typical Fortune 500 company has 200-2000 SCORM courses in its LMS — covering compliance (anti-bribery, harassment, GDPR), onboarding (HR policies, IT security), and product training. Each course is 30-200 slides, professionally authored, fact-checked, and aligned with internal policy. It's a higher-quality training corpus than Wikipedia for any internal-facing LLM.
The catch: SCORM is a binary format. Each ZIP contains 50-500 files of JavaScript, JSON, HTML, and binary media. You can't just catits content into a text file. You need a SCORM parser per authoring tool — Articulate Rise uses one format, Storyline another, iSpring a third. SCORM Converter ships 16 dedicated parsers for the major tools.
Pipeline overview
SCORM library (ZIP files)
↓
[1] Extract per package via SCORM Converter
↓
Markdown (one .md per course)
↓
[2] Chunk (semantic, header-aware, 500-1000 tokens)
↓
Chunks with metadata (course_id, section, slide_n)
↓
[3] Embed (text-embedding-3-large or Cohere embed-v4.0)
↓
Vector store (Pinecone, Qdrant, pgvector)
↓
[4] RAG query loop (LLM with cite-as-you-go citations)
↓
Internal chatbot / Slack assistantStep 1: extract SCORM to Markdown
Markdown is the right intermediate representation for RAG: it preserves headings (which power chunk boundaries), tables (GFM-compatible), and image/video references (which can be skipped or processed separately). With SCORM Converter:
# Bulk extract via API (when API launches)
for course in scorm_library/*.zip; do
curl -X POST https://api.scormconverter.com/v1/extract \
-H "Authorization: Bearer $TOKEN" \
-F "file=@$course" \
-F "format=markdown" \
> "out/$(basename $course .zip).md"
doneEach output .md follows a predictable structure: # Course title, then ## Module, then ### Slide N: titlewith body text underneath. This makes chunking trivial.
Step 2: semantic chunking
The chunking step is where naive RAG pipelines fail. Splitting by raw character count (e.g. every 500 chars) breaks slides mid-sentence. Splitting by token count is better but still ignores topical boundaries.
For SCORM, use header-aware chunking:
# Python — using LangChain's MarkdownHeaderTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "course"),
("##", "module"),
("###", "slide"),
],
)
chunks = splitter.split_text(markdown_content)
# Each chunk now has metadata: course, module, slideIf individual slides exceed ~1000 tokens, fall back to a recursive splitter inside each header section. The metadata is preserved automatically, which is crucial for citations later.
Step 3: embed and store
For internal-facing assistants, text-embedding-3-large (OpenAI) or embed-v4.0 (Cohere) are the current sweet spots. Both produce 1024-3072 dimensional vectors at $0.13-0.20 per 1M tokens.
# Python — pgvector with OpenAI embeddings
import openai, psycopg2
for chunk in chunks:
embedding = openai.embeddings.create(
model="text-embedding-3-large",
input=chunk.page_content,
).data[0].embedding
cur.execute(
"INSERT INTO scorm_chunks (course, module, slide, text, embedding) "
"VALUES (%s, %s, %s, %s, %s)",
(chunk.metadata["course"], chunk.metadata["module"],
chunk.metadata["slide"], chunk.page_content, embedding)
)Step 4: RAG query loop with citations
The L&D chatbot pattern: a user asks "what's our policy on travel expenses?", retrieve top-K chunks from the compliance course, send to LLM with instruction to cite each fact:
# Pseudo-code RAG query
query_embedding = embed(user_question)
chunks = vector_db.similarity_search(query_embedding, k=5)
prompt = f"""
You are an L&D assistant. Answer the user question using ONLY the
following course excerpts. Cite each fact with [course:module:slide].
User question: {user_question}
Excerpts:
{format_chunks_with_metadata(chunks)}
Answer (with citations):
"""
response = llm.complete(prompt)
# "Travel expenses up to $X require manager approval [Travel Policy:Module 2:Slide 4].
# Receipts must be submitted within 30 days [...:...:Slide 7]."Special considerations for SCORM
Multimodal: images and video
SCORM courses are heavily visual. Markdown export preserves image references (relative paths to the extracted assets), but doesn't embed images inline. For RAG, you have three options:
- Skip images. Most policy / compliance content is text-driven. You lose nothing important.
- Image captioning. Run each extracted image through Gemini or GPT-4-Vision to produce a caption, embed alongside text. ~$0.002 per image.
- Native multimodal embedding. Use Voyage AI Multimodal or Cohere Embed v4 (multimodal) to embed images as vectors directly. Best for technical/visual content.
Quiz extraction
Every authoring tool stores quiz questions slightly differently. SCORM Converter's parsers extract them into a structured ExtractedQuizschema (question, options[], correctAnswer). For RAG, these are gold — they let your assistant verify employee understanding without exposing answers prematurely.
Multi-language corpora
~18% of enterprise SCORM is non-English. text-embedding-3-large andembed-v4.0 are both multilingual and produce comparable retrieval quality across English, Italian, Spanish, French, German, Portuguese, Russian, and Chinese. You can mix languages in the same vector store.
Production deployment
A small enterprise (500 employees, 50 courses) can deploy a full SCORM-RAG assistant in a week with these components:
- Vector store: pgvector on Supabase or Postgres ($25/mo)
- Embeddings: OpenAI
text-embedding-3-large(~$5 one-time, ~$0.50/mo retrieval) - LLM: Claude Haiku or GPT-4o-mini (~$0.50/query × 1000 queries/month = $500/mo)
- SCORM conversion: ~$0.10/course × 50 courses = $5 one-time
- Slack bot or web UI: open-source kits available
Total: ~$525/mo running cost for an assistant that answers internal policy and product questions using your real training content with citations.
Pitfalls to avoid
- Don't embed the entire course as one chunk. Retrieval quality crashes. Split per slide / per heading.
- Don't skip the metadata. Without
{course, module, slide}metadata, citations are useless and your users won't trust the assistant. - Don't use SCORM courses with remote-launcher iframes. About 12% of SCORM packages are just iframe loaders pointing to a vendor cloud (Coassemble, Genially, MathAlea, Elucidat). There's nothing to extract; the content lives on the vendor's server.
- Don't skip the quality gate.A SCORM converter that silently returns "loading..." placeholders for unsupported tools will poison your vector store. Use one with explicit silent-failure detection.
- Don't re-embed monthly. Embeddings are stable; only re-embed when you change content or upgrade the embedding model.
Related reading
- SCORM to Markdown for Developers
- SCORM vs xAPI vs cmi5 — Definitive Comparison
- The State of SCORM 2026 Report
- LangChain MarkdownHeaderTextSplitter docs
- Anthropic — Contextual Retrieval for RAG
Ready to build your SCORM-RAG pipeline?
Convert your SCORM library to Markdown in seconds. Free during beta. API coming soon.