AI Engineering · Updated June 2026

SCORM to RAG Pipeline: Train Your LLM on E-Learning Content

14 min readUpdated June 22, 2026

TL;DR.Most enterprise L&D departments sit on hundreds of legacy SCORM courses — compliance training, onboarding modules, product tutorials. In 2026, that's the highest-value untapped training corpus for internal LLM assistants. This guide walks through converting SCORM 1.2 / 2004 packages into Markdown chunks, embedding them in a vector store, and exposing them to your team via a retrieval-augmented generation (RAG) chatbot. Total cost: ~$0.10 per course converted, plus standard embedding costs.

Why SCORM is an underrated training corpus

A typical Fortune 500 company has 200-2000 SCORM courses in its LMS — covering compliance (anti-bribery, harassment, GDPR), onboarding (HR policies, IT security), and product training. Each course is 30-200 slides, professionally authored, fact-checked, and aligned with internal policy. It's a higher-quality training corpus than Wikipedia for any internal-facing LLM.

The catch: SCORM is a binary format. Each ZIP contains 50-500 files of JavaScript, JSON, HTML, and binary media. You can't just catits content into a text file. You need a SCORM parser per authoring tool — Articulate Rise uses one format, Storyline another, iSpring a third. SCORM Converter ships 16 dedicated parsers for the major tools.

Pipeline overview

SCORM library (ZIP files)
       ↓
  [1] Extract per package via SCORM Converter
       ↓
  Markdown (one .md per course)
       ↓
  [2] Chunk (semantic, header-aware, 500-1000 tokens)
       ↓
  Chunks with metadata (course_id, section, slide_n)
       ↓
  [3] Embed (text-embedding-3-large or Cohere embed-v4.0)
       ↓
  Vector store (Pinecone, Qdrant, pgvector)
       ↓
  [4] RAG query loop (LLM with cite-as-you-go citations)
       ↓
  Internal chatbot / Slack assistant

Step 1: extract SCORM to Markdown

Markdown is the right intermediate representation for RAG: it preserves headings (which power chunk boundaries), tables (GFM-compatible), and image/video references (which can be skipped or processed separately). With SCORM Converter:

# Bulk extract via API (when API launches)
for course in scorm_library/*.zip; do
  curl -X POST https://api.scormconverter.com/v1/extract \
    -H "Authorization: Bearer $TOKEN" \
    -F "file=@$course" \
    -F "format=markdown" \
    > "out/$(basename $course .zip).md"
done

Each output .md follows a predictable structure: # Course title, then ## Module, then ### Slide N: titlewith body text underneath. This makes chunking trivial.

Step 2: semantic chunking

The chunking step is where naive RAG pipelines fail. Splitting by raw character count (e.g. every 500 chars) breaks slides mid-sentence. Splitting by token count is better but still ignores topical boundaries.

For SCORM, use header-aware chunking:

# Python — using LangChain's MarkdownHeaderTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "course"),
        ("##", "module"),
        ("###", "slide"),
    ],
)

chunks = splitter.split_text(markdown_content)
# Each chunk now has metadata: course, module, slide

If individual slides exceed ~1000 tokens, fall back to a recursive splitter inside each header section. The metadata is preserved automatically, which is crucial for citations later.

Step 3: embed and store

For internal-facing assistants, text-embedding-3-large (OpenAI) or embed-v4.0 (Cohere) are the current sweet spots. Both produce 1024-3072 dimensional vectors at $0.13-0.20 per 1M tokens.

# Python — pgvector with OpenAI embeddings
import openai, psycopg2

for chunk in chunks:
    embedding = openai.embeddings.create(
        model="text-embedding-3-large",
        input=chunk.page_content,
    ).data[0].embedding

    cur.execute(
        "INSERT INTO scorm_chunks (course, module, slide, text, embedding) "
        "VALUES (%s, %s, %s, %s, %s)",
        (chunk.metadata["course"], chunk.metadata["module"],
         chunk.metadata["slide"], chunk.page_content, embedding)
    )

Step 4: RAG query loop with citations

The L&D chatbot pattern: a user asks "what's our policy on travel expenses?", retrieve top-K chunks from the compliance course, send to LLM with instruction to cite each fact:

# Pseudo-code RAG query
query_embedding = embed(user_question)
chunks = vector_db.similarity_search(query_embedding, k=5)

prompt = f"""
You are an L&D assistant. Answer the user question using ONLY the
following course excerpts. Cite each fact with [course:module:slide].

User question: {user_question}

Excerpts:
{format_chunks_with_metadata(chunks)}

Answer (with citations):
"""

response = llm.complete(prompt)
# "Travel expenses up to $X require manager approval [Travel Policy:Module 2:Slide 4].
#  Receipts must be submitted within 30 days [...:...:Slide 7]."

Special considerations for SCORM

Multimodal: images and video

SCORM courses are heavily visual. Markdown export preserves image references (relative paths to the extracted assets), but doesn't embed images inline. For RAG, you have three options:

Skip images. Most policy / compliance content is text-driven. You lose nothing important.
Image captioning. Run each extracted image through Gemini or GPT-4-Vision to produce a caption, embed alongside text. ~$0.002 per image.
Native multimodal embedding. Use Voyage AI Multimodal or Cohere Embed v4 (multimodal) to embed images as vectors directly. Best for technical/visual content.

Quiz extraction

Every authoring tool stores quiz questions slightly differently. SCORM Converter's parsers extract them into a structured ExtractedQuizschema (question, options[], correctAnswer). For RAG, these are gold — they let your assistant verify employee understanding without exposing answers prematurely.

Multi-language corpora

~18% of enterprise SCORM is non-English. text-embedding-3-large andembed-v4.0 are both multilingual and produce comparable retrieval quality across English, Italian, Spanish, French, German, Portuguese, Russian, and Chinese. You can mix languages in the same vector store.

Production deployment

A small enterprise (500 employees, 50 courses) can deploy a full SCORM-RAG assistant in a week with these components:

Vector store: pgvector on Supabase or Postgres ($25/mo)
Embeddings: OpenAI text-embedding-3-large (~$5 one-time, ~$0.50/mo retrieval)
LLM: Claude Haiku or GPT-4o-mini (~$0.50/query × 1000 queries/month = $500/mo)
SCORM conversion: ~$0.10/course × 50 courses = $5 one-time
Slack bot or web UI: open-source kits available

Total: ~$525/mo running cost for an assistant that answers internal policy and product questions using your real training content with citations.

Pitfalls to avoid

Don't embed the entire course as one chunk. Retrieval quality crashes. Split per slide / per heading.
Don't skip the metadata. Without {course, module, slide}metadata, citations are useless and your users won't trust the assistant.
Don't use SCORM courses with remote-launcher iframes. About 12% of SCORM packages are just iframe loaders pointing to a vendor cloud (Coassemble, Genially, MathAlea, Elucidat). There's nothing to extract; the content lives on the vendor's server.
Don't skip the quality gate.A SCORM converter that silently returns "loading..." placeholders for unsupported tools will poison your vector store. Use one with explicit silent-failure detection.
Don't re-embed monthly. Embeddings are stable; only re-embed when you change content or upgrade the embedding model.

Ready to build your SCORM-RAG pipeline?

Convert your SCORM library to Markdown in seconds. Free during beta. API coming soon.