Back to Blog
AI Engineering · Updated June 2026

SCORM to RAG Pipeline: Train Your LLM on E-Learning Content

14 min readUpdated June 22, 2026

TL;DR.Most enterprise L&D departments sit on hundreds of legacy SCORM courses — compliance training, onboarding modules, product tutorials. In 2026, that's the highest-value untapped training corpus for internal LLM assistants. This guide walks through converting SCORM 1.2 / 2004 packages into Markdown chunks, embedding them in a vector store, and exposing them to your team via a retrieval-augmented generation (RAG) chatbot. Total cost: ~$0.10 per course converted, plus standard embedding costs.

Why SCORM is an underrated training corpus

A typical Fortune 500 company has 200-2000 SCORM courses in its LMS — covering compliance (anti-bribery, harassment, GDPR), onboarding (HR policies, IT security), and product training. Each course is 30-200 slides, professionally authored, fact-checked, and aligned with internal policy. It's a higher-quality training corpus than Wikipedia for any internal-facing LLM.

The catch: SCORM is a binary format. Each ZIP contains 50-500 files of JavaScript, JSON, HTML, and binary media. You can't just catits content into a text file. You need a SCORM parser per authoring tool — Articulate Rise uses one format, Storyline another, iSpring a third. SCORM Converter ships 16 dedicated parsers for the major tools.

Pipeline overview

SCORM library (ZIP files)
       ↓
  [1] Extract per package via SCORM Converter
       ↓
  Markdown (one .md per course)
       ↓
  [2] Chunk (semantic, header-aware, 500-1000 tokens)
       ↓
  Chunks with metadata (course_id, section, slide_n)
       ↓
  [3] Embed (text-embedding-3-large or Cohere embed-v4.0)
       ↓
  Vector store (Pinecone, Qdrant, pgvector)
       ↓
  [4] RAG query loop (LLM with cite-as-you-go citations)
       ↓
  Internal chatbot / Slack assistant

Step 1: extract SCORM to Markdown

Markdown is the right intermediate representation for RAG: it preserves headings (which power chunk boundaries), tables (GFM-compatible), and image/video references (which can be skipped or processed separately). With SCORM Converter:

# Bulk extract via API (when API launches)
for course in scorm_library/*.zip; do
  curl -X POST https://api.scormconverter.com/v1/extract \
    -H "Authorization: Bearer $TOKEN" \
    -F "file=@$course" \
    -F "format=markdown" \
    > "out/$(basename $course .zip).md"
done

Each output .md follows a predictable structure: # Course title, then ## Module, then ### Slide N: titlewith body text underneath. This makes chunking trivial.

Step 2: semantic chunking

The chunking step is where naive RAG pipelines fail. Splitting by raw character count (e.g. every 500 chars) breaks slides mid-sentence. Splitting by token count is better but still ignores topical boundaries.

For SCORM, use header-aware chunking:

# Python — using LangChain's MarkdownHeaderTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "course"),
        ("##", "module"),
        ("###", "slide"),
    ],
)

chunks = splitter.split_text(markdown_content)
# Each chunk now has metadata: course, module, slide

If individual slides exceed ~1000 tokens, fall back to a recursive splitter inside each header section. The metadata is preserved automatically, which is crucial for citations later.

Step 3: embed and store

For internal-facing assistants, text-embedding-3-large (OpenAI) or embed-v4.0 (Cohere) are the current sweet spots. Both produce 1024-3072 dimensional vectors at $0.13-0.20 per 1M tokens.

# Python — pgvector with OpenAI embeddings
import openai, psycopg2

for chunk in chunks:
    embedding = openai.embeddings.create(
        model="text-embedding-3-large",
        input=chunk.page_content,
    ).data[0].embedding

    cur.execute(
        "INSERT INTO scorm_chunks (course, module, slide, text, embedding) "
        "VALUES (%s, %s, %s, %s, %s)",
        (chunk.metadata["course"], chunk.metadata["module"],
         chunk.metadata["slide"], chunk.page_content, embedding)
    )

Step 4: RAG query loop with citations

The L&D chatbot pattern: a user asks "what's our policy on travel expenses?", retrieve top-K chunks from the compliance course, send to LLM with instruction to cite each fact:

# Pseudo-code RAG query
query_embedding = embed(user_question)
chunks = vector_db.similarity_search(query_embedding, k=5)

prompt = f"""
You are an L&D assistant. Answer the user question using ONLY the
following course excerpts. Cite each fact with [course:module:slide].

User question: {user_question}

Excerpts:
{format_chunks_with_metadata(chunks)}

Answer (with citations):
"""

response = llm.complete(prompt)
# "Travel expenses up to $X require manager approval [Travel Policy:Module 2:Slide 4].
#  Receipts must be submitted within 30 days [...:...:Slide 7]."

Special considerations for SCORM

Multimodal: images and video

SCORM courses are heavily visual. Markdown export preserves image references (relative paths to the extracted assets), but doesn't embed images inline. For RAG, you have three options:

  • Skip images. Most policy / compliance content is text-driven. You lose nothing important.
  • Image captioning. Run each extracted image through Gemini or GPT-4-Vision to produce a caption, embed alongside text. ~$0.002 per image.
  • Native multimodal embedding. Use Voyage AI Multimodal or Cohere Embed v4 (multimodal) to embed images as vectors directly. Best for technical/visual content.

Quiz extraction

Every authoring tool stores quiz questions slightly differently. SCORM Converter's parsers extract them into a structured ExtractedQuizschema (question, options[], correctAnswer). For RAG, these are gold — they let your assistant verify employee understanding without exposing answers prematurely.

Multi-language corpora

~18% of enterprise SCORM is non-English. text-embedding-3-large andembed-v4.0 are both multilingual and produce comparable retrieval quality across English, Italian, Spanish, French, German, Portuguese, Russian, and Chinese. You can mix languages in the same vector store.

Production deployment

A small enterprise (500 employees, 50 courses) can deploy a full SCORM-RAG assistant in a week with these components:

  • Vector store: pgvector on Supabase or Postgres ($25/mo)
  • Embeddings: OpenAI text-embedding-3-large (~$5 one-time, ~$0.50/mo retrieval)
  • LLM: Claude Haiku or GPT-4o-mini (~$0.50/query × 1000 queries/month = $500/mo)
  • SCORM conversion: ~$0.10/course × 50 courses = $5 one-time
  • Slack bot or web UI: open-source kits available

Total: ~$525/mo running cost for an assistant that answers internal policy and product questions using your real training content with citations.

Pitfalls to avoid

  • Don't embed the entire course as one chunk. Retrieval quality crashes. Split per slide / per heading.
  • Don't skip the metadata. Without {course, module, slide}metadata, citations are useless and your users won't trust the assistant.
  • Don't use SCORM courses with remote-launcher iframes. About 12% of SCORM packages are just iframe loaders pointing to a vendor cloud (Coassemble, Genially, MathAlea, Elucidat). There's nothing to extract; the content lives on the vendor's server.
  • Don't skip the quality gate.A SCORM converter that silently returns "loading..." placeholders for unsupported tools will poison your vector store. Use one with explicit silent-failure detection.
  • Don't re-embed monthly. Embeddings are stable; only re-embed when you change content or upgrade the embedding model.

Related reading

Ready to build your SCORM-RAG pipeline?

Convert your SCORM library to Markdown in seconds. Free during beta. API coming soon.