Skip to content

Determine LMS Content Data Model Design #1

@ormsbee

Description

@ormsbee

This is a discovery ticket that would end in multiple ADRs in this repo. The goal is to determine the basic data structures and relationships needed to store content data for the LMS with the following goals:

  1. Handle all existing course use cases.
  2. Handle v2 content libraries, and library resources shared across multiple courses, with different policies/overrides.
  3. Handle potentially non-XBlock content.
  4. Allow for fast publishing.
  5. Allow for atomic publishing.
  6. Allow for easy export.
  7. Allow third party applications to build their own advanced data structures on top of it.
  8. Allow for efficient querying.
  9. Minimize the amount of wasted space caused by nearly identical versions caused by minor edits.

There will likely be some prototyping involved in this, as well as a lot of discussion.

These ADRs would include:

  • Determine high level approach to splitting reusable content and learning context specific policy.
  • Determine high level approach to versioning and incremental publishing.
  • Determine leaf XBlock-level modeling.
  • Determine unit-level modeling.
  • Determine sequence-level modeling.

Open edX's ModuleStore has explicit versioning capabilities (though they're not really used from LMS most of the time). Blockstore has versioning deeply baked into the design, but we've generally avoided encoding multi-version support into the post-publish data stores we build during course publish (e.g. CourseOverview, Block Transformers/Course Blocks API, etc.)

I think we've come to a point where we really do need to cross that divide and start introducing content versioning concepts to the LMS more generally. Some motivating reasons:

  1. It's difficult to preview changes before publishing if the LMS can only display the published version.
  2. The design of content libraries assumes that multiple versions are available simutaneously.
  3. Building all these separate stores of published data takes time, and parts may fail. When this happens, our course is in an inconsistent half-published state. For example, the rendered course content may be updated but the course outline generation may have failed so that nobody can see the new content. Ideally, we'd want to give these systems time to build up the data that they need, and then change everything atomically at once.
  4. Other systems like scoring and state storage would benefit from being able to store the actual version it was created against.

Storage Scaling Issues

SplitMongo ModuleStore wasted a lot of space with old version data, eventually leading us to create a separate cleanup script for it. There were a number of reasons why disk usage got so bad with this system:

  1. Structural data was stored inefficiently.
  2. New versions were being published for tiny changes.
  3. There was no cleanup.
  4. Historical course data had very little use.

Every time there was even a small change (e.g. the title of a Unit), we ended up writing a document with all settings-scoped data for the course. This happened all the time in Studio, so that for every one version that is of interest to us for viewing or preview purposes, there would be dozens of almost-identical intermediate versions. We ended up in a place where the majority of course storage was wasted in this way.

There are a few ways we can make this much better:

  • Isolate changes better
    Most content doesn't change very much from one version to the next, so we should break up the course into more granular pieces and track their changes individually.
  • Support simple cleanup policies
    There will be intermediate versions that can be almost immediately discarded (like the last state of the Studio draft). We should have obvious cleanup facilities for getting rid of those as soon as they are not needed.
  • Support a simplified data model for clients.
    Thinking about versions is hard, and unintuitive. We should have a set of primitives that help people model version-awareness into their content without having to overly complicate their data models.

Modeling Versioning in a way that scales (i.e. both features and users)

The following is a disorganized set of thoughts:

Entities (the names are bad, I'm just trying to get down the ideas)

  • LearningObject
  • LearningObjectVersion (with sub-types made with joined tables that represent things like Units, Blocks).
  • Bundles (?) of LearningObjectVersions. I wanted to resist adding a separate layer here, but I realized that without this, we'd have to echo out all the (LearningObjectVersion/LearningContextVersion) entries with each new version, even if the libraries that a course is using isn't changing at all. Which would be really bad for encouraging library use. I really don't like the reuse of a Blockstore term and concept though.
  • LearningContextVersion can contain multiple Bundles (e.g. a Course consists of versioned bundles for its "own" resources, as well as the Bundles of various other things like Libraries).
  • LearningContextBranch enforces that at any given point, there is one live version per branch (e.g. "draft", "live")
  • Some sort of registry where any app that has data related to a version has a chance to put the status of it (i.e. is it ready?) (How to handle the case where a process dies?)

I'm fiddling with some of these ideas in the learning_publishing app's models.py file at the moment.

Concern: How is it different than Blockstore? A: It's going to be much more relational data model that you hang other relational models (e.g. XBlock content, scheduling information) off of. Also, it's going to have zero intelligence about cycle detection or dependencies. It's also going to have explicit measures for cleanup of unused versions, which is not a thing in Blockstore.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions