Description
Archive-based content types (H5P, HTML5 zip, IMSCP) may contain references to external URLs — images, videos, fonts, stylesheets, scripts — that won't be available in offline Kolibri deployments. The pipeline conversion handlers need to scan archive contents for these references, download the resources, bundle them into the archive, and rewrite the references to point to the local copies.
Context
Existing logic in downloader.py
We already have most of this URL extraction and rewriting logic in downloader.py. download_static_assets() and its inner functions extract URLs from:
- HTML attributes:
img[src], link[href], script[src], source[src], img[srcset], [style*="background-image"]
- CSS:
url() references for fonts, images, etc.
- Recursive resource following (CSS that references fonts, etc.)
And ArchiveDownloader downloads pages with all their resources, rewrites paths, and creates ZIP archives.
However, this logic is not reusable by the pipeline because:
kolibri-zip as reference spec
Kolibri's kolibri-zip package handles the runtime side of this: when rendering ZIP-based content, it extracts files, resolves internal path references, and rewrites them to blob URLs. Its fileUtils.js provides a comprehensive spec for which reference types need handling:
HTML/XML files (src, href, srcset, inline style, <style> blocks):
<img src="images/photo.jpg">
<link href="styles/main.css">
<script src="https://cdn.example.com/lib.js"></script>
<img srcset="img-300.jpg 300w, img-600.jpg 600w">
<div style="background: url('bg.png')">
CSS files (url(), @import — both url() and bare string forms):
@import 'fonts/custom.css';
background-image: url('../images/bg.png');
@font-face { src: url('https://fonts.example.com/font.woff2'); }
H5P JSON (path attributes in content/content.json):
{
"video": {
"files": [
{ "path": "https://h5p.org/sites/default/files/h5p/iv.mp4", "mime": "video/mp4" }
]
}
}
kolibri-zip handles internal references at runtime, but cannot fetch external URLs (especially offline). That's ricecooker's job at import time.
Key architectural constraint
External resource downloading and reference rewriting must happen before create_predictable_zip is called in ArchiveProcessingBaseHandler.handle_file() (convert.py). Once create_predictable_zip runs, the archive is sealed — it iterates existing files and can only transform them via file_converter (currently used for media compression), not add new ones.
The flow in handle_file() would become:
validate_archive(path)
path = download_and_rewrite_external_refs(path) # <-- new step
create_predictable_zip(path, file_converter=...) # existing step
Where download_and_rewrite_external_refs would:
- Extract the archive to a temp directory
- Scan text-based files for external URL references
- Download those resources into the temp directory
- Rewrite references in the source files to point to local copies
- Return the temp directory path (which
create_predictable_zip already accepts — it handles both directories and zip files)
Related issues and PRs
Approach
Phase 1: Land bug fixes from #636 and #639
Merge the open PRs first to preserve @jaltekruse's contributor attribution in git history before the refactor changes the code structure. These fix real bugs in the URL extraction logic that the shared utilities will need to carry forward.
Phase 2: Extract shared utilities from downloader.py (supersedes #303)
Extract the URL extraction and rewriting logic from download_static_assets() inner functions into standalone, testable utility functions that operate on file contents (strings/bytes) rather than requiring a live HTTP session:
- URL extraction: Given HTML/CSS/JSON content, return a list of referenced URLs
- URL rewriting: Given content and a URL mapping (old → new), return rewritten content
- External URL filtering: Distinguish external (http/https) from internal (relative paths already in archive) references
This directly addresses #303's concern about untestable inner functions. The extracted functions can be unit tested with plain strings — no HTTP server, no filesystem, no platform-specific path issues. This also resolves the Windows test failures in #636, since the core logic tests won't depend on filesystem paths.
Phase 3: Create archive processing utility
Build on the Phase 2 utilities to create an archive-level processor:
- Open an archive and iterate its text-based files (HTML, CSS, JSON, XML)
- Use Phase 2 extractors to find external URL references
- Download external resources into the extracted archive directory
- Use Phase 2 rewriters to update references to local paths
- Loop detection for recursive references (à la kolibri-zip's
visitedPaths)
Phase 4: Integrate into pipeline conversion handlers
Wire the archive processor into the existing handlers, running before create_predictable_zip:
H5PConversionHandler: Scan content/content.json for external path values, plus HTML/CSS in content (highest priority — videos and images commonly external)
HTML5ConversionHandler: Scan HTML/CSS files for external references
IMSCPConversionHandler: Scan entry point HTML files and their CSS for external references
Reference types to handle
From kolibri-zip's fileUtils.js and existing downloader.py logic:
| File type |
Reference patterns |
Source |
| HTML/XML |
src, href, srcset attributes; inline style; <style> blocks |
kolibri-zip DOMMapper, downloader.py download_static_assets() |
| CSS |
url(), @import (both url() and bare string forms) |
kolibri-zip CSSMapper, downloader.py _CSS_URL_RE, PR #639 _CSS_IMPORT_RE |
| H5P JSON |
path attributes in content/content.json |
H5P-specific |
References
Description
Archive-based content types (H5P, HTML5 zip, IMSCP) may contain references to external URLs — images, videos, fonts, stylesheets, scripts — that won't be available in offline Kolibri deployments. The pipeline conversion handlers need to scan archive contents for these references, download the resources, bundle them into the archive, and rewrite the references to point to the local copies.
Context
Existing logic in
downloader.pyWe already have most of this URL extraction and rewriting logic in
downloader.py.download_static_assets()and its inner functions extract URLs from:img[src],link[href],script[src],source[src],img[srcset],[style*="background-image"]url()references for fonts, images, etc.And
ArchiveDownloaderdownloads pages with all their resources, rewrites paths, and creates ZIP archives.However, this logic is not reusable by the pipeline because:
kolibri-zip as reference spec
Kolibri's
kolibri-zippackage handles the runtime side of this: when rendering ZIP-based content, it extracts files, resolves internal path references, and rewrites them to blob URLs. ItsfileUtils.jsprovides a comprehensive spec for which reference types need handling:HTML/XML files (
src,href,srcset, inlinestyle,<style>blocks):CSS files (
url(),@import— bothurl()and bare string forms):H5P JSON (
pathattributes incontent/content.json):{ "video": { "files": [ { "path": "https://h5p.org/sites/default/files/h5p/iv.mp4", "mime": "video/mp4" } ] } }kolibri-zip handles internal references at runtime, but cannot fetch external URLs (especially offline). That's ricecooker's job at import time.
Key architectural constraint
External resource downloading and reference rewriting must happen before
create_predictable_zipis called inArchiveProcessingBaseHandler.handle_file()(convert.py). Oncecreate_predictable_zipruns, the archive is sealed — it iterates existing files and can only transform them viafile_converter(currently used for media compression), not add new ones.The flow in
handle_file()would become:Where
download_and_rewrite_external_refswould:create_predictable_zipalready accepts — it handles both directories and zip files)Related issues and PRs
downloader.pyto make URL detection and rewriting unit-testable. This issue supersedes Refactor downloader.py to make more functions unit testable #303 — the extraction of URL logic into shared utilities serves both the testability goal and the pipeline integration goal.linktag filtering ("rel" in nodevs"rel" in node.attrs) and extensionless URL path placement. By @jaltekruse.@importwith bare strings (not wrapped inurl()). By @jaltekruse.Approach
Phase 1: Land bug fixes from #636 and #639
Merge the open PRs first to preserve @jaltekruse's contributor attribution in git history before the refactor changes the code structure. These fix real bugs in the URL extraction logic that the shared utilities will need to carry forward.
Phase 2: Extract shared utilities from
downloader.py(supersedes #303)Extract the URL extraction and rewriting logic from
download_static_assets()inner functions into standalone, testable utility functions that operate on file contents (strings/bytes) rather than requiring a live HTTP session:This directly addresses #303's concern about untestable inner functions. The extracted functions can be unit tested with plain strings — no HTTP server, no filesystem, no platform-specific path issues. This also resolves the Windows test failures in #636, since the core logic tests won't depend on filesystem paths.
Phase 3: Create archive processing utility
Build on the Phase 2 utilities to create an archive-level processor:
visitedPaths)Phase 4: Integrate into pipeline conversion handlers
Wire the archive processor into the existing handlers, running before
create_predictable_zip:H5PConversionHandler: Scancontent/content.jsonfor externalpathvalues, plus HTML/CSS in content (highest priority — videos and images commonly external)HTML5ConversionHandler: Scan HTML/CSS files for external referencesIMSCPConversionHandler: Scan entry point HTML files and their CSS for external referencesReference types to handle
From kolibri-zip's
fileUtils.jsand existingdownloader.pylogic:src,href,srcsetattributes; inlinestyle;<style>blocksDOMMapper, downloader.pydownload_static_assets()url(),@import(bothurl()and bare string forms)CSSMapper, downloader.py_CSS_URL_RE, PR #639_CSS_IMPORT_REpathattributes incontent/content.jsonReferences
ricecooker/utils/pipeline/convert.pyricecooker/utils/downloader.py