Skip to content

Add HuggingFaceInferenceOpDesc with dispatcher + per-task codegen architecture (text-generation) #5277

@PG1204

Description

@PG1204

Feature Summary

The HuggingFace inference operator (#5041) needs to cover ~20 HF pipeline tasks (text-generation, image-classification, ASR, text-to-image, …). To land it cleanly and let the per-task work proceed in parallel, the operator is introduced via a dispatcher + per-task codegen architecture: a thin HuggingFaceInferenceOpDesc selects a TaskCodegen based on the configured task, and the selected codegen contributes the per-task Python payload + parse snippets. Shared infrastructure (provider fallback, HTTP loop, response-parsing framework) lives in PythonCodegenBase.

This issue covers shipping the dispatcher pattern + the first task family (text-generation) end-to-end. Subsequent child issues add the image, audio / media-generation, and QA / ranking task families by introducing new *Codegen objects and registering them in the dispatcher map. The architecture lets each task-family PR stay focused: a new task family means one new file plus one entry in the dispatcher map — no surgery on the shared infrastructure or other codegens.

Concretely, landing this would enable:

  • A working HuggingFace operator on the workspace for text-generation tasks against HF Hub and any OpenAI-compatible third-party provider (Cerebras, Groq, Sambanova, Together, …).
  • A clean extension point for the image / audio / QA task families to plug into via subsequent PRs without modifying the operator class or the shared Python infrastructure.

Proposed Solution or Design

  1. New files under common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/:
    • HuggingFaceInferenceOpDesc.scala — thin (~180-line) dispatcher holding the @JsonProperty fields and the registeredCodegens map.
    • codegen/TaskCodegen.scala — trait + CodegenContext case class; default tasks: Set[String] = Set(task) for single-task codegens, overridable by multi-task codegens.
    • codegen/PythonCodegenBase.scala — shared provider-fallback (HF router + OpenAI-compatible third-party providers), process_table loop, _parse_response framework, with two holes for the per-task payload + parse snippets.
    • codegen/TextGenCodegen.scala — text-generation's chat-completions payload and body["choices"][0]["message"]["content"] parse.
  2. Register HuggingFaceInferenceOpDesc in LogicalOp.scala's @JsonSubTypes.
  3. Design constraints baked into the codegen:
    • Safe codegen via EncodableString + pyb"...": user-input string fields are typed as EncodableString (String @EncodableStringAnnotation); the pyb macro emits them as self.decode_python_template('<base64>') runtime expressions instead of raw Python literals, so they never appear in the generated source as-is. This is what satisfies PythonCodeRawInvalidTextSpec's leakage check.
    • Constants in open(self): per-instance attributes (self.MODEL_ID, self.PROMPT_COLUMN, …) are assigned in the lifecycle method so self is in scope for the decode call.
    • Codegen totality: generatePythonCode never throws on arbitrary @JsonProperty values — unknown task strings fall back to TextGenCodegen, and the generated Python's else branch produces a generic {"inputs": prompt_value} payload, matching the original monolithic operator's behavior. Required by the regression test contract.
    • Defensive MODEL_ID validation at runtime: generated Python rejects malformed model IDs (path-traversal segments, query strings, fragments, control characters) with a clear ValueError before any HF URL is composed.

References:

Impact / Priority

(P2) Medium — required for the HuggingFace inference operator (#5041) to function. Does not affect existing functionality.

Affected Area

Workflow Engine (Amber) — operator descriptor + Python codegen.

Task Type

  • Refactor / Cleanup
  • DevOps / Deployment / CI
  • Testing / QA
  • Documentation
  • Performance
  • Other

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions