🌐 Course: 20465 – System Programming Lab | 👨🏫 Instructor: Danny Calfon | 📅 Submitted: 28.03.2022 | 📑 Spec: docs/spec.pdf
A complete assembler implemented in C for a custom assembly language targeting a fictional 20-bit machine.
Translates.assource files into encoded machine output through a staged pipeline of macro expansion → parsing → symbol resolution → memory-image construction → output generation.
- Macro Expansion — Replaces macro usages with their corresponding definitions
- Symbol Management — Resolves and defines labels across two passes
- Machine Code Generation — Converts assembly instructions into 20-bit binary words
- Auxiliary File Creation — Produces
.ob,.ent, and.extoutput files - Robust Error Handling — Detects and logs all errors at every stage without premature termination
- What It Does
- Target Machine
- Pipeline Architecture
- Why Two Passes?
- Processing Stages
- Core Data Structures
- Error Handling & Diagnostics
- File Structure
- Design Decisions
- Subtle Challenges
- Possible Improvements
- Summary
- Usage
- License
This project is a complete assembler, not just a parser. For each source file, it:
- Expands macros into an intermediate
.amfile - First pass — parses, validates, builds the symbol table, tracks instruction/data counters
- Address finalization — adjusts data symbol addresses once the instruction image size is known
- Second pass — resolves symbol references, encodes final machine words
- Output generation — produces
.ob,.ent, and.extfiles if no fatal errors occurred
What makes it especially interesting is that it combines architecture, custom data structures, stateful multi-stage processing, machine-oriented encoding, and robust edge-case handling.
This project implements the assembler stage only. It does not implement linking or loading.
| Property | Value |
|---|---|
| Memory | 8192 words |
| Word size | 20 bits |
| Registers | 16 general-purpose |
| Instruction set | 16 operations |
| Addressing modes | 4 |
This project is not just about parsing text — it is a translator tied to a concrete machine representation. The internal design had to match instruction encoding rules, word structure, register model, addressing-method rules, and base/offset address representation.
Source file (.as)
│
▼
┌─────────────────────┐
│ Macro Preprocessing │ → Expands macros, generates .am file
└─────────────────────┘
│
▼
┌─────────────────────┐
│ First Pass │ → Parses, validates, builds symbol table
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Address Finalization │ → Adjusts data symbol addresses
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Second Pass │ → Encodes machine words, resolves symbols
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Output Generation │ → .ob / .ent / .ext
└─────────────────────┘
Each stage has a single focused responsibility and prepares the data needed by the next one.
Assembly code may reference labels before they are defined later in the file. Final encoding can't happen correctly during a single read-through.
First pass — discovery phase:
- Parse and classify every line
- Validate syntax and structure
- Detect labels and directives
- Build the symbol table
- Update instruction counter (IC) and data counter (DC)
- Collect metadata needed for final encoding
Address finalization — once the instruction image size is known, data symbol addresses are adjusted to their correct final values.
Second pass — encoding phase:
- Resolve all symbol references
- Validate symbol-dependent semantics
- Encode final machine words
- Write binary words into the memory image
- Track external symbol usage locations
- Prepare output artifacts
This separation keeps forward references manageable and cleanly divides discovery from final encoding.
Before assembly begins, the source goes through a macro-expansion stage. It detects macro definitions, stores their metadata, expands all usages, and generates an intermediate macro-free .am file. Separating macro handling from the main assembler keeps the rest of the pipeline simpler and more predictable.
Builds the semantic foundation of the program — parses each line, classifies statements, validates syntax, detects labels, inserts symbols into the symbol table, and updates IC/DC. By the end, the assembler knows the structure of the program and has enough information to finalize layout.
After the first pass, symbol values are adjusted based on final instruction-image size. This is especially important for data symbols, whose final positions depend on where the code image ends.
Completes the translation — resolves label references, encodes instructions and operands, writes final machine words into the memory image, tracks external symbol positions, and validates symbol-dependent cases. This is where parsed source data becomes final encoded output.
If no fatal errors are found, generates: the expanded intermediate file, object file (.ob), entry file (.ent), and external file (.ext). This makes the assembler a full translation stage rather than only an in-memory analyzer.
A major part of the project design was choosing internal representations that match both the machine model and the translation workflow. These were architectural choices, not just implementation details.
Dedicated structs for a single bit, a 20-bit binary word, and a printable 5-part hex-style word. Rather than treating encoded output as plain integers, the memory representation reflects the target machine directly — clearer memory-image construction, easier reasoning about encoded output, strong alignment with the 20-bit word model.
Each operation stores its opcode, funct, keyword, and the legal addressing modes for both source and destination operands. This makes instruction validation data-driven — instead of scattering operation rules across long conditional chains, the legal structure of the instruction set lives in the operation model itself. Benefits: clearer validation logic, easier operand compatibility checks, better separation between language definition and parser logic.
Allowed addressing methods are represented explicitly per operation and per operand position. The assembler must validate not only whether an operand is recognizable, but also whether its addressing mode is legal for a specific operation and position (source vs. destination). This produces stronger semantic validation with less duplicated logic.
The symbol table and macro table share the same underlying hash-table implementation with chained collision handling. Each item stores a name, a type-specific payload via a union, and a pointer for chaining. Both symbols and macros are primarily accessed by name, so hash-based lookup is a natural fit — fast lookup, reusable core logic, less repeated code, better abstraction in C.
Each symbol stores more than just a name and address. It includes:
- Full address value
- Base address and offset (for the target machine's base+offset encoding)
- Attribute flags:
code,data,entry,external
Storing base/offset directly in the symbol model avoids repeated decomposition in later stages and simplifies both encoding and entry/external export logic.
Instead of assigning each symbol one rigid category, properties are stored as flags. A symbol may accumulate meaning across stages — it may be code or data, later marked as entry, or conflict with external/local rules. Some combinations are legal; some are contradictions. Flags make this flexible to represent and straightforward to validate.
Macros are represented with dedicated metadata describing their relevant source range. Macro expansion is treated as a structured preprocessing stage, not as loose text replacement — clearer logic, easier bookkeeping, better stage separation.
A dedicated structure tracks every location where an external symbol is used. For each external symbol, the assembler stores the name and one or more usage locations, each with base and offset information. External symbols have a one-to-many relationship with usage positions, and the model reflects that directly — clean export of external references, explicit representation of all usage locations.
A large portion of the project's complexity is in handling malformed input and subtle semantic errors. The diagnostic system provides:
- File and line-number context on every message
- Separate warning and error flows
- Output to both
stderranderrors.log - No premature termination — all errors in a stage are collected before stopping
| Family | Examples |
|---|---|
| Macro preprocessing | Unnamed macro, nested macros, closing without opening, duplicate names, reserved-word names |
| Labels & symbols | Empty/illegal label, overlong name, illegal characters, collision with reserved keywords or register names, duplicate definitions, local/external role conflicts |
| Line structure | Illegal leading characters, extra trailing tokens, missing spaces in critical positions, invalid leading token, excessive line length |
| Comma & argument formatting | Missing/extra commas, comma before first arg, comma after last arg, illegal comma placement |
| Operand & immediate values | Wrong operand count, missing source/destination operand, too many operands, non-numeric where expected, malformed signed-number syntax |
| Addressing-mode legality | Illegal source/destination operand type, incompatible with current operation, illegal indexed-addressing usage, undefined register reference |
| Symbol resolution | Undefined labels, entry-declared but never defined, illegal symbol-state combinations, role contradictions |
| System-level | Input file could not be opened, output file could not be created, memory allocation failure |
| File | Role |
|---|---|
compiler.c |
Entry point; reads input files, calls handleSingleFile() for each |
preProccesor.c |
Macro expansion — reads .as, writes .am |
parse.c |
Central parsing logic, dispatches to first/second pass based on global state |
firstRun.c |
First pass — syntax/semantic validation, IC/DC tracking |
secondRun.c |
Second pass — machine-code encoding, label resolution, external tracking |
tables.c |
Symbol table, macro table, external operands — hash-table based |
memory.c |
IC/DC counters, memory-image construction, .ob file writing |
operations.c |
Operation definitions (opcode, funct, addressing modes) and lookups |
exportFiles.c |
Generates .ob, .ent, .ext output files |
errors.c |
Centralized error/warning logging to stderr and errors.log |
sharedStates.c |
Global assembler state (active pass, current file, line number) — getters/setters for controlled access |
helpers.c |
String utilities — trimming, cloning, binary-to-hex conversion |
utils.c |
Validation helpers — instructions, macros, registers, label names |
data.h acts as a master aggregator — a single include that pulls in standard libraries (lib.h), all typedefs/enums/flags (variables.h), and all function prototypes (functions.h). Individual headers under headers/functions/ mirror each .c file.
| File | Contents |
|---|---|
complex_typedef.h |
Core structs: BinaryWord, HexWord, Operation, Item |
constants.h |
BINARY_WORD_SIZE, MEMORY_START, register codes, etc. |
flags.h |
Enums for errors, warnings, assembler states, booleans |
variables.h |
Aggregates the above into one header |
errors/— intentional error casesmixed/— valid/invalid combinationsvalid/— fully valid assembler programserrors.log— error log from test runs
| Decision | Rationale |
|---|---|
| Two-pass architecture | Symbol resolution and final memory layout depend on information not available in a single read |
| Explicit 20-bit word modeling | Stays close to the target machine — clearer encoding, easier reasoning |
| Data-driven instruction definitions | Legality checks live in structured models, not scattered procedural conditions |
| Shared hash-table for symbols & macros | Both are name-keyed; sharing the infrastructure reduces duplication |
| Rich symbol metadata with base/offset | Supports base+offset encoding, symbol-role validation, and output generation without repeated decomposition |
| Symbol attribute flags | Symbols accumulate roles across stages; flags are more flexible than rigid single-category assignment |
| Dedicated external-usage tracking | External symbols require both declaration tracking and usage-location tracking — one-to-many by nature |
| Centralized diagnostics | Consistent, contextual reporting across all stages without premature termination |
Some of the most interesting problems lived at the boundary between stages.
Syntax vs. semantics — Some failures are local to one line. Others only become detectable after enough global context has been collected across the full file.
Symbol consistency across stages — A symbol may be declared early, updated in the first pass, marked as entry later, and potentially conflict with external/local rules — all of which must be tracked coherently across both passes.
Addressing-mode validation — Recognizing an operand is not enough. It must also be legal for the specific operation and position (source vs. destination). An operand can be syntactically valid and still be semantically illegal for a specific instruction.
Code/data address relationship — Final data symbol addresses depend on the final instruction-image size. Both images must stay consistent throughout both passes, and data addresses can only be finalized after the code image is complete.
External reference tracking — The assembler must record not only that an external symbol exists, but every individual usage location in the final memory image, each with its own base/offset pair.
If revisiting the project today:
Clearer phase separation — Separate lexical, syntactic, and semantic analysis even more explicitly into distinct layers, rather than interleaving them in the two-pass flow.
Less shared mutable state — Reduce reliance on global state in sharedStates.c and pass more context explicitly between stages.
More focused automated testing — Add dedicated tests for label validation, operand parsing, addressing legality, symbol resolution, and macro edge cases — not just full-file integration tests.
More formal diagnostics layering — Organize diagnostics more systematically by stage and severity, making it easier to triage failures during development.
This project sits in a strong middle ground:
- Small enough to understand fully
- Large enough to require real architectural thinking
- Low-level enough to feel like genuine systems programming
- Structured enough to have meaningful discussions about staging, data structures, symbol handling, and edge cases
It was a strong exercise in building a complete translation system — not just isolated parsing functions, but a full pipeline from symbolic source code to encoded machine-level output.
gcc -o assembler compiler.c errors.c exportFiles.c firstRun.c secondRun.c \
preProccesor.c tables.c helpers.c memory.c operations.c parse.c \
sharedStates.c utils.c
./assembler file1.as file2.as ...Output files (.ob, .ent, .ext) are generated in the same directory as the source files. Errors are written to both stderr and errors.log.
MIT License — see LICENSE for details.