Skip to content

v1t3ls0n/assembler

Repository files navigation

🏆 Assembler Project


🌐 Course: 20465 – System Programming Lab  |  👨‍🏫 Instructor: Danny Calfon  |  📅 Submitted: 28.03.2022  |  📑 Spec: docs/spec.pdf

A complete assembler implemented in C for a custom assembly language targeting a fictional 20-bit machine.
Translates .as source files into encoded machine output through a staged pipeline of macro expansion → parsing → symbol resolution → memory-image construction → output generation.

Features at a Glance

  • Macro Expansion — Replaces macro usages with their corresponding definitions
  • Symbol Management — Resolves and defines labels across two passes
  • Machine Code Generation — Converts assembly instructions into 20-bit binary words
  • Auxiliary File Creation — Produces .ob, .ent, and .ext output files
  • Robust Error Handling — Detects and logs all errors at every stage without premature termination

📋 Table of Contents


🎯 What It Does

This project is a complete assembler, not just a parser. For each source file, it:

  1. Expands macros into an intermediate .am file
  2. First pass — parses, validates, builds the symbol table, tracks instruction/data counters
  3. Address finalization — adjusts data symbol addresses once the instruction image size is known
  4. Second pass — resolves symbol references, encodes final machine words
  5. Output generation — produces .ob, .ent, and .ext files if no fatal errors occurred

What makes it especially interesting is that it combines architecture, custom data structures, stateful multi-stage processing, machine-oriented encoding, and robust edge-case handling.

This project implements the assembler stage only. It does not implement linking or loading.


🖥️ Target Machine

Property Value
Memory 8192 words
Word size 20 bits
Registers 16 general-purpose
Instruction set 16 operations
Addressing modes 4

This project is not just about parsing text — it is a translator tied to a concrete machine representation. The internal design had to match instruction encoding rules, word structure, register model, addressing-method rules, and base/offset address representation.


🏗️ Pipeline Architecture

Source file (.as)
      │
      ▼
┌─────────────────────┐
│  Macro Preprocessing │  → Expands macros, generates .am file
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│     First Pass       │  → Parses, validates, builds symbol table
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│ Address Finalization │  → Adjusts data symbol addresses
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│     Second Pass      │  → Encodes machine words, resolves symbols
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│   Output Generation  │  → .ob  /  .ent  /  .ext
└─────────────────────┘

Each stage has a single focused responsibility and prepares the data needed by the next one.


🔁 Why Two Passes?

Assembly code may reference labels before they are defined later in the file. Final encoding can't happen correctly during a single read-through.

First pass — discovery phase:

  • Parse and classify every line
  • Validate syntax and structure
  • Detect labels and directives
  • Build the symbol table
  • Update instruction counter (IC) and data counter (DC)
  • Collect metadata needed for final encoding

Address finalization — once the instruction image size is known, data symbol addresses are adjusted to their correct final values.

Second pass — encoding phase:

  • Resolve all symbol references
  • Validate symbol-dependent semantics
  • Encode final machine words
  • Write binary words into the memory image
  • Track external symbol usage locations
  • Prepare output artifacts

This separation keeps forward references manageable and cleanly divides discovery from final encoding.


🧩 Processing Stages

1️⃣ Macro Preprocessing

Before assembly begins, the source goes through a macro-expansion stage. It detects macro definitions, stores their metadata, expands all usages, and generates an intermediate macro-free .am file. Separating macro handling from the main assembler keeps the rest of the pipeline simpler and more predictable.

2️⃣ First Pass

Builds the semantic foundation of the program — parses each line, classifies statements, validates syntax, detects labels, inserts symbols into the symbol table, and updates IC/DC. By the end, the assembler knows the structure of the program and has enough information to finalize layout.

3️⃣ Address Finalization

After the first pass, symbol values are adjusted based on final instruction-image size. This is especially important for data symbols, whose final positions depend on where the code image ends.

4️⃣ Second Pass

Completes the translation — resolves label references, encodes instructions and operands, writes final machine words into the memory image, tracks external symbol positions, and validates symbol-dependent cases. This is where parsed source data becomes final encoded output.

5️⃣ Output Generation

If no fatal errors are found, generates: the expanded intermediate file, object file (.ob), entry file (.ent), and external file (.ext). This makes the assembler a full translation stage rather than only an in-memory analyzer.


🧱 Core Data Structures

A major part of the project design was choosing internal representations that match both the machine model and the translation workflow. These were architectural choices, not just implementation details.

🔹 Machine Word Representation

Dedicated structs for a single bit, a 20-bit binary word, and a printable 5-part hex-style word. Rather than treating encoded output as plain integers, the memory representation reflects the target machine directly — clearer memory-image construction, easier reasoning about encoded output, strong alignment with the 20-bit word model.

🔹 Operation Table

Each operation stores its opcode, funct, keyword, and the legal addressing modes for both source and destination operands. This makes instruction validation data-driven — instead of scattering operation rules across long conditional chains, the legal structure of the instruction set lives in the operation model itself. Benefits: clearer validation logic, easier operand compatibility checks, better separation between language definition and parser logic.

🔹 Addressing-Mode Capability Model

Allowed addressing methods are represented explicitly per operation and per operand position. The assembler must validate not only whether an operand is recognizable, but also whether its addressing mode is legal for a specific operation and position (source vs. destination). This produces stronger semantic validation with less duplicated logic.

🔹 Hash-Table Infrastructure

The symbol table and macro table share the same underlying hash-table implementation with chained collision handling. Each item stores a name, a type-specific payload via a union, and a pointer for chaining. Both symbols and macros are primarily accessed by name, so hash-based lookup is a natural fit — fast lookup, reusable core logic, less repeated code, better abstraction in C.

🔹 Rich Symbol Model

Each symbol stores more than just a name and address. It includes:

  • Full address value
  • Base address and offset (for the target machine's base+offset encoding)
  • Attribute flags: code, data, entry, external

Storing base/offset directly in the symbol model avoids repeated decomposition in later stages and simplifies both encoding and entry/external export logic.

🔹 Symbol Attribute Flags

Instead of assigning each symbol one rigid category, properties are stored as flags. A symbol may accumulate meaning across stages — it may be code or data, later marked as entry, or conflict with external/local rules. Some combinations are legal; some are contradictions. Flags make this flexible to represent and straightforward to validate.

🔹 Macro Metadata Model

Macros are represented with dedicated metadata describing their relevant source range. Macro expansion is treated as a structured preprocessing stage, not as loose text replacement — clearer logic, easier bookkeeping, better stage separation.

🔹 External Usage Tracking

A dedicated structure tracks every location where an external symbol is used. For each external symbol, the assembler stores the name and one or more usage locations, each with base and offset information. External symbols have a one-to-many relationship with usage positions, and the model reflects that directly — clean export of external references, explicit representation of all usage locations.


⚠️ Error Handling & Diagnostics

A large portion of the project's complexity is in handling malformed input and subtle semantic errors. The diagnostic system provides:

  • File and line-number context on every message
  • Separate warning and error flows
  • Output to both stderr and errors.log
  • No premature termination — all errors in a stage are collected before stopping

Error Families

Family Examples
Macro preprocessing Unnamed macro, nested macros, closing without opening, duplicate names, reserved-word names
Labels & symbols Empty/illegal label, overlong name, illegal characters, collision with reserved keywords or register names, duplicate definitions, local/external role conflicts
Line structure Illegal leading characters, extra trailing tokens, missing spaces in critical positions, invalid leading token, excessive line length
Comma & argument formatting Missing/extra commas, comma before first arg, comma after last arg, illegal comma placement
Operand & immediate values Wrong operand count, missing source/destination operand, too many operands, non-numeric where expected, malformed signed-number syntax
Addressing-mode legality Illegal source/destination operand type, incompatible with current operation, illegal indexed-addressing usage, undefined register reference
Symbol resolution Undefined labels, entry-declared but never defined, illegal symbol-state combinations, role contradictions
System-level Input file could not be opened, output file could not be created, memory allocation failure

🏗️ File Structure

Main Source Files (.c)

File Role
compiler.c Entry point; reads input files, calls handleSingleFile() for each
preProccesor.c Macro expansion — reads .as, writes .am
parse.c Central parsing logic, dispatches to first/second pass based on global state
firstRun.c First pass — syntax/semantic validation, IC/DC tracking
secondRun.c Second pass — machine-code encoding, label resolution, external tracking
tables.c Symbol table, macro table, external operands — hash-table based
memory.c IC/DC counters, memory-image construction, .ob file writing
operations.c Operation definitions (opcode, funct, addressing modes) and lookups
exportFiles.c Generates .ob, .ent, .ext output files
errors.c Centralized error/warning logging to stderr and errors.log
sharedStates.c Global assembler state (active pass, current file, line number) — getters/setters for controlled access
helpers.c String utilities — trimming, cloning, binary-to-hex conversion
utils.c Validation helpers — instructions, macros, registers, label names

Header Files (headers/)

data.h acts as a master aggregator — a single include that pulls in standard libraries (lib.h), all typedefs/enums/flags (variables.h), and all function prototypes (functions.h). Individual headers under headers/functions/ mirror each .c file.

Variables (variables/)

File Contents
complex_typedef.h Core structs: BinaryWord, HexWord, Operation, Item
constants.h BINARY_WORD_SIZE, MEMORY_START, register codes, etc.
flags.h Enums for errors, warnings, assembler states, booleans
variables.h Aggregates the above into one header

Test Files (__test_files/)

  • errors/ — intentional error cases
  • mixed/ — valid/invalid combinations
  • valid/ — fully valid assembler programs
  • errors.log — error log from test runs

💡 Design Decisions

Decision Rationale
Two-pass architecture Symbol resolution and final memory layout depend on information not available in a single read
Explicit 20-bit word modeling Stays close to the target machine — clearer encoding, easier reasoning
Data-driven instruction definitions Legality checks live in structured models, not scattered procedural conditions
Shared hash-table for symbols & macros Both are name-keyed; sharing the infrastructure reduces duplication
Rich symbol metadata with base/offset Supports base+offset encoding, symbol-role validation, and output generation without repeated decomposition
Symbol attribute flags Symbols accumulate roles across stages; flags are more flexible than rigid single-category assignment
Dedicated external-usage tracking External symbols require both declaration tracking and usage-location tracking — one-to-many by nature
Centralized diagnostics Consistent, contextual reporting across all stages without premature termination

🧠 Subtle Challenges

Some of the most interesting problems lived at the boundary between stages.

Syntax vs. semantics — Some failures are local to one line. Others only become detectable after enough global context has been collected across the full file.

Symbol consistency across stages — A symbol may be declared early, updated in the first pass, marked as entry later, and potentially conflict with external/local rules — all of which must be tracked coherently across both passes.

Addressing-mode validation — Recognizing an operand is not enough. It must also be legal for the specific operation and position (source vs. destination). An operand can be syntactically valid and still be semantically illegal for a specific instruction.

Code/data address relationship — Final data symbol addresses depend on the final instruction-image size. Both images must stay consistent throughout both passes, and data addresses can only be finalized after the code image is complete.

External reference tracking — The assembler must record not only that an external symbol exists, but every individual usage location in the final memory image, each with its own base/offset pair.


🛠️ Possible Improvements

If revisiting the project today:

Clearer phase separation — Separate lexical, syntactic, and semantic analysis even more explicitly into distinct layers, rather than interleaving them in the two-pass flow.

Less shared mutable state — Reduce reliance on global state in sharedStates.c and pass more context explicitly between stages.

More focused automated testing — Add dedicated tests for label validation, operand parsing, addressing legality, symbol resolution, and macro edge cases — not just full-file integration tests.

More formal diagnostics layering — Organize diagnostics more systematically by stage and severity, making it easier to triage failures during development.


📌 Summary

This project sits in a strong middle ground:

  • Small enough to understand fully
  • Large enough to require real architectural thinking
  • Low-level enough to feel like genuine systems programming
  • Structured enough to have meaningful discussions about staging, data structures, symbol handling, and edge cases

It was a strong exercise in building a complete translation system — not just isolated parsing functions, but a full pipeline from symbolic source code to encoded machine-level output.


🚀 Usage

gcc -o assembler compiler.c errors.c exportFiles.c firstRun.c secondRun.c \
    preProccesor.c tables.c helpers.c memory.c operations.c parse.c \
    sharedStates.c utils.c

./assembler file1.as file2.as ...

Output files (.ob, .ent, .ext) are generated in the same directory as the source files. Errors are written to both stderr and errors.log.


⚖️ License

MIT License — see LICENSE for details.

About

Assembler – System Programming Lab Final Project (28.03.2022)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors