🏆 Assembler Project

🌐 Course: 20465 – System Programming Lab | 👨‍🏫 Instructor: Danny Calfon | 📅 Submitted: 28.03.2022 | 📑 Spec: docs/spec.pdf

A complete assembler implemented in C for a custom assembly language targeting a fictional 20-bit machine.
Translates .as source files into encoded machine output through a staged pipeline of macro expansion → parsing → symbol resolution → memory-image construction → output generation.

Features at a Glance

Macro Expansion — Replaces macro usages with their corresponding definitions
Symbol Management — Resolves and defines labels across two passes
Machine Code Generation — Converts assembly instructions into 20-bit binary words
Auxiliary File Creation — Produces .ob, .ent, and .ext output files
Robust Error Handling — Detects and logs all errors at every stage without premature termination

🎯 What It Does

This project is a complete assembler, not just a parser. For each source file, it:

Expands macros into an intermediate .am file
First pass — parses, validates, builds the symbol table, tracks instruction/data counters
Address finalization — adjusts data symbol addresses once the instruction image size is known
Second pass — resolves symbol references, encodes final machine words
Output generation — produces .ob, .ent, and .ext files if no fatal errors occurred

What makes it especially interesting is that it combines architecture, custom data structures, stateful multi-stage processing, machine-oriented encoding, and robust edge-case handling.

This project implements the assembler stage only. It does not implement linking or loading.

🖥️ Target Machine

Property	Value
Memory	8192 words
Word size	20 bits
Registers	16 general-purpose
Instruction set	16 operations
Addressing modes	4

This project is not just about parsing text — it is a translator tied to a concrete machine representation. The internal design had to match instruction encoding rules, word structure, register model, addressing-method rules, and base/offset address representation.

🏗️ Pipeline Architecture

Source file (.as)
      │
      ▼
┌─────────────────────┐
│  Macro Preprocessing │  → Expands macros, generates .am file
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│     First Pass       │  → Parses, validates, builds symbol table
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│ Address Finalization │  → Adjusts data symbol addresses
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│     Second Pass      │  → Encodes machine words, resolves symbols
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│   Output Generation  │  → .ob  /  .ent  /  .ext
└─────────────────────┘

Each stage has a single focused responsibility and prepares the data needed by the next one.

🔁 Why Two Passes?

Assembly code may reference labels before they are defined later in the file. Final encoding can't happen correctly during a single read-through.

First pass — discovery phase:

Parse and classify every line
Validate syntax and structure
Detect labels and directives
Build the symbol table
Update instruction counter (IC) and data counter (DC)
Collect metadata needed for final encoding

Address finalization — once the instruction image size is known, data symbol addresses are adjusted to their correct final values.

Second pass — encoding phase:

Resolve all symbol references
Validate symbol-dependent semantics
Encode final machine words
Write binary words into the memory image
Track external symbol usage locations
Prepare output artifacts

This separation keeps forward references manageable and cleanly divides discovery from final encoding.

🧩 Processing Stages

1️⃣ Macro Preprocessing

Before assembly begins, the source goes through a macro-expansion stage. It detects macro definitions, stores their metadata, expands all usages, and generates an intermediate macro-free .am file. Separating macro handling from the main assembler keeps the rest of the pipeline simpler and more predictable.

2️⃣ First Pass

Builds the semantic foundation of the program — parses each line, classifies statements, validates syntax, detects labels, inserts symbols into the symbol table, and updates IC/DC. By the end, the assembler knows the structure of the program and has enough information to finalize layout.

3️⃣ Address Finalization

After the first pass, symbol values are adjusted based on final instruction-image size. This is especially important for data symbols, whose final positions depend on where the code image ends.

4️⃣ Second Pass

Completes the translation — resolves label references, encodes instructions and operands, writes final machine words into the memory image, tracks external symbol positions, and validates symbol-dependent cases. This is where parsed source data becomes final encoded output.

5️⃣ Output Generation

If no fatal errors are found, generates: the expanded intermediate file, object file (.ob), entry file (.ent), and external file (.ext). This makes the assembler a full translation stage rather than only an in-memory analyzer.

🧱 Core Data Structures

A major part of the project design was choosing internal representations that match both the machine model and the translation workflow. These were architectural choices, not just implementation details.

🔹 Machine Word Representation

Dedicated structs for a single bit, a 20-bit binary word, and a printable 5-part hex-style word. Rather than treating encoded output as plain integers, the memory representation reflects the target machine directly — clearer memory-image construction, easier reasoning about encoded output, strong alignment with the 20-bit word model.

🔹 Operation Table

Each operation stores its opcode, funct, keyword, and the legal addressing modes for both source and destination operands. This makes instruction validation data-driven — instead of scattering operation rules across long conditional chains, the legal structure of the instruction set lives in the operation model itself. Benefits: clearer validation logic, easier operand compatibility checks, better separation between language definition and parser logic.

🔹 Addressing-Mode Capability Model

Allowed addressing methods are represented explicitly per operation and per operand position. The assembler must validate not only whether an operand is recognizable, but also whether its addressing mode is legal for a specific operation and position (source vs. destination). This produces stronger semantic validation with less duplicated logic.

🔹 Hash-Table Infrastructure

The symbol table and macro table share the same underlying hash-table implementation with chained collision handling. Each item stores a name, a type-specific payload via a union, and a pointer for chaining. Both symbols and macros are primarily accessed by name, so hash-based lookup is a natural fit — fast lookup, reusable core logic, less repeated code, better abstraction in C.

🔹 Rich Symbol Model

Each symbol stores more than just a name and address. It includes:

Full address value
Base address and offset (for the target machine's base+offset encoding)
Attribute flags: code, data, entry, external

Storing base/offset directly in the symbol model avoids repeated decomposition in later stages and simplifies both encoding and entry/external export logic.

🔹 Symbol Attribute Flags

Instead of assigning each symbol one rigid category, properties are stored as flags. A symbol may accumulate meaning across stages — it may be code or data, later marked as entry, or conflict with external/local rules. Some combinations are legal; some are contradictions. Flags make this flexible to represent and straightforward to validate.

🔹 Macro Metadata Model

Macros are represented with dedicated metadata describing their relevant source range. Macro expansion is treated as a structured preprocessing stage, not as loose text replacement — clearer logic, easier bookkeeping, better stage separation.

🔹 External Usage Tracking

A dedicated structure tracks every location where an external symbol is used. For each external symbol, the assembler stores the name and one or more usage locations, each with base and offset information. External symbols have a one-to-many relationship with usage positions, and the model reflects that directly — clean export of external references, explicit representation of all usage locations.

⚠️ Error Handling & Diagnostics

A large portion of the project's complexity is in handling malformed input and subtle semantic errors. The diagnostic system provides:

File and line-number context on every message
Separate warning and error flows
Output to both stderr and errors.log
No premature termination — all errors in a stage are collected before stopping

Error Families

Family	Examples
Macro preprocessing	Unnamed macro, nested macros, closing without opening, duplicate names, reserved-word names
Labels & symbols	Empty/illegal label, overlong name, illegal characters, collision with reserved keywords or register names, duplicate definitions, local/external role conflicts
Line structure	Illegal leading characters, extra trailing tokens, missing spaces in critical positions, invalid leading token, excessive line length
Comma & argument formatting	Missing/extra commas, comma before first arg, comma after last arg, illegal comma placement
Operand & immediate values	Wrong operand count, missing source/destination operand, too many operands, non-numeric where expected, malformed signed-number syntax
Addressing-mode legality	Illegal source/destination operand type, incompatible with current operation, illegal indexed-addressing usage, undefined register reference
Symbol resolution	Undefined labels, entry-declared but never defined, illegal symbol-state combinations, role contradictions
System-level	Input file could not be opened, output file could not be created, memory allocation failure

🏗️ File Structure

Main Source Files (`.c`)

File	Role
`compiler.c`	Entry point; reads input files, calls `handleSingleFile()` for each
`preProccesor.c`	Macro expansion — reads `.as`, writes `.am`
`parse.c`	Central parsing logic, dispatches to first/second pass based on global state
`firstRun.c`	First pass — syntax/semantic validation, IC/DC tracking
`secondRun.c`	Second pass — machine-code encoding, label resolution, external tracking
`tables.c`	Symbol table, macro table, external operands — hash-table based
`memory.c`	IC/DC counters, memory-image construction, `.ob` file writing
`operations.c`	Operation definitions (opcode, `funct`, addressing modes) and lookups
`exportFiles.c`	Generates `.ob`, `.ent`, `.ext` output files
`errors.c`	Centralized error/warning logging to `stderr` and `errors.log`
`sharedStates.c`	Global assembler state (active pass, current file, line number) — getters/setters for controlled access
`helpers.c`	String utilities — trimming, cloning, binary-to-hex conversion
`utils.c`	Validation helpers — instructions, macros, registers, label names

Header Files (`headers/`)

data.h acts as a master aggregator — a single include that pulls in standard libraries (lib.h), all typedefs/enums/flags (variables.h), and all function prototypes (functions.h). Individual headers under headers/functions/ mirror each .c file.

Variables (`variables/`)

File	Contents
`complex_typedef.h`	Core structs: `BinaryWord`, `HexWord`, `Operation`, `Item`
`constants.h`	`BINARY_WORD_SIZE`, `MEMORY_START`, register codes, etc.
`flags.h`	Enums for errors, warnings, assembler states, booleans
`variables.h`	Aggregates the above into one header

Test Files (`__test_files/`)

errors/ — intentional error cases
mixed/ — valid/invalid combinations
valid/ — fully valid assembler programs
errors.log — error log from test runs

💡 Design Decisions

Decision	Rationale
Two-pass architecture	Symbol resolution and final memory layout depend on information not available in a single read
Explicit 20-bit word modeling	Stays close to the target machine — clearer encoding, easier reasoning
Data-driven instruction definitions	Legality checks live in structured models, not scattered procedural conditions
Shared hash-table for symbols & macros	Both are name-keyed; sharing the infrastructure reduces duplication
Rich symbol metadata with base/offset	Supports base+offset encoding, symbol-role validation, and output generation without repeated decomposition
Symbol attribute flags	Symbols accumulate roles across stages; flags are more flexible than rigid single-category assignment
Dedicated external-usage tracking	External symbols require both declaration tracking and usage-location tracking — one-to-many by nature
Centralized diagnostics	Consistent, contextual reporting across all stages without premature termination

🧠 Subtle Challenges

Some of the most interesting problems lived at the boundary between stages.

Syntax vs. semantics — Some failures are local to one line. Others only become detectable after enough global context has been collected across the full file.

Symbol consistency across stages — A symbol may be declared early, updated in the first pass, marked as entry later, and potentially conflict with external/local rules — all of which must be tracked coherently across both passes.

Addressing-mode validation — Recognizing an operand is not enough. It must also be legal for the specific operation and position (source vs. destination). An operand can be syntactically valid and still be semantically illegal for a specific instruction.

Code/data address relationship — Final data symbol addresses depend on the final instruction-image size. Both images must stay consistent throughout both passes, and data addresses can only be finalized after the code image is complete.

External reference tracking — The assembler must record not only that an external symbol exists, but every individual usage location in the final memory image, each with its own base/offset pair.

🛠️ Possible Improvements

If revisiting the project today:

Clearer phase separation — Separate lexical, syntactic, and semantic analysis even more explicitly into distinct layers, rather than interleaving them in the two-pass flow.

Less shared mutable state — Reduce reliance on global state in sharedStates.c and pass more context explicitly between stages.

More focused automated testing — Add dedicated tests for label validation, operand parsing, addressing legality, symbol resolution, and macro edge cases — not just full-file integration tests.

More formal diagnostics layering — Organize diagnostics more systematically by stage and severity, making it easier to triage failures during development.

📌 Summary

This project sits in a strong middle ground:

Small enough to understand fully
Large enough to require real architectural thinking
Low-level enough to feel like genuine systems programming
Structured enough to have meaningful discussions about staging, data structures, symbol handling, and edge cases

It was a strong exercise in building a complete translation system — not just isolated parsing functions, but a full pipeline from symbolic source code to encoded machine-level output.

🚀 Usage

gcc -o assembler compiler.c errors.c exportFiles.c firstRun.c secondRun.c \
    preProccesor.c tables.c helpers.c memory.c operations.c parse.c \
    sharedStates.c utils.c

./assembler file1.as file2.as ...

Output files (.ob, .ent, .ext) are generated in the same directory as the source files. Errors are written to both stderr and errors.log.

⚖️ License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
___test_files		___test_files
docs		docs
headers		headers
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.html		README.html
README.md		README.md
compiler.c		compiler.c
data.h		data.h
errors.c		errors.c
exportFiles.c		exportFiles.c
firstRun.c		firstRun.c
helpers.c		helpers.c
makefile		makefile
memory.c		memory.c
operations.c		operations.c
parse.c		parse.c
preProccesor.c		preProccesor.c
runTests.sh		runTests.sh
secondRun.c		secondRun.c
sharedStates.c		sharedStates.c
tables.c		tables.c
utils.c		utils.c

Folders and files

Latest commit

History

Repository files navigation

🏆 Assembler Project

Features at a Glance

📋 Table of Contents

🎯 What It Does

🖥️ Target Machine

🏗️ Pipeline Architecture

🔁 Why Two Passes?

🧩 Processing Stages

1️⃣ Macro Preprocessing

2️⃣ First Pass

3️⃣ Address Finalization

4️⃣ Second Pass

5️⃣ Output Generation

🧱 Core Data Structures

🔹 Machine Word Representation

🔹 Operation Table

🔹 Addressing-Mode Capability Model

🔹 Hash-Table Infrastructure

🔹 Rich Symbol Model

🔹 Symbol Attribute Flags

🔹 Macro Metadata Model

🔹 External Usage Tracking

⚠️ Error Handling & Diagnostics

Error Families

🏗️ File Structure

Main Source Files (.c)

Header Files (headers/)

Variables (variables/)

Test Files (__test_files/)

💡 Design Decisions

🧠 Subtle Challenges

🛠️ Possible Improvements

📌 Summary

🚀 Usage

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Main Source Files (`.c`)

Header Files (`headers/`)

Variables (`variables/`)

Test Files (`__test_files/`)

Packages