Assembler, Part 1

April 15, 2014

This exercise is trickier than it looks, since we must write a two-pass assembler to handle labels that appear after their first reference; the first pass reads the program and saves the line numbers referred to by labels, and the second pass writes a memory image containing the assembled program.

We keep the line numbers referred to by labels in a global a-list, and define an a-list of opcodes:

(define labels (list))

(define opcodes '(("const" . 0) ("get" . 1) ("put" . 2)
  ("ld" . 3) ("st" . 4) ("add" . 5) ("sub" . 6) ("jpos" . 7)
  ("jz" . 8) ("j" . 9) ("halt" . 10)))

The first pass reads the program, writes an intermediate form of the program, and collects the a-list of label/line pairs:

(define (asm1 file-name) ; first-pass
  (set! labels (list))
  (with-input-from-file file-name (lambda ()
    (let loop ((k 0) (line (read-line)) (lines (list)))
      (if (eof-object? line) (reverse lines)
        (if (or (string=? line "") (char=? (string-ref line 0) #\#))
           (loop k (read-line) lines)
            (call-with-values
              (lambda () (split (string->list line)))
              (lambda (lbl opc obj cmt)
                (when (not (string=? lbl ""))
                  (set! labels (cons (cons lbl k) labels)))
                (loop (+ k 1) (read-line)
                      (cons (vector k lbl opc obj cmt) lines))))))))))

The output of the first pass is a list of five-slot vectors: line number, label, opcode, object, comment. Everything at this point remains in symbolic form; we keep the comment because we intend to use it in a future exercise. The split function extracts up to four fields from each line:

(define (split line)
  (define (skip-blanks)
    (let loop ()
      (if (or (null? line) (not (char-whitespace? (car line)))) #f
        (begin (set! line (cdr line)) (loop)))))
  (define (get-string)
    (let loop ((str (list)))
      (if (or (null? line) (char-whitespace? (car line)))
          (list->string (reverse str))
          (let ((c (car line)))
            (set! line (cdr line))
            (loop (cons c str))))))
  (let* ((lbl (get-string)) (_ (skip-blanks))
         (opc (get-string)) (_ (skip-blanks))
         (obj (get-string)) (_ (skip-blanks)))
    (values lbl opc obj (list->string line))))

The split function runs through the input line, extracting fields as it sees them and discarding the parts of the line that it has already processed. Our function isn’t perfect — ill-formed input can make it crash — but our exercise is about assembling, not parsing, and our function does work properly on well-formed input. The list output by the first pass is fed to the second pass, which writes a memory image:

(define (asm2 lines) ; second pass
  (let ((mem (make-vector 1000 0)))
    (do ((lines lines (cdr lines))) ((null? lines) mem)
      (let ((num (vector-ref (car lines) 0))
            (opc (vector-ref (car lines) 2))
            (obj (vector-ref (car lines) 3)))
        (vector-set! mem num (+ (* (cdr (assoc opc opcodes)) 1000)
          (if (assoc obj labels) (cdr (assoc obj labels))
            (if (not (string=? obj "")) (string->number obj)
              0))))))))

The last line (which is spread over four physical lines) that performs the vector-set! operation is rather intense. One of three things is stored in the memory location. If obj is a label, memory is set with the numeric value of the opcode in the two high-order digits and the label location in the three low-order digits. If obj is a number, it is necessarily the object of a const opcode, and is written literally to memory; the numeric value of the const opcode is 0, so the two high-order digits are not special, and are just part of the constant. Otherwise, for those opcodes that have no object, the numeric value of the opcode is written to the two high-order digits and the three low-order digits are set to 0.

If our sample program is stored in file program.asm, it is assembled like this:

> (asm2 (asm1 "program.asm"))
#1000(3010 4011 1000 8007 5011 4011 9002 3011 2000 10000 0)

Note that Scheme elides the trailing zeros from the vector.

You can run the program at http://programmingpraxis.codepad.org/3KFpWaoQ. We’ll write a simulator for our hypothetical computer in the next exercise.

Advertisement

Pages: 1 2

4 Responses to “Assembler, Part 1”

  1. Adolfo said

    Solution in Racket (https://gist.github.com/adolfopa/11063222#file-asm-1-notests-rkt)

    #lang racket/base
    
    (require racket/dict
             racket/match
             racket/sequence
             racket/set
             racket/string)
    
    (define instructions
      (hash "const" 00
            "get"   01
            "put"   02
            "ld"    03
            "st"    04
            "add"   05
            "sub"   06
            "jpos"  07
            "jz"    08
            "j"     09
            "halt"  10))
    
    (define instr-regexp
      (string-join (map regexp-quote
                        (sort (hash-keys instructions) > #:key string-length))
                   "|"
                   #:before-first "("
                   #:after-last ")"))
    
    (define stmt-regexp
      (pregexp
       (string-append "[ \t]*"                 ; leading whitespace
                      "(?:"                    ; (1)
                      "(?:(\\w+)[ \t]+)??"     ; optional label
                      instr-regexp             ; instruction mnemonic
                      "(?:[ \t]+(\\d+|\\w+))?" ; optional argument
                      ")?"                     ; (1)
                      "(?:#.*)?")))            ; optional comment
    
    (struct stmt (label op arg) #:transparent)
    
    (define (read-stmt line)
      (match (regexp-match stmt-regexp line)
        [(list _ fst snd thd)
         (when (and snd (not (dict-ref instructions snd #f)))
           (error "syntax error: unknown mnemonic" line))
         (and (or fst snd thd)
              (stmt fst snd thd))]
        [#f
         (error "syntax error: ill formed statement" line)]))
    
    (define (read-asm in)
      (for*/list ([line (in-lines in)]
                  [st (in-value (read-stmt line))]
                  #:when st)
        st))
    
    (define (make-symbol-table xs)
      (for/hash ([x xs]
                 [n (in-naturals)]
                 #:when (stmt-label x))
        (values (stmt-label x) n)))
    
    (define (resolve arg labels)
      (cond [(not arg) 0]
            [(regexp-match #px"^\\d+$" arg)
             (string->number arg)]
            [else
             (dict-ref labels arg)]))
    
    (define (encode op arg labels)
      (+ (* (dict-ref instructions op) 1000)
         (if arg (resolve arg labels) 0)))
    
    (define (assemble prog [mem (make-vector 1000 0)])
      (define stmts (read-asm prog))
      (define labels
        (make-symbol-table stmts))
      (for ([x stmts]
            [i (in-naturals)])
        (match x
          [(stmt _ "const" arg)
           (vector-set! mem i (resolve arg labels))]
          [(stmt _ op arg)
           (vector-set! mem i (encode op arg labels))]))
      mem)
    
    
  2. Mike said

    Both one and two pass assemblers in Python.

    import collections
    import itertools as it
    import re
    
    mnemonic = "const get put ld st add sub jpos jz j halt".split()
    opcode = {m:c for c,m in enumerate(mnemonic)}
    
    regex = re.compile(r"""(?x)            # verbose mode regex
        \A                                 # beginning of line
        (?P<label>[a-zA-Z_]\w*)?           # optional label--no leading whitespace
        (?:\s+(?P<opcode>[a-zA-Z]+)        # optional opcode after 1 or more ws
              (?:\s+(?P<operand>\w+))?)?   #   followed by an optional operand
        \s*                                # trailing ws
        (?:\#.*)?                          # optional comment
        \Z                                 # end of line
        """)
    
    EBadArg = "Error: '{label}' is referenced but not defined."
    EBadLabel = "Error: Duplicate label '{label}' at line {line_no}."
    EBadOp = "Error: Unrecognized instruction mnuemonic: '{op}'."
    ETooLong = "Error: Program is too long."
    WUnusedLabel = "Warning: Unused label: '{label}'."
    
    def perror(fmt, *args, **kwds):
        print(fmt.format(*args, **kwds))
        
    def one_pass(source, listing=False):
        class Symbol:
            def __init__(self):
                self.addr = None
                self.xref = []
    
        symtab = collections.defaultdict(Symbol)
        memory = [0] * 1000
    
        warnings = 0
        errors = 0
        
        pc = 0
        
        for line_no, line in enumerate(src):
            mo = regex.match(line)
                  
            label, op, arg = mo.groups()
            
            if label:
                if symtab[label].addr is not None:
                    perror(EBadLabel, label=label, line_no=line_no)
                    errors += 1
    
                # save label address and fix earlier references
                symtab[label].addr = pc
                for addr in symtab[label].xref:
                    memory[addr] += pc
                    
            if op:
                instruction = opcode.get(op, None)
                
                if instruction is None:
                    perror(EBadOp, op=op)
                    errors += 1
    
                if arg:
                    if arg.isdigit():
                        val = int(arg)
    
                    else:
                        symtab[arg].xref.append(pc)
                        val = symtab[arg].addr or 0
    
                else:
                    val = 0
                        
                if pc > 999 and not toolong:
                    perror(ETooLong)
                    errors += 1
                    toolong = True
    
                if not errors:
                    memory[pc] = instruction * 1000 + val
    
                pc += 1
    
        for label, info in symtab.items():
            if info.addr is None:
                perror(EBadArg, label=label)
                errors += 1
                
            if info.xref == []:
                perror(WUnusedLabel, label=label)
                warnings += 1
    
        if errors or warnings:
            print("There were {} errors and {} warnings.".format(errors, warnings))
            return None
        
        else:
            return memory
    
    
    def two_pass(source, listing=False):
        parsed  = []
        address = {}
        memory  = [0] * 1000
        
        # first pass
        pc = 0
        for line_no, line in enumerate(source):
            label, op, arg = regex.match(line).groups()
            parsed.append((label, op, arg, line_no, line))
            
            if label and label not in address:
                address[label] = pc
                    
            if op:
                pc += 1
    
        # second pass
        pc = 0
        toolong = False
        errors = []
        nerrors = 0
    
        if listing:
            print("pc   code   line   source\n---  -----  -----  ----------------")
        
        for label, op, arg, line_no, line in parsed:
            if label and address[label] != pc:
                errors.append((EBadLabel, {'label':label, 'line_no':line_no}))
                
            if op:
                instruction = opcode.get(op, None)
                
                if arg:
                    val = int(arg) if arg.isdigit() else address.get(arg, None)
                else:
                    val = 0
                        
                if instruction is None:
                    errors.append((EBadOp, {'op':op}))
    
                if val is None:
                    errors.append((EBadArg, {'label':arg}))
                    
                if pc > 999 and not toolong:
                    errors.append((ETooLong, {}))
                    toolong = True
    
                if not errors:
                    memory[pc] = instruction * 1000 + val
    
                if listing:
                    fmt = "{:03}  {:05}  {:05}  {}"
                    print(fmt.format(pc, memory[pc], line_no, line))
    
                if errors:
                    nerrors += len(errors)
                    
                    if not listing:
                        fmt = "{:05}  {}"
                        print(fmt.format(line_no, line))
                        
                    for error, kws in errors:
                        perror(error, **kws)
                    errors = []
                    print()
                    
                pc += 1
                
            else:
                if listing:
                    fmt = "            {:05}  {}"
                    print(fmt.format(line_no, line))
                        
        return None if nerrors > 0 else memory
    
  3. (def C
    {:M (make-array String 1000)
    :acc 0
    :opc (apply hash-map
    (mapcat list
    [“const” “get” “put” “ld” “st” “add” “sub” “jpos” “jz” “j” “halt”]
    (map #(format “%02d” %) (range))))})

    (defn reader
    (let [re #”^(\w\S*)?\s+(\w+)\s*(\w*)”
    whitespace #”^(#.*)?$”
    code-lines (remove #(re-matches whitespace %)
    (clojure.string/split code #”\n”))]
    (map #(apply vector (rest (re-find re %))) code-lines)))

    (defn label-def? [line] (first line))

    (defn get-labels
    (reduce
    (fn [r [[label opc arg] pos]]
    (assoc r label pos))
    {“” 0}
    (filter #(label-def? (first %))
    (map list code (range)))))

    (defn translate [opcodes source-code]
    (let

    (reduce
    (fn [r [_ opc obj]]
    (conj r
    (if (= opc “const”) obj
    (str (opcodes opc)
    (format “%03d” (labels obj))))))
    []
    code)))

    (defn load-asm [computer code]
    (doseq [[pos word] (map list (range) (translate (:opc C) code))]
    (aset (:M computer) pos word))
    computer)

    ; invoke with (load-asm C )

  4. (def C
      {:M (make-array String 1000)
       :acc 0
       :opc (apply hash-map 
                   (mapcat list 
                           ["const" "get" "put" "ld" "st" "add" "sub" "jpos" "jz" "j" "halt"]
                           (map #(format "%02d" %) (range))))})
    (defn reader [code]
      (let [re #"^(\w\S*)?\s+(\w+)\s*(\w*)"
            whitespace #"^(#.*)?$"
            code-lines (remove #(re-matches whitespace %)
                               (clojure.string/split code #"\n"))]
        (map #(apply vector (rest (re-find re %))) code-lines)))
    
    (defn label-def? [line] (first line))
    
    (defn get-labels [code]
      (reduce
        (fn [r [[label opc arg] pos]]
          (assoc r label pos))
        {"" 0}
        (filter #(label-def? (first %))
                (map list code (range)))))
    
    (defn translate [opcodes source-code]
      (let
        [code (reader source-code)
         labels (get-labels code)]
        (reduce
          (fn [r [_ opc obj]]
            (conj r 
                  (if (= opc "const") obj
                    (str (opcodes opc) 
                         (format "%03d" (labels obj))))))
          []
          code)))
    
    (defn load-asm [computer code]
      (doseq [[pos word] (map list (range) (translate (:opc C) code))]
        (aset (:M computer) pos word))
      computer)
    
    ; invoke with (load-asm C <code>)
    

    PS: sorry for the previous comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: