Xref
April 22, 2011
In the old days, when programs were written on punch cards or paper tape and editors didn’t provide regular-expression searches, most compilers provided a cross-reference utility that showed each use of each identifier in the program. You don’t often see cross-referencers any more, and that’s too bad, because a cross-referencer can provide good clues about the structure of a program; looking at a cross-reference listing can be a good way to start examining an unfamiliar program. Here’s some sample output from a Scheme cross-referencer, with the identifier as the first word on each line followed by a list of line numbers where the identifier appears:
0 28
1 18
< 27
= 30
?!.+-*/<=>:$%^&_~@ 5
a 24 25 26 27
add1 20
and 26 30
b 24 25 26 27
c 3 4 5 8 9 10 13
caar 30 32 36 37
car 25 26
cdar 30 33 34 36 37
cdr 27 31 34 37
char-alphabetic? 4
char-in-ident? 3 10
char-numeric? 4
char=? 13
cond 9 19 29
cons 11 21
cs 8 9 11 12 14
define 3 7 16 23 24
display 33 36
else 14 21 35
eof-object? 9 19
file 16 17
get-ident 7 18 20 21
identifiers 3
if 9
lambda 17
let 8 11 18 28
line 18 20 21
list->string 9 12
loop 8 11 14 18 20 21 28 31 34 37
lt? 24 28
member 5
newline 13 29 35
not 35
null? 9 29
or 4 25
pair? 12
peek-char 8 11 14
prev-line 28 30 31
prev-word 28 30 31 32 34 35
read-char 11 13 14
reverse 9 12
scheme 3
sort 28
string->list 5
string<? 25
string=? 20 26 30 32 35
w 18 19 20 21
when 35
with-input-from-file 17
ws 18 19 20 21 23 28 29 30 31 32 33 34 36 37
x 11
xref 1 16
xref-out 19 23
A quick look suggests that this program reads a text file (with-input-from-file, read-char, eof-object?) and does some kind of processing on its characters (char-alphabetic?, char-numeric?). There is some list-handling (car, cdr, string->list), loop is used conventionally, and ws seems to be some kind of key to understanding what the program does. And there is an unholy mess on line 5.
Your task is to write a cross-referencer for your chosen language. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Solution in C++: github
Outputs any alphanumeric word which does not start with digit and which is neither part of comment nor literal string
You can see my code at http://codepad.org/XXLZhua1
My Haskell solution (see http://bonsaicode.wordpress.com/2011/04/22/programming-praxis-xref/ for a version with comments):
import Data.List import qualified Data.List.Key as K import Language.Haskell.Lexer main :: IO () main = do file <- readFile "test.hs" mapM_ putStrLn . map ((\((n:_), ls) -> unwords $ n : nub ls) . unzip) . K.group fst $ K.sort fst [(s, show $ line p) | (tok, (p,s)) <- lexerPass0 file, elem tok [Varid, Conid, Varsym, Consym]]For lack of a good knowledge of Python lexical parsers:
#!/usr/bin/env python from collections import defaultdict from re import compile def xref(file_name): exclude = compile(r"[\\\"'\(\)\[\]:?.,]") # unwanted characters comment = compile(r"\#.*") # comments with open(file_name) as f: lines = f.readlines() d = defaultdict(list) for line_num in xrange(len(lines)): line = exclude.sub(' ', lines[line_num]) line = comment.sub('', line).strip() for word in line.split(): d[word].append(line_num + 1) return d def output(x_d): for entry in sorted(str(k) + "\t" + ", ".join(str(i) for i in v) for (k, v) in x_d.iteritems()): print entry return None if __name__ == "__main__": import sys output(xref(sys.argv[1] if sys.argv[1:] else "xref.py"))@pp : I don’t think 0 can be considered identifier in scheme.
Your first line output 0 as identifier.
Veer: I don’t know if 0 is officially a Scheme identifier, or not, but it doesn’t matter. The cross-referencer is more useful if it includes 0 in the list, and that’s good enough for me.
Here’s a quick one in ruby …
xref = Hash.new linecount = 0 IO.foreach(ARGV[0]) do |line| linecount += 1 line.sub!(/#.*/, "") tokens = line.split(/[^A-Za-z0-9]/) tokens.each do |token| (xref[token] ||= []) << linecount if token != "" end end xref.each_pair do |t, l| puts "#{t}: #{l}" endSomeone mentioned taking out the literal strings, but my feeling was that they should be left in as part of the cross reference. For splitting, I just split on anything that wasn’t a letter, number, or underscore. I think that should get most everything correct. There are probably some edge cases though (floating point numbers come to mind).
If anyone’s wondering, you can’t do Hash.new([]) to get an empty array as the default value of the new hash element. That will create a single array that’s shared by each hash element. Hence the funny looking line 8 which a) creates a new empty array if the hash element is empty and then b) adds the token to it (or the existing array).
Here is one in Factor. A bit more complicated since any non-space printing character can be used in identifiers. Therefore it is necessary to filter out anything that is not an identifier and whatever is left must be an identifier. Lots of edge cases here, but I chose to filter out string constants, integers, end of line comments, stack comments and words ending with “:” (e.g. CONSTANT:, :, USE:, etc.), and sequence constructors ({ }, H{ }, etc.)
! very simple xref USING: kernel math math.order regexp ascii io.encodings.ascii formatting sorting io io.files assocs sequences strings ; IN: xref CONSTANT: xref-dict H{ } CONSTANT: content-filter R/ \s!.*$|^!.*$||\b[0-9]+\b|\b"[^"]*"\b|\b[HV]?{\b|[][};]|\b.*:\b|\(.*--.*\)\b/ : filter-content ( str -- str ) content-filter "" re-replace ; : addref ( symbol line -- ) over xref-dict at [ rot drop ] [ V{ } clone [ rot xref-dict set-at ] keep ] if* swap suffix! drop ; : gobble-input ( -- ) 1 [ filter-content R/ \s+/ re-split [ >string dup length 0 > [ over addref ] [ drop ] if ] each 1 + ] each-line drop ; : report ( -- ) xref-dict keys [ <=> ] sort [ dup xref-dict at swap "%s " printf [ "%d " printf ] each nl ] each ; : xref ( file -- ) xref-dict clear-assoc ascii <file-reader> [ gobble-input ] with-input-stream report ;( scratchpad ) “xref.factor” xref
+ 26
30
36
> 24
>string 24
R/ 23
\b|\\b/ 9
\s+/ 23
addref 14 25
ascii 3 36
assocs 4
at 15 31
clear-assoc 35
clone 17
content-filter 12
drop 16 18 25 27
dup 24 31
each 25 32 32
each-line 27
filter-content 11 23
formatting 3
gobble-input 20 36
if 25
if* 17
io 4
io.encodings.ascii 3
io.files 4
keep 17
kernel 3
keys 30
length 24
math 3
math.order 3
nl 32
over 15 25
printf 32 32
re-replace 12
re-split 23
regexp 3
report 29 37
rot 16 17
sequences 4
set-at 17
sort 30
sorting 3
strings 4
suffix! 18
swap 18 32
with-input-stream 36
xref 5 34
xref-dict 7 15 17 30 31 35
Ooh,
The output above looks wrong, but is correct, problem is that first three lines was interpreted as HTML. Should be:
+ 26
<=> 30
<file-reader> 36
…
My try in REXX, this script analyzes itself
sym = '' asym. = '' nbr_zeilen = sourceline() znr = 0 do znr = 1 to nbr_zeilen zeile = sourceline(znr) zeile = translate(zeile,' ','()') do while length(zeile) > 0 parse value zeile with first zeile if pos(first, sym) == 0 then sym = sym first asym.first = asym.first znr end end sym = bubble(sym) do while length(sym) > 0 parse value sym with first sym say first asym.first end exit bubble: procedure parse arg text t. = '' i = 0 do while length(text) > 0 parse value text with first text i = i + 1 t.i = first end chg = 1 do while chg == 1 chg = 0 do j = 1 to i-1 do k = j+1 to i if t.j > t.k then do tmp = t.j t.j = t.k t.k = tmp chg = 1 end end end end text = '' do j = 1 to i text = text t.j end return strip(text)