Xref
April 22, 2011
In the old days, when programs were written on punch cards or paper tape and editors didn’t provide regular-expression searches, most compilers provided a cross-reference utility that showed each use of each identifier in the program. You don’t often see cross-referencers any more, and that’s too bad, because a cross-referencer can provide good clues about the structure of a program; looking at a cross-reference listing can be a good way to start examining an unfamiliar program. Here’s some sample output from a Scheme cross-referencer, with the identifier as the first word on each line followed by a list of line numbers where the identifier appears:
0 28
1 18
< 27
= 30
?!.+-*/<=>:$%^&_~@ 5
a 24 25 26 27
add1 20
and 26 30
b 24 25 26 27
c 3 4 5 8 9 10 13
caar 30 32 36 37
car 25 26
cdar 30 33 34 36 37
cdr 27 31 34 37
char-alphabetic? 4
char-in-ident? 3 10
char-numeric? 4
char=? 13
cond 9 19 29
cons 11 21
cs 8 9 11 12 14
define 3 7 16 23 24
display 33 36
else 14 21 35
eof-object? 9 19
file 16 17
get-ident 7 18 20 21
identifiers 3
if 9
lambda 17
let 8 11 18 28
line 18 20 21
list->string 9 12
loop 8 11 14 18 20 21 28 31 34 37
lt? 24 28
member 5
newline 13 29 35
not 35
null? 9 29
or 4 25
pair? 12
peek-char 8 11 14
prev-line 28 30 31
prev-word 28 30 31 32 34 35
read-char 11 13 14
reverse 9 12
scheme 3
sort 28
string->list 5
string<? 25
string=? 20 26 30 32 35
w 18 19 20 21
when 35
with-input-from-file 17
ws 18 19 20 21 23 28 29 30 31 32 33 34 36 37
x 11
xref 1 16
xref-out 19 23
A quick look suggests that this program reads a text file (with-input-from-file
, read-char
, eof-object?
) and does some kind of processing on its characters (char-alphabetic?
, char-numeric?
). There is some list-handling (car
, cdr
, string->list
), loop
is used conventionally, and ws
seems to be some kind of key to understanding what the program does. And there is an unholy mess on line 5.
Your task is to write a cross-referencer for your chosen language. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Solution in C++: github
Outputs any alphanumeric word which does not start with digit and which is neither part of comment nor literal string
You can see my code at http://codepad.org/XXLZhua1
My Haskell solution (see http://bonsaicode.wordpress.com/2011/04/22/programming-praxis-xref/ for a version with comments):
For lack of a good knowledge of Python lexical parsers:
@pp : I don’t think 0 can be considered identifier in scheme.
Your first line output 0 as identifier.
Veer: I don’t know if 0 is officially a Scheme identifier, or not, but it doesn’t matter. The cross-referencer is more useful if it includes 0 in the list, and that’s good enough for me.
Here’s a quick one in ruby …
Someone mentioned taking out the literal strings, but my feeling was that they should be left in as part of the cross reference. For splitting, I just split on anything that wasn’t a letter, number, or underscore. I think that should get most everything correct. There are probably some edge cases though (floating point numbers come to mind).
If anyone’s wondering, you can’t do Hash.new([]) to get an empty array as the default value of the new hash element. That will create a single array that’s shared by each hash element. Hence the funny looking line 8 which a) creates a new empty array if the hash element is empty and then b) adds the token to it (or the existing array).
Here is one in Factor. A bit more complicated since any non-space printing character can be used in identifiers. Therefore it is necessary to filter out anything that is not an identifier and whatever is left must be an identifier. Lots of edge cases here, but I chose to filter out string constants, integers, end of line comments, stack comments and words ending with “:” (e.g. CONSTANT:, :, USE:, etc.), and sequence constructors ({ }, H{ }, etc.)
( scratchpad ) “xref.factor” xref
+ 26
30
36
> 24
>string 24
R/ 23
\b|\\b/ 9
\s+/ 23
addref 14 25
ascii 3 36
assocs 4
at 15 31
clear-assoc 35
clone 17
content-filter 12
drop 16 18 25 27
dup 24 31
each 25 32 32
each-line 27
filter-content 11 23
formatting 3
gobble-input 20 36
if 25
if* 17
io 4
io.encodings.ascii 3
io.files 4
keep 17
kernel 3
keys 30
length 24
math 3
math.order 3
nl 32
over 15 25
printf 32 32
re-replace 12
re-split 23
regexp 3
report 29 37
rot 16 17
sequences 4
set-at 17
sort 30
sorting 3
strings 4
suffix! 18
swap 18 32
with-input-stream 36
xref 5 34
xref-dict 7 15 17 30 31 35
Ooh,
The output above looks wrong, but is correct, problem is that first three lines was interpreted as HTML. Should be:
+ 26
<=> 30
<file-reader> 36
…
My try in REXX, this script analyzes itself