Double Double Words

October 13, 2015

The biggest problem with today’s task is working out the specification. In some cases, it would be very, very bad to notify the user of doubled in words, as in this case, where the words are obviously intentionally doubled. One idea is to be strict about the doubling by including attached punctuation, but that can backfire, so we decide to simply report all doubled words and let the user sort out any that are intentional.

The other problem with today’s task is checking that a word that ends one line is not doubled at the beginning of the next line. The simplest solution here is to ignore line breaks and look only at words. But for reporting, it would be nice to report the line number where the doubling occurs. So we’ll keep track of line numbers as we read in the text.

Our solution uses global variables to track the current line and line number and a function that gets words from the input in order, updating the current line number as it goes. Here’s the function that gets words:

(define (read-word)
  (if (pair? line)
      (let ((word (car line)))
        (set! line (cdr line))
        word)
      (let ((input (read-line)))
        (if (eof-object? line)
            line
            (begin
              (set! line (string-split #\space (cleanup input)))
              (set! number (+ number 1))
              (read-word))))))

It calls an auxiliary function to remove punctuation:

(define (cleanup str)
  (let loop ((cs (string->list str)) (zs (list)))
    (cond ((null? cs) (list->string (reverse zs)))
          ((char-alphabetic? (car cs))
            (loop (cdr cs) (cons (car cs) zs)))
          ((char-whitespace? (car cs))
            (loop (cdr cs) (cons #\space zs)))
          (else (loop (cdr cs) zs)))))

The global variables are initially #f and are initialized in the driver function:

(define line #f)
(define number #f)

And here’s the driver function that reads words, reports doubles, and controls processing; notice that we ignore case in the string comparison:

(define (double file-name)
  (call-with-input-file file-name
    (lambda ()
      (set! line "") (set! number 0)
      (let loop ((prev "") (word (read-word)))
        (when (not (eof-object? word))
          (when (string-ci=? prev word)
            (display number) (display " ")
            (display word) (newline))
          (loop word (read-word)))))))

Called on a file that contains the three lines

    This is a very, very good
    good example of doubled
    doubled words.

the function returns three instances of doubled words:

> (double "sample")
1 very
2 good
3 doubled

Note that in the case where the word pair spans two lines, it is the second line number that is reported.

We used read-line and string-split from the Standard Prelude. You can run the program at http://ideone.com/idcTOo, where it has been modified slightly to read from standard input instead of a filename.

Pages: 1 2

6 Responses to “Double Double Words”

  1. Perfect example where perl rocks!

    while(<>){$l++;foreach(split/\W+/,lc$_){printf "%4d %s\n",$l,$_ if$x eq$_;$x=$_;}}
    
  2. Rutger said

    Python

    from collections import Counter
    import re
    
    text = """   Assassin beef noodles savant human chrome order-flow 
    lights neural physical render-farm post-stimulate fluidity skyscraper 
    8-bit. Free-market physical vinyl towards nano-Tokyo sign render-farm. 
    Decay digital katana disposable apophenia modem dissident narrative. 
    Soul-delay euro-pop vinyl pre-ablative market bridge sunglasses dead 
    youtube hotdog rebar claymore mine. """
    
    c = Counter(split for line in text.splitlines() for split in re.sub("[^\w]", " ",  line).split())
    print [word for word in c if c[word] > 1]
    
  3. mcmillhj said

    Alternate Perl solution:

    use strict; 
    use warnings; 
    
    my $text = do {
       local $/ = undef;
       <>;
    };
    
    my $line_no = 1;
    while ( my ($w1,$sep,$w2) = $text =~ m/(\w+)(\W+)(\w+)/ ) {
       $text =~ s/$w1\W+//;
       $line_no++ if $sep eq "\n";
       printf "%04d %s\n", $line_no, $w1 if $w1 eq $w2;
    }
    
  4. Mike said

    Here’s my Python version:

    Uses fileinput from the standard library to handle opening and closing files provided on the command line. It also keeps track of name of the file and line number. Uses regex’s to find the words in a line.

    If a repeated word is found, the program prints the word, the line number(s), and a portion of the line(s) surrounding the repeated word for context.

    with fileinput.input() as f:
        for line in f:
            line = line.rstrip()
    
            if fileinput.isfirstline():
                print(fileinput.filename())
                prevline = ''
                prevword = None
    
            firstword = True
            for match in pat.finditer(line):
                word = match.group().lower()
                if word == prevword:
                    b, e = match.span()
                    lineno = fileinput.filelineno()
                    fmt = "\t'{}' at {}: ...{}..."
                    if firstword:
                        context = prevline[-15:] + ' ' + line[:e+10]
                        where = "lines {}-{}".format(lineno-1, lineno)
                    else:
                        context = line[b-15:e+10]
                        where = "line {}".format(lineno)
    
                    print(fmt.format(word, where, context))
    
                prevword = word
                firstword = False
    
            prevline = line
    

    Example output:

    C:/projects/testdata.txt
    	'a' at lines 2-3: ...upon a a time. The...
    	'of' at lines 4-5: ...was a test of of the emerg...
    	'if' at lines 6-7: ...cast system. If if there had...
    	'been' at line 7: ...there had been been...
    
  5. maroonedsia said

    string content = File.ReadAllText("file.txt");
                string[] words = text.Split(’ ‘, ‘\t’, ‘\n’);
                string output = "";
     
                for (int i = 0; i < words.Length – 1; i++)
                {
                    if (words[i] == words[i1])
                    {
                        output += "Word Index: " i.ToString() ", "
                            "Word: " words[i] "\n";
                    }
                }
     
                MessageBox.Show(output);

  6. maroonedsia said

    1. I don’t know to format the text as code,
    2. For some reason, the “+” is removed from some line, for example the correct code is: if (words[i] == words[i+1])
    3. Why I cannot edit my comment to correct it?! :D

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: