Extract Number From String
January 18, 2019
We limit our function to integers, both positive and negative. It is tedious but straight forward to extend the function to real numbers, scientific notation, and complex numbers.
Our algorithm scans the string until it finds a digit, then collects digits and returns a number:
(define (extract-number str) (let loop ((cs (string->list str)) (prev #\X) (n 0)) (cond ((null? cs) (* (if (char=? prev #\-) -1 1) n)) ((char-numeric? (car cs)) (loop (cdr cs) prev (+ (* n 10) (- (char->integer (car cs)) (char->integer #\0))))) ((positive? n) (* (if (char=? prev #\-) -1 1) n)) (else (loop (cdr cs) (car cs) n)))))
Here are some examples:
> (extract-number "-123") -123 > (extract-number "-123junk") -123 > (extract-number "junk-123") -123 > (extract-number "junk-123junk") -123 > (extract-number "junk-123junk456") -123 > (extract-number "junk123junk456") 123
You can run the program at https://ideone.com/cpk12l.
This is a piece of cake for Perl – as it is its raison d’etra… uses some later Perl 5 features like “r” flag on regexp to replace in place and say to print with a “\n” at the end of it… {hence using -E rather than -e switch}
Well, we could just use a regex like James’ solution, but it’s more fun to roll our own recognizer. Also, we should take cognizance of internationalization and check things work fine with non-Latin numerics. Here’s some Python 3:
A Haskell version.
@matthew, your solution doesn’t handle negative numbers.
Here’s a solution in C, supporting ASCII digits 0-9.
Examples:
Here’s a less readable version of
extract
that uses compiler built-ins to handle overflow.[soucecode lang=”c”]
int extract(char* input, int* output) {
int start_idx = -1;
int end_idx = -1;
for (int i = 0; ; ++i) {
char c = input[i];
if (c == ‘\0’) return 0;
if (c < ‘0’ || c > ‘9’) continue;
start_idx = i;
break;
}
for (int i = start_idx + 1; ; ++i) {
char c = input[i];
if (c >= ‘0’ && c <= ‘9’) continue;
end_idx = i – 1;
break;
}
int result = 0;
int multiplier = 1;
if (start_idx > 0 && input[start_idx – 1] == ‘-‘)
multiplier = -1;
for (int i = end_idx; i >= start_idx; –i) {
if (i < end_idx && __builtin_mul_overflow(10, multiplier, &multiplier))
return 0;
int addend = input[i] – ‘0’;
if (__builtin_mul_overflow(addend, multiplier, &addend))
return 0;
if (__builtin_add_overflow(result, addend, &result))
return 0;
}
*output = result;
return 1;
}
[/sourcecode]
I had a typo (“soucecode”) that messed up the formatting of my solution that accommodates overflow. Here’s the properly formatted version.
@Daniel: Good point. I was reading your solution (incidentally, you need to initialize
result
on line 35 of the original) and wondering how you were going to handle minus signs, so liked your trick when I came to it. It seems unreasonable not to allow the Unicode minus sign (U+2212) which is not supported by the Pythonint
function@matthew, thanks. My intent was to initialize the result in
extract
.Here’s the updated code.
Regular expressions are clearly the way to go, but I implemented also another solution based on groupby. Split the string in groups of (possible) sign(s) digits and the rest and return the (signed) digits.
Mumps version
Mumps version