Extract Number From String
January 18, 2019
We have another string-handling exercise today:
Given a string containing a number, extract the number from the string. For instance, the strings “-123”, “-123junk”, “junk-123”, “junk-123junk” and “junk-123junk456” should all evaluate to the number -123.
Your task is to write a program to extract a number from a string. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
This is a piece of cake for Perl – as it is its raison d’etra… uses some later Perl 5 features like “r” flag on regexp to replace in place and say to print with a “\n” at the end of it… {hence using -E rather than -e switch}
> perl -E 'say shift=~s{^.*?(-?\d+).*$}{$1}r;' a-b-123cd456 -123Well, we could just use a regex like James’ solution, but it’s more fun to roll our own recognizer. Also, we should take cognizance of internationalization and check things work fine with non-Latin numerics. Here’s some Python 3:
def findseq(s,pred): p = 0 for i,c in enumerate(s): if not pred(c): if i != p: yield(s[p:i]) p = i+1 if p != len(s): yield s[p:] def findnums(s): return (int(n) for n in findseq(s,str.isdigit)) s = """ The 100th Fibonacci number is 354224848179261915075 The ١٠٠th Fibonacci number is ٣٥٤٢٢٤٨٤٨١٧٩٢٦١٩١٥٠٧٥ The १००th Fibonacci number is ३५४२२४८४८१७९२६१९१५०७५ The ௧௦௦th Fibonacci number is ௩௫௪௨௨௪௮௪௮௧௭௯௨௬௧௯௧௫௦௭௫ The ໑໐໐th Fibonacci number is ໓໕໔໒໒໔໘໔໘໑໗໙໒໖໑໙໑໕໐໗໕ The 𝟙𝟘𝟘th Fibonacci number is 𝟛𝟝𝟜𝟚𝟚𝟜𝟠𝟜𝟠𝟙𝟟𝟡𝟚𝟞𝟙𝟡𝟙𝟝𝟘𝟟𝟝 """ assert(list(findnums(s)) == [100,354224848179261915075]*6)A Haskell version.
{-# LANGUAGE OverloadedStrings #-} import Data.Either (rights) import Data.Text (Text, tails) import Data.Text.Read (decimal, signed) firstNum :: Integral a => Text -> Maybe a firstNum str = case rights . map (signed decimal) . tails $ str of [] -> Nothing ((n,_):_) -> Just n main :: IO () main = do print $ firstNum "" print $ firstNum "abcdef" print $ firstNum "123" print $ firstNum "ab123cde456f" print $ firstNum "ab-123cde456f" print $ firstNum "ab+123cde456f"@matthew, your solution doesn’t handle negative numbers.
Here’s a solution in C, supporting ASCII digits 0-9.
#include <stdio.h> #include <stdlib.h> int extract(char* input, int* output) { int start_idx = -1; int end_idx = -1; for (int i = 0; ; ++i) { char c = input[i]; if (c == '\0') return 0; if (c < '0' || c > '9') continue; start_idx = i; break; } for (int i = start_idx + 1; ; ++i) { char c = input[i]; if (c >= '0' && c <= '9') continue; end_idx = i - 1; break; } int multiplier = 1; if (start_idx > 0 && input[start_idx - 1] == '-') multiplier = -1; for (int i = end_idx; i >= start_idx; --i) { *output += multiplier * (input[i] - '0'); multiplier *= 10; } return 1; } int main(int argc, char* argv[]) { if (argc != 2) { fprintf(stderr, "Usage: %s STR\n", argv[0]); return EXIT_FAILURE; } int result; if (!extract(argv[1], &result)) return EXIT_FAILURE; printf("%d\n", result); return EXIT_SUCCESS; }Examples:
Here’s a less readable version of
extractthat uses compiler built-ins to handle overflow.[soucecode lang=”c”]
int extract(char* input, int* output) {
int start_idx = -1;
int end_idx = -1;
for (int i = 0; ; ++i) {
char c = input[i];
if (c == ‘\0’) return 0;
if (c < ‘0’ || c > ‘9’) continue;
start_idx = i;
break;
}
for (int i = start_idx + 1; ; ++i) {
char c = input[i];
if (c >= ‘0’ && c <= ‘9’) continue;
end_idx = i – 1;
break;
}
int result = 0;
int multiplier = 1;
if (start_idx > 0 && input[start_idx – 1] == ‘-‘)
multiplier = -1;
for (int i = end_idx; i >= start_idx; –i) {
if (i < end_idx && __builtin_mul_overflow(10, multiplier, &multiplier))
return 0;
int addend = input[i] – ‘0’;
if (__builtin_mul_overflow(addend, multiplier, &addend))
return 0;
if (__builtin_add_overflow(result, addend, &result))
return 0;
}
*output = result;
return 1;
}
[/sourcecode]
I had a typo (“soucecode”) that messed up the formatting of my solution that accommodates overflow. Here’s the properly formatted version.
int extract(char* input, int* output) { int start_idx = -1; int end_idx = -1; for (int i = 0; ; ++i) { char c = input[i]; if (c == '\0') return 0; if (c < '0' || c > '9') continue; start_idx = i; break; } for (int i = start_idx + 1; ; ++i) { char c = input[i]; if (c >= '0' && c <= '9') continue; end_idx = i - 1; break; } int result = 0; int multiplier = 1; if (start_idx > 0 && input[start_idx - 1] == '-') multiplier = -1; for (int i = end_idx; i >= start_idx; --i) { if (i < end_idx && __builtin_mul_overflow(10, multiplier, &multiplier)) return 0; int addend = input[i] - '0'; if (__builtin_mul_overflow(addend, multiplier, &addend)) return 0; if (__builtin_add_overflow(result, addend, &result)) return 0; } *output = result; return 1; }@Daniel: Good point. I was reading your solution (incidentally, you need to initialize
resulton line 35 of the original) and wondering how you were going to handle minus signs, so liked your trick when I came to it. It seems unreasonable not to allow the Unicode minus sign (U+2212) which is not supported by the Pythonintfunctiondef getnum(s,start,end): # Include numeric minus as well as hyphen minuschars = "-−" n = int(s[start:end]) if start > 0 and s[start-1] in minuschars: return -n else: return n def findseq(s,pred): p = 0 for i,c in enumerate(s): if not pred(c): if i != p: yield getnum(s,p,i) p = i+1 if p != len(s): yield getnum(s,p,len(s))@matthew, thanks. My intent was to initialize the result in
extract.Here’s the updated code.
int extract(char* input, int* output) { int start_idx = -1; int end_idx = -1; for (int i = 0; ; ++i) { char c = input[i]; if (c == '\0') return 0; if (c < '0' || c > '9') continue; start_idx = i; break; } for (int i = start_idx + 1; ; ++i) { char c = input[i]; if (c >= '0' && c <= '9') continue; end_idx = i - 1; break; } *output = 0; int multiplier = 1; if (start_idx > 0 && input[start_idx - 1] == '-') multiplier = -1; for (int i = end_idx; i >= start_idx; --i) { *output += multiplier * (input[i] - '0'); multiplier *= 10; } return 1; }Regular expressions are clearly the way to go, but I implemented also another solution based on groupby. Split the string in groups of (possible) sign(s) digits and the rest and return the (signed) digits.
import re from itertools import groupby OTHER, SIGN, DIGIT = range(3) integer = re.compile(r"[-+]?\d+") def extract_int(s): def group(c): if c in "+-": return SIGN if c.isdigit(): return DIGIT return OTHER sign = 1 for k, g in groupby(s, group): if k == DIGIT: yield sign * int("".join(g)) sign = -1 if k == SIGN and "".join(g)[-1] == "-" else 1 print(list(extract_int("junk1--123junk456pp"))) # -> [1, -123, 456] print(list(integer.findall("junk1--123junk456pp"))) # -> ['1', '-123', '456']Mumps version
Mumps version