Extract Number From String

January 18, 2019

We limit our function to integers, both positive and negative. It is tedious but straight forward to extend the function to real numbers, scientific notation, and complex numbers.

Our algorithm scans the string until it finds a digit, then collects digits and returns a number:

(define (extract-number str)
  (let loop ((cs (string->list str)) (prev #\X) (n 0))
    (cond ((null? cs)
            (* (if (char=? prev #\-) -1 1) n))
          ((char-numeric? (car cs))
            (loop (cdr cs) prev
                  (+ (* n 10)
                     (- (char->integer (car cs))
                        (char->integer #\0)))))
          ((positive? n)
            (* (if (char=? prev #\-) -1 1) n))
          (else (loop (cdr cs) (car cs) n)))))

Here are some examples:

> (extract-number "-123")
-123
> (extract-number "-123junk")
-123
> (extract-number "junk-123")
-123
> (extract-number "junk-123junk")
-123
> (extract-number "junk-123junk456")
-123
> (extract-number "junk123junk456")
123

You can run the program at https://ideone.com/cpk12l.

Posted by programmingpraxis

Filed in Exercises

10 Comments »

10 Responses to “Extract Number From String”

James Curtis-Smith said
January 18, 2019 at 10:18 AM
This is a piece of cake for Perl – as it is its raison d’etra… uses some later Perl 5 features like “r” flag on regexp to replace in place and say to print with a “\n” at the end of it… {hence using -E rather than -e switch}
```
> perl -E 'say shift=~s{^.*?(-?\d+).*$}{$1}r;' a-b-123cd456
-123
```

matthew said

January 18, 2019 at 12:43 PM

Well, we could just use a regex like James’ solution, but it’s more fun to roll our own recognizer. Also, we should take cognizance of internationalization and check things work fine with non-Latin numerics. Here’s some Python 3:

def findseq(s,pred):
    p = 0
    for i,c in enumerate(s):
        if not pred(c):
            if i != p: yield(s[p:i])
            p = i+1
    if p != len(s): yield s[p:]

def findnums(s): return (int(n) for n in findseq(s,str.isdigit))

s = """
The 100th Fibonacci number is 354224848179261915075
The ١٠٠th Fibonacci number is ٣٥٤٢٢٤٨٤٨١٧٩٢٦١٩١٥٠٧٥
The १००th Fibonacci number is ३५४२२४८४८१७९२६१९१५०७५
The ௧௦௦th Fibonacci number is ௩௫௪௨௨௪௮௪௮௧௭௯௨௬௧௯௧௫௦௭௫
The ໑໐໐th Fibonacci number is ໓໕໔໒໒໔໘໔໘໑໗໙໒໖໑໙໑໕໐໗໕
The 𝟙𝟘𝟘th Fibonacci number is 𝟛𝟝𝟜𝟚𝟚𝟜𝟠𝟜𝟠𝟙𝟟𝟡𝟚𝟞𝟙𝟡𝟙𝟝𝟘𝟟𝟝
"""
assert(list(findnums(s)) == [100,354224848179261915075]*6)

Globules said

January 19, 2019 at 6:34 AM

A Haskell version.

{-# LANGUAGE OverloadedStrings #-}

import Data.Either (rights)
import Data.Text (Text, tails)
import Data.Text.Read (decimal, signed)

firstNum :: Integral a => Text -> Maybe a
firstNum str = case rights . map (signed decimal) . tails $ str of
  [] -> Nothing
  ((n,_):_) -> Just n

main :: IO ()
main = do
  print $ firstNum ""
  print $ firstNum "abcdef"
  print $ firstNum "123"
  print $ firstNum "ab123cde456f"
  print $ firstNum "ab-123cde456f"
  print $ firstNum "ab+123cde456f"

$ ./extractnum
Nothing
Nothing
Just 123
Just 123
Just (-123)
Just 123

Daniel said
January 19, 2019 at 6:51 AM
@matthew, your solution doesn’t handle negative numbers.

Here’s a solution in C, supporting ASCII digits 0-9.
```
#include <stdio.h>
#include <stdlib.h>

int extract(char* input, int* output) {
  int start_idx = -1;
  int end_idx = -1;
  for (int i = 0; ; ++i) {
    char c = input[i];
    if (c == '\0') return 0;
    if (c < '0' || c > '9') continue;
    start_idx = i;
    break;
  }
  for (int i = start_idx + 1; ; ++i) {
    char c = input[i];
    if (c >= '0' && c <= '9') continue;
    end_idx = i - 1;
    break;
  }
  int multiplier = 1;
  if (start_idx > 0 && input[start_idx - 1] == '-')
    multiplier = -1;
  for (int i = end_idx; i >= start_idx; --i) {
    *output += multiplier * (input[i] - '0');
    multiplier *= 10;
  }
  return 1;
}

int main(int argc, char* argv[]) {
  if (argc != 2) {
    fprintf(stderr, "Usage: %s STR\n", argv[0]);
    return EXIT_FAILURE;
  }
  int result;
  if (!extract(argv[1], &result))
    return EXIT_FAILURE;
  printf("%d\n", result);
  return EXIT_SUCCESS;
}
```
Examples:
```
$ ./a.out -123
-123
$ ./a.out -123junk
-123
$ ./a.out junk-123
-123
$ ./a.out junk-123junk
-123
$ ./a.out junk-123junk456
-123
```
Here’s a less readable version of extract that uses compiler built-ins to handle overflow.

[soucecode lang=”c”]
int extract(char* input, int* output) {
int start_idx = -1;
int end_idx = -1;
for (int i = 0; ; ++i) {
char c = input[i];
if (c == ‘\0’) return 0;
if (c < ‘0’ || c > ‘9’) continue;
start_idx = i;
break;
}
for (int i = start_idx + 1; ; ++i) {
char c = input[i];
if (c >= ‘0’ && c <= ‘9’) continue;
end_idx = i – 1;
break;
}
int result = 0;
int multiplier = 1;
if (start_idx > 0 && input[start_idx – 1] == ‘-‘)
multiplier = -1;
for (int i = end_idx; i >= start_idx; –i) {
if (i < end_idx && __builtin_mul_overflow(10, multiplier, &multiplier))
return 0;
int addend = input[i] – ‘0’;
if (__builtin_mul_overflow(addend, multiplier, &addend))
return 0;
if (__builtin_add_overflow(result, addend, &result))
return 0;
}
*output = result;
return 1;
}
[/sourcecode]

Daniel said

January 19, 2019 at 6:54 AM

I had a typo (“soucecode”) that messed up the formatting of my solution that accommodates overflow. Here’s the properly formatted version.

int extract(char* input, int* output) {
  int start_idx = -1;
  int end_idx = -1;
  for (int i = 0; ; ++i) {
    char c = input[i];
    if (c == '\0') return 0;
    if (c < '0' || c > '9') continue;
    start_idx = i;
    break;
  }
  for (int i = start_idx + 1; ; ++i) {
    char c = input[i];
    if (c >= '0' && c <= '9') continue;
    end_idx = i - 1;
    break;
  }
  int result = 0;
  int multiplier = 1;
  if (start_idx > 0 && input[start_idx - 1] == '-')
    multiplier = -1;
  for (int i = end_idx; i >= start_idx; --i) {
    if (i < end_idx && __builtin_mul_overflow(10, multiplier, &multiplier))
      return 0;
    int addend = input[i] - '0';
    if (__builtin_mul_overflow(addend, multiplier, &addend))
      return 0;
    if (__builtin_add_overflow(result, addend, &result))
      return 0;
  }
  *output = result;
  return 1;
}

matthew said

January 19, 2019 at 10:31 AM

@Daniel: Good point. I was reading your solution (incidentally, you need to initialize result on line 35 of the original) and wondering how you were going to handle minus signs, so liked your trick when I came to it. It seems unreasonable not to allow the Unicode minus sign (U+2212) which is not supported by the Python int function

def getnum(s,start,end):
    # Include numeric minus as well as hyphen
    minuschars = "-−"
    n = int(s[start:end])
    if start > 0 and s[start-1] in minuschars: return -n
    else: return n
    
def findseq(s,pred):
    p = 0
    for i,c in enumerate(s):
        if not pred(c):
            if i != p: yield getnum(s,p,i)
            p = i+1
    if p != len(s): yield getnum(s,p,len(s))

Daniel said

January 19, 2019 at 4:26 PM

@matthew, thanks. My intent was to initialize the result in extract.

Here’s the updated code.

int extract(char* input, int* output) {
  int start_idx = -1;
  int end_idx = -1;
  for (int i = 0; ; ++i) {
    char c = input[i];
    if (c == '\0') return 0;
    if (c < '0' || c > '9') continue;
    start_idx = i;
    break;
  }
  for (int i = start_idx + 1; ; ++i) {
    char c = input[i];
    if (c >= '0' && c <= '9') continue;
    end_idx = i - 1;
    break;
  }
  *output = 0;
  int multiplier = 1;
  if (start_idx > 0 && input[start_idx - 1] == '-')
    multiplier = -1;
  for (int i = end_idx; i >= start_idx; --i) {
    *output += multiplier * (input[i] - '0');
    multiplier *= 10;
  }
  return 1;
}

Paul said

January 22, 2019 at 9:40 AM

Regular expressions are clearly the way to go, but I implemented also another solution based on groupby. Split the string in groups of (possible) sign(s) digits and the rest and return the (signed) digits.

import re
from itertools import groupby

OTHER, SIGN, DIGIT = range(3)
integer = re.compile(r"[-+]?\d+")

def extract_int(s):
    def group(c):
        if c in "+-": return SIGN
        if c.isdigit(): return DIGIT
        return OTHER
    sign = 1
    for k, g in groupby(s, group):
        if k == DIGIT: yield sign * int("".join(g))
        sign = -1 if k == SIGN and "".join(g)[-1] == "-" else 1

print(list(extract_int("junk1--123junk456pp")))  # -> [1, -123, 456]
print(list(integer.findall("junk1--123junk456pp")))  # -> ['1', '-123', '456']

Steve said

June 7, 2019 at 5:21 AM

Mumps version

Code:
findnum(n) ;
 ;
 n i,num
 i n'?.e.1"-"1.n.e q "Does not contain valid number"
 f i=1:1 i $e(n,i,$l(n))?.1"-"1.n.e q
 q +$e(n,i,$l(n))

---

Run:
YDB>f i="-123","-123junk","junk-123","junk-123junk","junk-123junk456","junk123junk456" w !,i," --> ",$$findnum^findnum(i)
abc --> Does not contain valid number
-123 --> -123
-123junk --> -123
junk-123 --> -123
junk-123junk --> -123
junk-123junk456 --> -123
junk123junk456 --> 123

Steve said

June 7, 2019 at 5:24 AM

Mumps version


Code:
findnum(n) ;
 ;
 n i,num
 i n'?.e.1"-"1.n.e q "Does not contain valid number"
 f i=1:1 i $e(n,i,$l(n))?.1"-"1.n.e q
 q +$e(n,i,$l(n))

---

Run:
YDB>f i="-123","-123junk","junk-123","junk-123junk","junk-123junk456","junk123junk456" w !,i," --> ",$$findnum^findnum(i)

-123 --> -123
-123junk --> -123
junk-123 --> -123
junk-123junk --> -123
junk-123junk456 --> -123
junk123junk456 --> 123
YDB>zp ^findnum

Programming Praxis