Edit: https://doc.rust-lang.org/src/core/num/mod.rs.html#1537
interesting! It boils down to this
pub const fn from_ascii_radix(src: &[u8], radix: u32) -> Result<u32, ParseIntError> {
use self::IntErrorKind::*;
use self::ParseIntError as PIE;
// guard: radix must be 2..=36
if 2 > radix || radix > 36 {
from_ascii_radix_panic(radix);
}
if src.is_empty() {
return Err(PIE { kind: Empty });
}
// Strip leading '+' or '-', detect sign
// (a bare '+' or '-' with nothing after it is an error)
// accumulate digits, checking for overflow
Ok(result)
}Ugly (and not performant if in a hot path) but it works.
for(int i = 0; i < len(characters); i++)
{
if(characters[i]-48 <= 9 && characters[i]-48 >= 0)
{
ret = ret * 10 + characters[i] - 48;
}
else
{
return ERROR;
}
}
return ret;
Adjust until it actually works, but you get the picture. #include <stdio.h>
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "usage: require one numeric argument");
}
char *nump = argv[1];
unsigned neg = 0;
unsigned long long ures = 0;
if (*nump == '-') {
neg = 1;
nump = nump + 1;
}
if (!*nump) {
fprintf(stderr, "require non empty string\n");
return 1;
}
char b;
while (b = *nump++) {
if (b >= '0' && b <= '9') {
unsigned long long nres = (ures * 10) + (b - '0');
if (nres < ures) {
fprintf(stderr, "overflow in '%s'\n", argv[1]);
return 1;
}
ures = nres;
} else {
if (b >= ' ') {
fprintf(stderr, "invalid char '%c' in '%s'\n", b, argv[1]);
} else {
fprintf(stderr, "invalid byte '%d' in '%s'\n", b, argv[1]);
}
return 1;
}
}
long long res = (long long) ures;
if (neg) {
if (ures <= 0x8000000000000000ULL) {
res = -res;
} else {
fprintf(stderr, "underflow in '%s'\n", argv[1]);
return 1;
}
} else if (ures > 0x7FFFFFFFFFFFFFFFULL) {
fprintf(stderr, "overflow in '%s'\n", argv[1]);
return 1;
}
fprintf(stdout, "result: %lld\n", res);
return 0;
}That should be opt-in via a flag, if it needs to be supported at all. Unix file permissions are the only deliberate use of octal I've ever seen.
For integers, you're faster (in both development time and runtime) to write your own parser than to try and assemble the pieces in this pile of shit into a half-working one.
C++17 from_chars excluded. Incidentally, 2022 seems about right for the year that ONE open source implementation finally actually implemented the float part of that. Or was it more like 2024?
Ok, having a method to do that for you would be nice, but the post reads like it's an issue that std library doesn't provide you with a method behaving as you exactly want
Perhaps the right title should be "No way to parse pathological edge cases in 'C'"
And then see how other languages do.
In every language, the standard library makes some assumptions about this. In JavaScript, an empty string parses to zero.
The standard C library, which dates back to the stone age, does the simplest thing you can do without range checking, because, well, that's kinda the C paradigm. If you want parsing that handles edge cases in a specific way, you do it yourself. It's just digits.
None of the C functions referenced (atol, strtol, sscanf) are number-parsing functions per se. Rather, they're numeric-lexeme scanning+extraction functions.
These functions are all designed to avoid making any assumptions about the syntax of the larger document the numeric lexeme might be embedded in. You might, after all, be using a syntax where numbers can come with units on the end. Or you might be reading numbers as comma-separated values.
And, as a key point the author might be missing: C, in being co-designed with UNIX, offers primitives tuned for the context of:
- writing UNIX CLI tools that work with unbounded streams of input (i.e. piped output from other UNIX CLI tools),
- where, crucially, the stream is just text, and so carries no TLV-esque framing protocol to tell you the definitive length of a thing;
- and nor (especially in early memory-constrained systems) are you able to perform allocations of heap memory in order to employ an unbounded growable buffer for retaining the current lexeme until you do reach the end of it (which, if you could, would let you use a scanner state-machine that doubles as a parser/validator, returning either a parsed value or an error)
- but instead, to deal with the 1. unbounded input, 2. of textual encoding, 3. in constant memory, you must eagerly scan the input stream (i.e. synchronously reduce over each received byte, or at most each fixed-length N-byte chunk using a static or stack-allocated fixed-length buffer, discarding the original string bytes once reduced-over) to produce lexically-decoded (but not parsed/validated) lexemes; and then do this again, on a higher level, feeding your stream of lexemes into a fixed-sized sum-typed ring-buffer (i.e. an array-of-union-typed-lexeme-struct-type-entries), where you can then invoke a function that attempts to scan over + consume them (but unlike the original stream-parsing function, doesn't consume the buffer unless successful, and so isn't functioning as a scanner per se, but rather as an LR parser.)
If you're not writing UNIX CLI tools, direct use of the C-stdlib numeric-lexeme scan functions is operating on the wrong abstraction layer. What you want, if you have pre-framed strings that are "either valid numbers or parse errors", is to implement an actual parsing function... that can then invoke these numeric-lexer functions to do the majority of its work.
And if you're writing C, and yet you're not in UNIX-pipeline unbounded-text-stream land, but rather are parsing well-defined bounded-length "documents" (like, say, C source files)... then you probably want to use a real lexer-generator (like flex) to feed a parser-generator (like yacc/bison). Where:
- you'd validate the token in context, in the parsing phaase;
- and your lexing rules would make certain classes of input invalid at lexing time. (E.g. you can write your lexeme matching rules such that multi-digit numbers with leading zeroes, or floating-point values with no digits before/after the decimal place, simply aren't "numbers" from your lexer's perspective.)
...which means that, once again, you can "get away with" invokeing the regular C numeric-lexeme scanner functions; i.e. `yylval = atoi(yytext);` in bison terms. (And you'd want to, since doing so saves memory vs. keeping the numbers around as strings.)
:)
Yes, the standard library is bad. This is by far the worst part of the C legacy. But it is not that hard to write your own.
String functions like this are not difficult at all, and you can use better naming and semantics, write faster code etc.
C is not the C standard library, ffs.