What is the best way of working out if some column value is truly numeric?
Suppose I create a table like this:
create table hjr (col1 varchar2(30)); begin for x in 1 .. 17000000 loop insert into hjr values (x); end loop; commit; end; /
So there are now 17 million records in that table, all of which have a VARCHAR2 column filled with data which is actually, if you looked carefully at it, purely numeric data. But now I now do this:
update hjr set col1 = '186298342kjdfvf9' where col1=327642; commit;
So, now one row has some non-numeric data in the COL1 column: what’s the quickest way of getting Oracle to find which row has non-numeric data? Well, just to make sure I wasn’t re-inventing the wheel, I did a Google (actually a quackquackgo… but that just sounds a bit weird!) and these are the top two searches that came up:
- http://www.techonthenet.com/oracle/questions/isnumeric.php
- http://cwbrian.wordpress.com/2007/03/30/oracle-check-for-any-non-numeric-value-in-a-column/
Both sites suggest using Oracle’s TRANSLATE function to swap blank space for any known numeric values found in the input string. If a string were truly numeric, therefore, it would end up containing nothing, and its length would be zero. The second site is a bit tricky to get to these days for some reason -I had to check the Google Cache of it, and the page’s source code to be sure.
But Technonthenet, in particular, is a site I’ve used many times and the quality of whose writings I have no qualms about. So to see it suggesting the ‘translate’ route is interesting and reassuring -because that’s exactly the way of doing things I was going to suggest using anyway!
Here’s a little function, for example:
create or replace function isnumeric (p_string IN varchar2) return number deterministic as begin if nvl(length(trim(translate(p_string,'1234567890.',' '))),0)>0 then return 0; else return 1; end if; end; /
The translate function swaps found characters for something else. In this case, I’m asking to swap the numbers 1, 2, 3 and so on for a space (‘ ‘). I then use the TRIM function to strip out spaces. I then measure the length of anything that’s left. So, the number ’667′ would have been translated to ‘ ‘ (3 spaces); trimmed to nothing, and thus caused a ’1′ to return (1 means “it’s a number”). But if the input being tested was ’667xf’, then there are no instructions as to what to replace the letters ‘x’ and ‘f’ with, so that would have been translated to ‘ xf’, trimmed to just ‘xf’ and that would have caused a 0 return (0 meaning ‘it’s NOT a number’).
As I say, this was the way I thought I’d go independently, before finding out that respectable sources had been there a long time before. It’s good to know you’re not alone!!
But here’s a second idea I had:
create or replace function isnumber(p_string IN varchar2) return number deterministic as v_number number; begin v_number := p_string; return 1; exception when others then return 0; end; /
This one’s much cruder. It creates a numeric variable (v_number) and then attempts to assign the input string to it. It’s therefore relying on Oracle’s ability to implicitly convert between data types -and, in this case specifically, on its ability to cast between number and string data types. Should the implicit cast work, the function returns a ’1′: it can only have worked if the input string really did contain only-numeric numbers. Should the implicit cast fail, the function will error -but we have an error handler to deal with that. The error handler simply returns 0, indicating that something in the input string could not be implicitly converted to a number.
I think there’s no doubt that this is a much riskier proposition than before. You are, after all, relying on Oracle’s implicit (and not extensively documented) behaviour, which might change between database versions. You’re also relying on a ‘when others’ error handler -so you’ll be told something’s non-numeric if the code ever errors out for completely spurious reasons.
That said, I know this code has worked on every Oracle version from 8.0.6 through to 11.2.0.3, so if we’re worrying about Oracle changing its product’s behaviour, we’ve been worrying for a long time! Moreover, this code doesn’t seem especially complex, so the chances of triggering the ‘when others’ exception unreasonably seems pretty remote. On both grounds, therefore, maybe we can ‘get away with it’?
So: we have an ISNUMERIC function, which is well-behaved and accords with what others have written; and there’s an ISNUMBER function, which depends on implicit datatype conversions and some other assumptions about Oracle behaviour.
We obviously need some speed and efficiency comparisons. So, first up, here’s the average runtime for ISNUMBER, and the main statistics associated with it:
SQL> select count(*) from hjr where isnumber(col1)=0;
Elapsed: 00:00:27.62
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
31234 consistent gets
31229 physical reads
0 redo size
526 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
And here’s the equivalent for ISNUMERIC (as before, the elapsed time shown is actually the average of five runs, with the longest and shortest discarded):
SQL> select count(*) from hjr where isnumeric(col1)=0;
Elapsed: 00:02:00.29
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
31234 consistent gets
31229 physical reads
0 redo size
526 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
That’s quite a difference! The method involving implicit casts and error handlers copes in about 27 seconds instead of 129… it is, in other words, about five times as fast in returning a result -and the statistics don’t seem to indicate anything particularly different about either method (same consistent gets, same SQL round trips and so on).
So this is a vote for doing things the dirty way?
Well, let’s alter the experiement a little before we agree to that. Let’s make rather more non-numerics than before:
delete from hjr where isnumber(col1)=0; update hjr set col1=col||'u8#' where mod(col1,2)=0; commit;
Thanks to a bit of help from the MOD(2) function, half the rows now contain non-numeric data in the COL1 column. Let’s now re-run our tests. First, the ISNUMBER function, relying on implicit data type casting and error handling:
SQL> select count(*) from hjr where isnumber(col1)=0;
Elapsed: 00:00:55.45
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
32111 consistent gets
17789 physical reads
0 redo size
529 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Interesting that the number of physical reads has reduced so dramatically now that half the data is non-numeric! Note, too, that this now completes in about 55 seconds, which is about twice as long than the 27 seconds it took to find one bad row out of 17 million good ones.
But anyway: here’s the ISNUMERIC version, relying on translating away any numeric characters and counting the length of what’s left:
SQL> select count(*) from hjr where isnumeric(col1)=0;
Elapsed: 00:02:01.85
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
32111 consistent gets
17791 physical reads
0 redo size
529 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
These are the same sorts of statistics as with the other function, including a big reduction in physical reads-but the elapsed time is still just over 2 minutes… which is almost exactly the same time as it took to find just one non-numeric row first time around. The ‘translate away’ technique is slow, therefore, but at least it’s consistently slow!
The technique that relies on implicit data conversions and error handling remains much faster -but the more and more exceptions it has to throw, the slower it gets.
If you are looking for the proverbial needle-in-a-haystack, then, I reckon the implicit-cast-and-catch might be the way to go. But if you are unsure of your data -specifically, if you are not sure if a significant part of it might be ‘bad’- then the ‘trim(translate))’ function might be more reliable.
