Dowemo
0 0 0 0


Question:

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:

grep -e "[x{00FF}-x{FFFF}]" file.xml

But this returns every line in the file, regardless of whether the line contains a character in the range specified.

Do I have the syntax wrong or am I doing something else wrong? I've also tried:

egrep "[x{00FF}-x{FFFF}]" file.xml 

(with both single and double quotes surrounding the pattern).


Best Answer:


Searching for non-printable chars.

I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^n -~]". Add r for DOS text files. That translates to "[^x0Ax020-x07E]" and add x0D for CR"

Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.

I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:

grep -c -P -n "[x00-x08x0E-x1Fx80-xFF]" *

breakdown:

x00-x08 - non-printable control chars 0 - 7 decimal
x0E-x1F - more non-printable control chars 14 - 31 decimal
x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches

E.g. practical example of use find to grep all files under current directory:

find . -type f -exec grep -c -P -n "[x00-x08x0E-x1Fx80-xFF]" {} + 

You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.

Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes
Dec   Hex Ctrl Char description           Dec Hex Ctrl Char description
0     00  ^@  NULL                        16  10  ^P  DATA LINK ESCAPE (DLE)
1     01  ^A  START OF HEADING (SOH)      17  11  ^Q  DEVICE CONTROL 1 (DC1)
2     02  ^B  START OF TEXT (STX)         18  12  ^R  DEVICE CONTROL 2 (DC2)
3     03  ^C  END OF TEXT (ETX)           19  13  ^S  DEVICE CONTROL 3 (DC3)
4     04  ^D  END OF TRANSMISSION (EOT)   20  14  ^T  DEVICE CONTROL 4 (DC4)
5     05  ^E  END OF QUERY (ENQ)          21  15  ^U  NEGATIVE ACKNOWLEDGEMENT (NAK)
6     06  ^F  ACKNOWLEDGE (ACK)           22  16  ^V  SYNCHRONIZE (SYN)
7     07  ^G  BEEP (BEL)                  23  17  ^W  END OF TRANSMISSION BLOCK (ETB)
8     08  ^H  BACKSPACE (BS)**            24  18  ^X  CANCEL (CAN)
9     09  ^I  HORIZONTAL TAB (HT)**       25  19  ^Y  END OF MEDIUM (EM)
10    0A  ^J  LINE FEED (LF)**            26  1A  ^Z  SUBSTITUTE (SUB)
11    0B  ^K  VERTICAL TAB (VT)**         27  1B  ^[  ESCAPE (ESC)
12    0C  ^L  FF (FORM FEED)**            28  1C  ^  FILE SEPARATOR (FS) RIGHT ARROW
13    0D  ^M  CR (CARRIAGE RETURN)**      29  1D  ^]  GROUP SEPARATOR (GS) LEFT ARROW
14    0E  ^N  SO (SHIFT OUT)              30  1E  ^^  RECORD SEPARATOR (RS) UP ARROW
15    0F  ^O  SI (SHIFT IN)               31  1F  ^_  UNIT SEPARATOR (US) DOWN ARROW



Copyright © 2011 Dowemo All rights reserved.    Creative Commons   AboutUs