How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data

How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data ?

Asked on January 8, 2019 in XML.
Add Comment


  • 11 Answer(s)

    Codes are given to remove invalid hexadecimal characters from an XML:

    /// <summary>
    /// Removes control characters and other non-UTF-8 characters
    /// </summary>
    /// <param name="inString">The string to process</param>
    /// <returns>A string with no control characters or entities above 0x00FD</returns>
    public static string RemoveTroublesomeCharacters(string inString)
    {
       if (inString == null) return null;
     
       StringBuilder newString = new StringBuilder();
       char ch;
     
       for (int i = 0; i < inString.Length; i++)
       {
     
          ch = inString[i];
          // remove any characters outside the valid UTF-8 range as well as all control characters
          // except tabs and new lines
          //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
          //if using .NET version prior to 4, use above logic
          if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
          {
             newString.Append(ch);
          }
       }
       return newString.ToString();
     
    }
    
    Answered on January 8, 2019.
    Add Comment

    To support all Unicode characters, not simply up to 0x00FD. The XML spec is:

    Char = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

         In .NET, the internal representation of Unicode characters is 16 bits only, so we cannot `allow’ 0x10000-0x10FFFF by explicitly. The XML spec explicitly will not allows the alternate code points which are starting at 0xD800 from appearing. Despite of, it is possible that if we permit these alternate code points in our whitelist, utf-8 encoding our string can produce valid XML in the last as long as proper utf-8 encoding was produced from the alternate pairs of utf-16 characters in the .NET string. 

    The characters that we excluding will not valid in XML. They are perfectly valid Unicode code points. We are not removing `non-utf-8 characters’. We are removing utf-8 characters that might not appear in well-formed XML documents.

    public static string XmlCharacterWhitelist( string in_string ) {
       if( in_string == null ) return null;
     
       StringBuilder sbOutput = new StringBuilder();
       char ch;
     
       for( int i = 0; i < in_string.Length; i++ ) {
          ch = in_string[i];
          if( ( ch >= 0x0020 && ch <= 0xD7FF ) ||
             ( ch >= 0xE000 && ch <= 0xFFFD ) ||
             ch == 0x0009 ||
             ch == 0x000A ||
             ch == 0x000D ) {
             sbOutput.Append( ch );
          }
       }
       return sbOutput.ToString();
    }
    
    Answered on January 8, 2019.
    Add Comment

         The way to remove invalid XML characters use XmlConvert.IsXmlChar method.

    Simple example:

    void Main() {
       string content = "\v\f\0";
       Console.WriteLine(IsValidXmlString(content)); // False
     
       content = RemoveInvalidXmlChars(content);
       Console.WriteLine(IsValidXmlString(content)); // True
    }
     
    static string RemoveInvalidXmlChars(string text) {
       char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
       return new string(validXmlChars);
    }
     
    static bool IsValidXmlString(string text) {
       try {
          XmlConvert.VerifyXmlChars(text);
          return true;
       } catch {
          return false;
       }
    }
    
    Answered on January 8, 2019.
    Add Comment
    • The solution needs to handle XML data sources that use character encodings other than UTF-8, e.g. by specifying the character encoding at the XML document declaration. Not mangling the character encoding of the source while stripping invalid hexadecimal characters has been a major sticking point.
    • The removal of invalid hexadecimal characters should only remove hexadecimal encoded values, as you can often find href values in data that happens to contains a string that would be a string match for a hexadecimal character.

    Background:

    I need to consume an XML-based data source that conforms to a specific format (think Atom or RSS feeds), but want to be able to consume data sources that have been published which contain invalid hexadecimal characters per the XML specification.

    In .NET if you have a Stream that represents the XML data source, and then attempt to parse it using an XmlReader and/or XPathDocument, an exception is raised due to the inclusion of invalid hexadecimal characters in the XML data. My current attempt to resolve this issue is to parse the Stream as a string and use a regular expression to remove and/or replace the invalid hexadecimal characters, but I am looking for a more performant solution.

    Answered on February 28, 2019.
    Add Comment
    /// <summary>
    /// Removes control characters and other non-UTF-8 characters
    /// </summary>
    /// <param name="inString">The string to process</param>
    /// <returns>A string with no control characters or entities above 0x00FD</returns>
    public static string RemoveTroublesomeCharacters(string inString)
    {
        if (inString == null) return null;
    
        StringBuilder newString = new StringBuilder();
        char ch;
    
        for (int i = 0; i < inString.Length; i++)
        {
    
            ch = inString[i];
            // remove any characters outside the valid UTF-8 range as well as all control characters
            // except tabs and new lines
            //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
            //if using .NET version prior to 4, use above logic
            if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
            {
                newString.Append(ch);
            }
        }
        return newString.ToString();
    
    }

     

    Answered on February 28, 2019.
    Add Comment
    public static string XmlCharacterWhitelist( string in_string ) {
        if( in_string == null ) return null;
    
        StringBuilder sbOutput = new StringBuilder();
        char ch;
    
        for( int i = 0; i < in_string.Length; i++ ) {
            ch = in_string[i];
            if( ( ch >= 0x0020 && ch <= 0xD7FF ) || 
                ( ch >= 0xE000 && ch <= 0xFFFD ) ||
                ch == 0x0009 ||
                ch == 0x000A || 
                ch == 0x000D ) {
                sbOutput.Append( ch );
            }
        }
        return sbOutput.ToString();
    }

     

    Answered on February 28, 2019.
    Add Comment
    void Main() {
        string content = "\v\f\0";
        Console.WriteLine(IsValidXmlString(content)); // False
    
        content = RemoveInvalidXmlChars(content);
        Console.WriteLine(IsValidXmlString(content)); // True
    }
    
    static string RemoveInvalidXmlChars(string text) {
        char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
        return new string(validXmlChars);
    }
    
    static bool IsValidXmlString(string text) {
        try {
            XmlConvert.VerifyXmlChars(text);
            return true;
        } catch {
            return false;
        }
    }
    Answered on February 28, 2019.
    Add Comment
    public class InvalidXmlCharacterReplacingStreamReader : StreamReader
    {
        private readonly char _replacementCharacter;
    
        public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName)
        {
            this._replacementCharacter = replacementCharacter;
        }
    
        public override int Peek()
        {
            int ch = base.Peek();
            if (ch != -1 && IsInvalidChar(ch))
            {
                return this._replacementCharacter;
            }
            return ch;
        }
    
        public override int Read()
        {
            int ch = base.Read();
            if (ch != -1 && IsInvalidChar(ch))
            {
                return this._replacementCharacter;
            }
            return ch;
        }
    
        public override int Read(char[] buffer, int index, int count)
        {
            int readCount = base.Read(buffer, index, count);
            for (int i = index; i < readCount + index; i++)
            {
                char ch = buffer[i];
                if (IsInvalidChar(ch))
                {
                    buffer[i] = this._replacementCharacter;
                }
            }
            return readCount;
        }
    
        private static bool IsInvalidChar(int ch)
        {
            return (ch < 0x0020 || ch > 0xD7FF) &&
                   (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D;
        }
    }
    Answered on February 28, 2019.
    Add Comment
    public static string RemoveInvalidXmlChars(string input)
    {
        var isValid = new Predicate<char>(value =>
            (value >= 0x0020 && value <= 0xD7FF) ||
            (value >= 0xE000 && value <= 0xFFFD) ||
            value == 0x0009 ||
            value == 0x000A ||
            value == 0x000D);
    
        return new string(Array.FindAll(input.ToCharArray(), isValid));
    }
    Answered on February 28, 2019.
    Add Comment
    public class InvalidXmlCharacterReplacingStreamReader : TextReader
    {
        private StreamReader implementingStreamReader;
        private char replacementCharacter;
    
        public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter)
        {
            implementingStreamReader = new StreamReader(stream);
            this.replacementCharacter = replacementCharacter;
        }
    
        public override void Close()
        {
            implementingStreamReader.Close();
        }
    
        public override ObjRef CreateObjRef(Type requestedType)
        {
            return implementingStreamReader.CreateObjRef(requestedType);
        }
    
        public void Dispose()
        {
            implementingStreamReader.Dispose();
        }
    
        public override bool Equals(object obj)
        {
            return implementingStreamReader.Equals(obj);
        }
    
        public override int GetHashCode()
        {
            return implementingStreamReader.GetHashCode();
        }
    
        public override object InitializeLifetimeService()
        {
            return implementingStreamReader.InitializeLifetimeService();
        }
    
        public override int Peek()
        {
            int ch = implementingStreamReader.Peek();
            if (ch != -1)
            {
                if (
                    (ch < 0x0020 || ch > 0xD7FF) &&
                    (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D
                    )
                {
                    return replacementCharacter;
                }
            }
            return ch;
        }
    
        public override int Read()
        {
            int ch = implementingStreamReader.Read();
            if (ch != -1)
            {
                if (
                    (ch < 0x0020 || ch > 0xD7FF) &&
                    (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D
                    )
                {
                    return replacementCharacter;
                }
            }
            return ch;
        }
    
        public override int Read(char[] buffer, int index, int count)
        {
            int readCount = implementingStreamReader.Read(buffer, index, count);
            for (int i = index; i < readCount+index; i++)
            {
                char ch = buffer[i];
                if (
                    (ch < 0x0020 || ch > 0xD7FF) &&
                    (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D
                    )
                {
                    buffer[i] = replacementCharacter;
                }
            }
            return readCount;
        }
    
        public override Task<int> ReadAsync(char[] buffer, int index, int count)
        {
            throw new NotImplementedException();
        }
    
        public override int ReadBlock(char[] buffer, int index, int count)
        {
            throw new NotImplementedException();
        }
    
        public override Task<int> ReadBlockAsync(char[] buffer, int index, int count)
        {
            throw new NotImplementedException();
        }
    
        public override string ReadLine()
        {
            throw new NotImplementedException();
        }
    
        public override Task<string> ReadLineAsync()
        {
            throw new NotImplementedException();
        }
    
        public override string ReadToEnd()
        {
            throw new NotImplementedException();
        }
    
        public override Task<string> ReadToEndAsync()
        {
            throw new NotImplementedException();
        }
    
        public override string ToString()
        {
            return implementingStreamReader.ToString();
    }
    Answered on February 28, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.