Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

IT story

Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

hot-time 2020. 9. 10. 19:01

Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

Java를 제외 하고이 질문 과 매우 유사합니다 .

Java에서 XML 출력에 대한 문자열 인코딩에 권장되는 방법은 무엇입니까? 문자열에는 "&", "<"등과 같은 문자가 포함될 수 있습니다.

아주 간단하게 : XML 라이브러리를 사용하십시오. 그렇게 하면 XML 사양에 대한 자세한 지식을 요구하는 대신 실제로 옳습니다 .

다른 사람들이 언급했듯이 XML 라이브러리를 사용하는 것이 가장 쉬운 방법입니다. 스스로 탈출하고 싶다면 Apache Commons Lang 라이브러리 StringEscapeUtils에서 살펴볼 수 있습니다.

그냥 사용하십시오.

<![CDATA[ your text here ]]>

이것은 끝을 제외한 모든 문자를 허용합니다

]]>

따라서 & 및>와 같이 불법적 인 문자를 포함 할 수 있습니다. 예를 들면.

<element><![CDATA[ characters such as & and > are allowed ]]></element>

그러나 CDATA 블록을 사용할 수 없으므로 속성을 이스케이프해야합니다.

이 시도:

String xmlEscapeText(String t) {
   StringBuilder sb = new StringBuilder();
   for(int i = 0; i < t.length(); i++){
      char c = t.charAt(i);
      switch(c){
      case '<': sb.append("&lt;"); break;
      case '>': sb.append("&gt;"); break;
      case '\"': sb.append("&quot;"); break;
      case '&': sb.append("&amp;"); break;
      case '\'': sb.append("&apos;"); break;
      default:
         if(c>0x7e) {
            sb.append("&#"+((int)c)+";");
         }else
            sb.append(c);
      }
   }
   return sb.toString();
}

이것은 텍스트 문자열의 이스케이프 버전을 제공하는 데 잘 작동했습니다.

public class XMLHelper {

/**
 * Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "&lt;A &amp; B &gt;"
 * .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
 * no characters to protect, the original string is returned.
 * 
 * @param originalUnprotectedString
 *            original string which may contain characters either reserved in XML or with different representation
 *            in different encodings (like 8859-1 and UFT-8)
 * @return
 */
public static String protectSpecialCharacters(String originalUnprotectedString) {
    if (originalUnprotectedString == null) {
        return null;
    }
    boolean anyCharactersProtected = false;

    StringBuffer stringBuffer = new StringBuffer();
    for (int i = 0; i < originalUnprotectedString.length(); i++) {
        char ch = originalUnprotectedString.charAt(i);

        boolean controlCharacter = ch < 32;
        boolean unicodeButNotAscii = ch > 126;
        boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';

        if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
            stringBuffer.append("&#" + (int) ch + ";");
            anyCharactersProtected = true;
        } else {
            stringBuffer.append(ch);
        }
    }
    if (anyCharactersProtected == false) {
        return originalUnprotectedString;
    }

    return stringBuffer.toString();
}

}

이 질문은 8 년이 지났지 만 아직 완전히 정답이 아닙니다! 아니요,이 간단한 작업을 수행하기 위해 전체 타사 API를 가져올 필요는 없습니다. 나쁜 충고.

다음 방법은 다음과 같습니다.

기본 다국어 평면 외부의 문자를 올바르게 처리
XML에 필요한 이스케이프 문자
선택 사항이지만 일반적인 비 ASCII 문자를 이스케이프합니다.
교체 불법 유니 코드 대체 문자와 XML 1.0의 문자. 여기에는 최선의 선택이 없습니다. 제거하는 것도 똑같이 유효합니다.

나는 가장 일반적인 경우에 최적화하려고 노력했지만 여전히 이것을 통해 / dev / random을 파이프하고 XML에서 유효한 문자열을 얻을 수 있는지 확인했습니다.

public static String encodeXML(CharSequence s) {
    StringBuilder sb = new StringBuilder();
    int len = s.length();
    for (int i=0;i<len;i++) {
        int c = s.charAt(i);
        if (c >= 0xd800 && c <= 0xdbff && i + 1 < len) {
            c = ((c-0xd7c0)<<10) | (s.charAt(++i)&0x3ff);    // UTF16 decode
        }
        if (c < 0x80) {      // ASCII range: test most common case first
            if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) {
                // Illegal XML character, even encoded. Skip or substitute
                sb.append("&#xfffd;");   // Unicode replacement character
            } else {
                switch(c) {
                  case '&':  sb.append("&amp;"); break;
                  case '>':  sb.append("&gt;"); break;
                  case '<':  sb.append("&lt;"); break;
                  // Uncomment next two if encoding for an XML attribute
//                  case '\''  sb.append("&apos;"); break;
//                  case '\"'  sb.append("&quot;"); break;
                  // Uncomment next three if you prefer, but not required
//                  case '\n'  sb.append("&#10;"); break;
//                  case '\r'  sb.append("&#13;"); break;
//                  case '\t'  sb.append("&#9;"); break;

                  default:   sb.append((char)c);
                }
            }
        } else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) {
            // Illegal XML character, even encoded. Skip or substitute
            sb.append("&#xfffd;");   // Unicode replacement character
        } else {
            sb.append("&#x");
            sb.append(Integer.toHexString(c));
            sb.append(';');
        }
    }
    return sb.toString();
}

Edit: for those who continue to insist it foolish to write your own code for this when there are perfectly good Java APIs to deal with XML, you might like to know that the StAX API included with Oracle Java 8 (I haven't tested others) fails to encode CDATA content correctly: it doesn't escape ]]> sequences in the content. A third party library, even one that's part of the Java core, is not always the best option.

StringEscapeUtils.escapeXml() does not escape control characters (< 0x20). XML 1.1 allows control characters; XML 1.0 does not. For example, XStream.toXML() will happily serialize a Java object's control characters into XML, which an XML 1.0 parser will reject.

To escape control characters with Apache commons-lang, use

NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))

While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.

Consider this: XML was meant to be written by humans.

Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.

Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:

<%@taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%>

<item>${fn:escapeXml(value)}</item>

The behavior of StringEscapeUtils.escapeXml() has changed from Commons Lang 2.5 to 3.0. It now no longer escapes Unicode characters greater than 0x7f.

This is a good thing, the old method was to be a bit to eager to escape entities that could just be inserted into a utf8 document.

The new escapers to be included in Google Guava 11.0 also seem promising: http://code.google.com/p/guava-libraries/issues/detail?id=799

public String escapeXml(String s) {
    return s.replaceAll("&", "&amp;").replaceAll(">", "&gt;").replaceAll("<", "&lt;").replaceAll("\"", "&quot;").replaceAll("'", "&apos;");
}

Note: Your question is about escaping, not encoding. Escaping is using <, etc. to allow the parser to distinguish between "this is an XML command" and "this is some text". Encoding is the stuff you specify in the XML header (UTF-8, ISO-8859-1, etc).

First of all, like everyone else said, use an XML library. XML looks simple but the encoding+escaping stuff is dark voodoo (which you'll notice as soon as you encounter umlauts and Japanese and other weird stuff like "full width digits" (&#FF11; is 1)). Keeping XML human readable is a Sisyphus' task.

I suggest never to try to be clever about text encoding and escaping in XML. But don't let that stop you from trying; just remember when it bites you (and it will).

That said, if you use only UTF-8, to make things more readable you can consider this strategy:

If the text does contain '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
If the text doesn't contain these three characters, don't warp it.

I'm using this in an SQL editor and it allows the developers to cut&paste SQL from a third party SQL tool into the XML without worrying about escaping. This works because the SQL can't contain umlauts in our case, so I'm safe.

While I agree with Jon Skeet in principle, sometimes I don't have the option to use an external XML library. And I find it peculiar the two functions to escape/unescape a simple value (attribute or tag, not full document) are not available in the standard XML libraries included with Java.

As a result and based on the different answers I have seen posted here and elsewhere, here is the solution I've ended up creating (nothing worked as a simple copy/paste):

  public final static String ESCAPE_CHARS = "<>&\"\'";
  public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] {
      "&lt;"
    , "&gt;"
    , "&amp;"
    , "&quot;"
    , "&apos;"
  }));

  private static String UNICODE_LOW =  "" + ((char)0x20); //space
  private static String UNICODE_HIGH = "" + ((char)0x7f);

  //should only use for the content of an attribute or tag      
  public static String toEscaped(String content) {
    String result = content;

    if ((content != null) && (content.length() > 0)) {
      boolean modified = false;
      StringBuilder stringBuilder = new StringBuilder(content.length());
      for (int i = 0, count = content.length(); i < count; ++i) {
        String character = content.substring(i, i + 1);
        int pos = ESCAPE_CHARS.indexOf(character);
        if (pos > -1) {
          stringBuilder.append(ESCAPE_STRINGS.get(pos));
          modified = true;
        }
        else {
          if (    (character.compareTo(UNICODE_LOW) > -1)
               && (character.compareTo(UNICODE_HIGH) < 1)
             ) {
            stringBuilder.append(character);
          }
          else {
            stringBuilder.append("&#" + ((int)character.charAt(0)) + ";");
            modified = true;
          }
        }
      }
      if (modified) {
        result = stringBuilder.toString();
      }
    }

    return result;
  }

The above accommodates several different things:

avoids using char based logic until it absolutely has to - improves unicode compatibility
attempts to be as efficient as possible given the probability is the second "if" condition is likely the most used pathway
is a pure function; i.e. is thread-safe
optimizes nicely with the garbage collector by only returning the contents of the StringBuilder if something actually changed - otherwise, the original string is returned

At some point, I will write the inversion of this function, toUnescaped(). I just don't have time to do that today. When I do, I will come update this answer with the code. :)

For those looking for the quickest-to-write solution: use methods from apache commons-lang:

StringEscapeUtils.escapeXml10() for xml 1.0
StringEscapeUtils.escapeXml11() for xml 1.1
StringEscapeUtils.escapeXml() is now deprecated, but was used commonly in the past

Remember to include dependency:

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-lang3</artifactId>
  <version>3.5</version> <!--check current version! -->
</dependency>

To escape XML characters, the easiest way is to use the Apache Commons Lang project, JAR downloadable from: http://commons.apache.org/lang/

The class is this: org.apache.commons.lang3.StringEscapeUtils;

It has a method named "escapeXml", that will return an appropriately escaped String.

If you are looking for a library to get the job done, try:

Guava 26.0 documented here

return XmlEscapers.xmlContentEscaper().escape(text);

Note: There is also an xmlAttributeEscaper()
Apache Commons Text 1.4 documented here

StringEscapeUtils.escapeXml11(text)

Note: There is also an escapeXml10() method

Here's an easy solution and it's great for encoding accented characters too!

String in = "Hi Lârry & Môe!";

StringBuilder out = new StringBuilder();
for(int i = 0; i < in.length(); i++) {
    char c = in.charAt(i);
    if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) {
        out.append("&#" + (int) c + ";");
    } else {
        out.append(c);
    }
}

System.out.printf("%s%n", out);

Outputs

Hi L&#226;rry &#38; M&#244;e!

You could use the Enterprise Security API (ESAPI) library, which provides methods like encodeForXML and encodeForXMLAttribute. Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder.

Use JAXP and forget about text handling it will be done for you automatically.

Try to encode the XML using Apache XML serializer

//Serialize DOM
OutputFormat format    = new OutputFormat (doc); 
// as a String
StringWriter stringOut = new StringWriter ();    
XMLSerializer serial   = new XMLSerializer (stringOut, 
                                          format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());

Just replace

 & with &amp;

And for other characters:

> with &gt;
< with &lt;
\" with &quot;
' with &apos;

참고URL : https://stackoverflow.com/questions/439298/best-way-to-encode-text-data-for-xml-in-java

'IT story' 카테고리의 다른 글

Xcode“Could not launch”. (0)	2020.09.10
MS SQL Server 2005에서 열린 / 활성 연결의 총 수를 확인하는 방법 (0)	2020.09.10
Twitter 부트 스트랩 탭을 페이지 중앙에 배치하려면 어떻게합니까? (0)	2020.09.10
Appcompat 21에서 툴바 색상 변경 (0)	2020.09.10
시간 빼기 Go에서 시간에서 기간 (0)	2020.09.10

현재글Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

hot-time

Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

'IT story' 카테고리의 다른 글

'IT story'의 다른글

티스토리툴바

Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

Java에서 XML 용 텍스트 데이터를 인코딩하는 가장 좋은 방법은 무엇입니까?

'IT story' 카테고리의 다른 글

'IT story'의 다른글

관련글

티스토리툴바