谈谈html转义字符如何通过代码识别

2019-10-26 17:14:25

字体：大中小

来源：转载

供稿：网友

偶尔会在数据中看到诸如' 这样的字符，特征如下

以&#开头，中间是一串数字，以；结尾
以&开头，中间一串字符，以；结尾

比如最常见的 或者等价的 

浏览器遇到这些转义符，会转义回来，但如何通过代码识别？ org.apache.commons.lang.StringEscapeUtils.unescapeHtml提供了很好的说明

遇到上面的第一种情况，中间是数字的，直接将数字（unicode）转为char
遇到第二情况，中间是字符，只能查映射表了,从映射表中找到字符对应的数字再转换为char 看看代码就一目了然了

看看HTML40如何定义的

复制代码

代码如下:
static {
HTML40 = new Entities();
fillWithHtml40Entities(HTML40);
}
static void fillWithHtml40Entities(Entities entities) {
entities.addEntities(BASIC_ARRAY);
entities.addEntities(ISO8859_1_ARRAY);
entities.addEntities(HTML40_ARRAY);
}

再看看BASIC_ARRAY、ISO8859_1_ARRAY、HTML40_ARRAY 分别是什么

BASIC_ARRAY

复制代码

代码如下:
private static final String[][] BASIC_ARRAY = {{"quot", "34"}, // " - double-quote
{"amp", "38"}, // & - ampersand
{"lt", "60"}, // < - less-than
{"gt", "62"}, // > - greater-than
};

ISO8859_1_ARRAY

复制代码

代码如下:
static final String[][] ISO8859_1_ARRAY = {{"nbsp", "160"}, // non-breaking space
{"iexcl", "161"}, // inverted exclamation mark
{"cent", "162"}, // cent sign
{"pound", "163"}, // pound sign
{"curren", "164"}, // currency sign
{"yen", "165"}, // yen sign = yuan sign
{"brvbar", "166"}, // broken bar = broken vertical bar
{"sect", "167"}, // section sign
{"uml", "168"}, // diaeresis = spacing diaeresis
{"copy", "169"}, // � - copyright sign
{"ordf", "170"}, // feminine ordinal indicator
{"laquo", "171"}, // left-pointing double angle quotation mark = left pointing guillemet
{"not", "172"}, // not sign
{"shy", "173"}, // soft hyphen = discretionary hyphen
{"reg", "174"}, // � - registered trademark sign
{"macr", "175"}, // macron = spacing macron = overline = APL overbar
{"deg", "176"}, // degree sign
{"plusmn", "177"}, // plus-minus sign = plus-or-minus sign
{"sup2", "178"}, // superscript two = superscript digit two = squared
{"sup3", "179"}, // superscript three = superscript digit three = cubed
{"acute", "180"}, // acute accent = spacing acute
{"micro", "181"}, // micro sign
{"para", "182"}, // pilcrow sign = paragraph sign
{"middot", "183"}, // middle dot = Georgian comma = Greek middle dot
{"cedil", "184"}, // cedilla = spacing cedilla
{"sup1", "185"}, // superscript one = superscript digit one
{"ordm", "186"}, // masculine ordinal indicator
{"raquo", "187"}, // right-pointing double angle quotation mark = right pointing guillemet
{"frac14", "188"}, // vulgar fraction one quarter = fraction one quarter
{"frac12", "189"}, // vulgar fraction one half = fraction one half
{"frac34", "190"}, // vulgar fraction three quarters = fraction three quarters
{"iquest", "191"}, // inverted question mark = turned question mark
{"Agrave", "192"}, // � - uppercase A, grave accent
{"Aacute", "193"}, // � - uppercase A, acute accent
{"Acirc", "194"}, // � - uppercase A, circumflex accent
{"Atilde", "195"}, // � - uppercase A, tilde
{"Auml", "196"}, // � - uppercase A, umlaut
{"Aring", "197"}, // � - uppercase A, ring
{"AElig", "198"}, // � - uppercase AE
{"Ccedil", "199"}, // � - uppercase C, cedilla
{"Egrave", "200"}, // � - uppercase E, grave accent
{"Eacute", "201"}, // � - uppercase E, acute accent
{"Ecirc", "202"}, // � - uppercase E, circumflex accent
{"Euml", "203"}, // � - uppercase E, umlaut
{"Igrave", "204"}, // � - uppercase I, grave accent
{"Iacute", "205"}, // � - uppercase I, acute accent
{"Icirc", "206"}, // � - uppercase I, circumflex accent
{"Iuml", "207"}, // � - uppercase I, umlaut
{"ETH", "208"}, // � - uppercase Eth, Icelandic
{"Ntilde", "209"}, // � - uppercase N, tilde
{"Ograve", "210"}, // � - uppercase O, grave accent
{"Oacute", "211"}, // � - uppercase O, acute accent
{"Ocirc", "212"}, // � - uppercase O, circumflex accent
{"Otilde", "213"}, // � - uppercase O, tilde
{"Ouml", "214"}, // � - uppercase O, umlaut
{"times", "215"}, // multiplication sign
{"Oslash", "216"}, // � - uppercase O, slash
{"Ugrave", "217"}, // � - uppercase U, grave accent
{"Uacute", "218"}, // � - uppercase U, acute accent
{"Ucirc", "219"}, // � - uppercase U, circumflex accent
{"Uuml", "220"}, // � - uppercase U, umlaut
{"Yacute", "221"}, // � - uppercase Y, acute accent
{"THORN", "222"}, // � - uppercase THORN, Icelandic
{"szlig", "223"}, // � - lowercase sharps, German
{"agrave", "224"}, // � - lowercase a, grave accent
{"aacute", "225"}, // � - lowercase a, acute accent
{"acirc", "226"}, // � - lowercase a, circumflex accent
{"atilde", "227"}, // � - lowercase a, tilde
{"auml", "228"}, // � - lowercase a, umlaut
{"aring", "229"}, // � - lowercase a, ring
{"aelig", "230"}, // � - lowercase ae
{"ccedil", "231"}, // � - lowercase c, cedilla
{"egrave", "232"}, // � - lowercase e, grave accent
{"eacute", "233"}, // � - lowercase e, acute accent
{"ecirc", "234"}, // � - lowercase e, circumflex accent
{"euml", "235"}, // � - lowercase e, umlaut
{"igrave", "236"}, // � - lowercase i, grave accent
{"iacute", "237"}, // � - lowercase i, acute accent
{"icirc", "238"}, // � - lowercase i, circumflex accent
{"iuml", "239"}, // � - lowercase i, umlaut
{"eth", "240"}, // � - lowercase eth, Icelandic
{"ntilde", "241"}, // � - lowercase n, tilde
{"ograve", "242"}, // � - lowercase o, grave accent
{"oacute", "243"}, // � - lowercase o, acute accent
{"ocirc", "244"}, // � - lowercase o, circumflex accent
{"otilde", "245"}, // � - lowercase o, tilde
{"ouml", "246"}, // � - lowercase o, umlaut
{"divide", "247"}, // division sign
{"oslash", "248"}, // � - lowercase o, slash
{"ugrave", "249"}, // � - lowercase u, grave accent
{"uacute", "250"}, // � - lowercase u, acute accent
{"ucirc", "251"}, // � - lowercase u, circumflex accent
{"uuml", "252"}, // � - lowercase u, umlaut
{"yacute", "253"}, // � - lowercase y, acute accent
{"thorn", "254"}, // � - lowercase thorn, Icelandic
{"yuml", "255"}, // � - lowercase y, umlaut
};

上一篇：html 网页中的锚点(命名锚记)的使用介绍

下一篇：html中使用js来获取本地系统时间

发表评论

共有条评论